Open-source large language models (LLMs) such as Falcon, (Open-)LLaMA, X-Gen, StarCoder, and RedPajama have advanced significantly in recent months, rivaling proprietary models like ChatGPT or GPT-4 for specific tasks. However, deploying these models efficiently remains challenging.
This article demonstrates how to deploy open-source LLMs using Hugging Face Inference Endpoints, a managed SaaS solution that simplifies model deployment. You'll also learn to stream responses and test endpoint performance.
What is Hugging Face Inference Endpoints?
Hugging Face Inference Endpoints provide a secure, straightforward way to deploy machine learning models in production. Key benefits for LLM deployment include:
- Easy Deployment: Convert models into production-ready APIs with a few clicks, without handling infrastructure or MLOps.
- Cost Efficiency: Automatic scaling down to zero reduces costs when not in use, with billing based on uptime.
- Enterprise Security: Deploy models in secure offline endpoints with VPC connections, SOC 2 Type 2 certification, and GDPR compliance.
- LLM Optimization: High throughput via Paged Attention and low latency through Text Generation Inference with Flash Attention.
- Comprehensive Task Support: Out-of-the-box support for Transformers, Sentence-Transformers, and Diffusers, plus customization for any ML task.
Get started at ui.endpoints.huggingface.co.
1. Deploy Falcon 40B Instruct
Login with a User or Organization account that has a payment method (add one here). Navigate to Inference Endpoints and click "New endpoint." Select the repository tiiuae/falcon-40b-instruct, choose cloud and region, adjust instance and security settings, then deploy.
Inference Endpoints suggests an instance based on model size (e.g., 4x NVIDIA T4). For best LLM performance, change the instance to GPU [xlarge] · 1x Nvidia A100. If unavailable, request a quota increase.
Click "Create Endpoint" and wait about 10 minutes for it to go live.
2. Test the LLM Endpoint
The Endpoint overview includes an Inference Widget for manual requests, as well as a cURL command. For example:
curl https://j4xhm53fxl9ussm8.us-east-1.aws.endpoints.huggingface.cloud \
-X POST \
-d '{"inputs":"Once upon a time,"}' \
-H "Authorization: Bearer <hf_token>" \
-H "Content-Type: application/json"
Supported generation parameters include:
temperature(default 1.0)max_new_tokens(default 20, max 512)repetition_penaltyseedstoptokenstop_kandtop_p(default null)
3. Stream Responses in JavaScript and Python
Streaming reduces perceived latency by returning tokens one by one. Use server-sent events (SSE) with the stream parameter set to true.
Streaming with Python
import requests
endpoint_url = "https://j4xhm53fxl9ussm8.us-east-1.aws.endpoints.huggingface.cloud"
headers = {"Authorization": "Bearer <hf_token>", "Content-Type": "application/json"}
payload = {"inputs": "Once upon a time,", "parameters": {"max_new_tokens": 100}, "stream": True}
response = requests.post(endpoint_url, json=payload, headers=headers, stream=True)
for line in response.iter_lines():
if line:
print(line.decode("utf-8"))
Streaming with JavaScript
const endpointUrl = "https://j4xhm53fxl9ussm8.us-east-1.aws.endpoints.huggingface.cloud";
const headers = {"Authorization": "Bearer <hf_token>", "Content-Type": "application/json"};
const payload = {inputs: "Once upon a time,", parameters: {max_new_tokens: 100}, stream: true};
fetch(endpointUrl, {method: "POST", headers, body: JSON.stringify(payload)}).then(async response => {
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const {done, value} = await reader.read();
if (done) break;
console.log(decoder.decode(value));
}
});
Conclusion
Hugging Face Inference Endpoints make deploying open-source LLMs simple, cost-effective, and secure. With built-in optimization and streaming support, you can build responsive AI applications without managing infrastructure.