Hugging Face has announced a new purpose-built Inference Container for Amazon SageMaker, designed to simplify the deployment of Large Language Models (LLMs) in a secure, managed environment. The new container, powered by Text Generation Inference (TGI), brings optimized performance and features previously available only in high-traffic LLM services like HuggingChat and OpenAssistant to AWS customers.
TGI is an open-source solution for serving LLMs, supporting Tensor Parallelism, dynamic batching, and custom CUDA kernels for popular architectures such as BLOOM, GPT-NeoX, Llama, StarCoder, and Falcon. It also includes quantization via bitsandbytes, continuous batching for increased throughput, accelerated weight loading with safetensors, logit warpers, watermarking as proposed in "A Watermark for Large Language Models," and token streaming using Server-Sent Events.
The container officially supports models like BLOOM/BLOOMZ, MT0-XXL, Galactica, SantaCoder, GPT-NeoX 20B, FLAN-T5-XXL, Llama (and its variants vicuna, alpaca, koala), StarCoder/SantaCoder, and Falcon 7B/40B.
To demonstrate the container, Hugging Face provides a step-by-step guide to deploy the 12B Pythia Open Assistant Model—an open-source chat LLM trained on the Open Assistant dataset. The process involves setting up the development environment with the SageMaker Python SDK, retrieving the container URI, deploying the model, running inference, and optionally creating a Gradio chatbot backed by SageMaker.
With this release, businesses can leverage the same technologies driving low-latency, high-concurrency LLM experiences directly within AWS's robust infrastructure.