Continuous batching is a technique used in deep learning inference to improve throughput and reduce latency. Unlike static batching, which waits for a fixed number of requests to arrive before processing them, continuous batching allows requests to be added and removed dynamically as they are processed. This approach is particularly beneficial for serving large language models (LLMs), where request lengths and arrival times vary widely.
The core idea is to treat each request as an independent sequence. During inference, the model processes tokens from multiple sequences in parallel, but not all sequences need to start or end at the same time. When one sequence finishes (generates an end-of-sequence token), its slot in the batch becomes available for a new request, without waiting for all other sequences to complete.
This method maximizes hardware utilization, especially for GPUs, by ensuring that no processing capacity is wasted while waiting for stragglers. It also reduces the tail latency for short requests, as they can be completed quickly without being blocked by longer ones.
Continuous batching is now a standard feature in many LLM serving frameworks like vLLM, TensorRT-LLM, and Hugging Face TGI. It represents a key optimization for deploying large models in production environments where efficiency and responsiveness are critical.