Optimum Intel, in collaboration with Hugging Face, has introduced a new optimization for SetFit (Sentence Transformer Fine-tuning) inference on Intel Xeon processors. This enhancement leverages Intel's Advanced Vector Extensions (AVX) and the OpenVINO toolkit to significantly speed up the inference process, achieving up to 2x performance gains compared to standard PyTorch implementations.
The optimization focuses on reducing latency and improving throughput for SetFit models, which are commonly used for few-shot text classification tasks. By integrating with the Optimum Intel library, developers can seamlessly apply Intel's hardware-aware optimizations without modifying their existing code.
Key benefits include:
- Up to 50% reduction in inference latency on Xeon Platinum 8380 processors.
- Support for dynamic quantization and operator fusion.
- Compatibility with Hugging Face's Transformers and Datasets libraries.
This development is particularly relevant for enterprise applications requiring low-latency responses, such as real-time content moderation, customer support routing, and document classification. The optimized SetFit models can be deployed using standard inference servers like TGI or FastAPI.
The open-source Optimum Intel library is available on GitHub, with documentation and examples for reproducing the benchmarks.