Hugging Face Transformers offers state-of-the-art models across domains, but achieving optimal inference speed and memory usage often requires optimization. The Hugging Face ecosystem provides ready-to-use tools that reduce memory footprint and improve inference with minimal code changes.
In this tutorial, we demonstrate how to optimize Bark, a text-to-speech (TTS) model, using three simple techniques from the Transformers, Optimum, and Accelerate libraries. We also show how to benchmark the model before and after optimization.
Bark Architecture
Bark, developed by Suno AI, is a transformer-based TTS model that generates speech, music, background noise, and sound effects. It consists of four sub-models:
- BarkSemanticModel: A causal autoregressive transformer that predicts semantic text tokens from tokenized text.
- BarkCoarseModel: A causal autoregressive transformer that predicts the first two audio codebooks for EnCodec.
- BarkFineModel: A non-causal autoencoder transformer that iteratively predicts remaining codebooks.
- EncodecModel: Decodes the output audio array.
Two checkpoints are available: small and large.
Loading the Model
from transformers import BarkModel
model = BarkModel.from_pretrained("suno/bark-small")
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = model.to(device)
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("suno/bark-small")
Optimization Techniques
We apply optimizations from Optimum and Accelerate with minimal code changes.
Base Case
Before optimization, we measure latency and memory:
import torch
from transformers import set_seed
def measure_latency_and_memory_use(model, inputs, nb_loops=5):
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
torch.cuda.reset_peak_memory_stats(device)
torch.cuda.empty_cache()
torch.cuda.synchronize()
start_event.record()
for _ in range(nb_loops):
set_seed(0)
output = model.generate(**inputs, do_sample=True, fine_temperature=0.4, coarse_temperature=0.8)
end_event.record()
torch.cuda.synchronize()
max_memory = torch.cuda.max_memory_allocated(device)
elapsed_time = start_event.elapsed_time(end_event) * 1.0e-3
print('Execution time:', elapsed_time/nb_loops, 'seconds')
print('Max memory footprint', max_memory*1e-9, ' GB')
return output
with torch.inference_mode():
speech_output = measure_latency_and_memory_use(model, inputs, nb_loops=5)
Output: Execution time: 9.38 seconds, Max memory footprint: 1.91 GB.
Audio output is available here.
For a streamlined version with full code, see the Google Colab.