Laravel

Hugging Face Transformers offers state-of-the-art models across domains, but achieving optimal inference speed and memory usage often requires optimization. The Hugging Face ecosystem provides ready-to-use tools that reduce memory footprint and improve inference with minimal code changes.

In this tutorial, we demonstrate how to optimize Bark, a text-to-speech (TTS) model, using three simple techniques from the Transformers, Optimum, and Accelerate libraries. We also show how to benchmark the model before and after optimization.

Bark Architecture

Bark, developed by Suno AI, is a transformer-based TTS model that generates speech, music, background noise, and sound effects. It consists of four sub-models:

BarkSemanticModel: A causal autoregressive transformer that predicts semantic text tokens from tokenized text.
BarkCoarseModel: A causal autoregressive transformer that predicts the first two audio codebooks for EnCodec.
BarkFineModel: A non-causal autoencoder transformer that iteratively predicts remaining codebooks.
EncodecModel: Decodes the output audio array.

Two checkpoints are available: small and large.

Loading the Model

from transformers import BarkModel
model = BarkModel.from_pretrained("suno/bark-small")
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = model.to(device)
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("suno/bark-small")

Optimization Techniques

We apply optimizations from Optimum and Accelerate with minimal code changes.

Base Case

Before optimization, we measure latency and memory:

import torch
from transformers import set_seed

def measure_latency_and_memory_use(model, inputs, nb_loops=5):
    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    torch.cuda.reset_peak_memory_stats(device)
    torch.cuda.empty_cache()
    torch.cuda.synchronize()
    start_event.record()
    for _ in range(nb_loops):
        set_seed(0)
        output = model.generate(**inputs, do_sample=True, fine_temperature=0.4, coarse_temperature=0.8)
    end_event.record()
    torch.cuda.synchronize()
    max_memory = torch.cuda.max_memory_allocated(device)
    elapsed_time = start_event.elapsed_time(end_event) * 1.0e-3
    print('Execution time:', elapsed_time/nb_loops, 'seconds')
    print('Max memory footprint', max_memory*1e-9, ' GB')
    return output

with torch.inference_mode():
    speech_output = measure_latency_and_memory_use(model, inputs, nb_loops=5)

Output: Execution time: 9.38 seconds, Max memory footprint: 1.91 GB.

Audio output is available here.

For a streamlined version with full code, see the Google Colab.

Boosting Bark TTS Performance with Hugging Face Optimization Tools

Bark Architecture

Loading the Model

Optimization Techniques

Base Case

We Care About Your Privacy

How and why we process data