DailyGlimpse

Hugging Face Integrates PyTorch/XLA for Faster and Cheaper Training on Cloud TPUs

AI
April 26, 2026 · 5:52 PM
Hugging Face Integrates PyTorch/XLA for Faster and Cheaper Training on Cloud TPUs

The PyTorch-TPU project, a collaboration between Facebook PyTorch and Google TPU teams, has brought first-class support for training on Cloud TPUs using PyTorch/XLA. This integration allows PyTorch users to run and scale their models on Cloud TPUs while maintaining the familiar Hugging Face trainers interface.

Key Features of the Integration

XLA:TPU Device Type PyTorch/XLA adds a new xla device type that works just like other PyTorch device types. Users can create tensors on XLA devices with minimal code changes:

import torch
import torch_xla
import torch_xla.core.xla_model as xm

t = torch.randn(2, 2, device=xm.xla_device())
print(t.device)
print(t)

The Trainer module automatically detects XLA:TPU devices and sets up the training environment accordingly.

XLA Device Step Computation Since Cloud TPUs consist of 8 cores, gradient consolidation across replicas is essential. The integration uses xm.optimizer_step(optimizer) to handle this, ensuring seamless multi-core training.

Input Pipeline Optimization To prevent idle time between host CPU and TPU accelerators, PyTorch/XLA provides a MpDeviceLoader that pipelines data loading and graph execution:

import torch_xla.distributed.parallel_loader as pl
dataloader = pl.MpDeviceLoader(dataloader, device)

Checkpoint Handling Checkpointing is handled via xm.save(), which ensures tensors are saved from CPU and loaded correctly across devices. The save_pretrained method in PreTrainedModel and the trainer's train method have been updated to use this API.

How PyTorch/XLA Works

PyTorch/XLA uses lazy tensor execution: operations are recorded in a graph until results are needed, allowing XLA to optimize and fuse operations. This remains transparent to the user, as synchronization happens automatically when copying data between devices or taking optimizer steps.

Performance

The integration delivers faster and cheaper training for Transformer models on Cloud TPUs, making it an attractive option for scaling deep learning workloads.