Google DeepMind has unveiled a groundbreaking training architecture called Decoupled DiLoCo, designed to train large language models across multiple data centers worldwide. The method dramatically reduces communication overhead and enhances fault tolerance by decoupling synchronization from computation.
Traditional distributed training relies on frequent synchronization between GPUs, which becomes a bottleneck when spanning geographically distant data centers. Decoupled DiLoCo addresses this by allowing compute islands to operate independently for long periods, only synchronizing periodically. This reduces bandwidth demands and makes the system resilient to hardware failures—if one data center goes down, others can continue training.
In experiments with the Gemma 4 model, Decoupled DiLoCo achieved performance comparable to standard methods while using significantly less inter-data-center bandwidth and achieving higher goodput (time spent on actual computation rather than waiting). The architecture also enables flexible use of heterogeneous hardware across different locations.
As AI models scale to trillions of parameters, Decoupled DiLoCo is seen as a key innovation for building next-generation training infrastructure, allowing organizations to leverage distributed compute resources more efficiently and reliably.