In a recent keynote, Anirban Roy, Principal Engineer at Amazon Web Services, challenged conventional metrics for evaluating large language model training efficiency. Instead of focusing solely on throughput—the raw number of tokens processed per second—Roy introduced the concept of 'goodput,' which measures real progress toward model convergence.
Roy breaks down the losses in training pipelines into three categories:
- Infrastructure availability (downtime, failures)
- Framework overhead (checkpointing, recovery)
- Model compute utilization (MFU)
By combining these factors, an end-to-end efficiency metric emerges that guides actionable engineering priorities. The talk emphasizes that raw throughput can be misleading; a system may process many tokens but waste substantial time on recovery and idle cycles. Goodput reframes the discussion to focus on what truly matters: how much usable training progress is achieved per unit of time.
This perspective is especially critical as organizations scale LLM training across thousands of accelerators, where even small inefficiencies compound into massive resource waste.