BigCodeBench, a new benchmark for evaluating code generation models, has been released. It aims to address the limitations of HumanEval by providing a more comprehensive and challenging set of programming tasks. The benchmark includes 1,140 diverse coding problems that test a model's ability to generate functionally correct code across various domains and difficulty levels. Early results show that even state-of-the-art models struggle with many tasks, highlighting the need for continued improvement in code generation.
BigCodeBench: A New Benchmark for Evaluating Code Generation Models
AI
April 26, 2026 · 4:30 PM