BigCodeArena is a new evaluation platform that tests code generation models by executing their output end-to-end. Unlike traditional benchmarks that rely on static analysis or simple test cases, BigCodeArena runs the generated code in a sandboxed environment, checking for correctness, efficiency, and adherence to specifications.
The platform covers multiple programming languages and problem domains, from algorithms to real-world API usage. Each submission is executed against a suite of hidden test cases, ensuring that the evaluation is thorough and resistant to overfitting.
Early results show that even top-performing models struggle with complex, multi-step tasks. The creators hope BigCodeArena will drive progress in code generation by providing a more realistic and rigorous evaluation methodology.