Similar to how companies assess candidate software developers, we then evaluate models by checking their generated code on test cases.
Unlike prior work in more restricted settings, our benchmark measures the ability of models to take an arbitrary natural language specification and generate Python code fulfilling this specification. To meet this challenge, we introduce APPS, a benchmark for code generation. It can be difficult to accurately assess code generation performance, and there has been surprisingly little work on evaluating code generation in a way that is both flexible and rigorous. Download a PDF of the paper titled Measuring Coding Challenge Competence With APPS, by Dan Hendrycks and Steven Basart and Saurav Kadavath and Mantas Mazeika and Akul Arora and Ethan Guo and Collin Burns and Samir Puranik and Horace He and Dawn Song and Jacob Steinhardt Download PDF Abstract:While programming is one of the most broadly applicable skills in modern society, modern machine learning models still cannot code solutions to basic problems.