Gistify! Codebase-Level Understanding via Runtime Execution

30 October 2025

Hyunji Lee

Minseon Kim

Chinmay Singh

Matheus Pereira

Atharv Sonwane

Isadora White

Elias Stengel-Eskin

Mohit Bansal

Zhengyan Shi

Alessandro Sordoni

Marc-Alexandre Côté

Xingdi Yuan

Lucas Caccia

ELM

ArXiv (abs)PDF HTML Github (38127★)

Main:10 Pages

18 Figures

Bibliography:3 Pages

11 Tables

Appendix:12 Pages

Abstract

As coding agents are increasingly deployed in large codebases, the need to automatically design challenging, codebase-level evaluation is central. We propose Gistify, a task where a coding LLM must create a single, minimal, self-contained file that can reproduce a specific functionality of a codebase. The coding LLM is given full access to a codebase along with a specific entrypoint (e.g., a python command), and the generated file must replicate the output of the same command ran under the full codebase, while containing only the essential components necessary to execute the provided command. Success on Gistify requires both structural understanding of the codebase, accurate modeling of its execution flow as well as the ability to produce potentially large code patches. Our findings show that current state-of-the-art models struggle to reliably solve Gistify tasks, especially ones with long executions traces.

View on arXiv

Comments on this paper