BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows

13 April 2026

Elaine Lau

Markus Dücker

Ronak Chaudhary

Hui Wen Goh

Rosemary Wei

Vaibhav Kumar

Saed Qunbar

Guram Gogia

Yi Liu

Scott Millslagle

Nasim Borazjanizadeh

Ulyana Tkachenko

Samuel Eshun Danquah

Collin Schweiker

Vijay Karumathil

Asrith Devalaraju

Varsha Sandadi

Haemi Nam

Punit Arani

Ray Epps

Abdullah Arif

Sahil Bhaiwala

Curtis Northcutt

Skyler Wang

Anish Athalye

Jonas Mueller

Francisco Guzmán

AIFin

ELM

ArXiv (abs)PDF HTML Github (5★)

Main:19 Pages

18 Figures

Bibliography:5 Pages

20 Tables

Appendix:27 Pages

Abstract

Existing AI benchmarks lack the fidelity to assess economically meaningful progress on professional workflows. To evaluate frontier AI agents in a high-value, labor-intensive profession, we introduce BankerToolBench (BTB): an open-source benchmark of end-to-end analytical workflows routinely performed by junior investment bankers. To develop an ecologically valid benchmark grounded in representative work environments, we collaborated with 502 investment bankers from leading firms. BTB requires agents to execute senior banker requests by navigating data rooms, using industry tools (market data platform, SEC filings database), and generating multi-file deliverables--including Excel financial models, PowerPoint pitch decks, and PDF/Word reports. Completing a BTB task takes bankers up to 21 hours, underscoring the economic stakes of successfully delegating this work to AI. BTB enables automated evaluation of any LLM or agent, scoring deliverables against 100+ rubric criteria defined by veteran investment bankers to capture stakeholder utility. Testing 9 frontier models, we find that even the best-performing model (GPT-5.4) fails nearly half of the rubric criteria and bankers rate 0% of its outputs as client-ready. Our failure analysis reveals key obstacles (such as breakdowns in cross-artifact consistency) and improvement directions for agentic AI in high-stakes professional workflows.

View on arXiv

Comments on this paper