Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2503.00096
Cited By
v1
v2 (latest)
BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology
28 February 2025
Ludovico Mitchener
Jon M. Laurent
Benjamin Tenmann
Siddharth Narayanan
Geemi P Wellawatte
A. White
Lorenzo Sani
Samuel G. Rodriques
LLMAG
LM&MA
ELM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology"
22 / 22 papers shown
Title
Robin: A multi-agent system for automating scientific discovery
Ali E. Ghareeb
Benjamin Chang
Ludovico Mitchener
Angela Yiu
Caralyn J. Szostkiewicz
Jon M. Laurent
Muhammed T. Razzak
A. White
Michaela M. Hinks
Samuel G. Rodriques
45
3
0
19 May 2025
Benchmarking AI scientists in omics data-driven biological research
Erpai Luo
Jinmeng Jia
Yifan Xiong
Xiangyu Li
Xiaobo Guo
Baoqi Yu
Lei Wei
Xuegong Zhang
ELM
64
0
0
13 May 2025
LLMs Outperform Experts on Challenging Biology Benchmarks
Lennart Justen
ELM
60
1
0
09 May 2025
Teaching Large Language Models to Reason through Learning and Forgetting
Tianwei Ni
Allen Nie
Sapana Chaudhary
Yao Liu
Huzefa Rangwala
Rasool Fakoor
ReLM
CLL
LRM
457
0
0
15 Apr 2025
Agent Laboratory: Using LLM Agents as Research Assistants
Samuel Schmidgall
Yusheng Su
Zihan Wang
Xingwu Sun
Jialian Wu
Xiaodong Yu
Jiang Liu
Michael Moor
Zicheng Liu
Emad Barsoum
LLMAG
80
60
2
08 Jan 2025
Aviary: training language agents on challenging scientific tasks
Siddharth Narayanan
James D. Braza
Ryan-Rhys Griffiths
Manu Ponnapati
Albert Bou
...
Ori Kabeli
Geemi P Wellawatte
Sam Cox
Samuel G. Rodriques
A. White
48
14
0
31 Dec 2024
GPT-4o System Card
OpenAI OpenAI
:
Aaron Hurst
Adam Lerer
Adam P. Goucher
...
Yuchen He
Yuchen Zhang
Yujia Jin
Yunxing Dai
Yury Malkov
MLLM
202
1,019
0
25 Oct 2024
DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models
Yiming Huang
Jianwen Luo
Yan Yu
Yitong Zhang
Fangyu Lei
...
Shizhu He
Lifu Huang
Xiao Liu
Jun Zhao
Kang Liu
ELM
ALM
AI4CE
61
8
0
09 Oct 2024
CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark
Zachary S. Siegel
Sayash Kapoor
Nitya Nagdir
Benedikt Stroebl
Arvind Narayanan
71
14
0
17 Sep 2024
SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories
Ben Bogin
Kejuan Yang
Shashank Gupta
Kyle Richardson
Erin Bransom
Peter Clark
Ashish Sabharwal
Tushar Khot
ELM
LRM
75
18
0
11 Sep 2024
Language agents achieve superhuman synthesis of scientific knowledge
Michael D. Skarlinski
Sam Cox
Jon M. Laurent
James D. Braza
Michaela M. Hinks
M. Hammerling
Manvitha Ponnapati
Samuel G. Rodriques
Andrew D. White
ELM
HILM
ALM
94
40
0
10 Sep 2024
LAB-Bench: Measuring Capabilities of Language Models for Biology Research
Jon M. Laurent
Joseph D. Janizek
Michael Ruzo
Michaela M. Hinks
M. Hammerling
Siddharth Narayanan
Manvitha Ponnapati
Andrew D. White
Samuel G. Rodriques
ELM
56
51
0
14 Jul 2024
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models
Bodhisattwa Prasad Majumder
Harshit Surana
Dhruv Agarwal
Bhavana Dalvi Mishra
Abhijeetsingh Meena
Aryan Prakhar
Tirth Vora
Tushar Khot
Ashish Sabharwal
Peter Clark
ELM
69
17
0
01 Jul 2024
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Terry Yue Zhuo
Minh Chien Vu
Jenny Chim
Han Hu
Wenhao Yu
...
David Lo
Daniel Fried
Xiaoning Du
H. D. Vries
Leandro von Werra
157
192
0
22 Jun 2024
An Evaluation of Large Language Models in Bioinformatics Research
Hengchuang Yin
Zhonghui Gu
Fanhao Wang
Yiparemu Abuduhaibaier
Yanqiao Zhu
Xinming Tu
Xian-Sheng Hua
Xiao Luo
Yizhou Sun
LM&MA
67
8
0
21 Feb 2024
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code
Xiangru Tang
Yuliang Liu
Zefan Cai
Yan Shao
Junjie Lu
...
Yujia Qin
Wangchunshu Zhou
Yilun Zhao
Arman Cohan
Mark B. Gerstein
ELM
LLMAG
100
29
0
16 Nov 2023
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez
John Yang
Alexander Wettig
Shunyu Yao
Kexin Pei
Ofir Press
Karthik Narasimhan
ELM
103
627
0
10 Oct 2023
BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models
Xiangru Tang
Bill Qian
Rick Gao
Jiakang Chen
Xinyun Chen
Mark B. Gerstein
68
16
0
31 Aug 2023
Large Language Models
Michael R Douglas
LLMAG
LM&MA
138
642
0
11 Jul 2023
Evaluating Large Language Models Trained on Code
Mark Chen
Jerry Tworek
Heewoo Jun
Qiming Yuan
Henrique Pondé
...
Bob McGrew
Dario Amodei
Sam McCandlish
Ilya Sutskever
Wojciech Zaremba
ELM
ALM
233
5,635
0
07 Jul 2021
Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development
Kexin Huang
Tianfan Fu
Wenhao Gao
Yue Zhao
Yusuf Roohani
J. Leskovec
Connor W. Coley
Cao Xiao
Jimeng Sun
Marinka Zitnik
OOD
LM&MA
75
283
0
18 Feb 2021
Measuring Massive Multitask Language Understanding
Dan Hendrycks
Collin Burns
Steven Basart
Andy Zou
Mantas Mazeika
Basel Alomair
Jacob Steinhardt
ELM
RALM
182
4,526
0
07 Sep 2020
1