Neighbor communities
0 / 0 papers shown
Title |
|---|
Top Contributors
| Name | # Papers | # Citations |
|---|---|---|
Social Events
| Date | Location | Event |
|---|---|---|
Title |
|---|
| Name | # Papers | # Citations |
|---|---|---|
| Date | Location | Event |
|---|---|---|
The community introduces new metrics, methodologies, or frameworks for evaluating language models.
Title |
|---|
Title | |||
|---|---|---|---|
![]() Evaluating Large Language Models on Multimodal Chemistry Olympiad ExamsCommunications Chemistry (Commun. Chem.), 2025 Yiming Cui Xin Yao Yuxuan Qin Xin Li Shijin Wang Guoping Hu | |||
![]() On Assessing the Relevance of Code Reviews Authored by Generative Models Robert Heumüller Frank Ortmeier | |||
![]() FrontierCS: Evolving Challenges for Evolving Intelligence Qiuyang Mang Wenhao Chai Zhifei Li Huanzhi Mao Shang Zhou ...Sewon Min Ion Stoica Joseph E. Gonzalez Jingbo Shang Alvin Cheung | |||
![]() Evaluating Large Language Models in Scientific Discovery Zhangde Song Jieyu Lu Yuanqi Du Botao Yu Thomas M. Pruyn ...Heather J. Kulik Haojun Jia Huan Sun Seyed Mohamad Moosavi Chenru Duan | |||
![]() Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers Adam Karvonen James Chua Clément Dumas Kit Fraser-Taliente Subhash Kantamneni ...Euan Ong Arnab Sen Sharma Daniel Wen Owain Evans Samuel Marks | |||
![]() Evaluating the Capability of Video Question Generation for Expert Knowledge Elicitation Huaying Zhang Atsushi Hashimoto Tosho Hirasawa | |||
![]() MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers Xuanjun Zong Zhiqi Shen Lei Wang Yunshi Lan Chao Yang | |||
![]() Criminal Liability in AI-Enabled Autonomous Vehicles: A Comparative StudySocial Science Research Network (SSRN), 2025 Sahibpreet Singh Manjit Singh | |||
![]() Evaluating Frontier LLMs on PhD-Level Mathematical Reasoning: A Benchmark on a Textbook in Theoretical Computer Science about Randomized Algorithms Yang Cao Yubin Chen Xuyang Guo Zhao Song Song Yue Jiahao Zhang Jiale Zhao | |||
![]() VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models Nguyen Tien Dong Minh-Anh Nguyen Thanh Dat Hoang Nguyen Tuan Ngoc Dao Xuan Quang Minh Phan Phi Hai Nguyen Thi Ngoc Anh Dang Van Tu Binh Vu | |||
![]() Audio MultiChallenge: A Multi-Turn Evaluation of Spoken Dialogue Systems on Natural Human Interaction Advait Gosai Tyler Vuong Utkarsh Tyagi Steven Li Wenjia You ...Arda Uçar Zhongwang Fang Brian Jang Bing Liu Yunzhong He | |||
![]() Large-Language Memorization During the Classification of United States Supreme Court Cases John E. Ortega Dhruv D. Joshi Matt P. Borkowski | |||
![]() MedCEG: Reinforcing Verifiable Medical Reasoning with Critical Evidence Graph Linjie Mu Yannian Gu Zhongzhen Huang Yakun Zhu Shaoting Zhang Xiaofan Zhang | |||
![]() LexRel: Benchmarking Legal Relation Extraction for Chinese Civil Cases Yida Cai Ranjuexiao Hu Huiyuan Xie Chenyang Li Yun Liu Yuxiao Ye Zhenghao Liu Weixing Shen Zhiyuan Liu | |||
![]() Market-Bench: Evaluating Large Language Models on Introductory Quantitative Trading and Market Dynamics Abhay Srivastava Sam Jung Spencer Mateega | |||
![]() LegalRikai: Open Benchmark - A Benchmark for Complex Japanese Corporate Legal Tasks Shogo Fujita Yuji Naraki Yiqing Zhu Shinsuke Mori | |||
![]() Confucius Code Agent: Scalable Agent Scaffolding for Real-World Codebases Zhaodong Wang Zhenting Qi Sherman Wong Nathan Hu Samuel Lin ...Erwin Gao Wenlin Chen Yilun Du Minlan Yu Ying Zhang | |||
![]() LLM-Assisted AHP for Explainable Cyber Range Evaluation Vyron Kampourakis Georgios Kavallieratos Georgios Spathoulas Vasileios Gkioulos Sokratis Katsikas | |||
![]() Challenges of Evaluating LLM Safety for User Welfare Manon Kempermann Sai Suresh Macharla Vasu Mahalakshmi Raveenthiran Theo Farrell Ingmar Weber | |||
![]() CP-Env: Evaluating Large Language Models on Clinical Pathways in a Controllable Hospital Environment Yakun Zhu Zhongzhen Huang Qianhan Feng Linjie Mu Yannian Gu Shaoting Zhang Qi Dou Xiaofan Zhang | |||
![]() SDialog: A Python Toolkit for End-to-End Agent Building, User Simulation, Dialog Generation, and Evaluation Sergio Burdisso Séverin Baroudi Yanis Labrak David Grunert Pawel Cyrta ...Srikanth Madikeri Esaú Villatoro-Tello Thomas Schaaf Ricard Marxer Petr Motlicek | |||
![]() USCSA: Evolution-Aware Security Analysis for Proxy-Based Upgradeable Smart Contracts Xiaoqi Li Lei Xie Wenkai Li Zongwei Li | |||
![]() Reasoning Models Ace the CFA Exams Jaisal Patel Yunzhe Chen Kaiwen He Keyi Wang David Li Kairong Xiao Xiao-Yang Liu | |||
![]() Large Language Models for Education and Research: An Empirical and User Survey-based Analysis Md Mostafizer Rahman Ariful Islam Shiplu Md Faizul Ibne Amin Yutaka Watanobe Lu Peng | |||
![]() Do Large Language Models Truly Understand Cross-cultural Differences? Shiwei Guo Sihang Jiang Qianxi He Yanghua Xiao Jiaqing Liang Bi Yude Minggui He Shimin Tao Li Zhang | |||
![]() Balanced Accuracy: The Right Metric for Evaluating LLM Judges - Explained through Youden's J statistic Stephane Collot Colin Fraser Justin Zhao William F. Shen Timon Willi Ilias Leontiadis | |||
![]() TeluguST-46: A Benchmark Corpus and Comprehensive Evaluation for Telugu-English Speech Translation Bhavana Akkiraju Srihari Bandarupalli Swathi Sambangi Vasavi Ravuri R Vijaya Saraswathi Anil Kumar Vuppala | |||
![]() Becoming Experienced Judges: Selective Test-Time Learning for Evaluators Seungyeon Jwa Daechul Ahn Reokyoung Kim Dongyeop Kang Jonghyun Choi | |||
![]() OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation Xiaojun Jia Jie Liao Qi Guo Teng Ma Simeng Qin ...Dongxian Wu Yiming Li Wenqi Ren Xiaochun Cao Yang Liu | |||
![]() SymPyBench: A Dynamic Benchmark for Scientific Reasoning with Executable Python Code Shima Imani Seungwhan Moon Adel Ahmadyan Lu Zhang Kirmani Ahmed Babak Damavandi | |||
![]() MedTutor-R1: Socratic Personalized Medical Teaching with Multi-Agent Simulation Zhitao He Haolin Yang Zeyu Qin Yi R Fung | |||
![]() Taxonomy-Adaptive Moderation Model with Robust Guardrails for Large Language Models Mahesh Kumar Nandwana Youngwan Lim Joseph Liu Alex Yang Varun Notibala Nishchaie Khanna | |||
![]() TeleAI-Safety: A comprehensive LLM jailbreaking benchmark towards attacks, defenses, and evaluations Xiuyuan Chen Jian Zhao Yuxiang He Yuan Xun Xinwei Liu ...Ziyan Shi Yuchen Yuan Tianle Zhang Chi Zhang Xuelong Li | |||
![]() Generalization Beyond Benchmarks: Evaluating Learnable Protein-Ligand Scoring Functions on Unseen Targets Jakub Kopko David Graber Saltuk Mustafa Eyrilmez Stanislav Mazurenko David Bednar Jiri Sedlar Josef Sivic | |||
![]() The AI Consumer Index (ACE) Julien Benchek Rohit Shetty Benjamin Hunsberger Ajay Arun Zach Richards Brendan Foody Osvald Nitski Bertie Vidgen | |||
![]() DataGovBench: Benchmarking LLM Agents for Real-World Data Governance Workflows Zhou Liu Zhaoyang Han Guochen Yan Hao Liang Bohan Zeng Xing Chen Yuanfeng Song Wentao Zhang | |||
![]() Executable Governance for AI: Translating Policies into Rules Using LLMs Gautam Varma Datla Anudeep Vurity Tejaswani Dash Tazeem Ahmad Mohd Adnan Saima Rafi | |||
![]() LeMat-GenBench: A Unified Evaluation Framework for Crystal Generative Models Siddharth Betala Samuel P. Gleason Ali Ramlaoui Andy Xu Georgia Channing ...Félix Therrien Alex Hernandez-Garcia Rocío Mercado N. M. Anoop Krishnan Alexandre Duval | |||
![]() Challenging the Abilities of Large Language Models in Italian: a Community Initiative Malvina Nissim Danilo Croce Viviana Patti Pierpaolo Basile Giuseppe Attanasio ...Andrea Zaninello Asya Zanollo Fabio Massimo Zanzotto Kamyar Zeinalipour Andrea Zugarini | |||
![]() LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence Wenjin Liu Haoran Luo Xin Feng Xiang Ji Lijuan Zhou Rui Mao Jiapu Wang Shirui Pan Erik Cambria | |||
![]() Evaluating Hydro-Science and Engineering Knowledge of Large Language Models Shiruo Hu Wenbo Shan Yingjia Li Zhiqi Wan Xinpeng Yu ...Chee Hui Lai Wei Luo Yubin He Bin Xu Jianshi Zhao | |||
![]() Catching UX Flaws in Code: Leveraging LLMs to Identify Usability Flaws at the Development Stage Nolan Platt Ethan Luchs Sehrish Nizamani | |||
![]() Understanding LLM Reasoning for Abstractive Summarization Haohan Yuan Haopeng Zhang | |||
| Name (-) |
|---|
| Name (-) |
|---|
| Name (-) |
|---|
| Date | Location | Event | |
|---|---|---|---|
| No social events available | |||