ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Communities
  3. ...

Neighbor communities

0 / 0 papers shown
Title
Top Contributors
Name# Papers# Citations
Social Events
DateLocationEvent
  1. Home
  2. Communities
  3. ELM

Evaluating Language Models

ELM
More data

The community introduces new metrics, methodologies, or frameworks for evaluating language models.

Neighbor communities

51015

Featured Papers

0 / 0 papers shown
Title

All papers

50 / 4,745 papers shown
Title
FrontierCS: Evolving Challenges for Evolving Intelligence
FrontierCS: Evolving Challenges for Evolving Intelligence
Qiuyang Mang
Wenhao Chai
Zhifei Li
Huanzhi Mao
Shang Zhou
...
Sewon Min
Ion Stoica
Joseph E. Gonzalez
Jingbo Shang
Alvin Cheung
ELMLRM
10
0
0
17 Dec 2025
Evaluating Large Language Models on Multimodal Chemistry Olympiad Exams
Evaluating Large Language Models on Multimodal Chemistry Olympiad Exams
Yiming Cui
Xin Yao
Yuxuan Qin
Xin Li
Shijin Wang
Guoping Hu
ELMLRM
5
0
0
17 Dec 2025
Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
Adam Karvonen
James Chua
Clément Dumas
Kit Fraser-Taliente
Subhash Kantamneni
...
Euan Ong
Arnab Sen Sharma
Daniel Wen
Owain Evans
Samuel Marks
LLMSVELM
112
0
0
17 Dec 2025
Evaluating Large Language Models in Scientific Discovery
Evaluating Large Language Models in Scientific Discovery
Zhangde Song
Jieyu Lu
Yuanqi Du
Botao Yu
Thomas M. Pruyn
...
Heather J. Kulik
Haojun Jia
Huan Sun
Seyed Mohamad Moosavi
Chenru Duan
ELMLRM
0
0
0
17 Dec 2025
Evaluating the Capability of Video Question Generation for Expert Knowledge Elicitation
Evaluating the Capability of Video Question Generation for Expert Knowledge Elicitation
Huaying Zhang
Atsushi Hashimoto
Tosho Hirasawa
ELM
0
0
0
17 Dec 2025
MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers
MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers
Xuanjun Zong
Zhiqi Shen
Lei Wang
Yunshi Lan
Chao Yang
ELM
0
0
0
17 Dec 2025
On Assessing the Relevance of Code Reviews Authored by Generative Models
On Assessing the Relevance of Code Reviews Authored by Generative Models
Robert Heumüller
Frank Ortmeier
ALMELM
8
0
0
17 Dec 2025
Criminal Liability in AI-Enabled Autonomous Vehicles: A Comparative Study
Criminal Liability in AI-Enabled Autonomous Vehicles: A Comparative StudySocial Science Research Network (SSRN), 2025
Sahibpreet Singh
Manjit Singh
AILawELM
108
0
0
16 Dec 2025
Evaluating Frontier LLMs on PhD-Level Mathematical Reasoning: A Benchmark on a Textbook in Theoretical Computer Science about Randomized Algorithms
Evaluating Frontier LLMs on PhD-Level Mathematical Reasoning: A Benchmark on a Textbook in Theoretical Computer Science about Randomized Algorithms
Yang Cao
Yubin Chen
Xuyang Guo
Zhao Song
Song Yue
Jiahao Zhang
Jiale Zhao
LRMELM
29
0
0
16 Dec 2025
VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models
VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models
Nguyen Tien Dong
Minh-Anh Nguyen
Thanh Dat Hoang
Nguyen Tuan Ngoc
Dao Xuan Quang Minh
Phan Phi Hai
Nguyen Thi Ngoc Anh
Dang Van Tu
Binh Vu
AILawELM
124
0
0
16 Dec 2025
Audio MultiChallenge: A Multi-Turn Evaluation of Spoken Dialogue Systems on Natural Human Interaction
Audio MultiChallenge: A Multi-Turn Evaluation of Spoken Dialogue Systems on Natural Human Interaction
Advait Gosai
Tyler Vuong
Utkarsh Tyagi
Steven Li
Wenjia You
...
Arda Uçar
Zhongwang Fang
Brian Jang
Bing Liu
Yunzhong He
AuLLMELM
124
0
0
16 Dec 2025
Olmo 3
Olmo 3
Team Olmo
Allyson Ettinger
Amanda Bertsch
Bailey Kuehl
David Graham
...
Luke Zettlemoyer
Pang Wei Koh
Ali Farhadi
Noah A. Smith
Hannaneh Hajishirzi
LLMAGOSLMKELMELMLRM
128
0
0
15 Dec 2025
CAPE: Capability Achievement via Policy Execution
CAPE: Capability Achievement via Policy Execution
David Ball
OffRLELM
0
0
0
15 Dec 2025
Large-Language Memorization During the Classification of United States Supreme Court Cases
Large-Language Memorization During the Classification of United States Supreme Court Cases
John E. Ortega
Dhruv D. Joshi
Matt P. Borkowski
AILawELM
28
0
0
15 Dec 2025
MedCEG: Reinforcing Verifiable Medical Reasoning with Critical Evidence Graph
MedCEG: Reinforcing Verifiable Medical Reasoning with Critical Evidence Graph
Linjie Mu
Yannian Gu
Zhongzhen Huang
Yakun Zhu
Shaoting Zhang
Xiaofan Zhang
LRMELM
17
0
0
15 Dec 2025
LexRel: Benchmarking Legal Relation Extraction for Chinese Civil Cases
LexRel: Benchmarking Legal Relation Extraction for Chinese Civil Cases
Yida Cai
Ranjuexiao Hu
Huiyuan Xie
Chenyang Li
Yun Liu
Yuxiao Ye
Zhenghao Liu
Weixing Shen
Zhiyuan Liu
AILawELM
131
0
0
14 Dec 2025
Large language models have learned to use language
Large language models have learned to use language
Gary Lupyan
ELMAI4CE
150
0
0
13 Dec 2025
Market-Bench: Evaluating Large Language Models on Introductory Quantitative Trading and Market Dynamics
Market-Bench: Evaluating Large Language Models on Introductory Quantitative Trading and Market Dynamics
Abhay Srivastava
Sam Jung
Spencer Mateega
AIFinELM
61
0
0
13 Dec 2025
LegalRikai: Open Benchmark - A Benchmark for Complex Japanese Corporate Legal Tasks
LegalRikai: Open Benchmark - A Benchmark for Complex Japanese Corporate Legal Tasks
Shogo Fujita
Yuji Naraki
Yiqing Zhu
Shinsuke Mori
AILawELM
108
0
0
12 Dec 2025
Challenges of Evaluating LLM Safety for User Welfare
Challenges of Evaluating LLM Safety for User Welfare
Manon Kempermann
Sai Suresh Macharla Vasu
Mahalakshmi Raveenthiran
Theo Farrell
Ingmar Weber
ELM
57
0
0
11 Dec 2025
CP-Env: Evaluating Large Language Models on Clinical Pathways in a Controllable Hospital Environment
CP-Env: Evaluating Large Language Models on Clinical Pathways in a Controllable Hospital Environment
Yakun Zhu
Zhongzhen Huang
Qianhan Feng
Linjie Mu
Yannian Gu
Shaoting Zhang
Qi Dou
Xiaofan Zhang
LLMAGLM&MAELM
185
0
0
11 Dec 2025
Confucius Code Agent: Scalable Agent Scaffolding for Real-World Codebases
Confucius Code Agent: Scalable Agent Scaffolding for Real-World Codebases
Zhaodong Wang
Zhenting Qi
Sherman Wong
Nathan Hu
Samuel Lin
...
Erwin Gao
Wenlin Chen
Yilun Du
Minlan Yu
Ying Zhang
LLMAGELM
40
0
0
11 Dec 2025
Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems
Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems
Eddie Landesberg
ELMCML
84
0
0
11 Dec 2025
LLM-Assisted AHP for Explainable Cyber Range Evaluation
LLM-Assisted AHP for Explainable Cyber Range Evaluation
Vyron Kampourakis
Georgios Kavallieratos
Georgios Spathoulas
Vasileios Gkioulos
Sokratis Katsikas
ELM
88
0
0
11 Dec 2025
Reasoning Models Ace the CFA Exams
Reasoning Models Ace the CFA Exams
Jaisal Patel
Yunzhe Chen
Kaiwen He
Keyi Wang
David Li
Kairong Xiao
Xiao-Yang Liu
ELMReLMLRM
198
0
0
09 Dec 2025
SDialog: A Python Toolkit for End-to-End Agent Building, User Simulation, Dialog Generation, and Evaluation
SDialog: A Python Toolkit for End-to-End Agent Building, User Simulation, Dialog Generation, and Evaluation
Sergio Burdisso
Séverin Baroudi
Yanis Labrak
David Grunert
Pawel Cyrta
...
Srikanth Madikeri
Esaú Villatoro-Tello
Thomas Schaaf
Ricard Marxer
Petr Motlicek
ELM
96
0
0
09 Dec 2025
A Practical Framework for Evaluating Medical AI Security: Reproducible Assessment of Jailbreaking and Privacy Vulnerabilities Across Clinical Specialties
A Practical Framework for Evaluating Medical AI Security: Reproducible Assessment of Jailbreaking and Privacy Vulnerabilities Across Clinical Specialties
Jinghao Wang
Ping Zhang
Carter Yagemann
ELM
120
0
0
09 Dec 2025
USCSA: Evolution-Aware Security Analysis for Proxy-Based Upgradeable Smart Contracts
USCSA: Evolution-Aware Security Analysis for Proxy-Based Upgradeable Smart Contracts
Xiaoqi Li
Lei Xie
Wenkai Li
Zongwei Li
ELM
24
0
0
09 Dec 2025
Balanced Accuracy: The Right Metric for Evaluating LLM Judges - Explained through Youden's J statistic
Balanced Accuracy: The Right Metric for Evaluating LLM Judges - Explained through Youden's J statistic
Stephane Collot
Colin Fraser
Justin Zhao
William F. Shen
Timon Willi
Ilias Leontiadis
ELM
36
0
0
08 Dec 2025
Do Large Language Models Truly Understand Cross-cultural Differences?
Do Large Language Models Truly Understand Cross-cultural Differences?
Shiwei Guo
Sihang Jiang
Qianxi He
Yanghua Xiao
Jiaqing Liang
Bi Yude
Minggui He
Shimin Tao
Li Zhang
ELMLRM
108
0
0
08 Dec 2025
Replicating TEMPEST at Scale: Multi-Turn Adversarial Attacks Against Trillion-Parameter Frontier Models
Replicating TEMPEST at Scale: Multi-Turn Adversarial Attacks Against Trillion-Parameter Frontier Models
Richard Young
AAMLELM
76
0
0
08 Dec 2025
TeluguST-46: A Benchmark Corpus and Comprehensive Evaluation for Telugu-English Speech Translation
TeluguST-46: A Benchmark Corpus and Comprehensive Evaluation for Telugu-English Speech Translation
Bhavana Akkiraju
Srihari Bandarupalli
Swathi Sambangi
Vasavi Ravuri
R Vijaya Saraswathi
Anil Kumar Vuppala
ELM
32
0
0
08 Dec 2025
Large Language Models for Education and Research: An Empirical and User Survey-based Analysis
Large Language Models for Education and Research: An Empirical and User Survey-based Analysis
Md Mostafizer Rahman
Ariful Islam Shiplu
Md Faizul Ibne Amin
Yutaka Watanobe
Lu Peng
AI4EdELM
128
0
0
08 Dec 2025
Becoming Experienced Judges: Selective Test-Time Learning for Evaluators
Becoming Experienced Judges: Selective Test-Time Learning for Evaluators
Seungyeon Jwa
Daechul Ahn
Reokyoung Kim
Dongyeop Kang
Jonghyun Choi
ELM
36
0
0
07 Dec 2025
OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation
OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation
Xiaojun Jia
Jie Liao
Qi Guo
Teng Ma
Simeng Qin
...
Dongxian Wu
Yiming Li
Wenqi Ren
Xiaochun Cao
Yang Liu
AAMLELM
204
0
0
06 Dec 2025
Less Is More for Multi-Step Logical Reasoning of LLM Generalisation Under Rule Removal, Paraphrasing, and Compression
Less Is More for Multi-Step Logical Reasoning of LLM Generalisation Under Rule Removal, Paraphrasing, and Compression
Qiming Bao
Xiaoxuan Fu
LRMELM
132
0
0
06 Dec 2025
Taxonomy-Adaptive Moderation Model with Robust Guardrails for Large Language Models
Taxonomy-Adaptive Moderation Model with Robust Guardrails for Large Language Models
Mahesh Kumar Nandwana
Youngwan Lim
Joseph Liu
Alex Yang
Varun Notibala
Nishchaie Khanna
KELMELM
188
0
0
05 Dec 2025
MedTutor-R1: Socratic Personalized Medical Teaching with Multi-Agent Simulation
MedTutor-R1: Socratic Personalized Medical Teaching with Multi-Agent Simulation
Zhitao He
Haolin Yang
Zeyu Qin
Yi R Fung
ELM
68
0
0
05 Dec 2025
SymPyBench: A Dynamic Benchmark for Scientific Reasoning with Executable Python Code
SymPyBench: A Dynamic Benchmark for Scientific Reasoning with Executable Python Code
Shima Imani
Seungwhan Moon
Adel Ahmadyan
Lu Zhang
Kirmani Ahmed
Babak Damavandi
AIMatReLMLRMELM
172
0
0
05 Dec 2025
Generalization Beyond Benchmarks: Evaluating Learnable Protein-Ligand Scoring Functions on Unseen Targets
Generalization Beyond Benchmarks: Evaluating Learnable Protein-Ligand Scoring Functions on Unseen Targets
Jakub Kopko
David Graber
Saltuk Mustafa Eyrilmez
Stanislav Mazurenko
David Bednar
Jiri Sedlar
Josef Sivic
ELM
152
0
0
05 Dec 2025
TeleAI-Safety: A comprehensive LLM jailbreaking benchmark towards attacks, defenses, and evaluations
TeleAI-Safety: A comprehensive LLM jailbreaking benchmark towards attacks, defenses, and evaluations
Xiuyuan Chen
Jian Zhao
Yuxiang He
Yuan Xun
Xinwei Liu
...
Ziyan Shi
Yuchen Yuan
Tianle Zhang
Chi Zhang
Xuelong Li
ELM
84
0
0
05 Dec 2025
The AI Consumer Index (ACE)
The AI Consumer Index (ACE)
Julien Benchek
Rohit Shetty
Benjamin Hunsberger
Ajay Arun
Zach Richards
Brendan Foody
Osvald Nitski
Bertie Vidgen
RALMELM
220
0
0
04 Dec 2025
LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence
LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence
Wenjin Liu
Haoran Luo
Xin Feng
Xiang Ji
Lijuan Zhou
Rui Mao
Jiapu Wang
Shirui Pan
Erik Cambria
AILawELM
53
0
0
04 Dec 2025
LeMat-GenBench: A Unified Evaluation Framework for Crystal Generative Models
LeMat-GenBench: A Unified Evaluation Framework for Crystal Generative Models
Siddharth Betala
Samuel P. Gleason
Ali Ramlaoui
Andy Xu
Georgia Channing
...
Félix Therrien
Alex Hernandez-Garcia
Rocío Mercado
N. M. Anoop Krishnan
Alexandre Duval
ELM
24
0
0
04 Dec 2025
Executable Governance for AI: Translating Policies into Rules Using LLMs
Executable Governance for AI: Translating Policies into Rules Using LLMs
Gautam Varma Datla
Anudeep Vurity
Tejaswani Dash
Tazeem Ahmad
Mohd Adnan
Saima Rafi
ELM
80
0
0
04 Dec 2025
DataGovBench: Benchmarking LLM Agents for Real-World Data Governance Workflows
DataGovBench: Benchmarking LLM Agents for Real-World Data Governance Workflows
Zhou Liu
Zhaoyang Han
Guochen Yan
Hao Liang
Bohan Zeng
Xing Chen
Yuanfeng Song
Wentao Zhang
ELM
133
0
0
04 Dec 2025
Challenging the Abilities of Large Language Models in Italian: a Community Initiative
Challenging the Abilities of Large Language Models in Italian: a Community Initiative
Malvina Nissim
Danilo Croce
Viviana Patti
Pierpaolo Basile
Giuseppe Attanasio
...
Andrea Zaninello
Asya Zanollo
Fabio Massimo Zanzotto
Kamyar Zeinalipour
Andrea Zugarini
ELM
32
0
0
04 Dec 2025
Evaluating Hydro-Science and Engineering Knowledge of Large Language Models
Evaluating Hydro-Science and Engineering Knowledge of Large Language Models
Shiruo Hu
Wenbo Shan
Yingjia Li
Zhiqi Wan
Xinpeng Yu
...
Chee Hui Lai
Wei Luo
Yubin He
Bin Xu
Jianshi Zhao
ELMAI4CE
78
0
0
03 Dec 2025
AITutor-EvalKit: Exploring the Capabilities of AI Tutors
AITutor-EvalKit: Exploring the Capabilities of AI Tutors
Numaan Naeem
Kaushal Kumar Maurya
Kseniia Petukhova
Ekaterina Kochmar
ELM
92
0
0
03 Dec 2025
Catching UX Flaws in Code: Leveraging LLMs to Identify Usability Flaws at the Development Stage
Catching UX Flaws in Code: Leveraging LLMs to Identify Usability Flaws at the Development Stage
Nolan Platt
Ethan Luchs
Sehrish Nizamani
ELM
32
0
0
03 Dec 2025
Loading #Papers per Month with "ELM"
Past speakers
Name (-)
Top Contributors
Name (-)
Top Organizations at ResearchTrend.AI
Name (-)
Social Events
DateLocationEvent
No social events available