Neighbor communities
0 / 0 papers shown
Title |
|---|
Top Contributors
| Name | # Papers | # Citations |
|---|---|---|
Social Events
| Date | Location | Event |
|---|---|---|
Title |
|---|
| Name | # Papers | # Citations |
|---|---|---|
| Date | Location | Event |
|---|---|---|
The community introduces new metrics, methodologies, or frameworks for evaluating language models.
Title |
|---|
Title | |||
|---|---|---|---|
![]() OmniEduBench: A Comprehensive Chinese Benchmark for Evaluating Large Language Models in Education Min Zhang Hao Chen Hao Chen Wenqi Zhang Didi Zhu Xin Lin Bo Jiang Aimin Zhou Fei Wu Kun Kuang | |||
![]() Gistify! Codebase-Level Understanding via Runtime Execution Hyunji Lee Minseon Kim Chinmay Singh Matheus Pereira Atharv Sonwane ...Zhengyan Shi Alessandro Sordoni Marc-Alexandre Côté Xingdi Yuan Lucas Caccia | |||
![]() SP-MCQA: Evaluating Intelligibility of TTS Beyond the Word Level Hitomi Jin Ling Tee Chaoren Wang Zijie Zhang Zhizheng Wu | |||
![]() EdgeRunner 20B: Military Task Parity with GPT-5 while Running on the Edge Jack FitzGerald Aristotelis Lazaridis Dylan Bates Aman Sharma Jonnathan Castillo ...Dave Anderson Jonathan Beck Jamie Cuticello Colton Malkerson Tyler Saltsman | |||
![]() QCoder Benchmark: Bridging Language Generation and Quantum Hardware through Simulator-Based Feedback Taku Mikuriya Tatsuya Ishigaki Masayuki Kawarada Shunya Minami Tadashi Kadowaki ...Shunya Takata Takumi Kato Tamotsu Basseda Reo Yamada Hiroya Takamura | |||
![]() ReaKase-8B: Legal Case Retrieval via Knowledge and Reasoning Representations with LLMs Yanran Tang Ruihong Qiu Xue Li Zi Huang | |||
![]() Cross-Platform Evaluation of Reasoning Capabilities in Foundation Models J. de Curtò I. de Zarzà Pablo García Jordi Cabot | |||
![]() Nexus: Execution-Grounded Multi-Agent Test Oracle Synthesis Dong Huang Mingzhe Du Jie M. Zhang Zheng Lin Meng Luo Qianru Zhang See-Kiong Ng | |||
![]() Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation Musfiqur Rahman SayedHassan Khatoonabadi Emad Shihab | |||
![]() AMO-Bench: Large Language Models Still Struggle in High School Math Competitions Shengnan An Xunliang Cai Xuezhi Cao Xiaoyu Li Yehao Lin ...Xinxuan Lv Dan Ma Xuanlin Wang Ziwen Wang Shuang Zhou | |||
![]() Not ready for the bench: LLM legal interpretation is unstable and out of step with human judgments Abhishek Purushothama Junghyun Min Brandon Waldon Nathan Schneider | |||
![]() Humains-Junior: A 3.8B Language Model Achieving GPT-4o-Level Factual Accuracy by Directed Exoskeleton Reasoning Nissan Yaron Dan Bystritsky Ben-Etzion Yaron | |||
![]() SciTrust 2.0: A Comprehensive Framework for Evaluating Trustworthiness of Large Language Models in Scientific Applications Emily Herron Junqi Yin Feiyi Wang | |||
![]() Testing Cross-Lingual Text Comprehension In LLMs Using Next Sentence Prediction Ritesh Sunil Chavan Jack Mostow | |||
![]() BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains Vijay Devane Mohd Nauman Bhargav Patel Aniket Mahendra Wakchoure Yogeshkumar Sant ...Ajay Nagpal Piyush Sawarkar Kundeshwar Vijayrao Pundalik Rohit Saluja Ganesh Ramakrishnan | |||
![]() Do Large Language Models Grasp The Grammar? Evidence from Grammar-Book-Guided Probing in Luxembourgish Lujun Li Yewei Song Lama Sleem Yiqun Wang Yangjie Xu Cedric Lothritz Niccolo Gentile Radu State Tegawende F. Bissyande Jacques Klein | |||
![]() S3C2 Summit 2025-03: Industry Secure Supply Chain Summit Elizabeth Lin Jonah Ghebremichael William Enck Yasemin Acar Michel Cukier Alexandros Kapravelos Christian Kastner Laurie Williams | |||
![]() RiddleBench: A New Generative Reasoning Benchmark for LLMs Deepon Halder Alan Saji Thanmay Jayakumar Ratish Puduppully Anoop Kunchukuttan Raj Dabre | |||
![]() PANORAMA: A Dataset and Benchmarks Capturing Decision Trails and Rationales in Patent Examination Hyunseung Lim Sooyohn Nam Sungmin Na Ji Yong Cho June Yong Yang Hyungyu Shin Yoonjoo Lee Juho Kim Moontae Lee Hwajung Hong | |||
| Name (-) |
|---|
| Name (-) |
|---|
| Name (-) |
|---|
| Date | Location | Event | |
|---|---|---|---|
| No social events available | |||