THELMA: Task Based Holistic Evaluation of Large Language Model Applications-RAG Question Answering

THELMA: Task Based Holistic Evaluation of Large Language Model Applications-RAG Question Answering

16 May 2025

Cibi Chakravarthy Senthilkumar

Rafael Castrillo

Papers citing "THELMA: Task Based Holistic Evaluation of Large Language Model Applications-RAG Question Answering"

8 / 8 papers shown

Title
On the Implications of Verbose LLM Outputs: A Case Study in Translation Evaluation Eleftheria Briakou Zhongtao Liu Colin Cherry Markus Freitag 32 3 0 01 Oct 2024
Retrieval-Augmented Generation with Knowledge Graphs for Customer Service Question Answering Zhentao Xu Mark Jerome Cruz Matthew Guevara Tie Wang Manasi Deshpande Xiaofeng Wang Zheng Li RALM 43 73 0 26 Apr 2024
The Power of Noise: Redefining Retrieval for RAG Systems Florin Cuconasu Giovanni Trappolini F. Siciliano Simone Filice Cesare Campagnano Y. Maarek Nicola Tonellotto Fabrizio Silvestri RALM 89 169 0 26 Jan 2024
ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems Jon Saad-Falcon Omar Khattab Christopher Potts Matei A. Zaharia RALM 68 116 0 16 Nov 2023
Lost in the Middle: How Language Models Use Long Contexts Nelson F. Liu Kevin Lin John Hewitt Ashwin Paranjape Michele Bevilacqua Fabio Petroni Percy Liang RALM 84 1,570 0 06 Jul 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena Lianmin Zheng Wei-Lin Chiang Ying Sheng Siyuan Zhuang Zhanghao Wu ... Dacheng Li Eric Xing Haotong Zhang Joseph E. Gonzalez Ion Stoica ALM OSLM ELM 312 4,253 0 09 Jun 2023
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation Sewon Min Kalpesh Krishna Xinxi Lyu M. Lewis Wen-tau Yih Pang Wei Koh Mohit Iyyer Luke Zettlemoyer Hannaneh Hajishirzi HILM ALM 113 678 0 23 May 2023
BERTScore: Evaluating Text Generation with BERT Tianyi Zhang Varsha Kishore Felix Wu Kilian Q. Weinberger Yoav Artzi 275 5,764 0 21 Apr 2019