Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2305.18201
Cited By
A Critical Evaluation of Evaluations for Long-form Question Answering
29 May 2023
Fangyuan Xu
Yixiao Song
Mohit Iyyer
Eunsol Choi
ELM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"A Critical Evaluation of Evaluations for Long-form Question Answering"
38 / 38 papers shown
Title
InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation
Yunjia Xi
Jianghao Lin
Menghui Zhu
Yongzhao Xiao
Zhuoying Ou
...
Weiwen Liu
Yasheng Wang
Ruiming Tang
Weinan Zhang
Yong Yu
96
1
0
21 May 2025
How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild
Saad Obaid ul Islam
Anne Lauscher
Goran Glavaš
HILM
LRM
160
2
0
21 Feb 2025
Prompt-based Depth Pruning of Large Language Models
Juyun Wee
Minjae Park
Jaeho Lee
VLM
162
0
0
04 Feb 2025
Learning to Explore and Select for Coverage-Conditioned Retrieval-Augmented Generation
Takyoung Kim
Kyungjae Lee
Y. Jang
Ji Yong Cho
Gangwoo Kim
Minseok Cho
Moontae Lee
258
1
0
28 Jan 2025
PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations
Ruosen Li
Teerth Patel
Xinya Du
LLMAG
ALM
141
102
0
03 Jan 2025
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Hui Wei
Shenghua He
Tian Xia
Andy H. Wong
Jingyang Lin
Mei Han
Mei Han
ALM
ELM
143
32
0
23 Aug 2024
Inverse Constitutional AI: Compressing Preferences into Principles
Arduin Findeis
Timo Kaufmann
Eyke Hüllermeier
Samuel Albanie
Robert Mullins
SyDa
95
12
0
02 Jun 2024
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
Tianlong Wang
Xianfeng Jiao
Yifan He
Zhongzhi Chen
Yinghao Zhu
Xu Chu
Junyi Gao
Yasha Wang
Liantao Ma
LLMSV
122
15
0
26 May 2024
ir_explain: a Python Library of Explainable IR Methods
Siyang Song
Harsh Agarwal
Venktesh V
Avishek Anand
Swastik Mohanty
Debapriyo Majumdar
Mandar Mitra
XAI
110
1
0
29 Apr 2024
AdvisorQA: Towards Helpful and Harmless Advice-seeking Question Answering with Collective Intelligence
Minbeom Kim
Hwanhee Lee
Joonsuk Park
Hwaran Lee
Kyomin Jung
94
3
0
18 Apr 2024
Evaluating Human-Language Model Interaction
Mina Lee
Megha Srivastava
Amelia Hardy
John Thickstun
Esin Durmus
...
Hancheng Cao
Tony Lee
Rishi Bommasani
Michael S. Bernstein
Percy Liang
LM&MA
ALM
95
102
0
19 Dec 2022
FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation
Chen Zhang
L. F. D’Haro
Qiquan Zhang
Thomas Friedrichs
Haizhou Li
63
16
0
25 Oct 2022
Towards a Unified Multi-Dimensional Evaluator for Text Generation
Ming Zhong
Yang Liu
Da Yin
Yuning Mao
Yizhu Jiao
Peng Liu
Chenguang Zhu
Heng Ji
Jiawei Han
ELM
85
276
0
13 Oct 2022
SNaC: Coherence Error Detection for Narrative Summarization
Tanya Goyal
Junyi Jessy Li
Greg Durrett
89
28
0
19 May 2022
Modeling Exemplification in Long-form Question Answering via Retrieval
Shufan Wang
Fangyuan Xu
Laure Thompson
Eunsol Choi
Mohit Iyyer
58
11
0
19 May 2022
Improving Passage Retrieval with Zero-Shot Question Generation
Devendra Singh Sachan
M. Lewis
Mandar Joshi
Armen Aghajanyan
Wen-tau Yih
J. Pineau
Luke Zettlemoyer
OOD
LRM
103
165
0
15 Apr 2022
Read before Generate! Faithful Long Form Question Answering with Machine Reading
Dan Su
Xiaoguang Li
Jindi Zhang
Lifeng Shang
Xin Jiang
Qun Liu
Pascale Fung
HILM
65
61
0
01 Mar 2022
COLD Decoding: Energy-based Constrained Text Generation with Langevin Dynamics
Lianhui Qin
Sean Welleck
Daniel Khashabi
Yejin Choi
AI4CE
97
151
0
23 Feb 2022
Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text
Sebastian Gehrmann
Elizabeth Clark
Thibault Sellam
ELM
AI4CE
138
193
0
14 Feb 2022
Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand
Jungo Kasai
Keisuke Sakaguchi
Ronan Le Bras
Lavinia Dunagan
Jacob Morrison
Alexander R. Fabbri
Yejin Choi
Noah A. Smith
82
40
0
08 Dec 2021
SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization
Philippe Laban
Tobias Schnabel
Paul N. Bennett
Marti A. Hearst
HILM
107
396
0
18 Nov 2021
Multitask Prompted Training Enables Zero-Shot Task Generalization
Victor Sanh
Albert Webson
Colin Raffel
Stephen H. Bach
Lintang Sutawika
...
T. Bers
Stella Biderman
Leo Gao
Thomas Wolf
Alexander M. Rush
LRM
348
1,708
0
15 Oct 2021
BARTScore: Evaluating Generated Text as Text Generation
Weizhe Yuan
Graham Neubig
Pengfei Liu
119
849
0
22 Jun 2021
BlonDe: An Automatic Evaluation Metric for Document-level Machine Translation
Yu Jiang
Tianyu Liu
Shuming Ma
Dongdong Zhang
Jian Yang
Haoyang Huang
Rico Sennrich
Ryan Cotterell
Mrinmaya Sachan
M. Zhou
47
60
0
22 Mar 2021
Evaluating Factuality in Generation with Dependency-level Entailment
Tanya Goyal
Greg Durrett
112
151
0
12 Oct 2020
MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics
Anthony Chen
Gabriel Stanovsky
Sameer Singh
Matt Gardner
76
51
0
07 Oct 2020
Evaluation of Text Generation: A Survey
Asli Celikyilmaz
Elizabeth Clark
Jianfeng Gao
ELM
LM&MA
115
387
0
26 Jun 2020
Adversarial NLI for Factual Correctness in Text Summarisation Models
Mario Barrantes
Benedikt Herudek
Richard Wang
46
17
0
24 May 2020
On Faithfulness and Factuality in Abstractive Summarization
Joshua Maynez
Shashi Narayan
Bernd Bohnet
Ryan T. McDonald
HILM
84
1,039
0
02 May 2020
Longformer: The Long-Document Transformer
Iz Beltagy
Matthew E. Peters
Arman Cohan
RALM
VLM
179
4,092
0
10 Apr 2020
Stanza: A Python Natural Language Processing Toolkit for Many Human Languages
Peng Qi
Yuhao Zhang
Yuhui Zhang
Jason Bolton
Christopher D. Manning
AI4TS
253
1,695
0
16 Mar 2020
Efficient Content-Based Sparse Attention with Routing Transformers
Aurko Roy
M. Saffar
Ashish Vaswani
David Grangier
MoE
329
602
0
12 Mar 2020
REALM: Retrieval-Augmented Language Model Pre-Training
Kelvin Guu
Kenton Lee
Zora Tung
Panupong Pasupat
Ming-Wei Chang
RALM
142
2,116
0
10 Feb 2020
Evaluating the Factual Consistency of Abstractive Text Summarization
Wojciech Kry'sciñski
Bryan McCann
Caiming Xiong
R. Socher
HILM
115
746
0
28 Oct 2019
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel
Noam M. Shazeer
Adam Roberts
Katherine Lee
Sharan Narang
Michael Matena
Yanqi Zhou
Wei Li
Peter J. Liu
AIMat
462
20,317
0
23 Oct 2019
ELI5: Long Form Question Answering
Angela Fan
Yacine Jernite
Ethan Perez
David Grangier
Jason Weston
Michael Auli
AI4MH
ELM
103
624
0
22 Jul 2019
BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang
Varsha Kishore
Felix Wu
Kilian Q. Weinberger
Yoav Artzi
352
5,860
0
21 Apr 2019
Texygen: A Benchmarking Platform for Text Generation Models
Yaoming Zhu
Sidi Lu
Lei Zheng
Jiaxian Guo
Weinan Zhang
Jun Wang
Yong Yu
107
692
0
06 Feb 2018
1