ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2305.18201
  4. Cited By
A Critical Evaluation of Evaluations for Long-form Question Answering

A Critical Evaluation of Evaluations for Long-form Question Answering

29 May 2023
Fangyuan Xu
Yixiao Song
Mohit Iyyer
Eunsol Choi
    ELM
ArXiv (abs)PDFHTML

Papers citing "A Critical Evaluation of Evaluations for Long-form Question Answering"

38 / 38 papers shown
Title
InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation
InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation
Yunjia Xi
Jianghao Lin
Menghui Zhu
Yongzhao Xiao
Zhuoying Ou
...
Weiwen Liu
Yasheng Wang
Ruiming Tang
Weinan Zhang
Yong Yu
96
1
0
21 May 2025
How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild
How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild
Saad Obaid ul Islam
Anne Lauscher
Goran Glavaš
HILMLRM
160
2
0
21 Feb 2025
Prompt-based Depth Pruning of Large Language Models
Prompt-based Depth Pruning of Large Language Models
Juyun Wee
Minjae Park
Jaeho Lee
VLM
162
0
0
04 Feb 2025
Learning to Explore and Select for Coverage-Conditioned Retrieval-Augmented Generation
Learning to Explore and Select for Coverage-Conditioned Retrieval-Augmented Generation
Takyoung Kim
Kyungjae Lee
Y. Jang
Ji Yong Cho
Gangwoo Kim
Minseok Cho
Moontae Lee
258
1
0
28 Jan 2025
PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations
PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations
Ruosen Li
Teerth Patel
Xinya Du
LLMAGALM
141
102
0
03 Jan 2025
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Hui Wei
Shenghua He
Tian Xia
Andy H. Wong
Jingyang Lin
Mei Han
Mei Han
ALMELM
143
32
0
23 Aug 2024
Inverse Constitutional AI: Compressing Preferences into Principles
Inverse Constitutional AI: Compressing Preferences into Principles
Arduin Findeis
Timo Kaufmann
Eyke Hüllermeier
Samuel Albanie
Robert Mullins
SyDa
95
12
0
02 Jun 2024
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
Tianlong Wang
Xianfeng Jiao
Yifan He
Zhongzhi Chen
Yinghao Zhu
Xu Chu
Junyi Gao
Yasha Wang
Liantao Ma
LLMSV
122
15
0
26 May 2024
ir_explain: a Python Library of Explainable IR Methods
ir_explain: a Python Library of Explainable IR Methods
Siyang Song
Harsh Agarwal
Venktesh V
Avishek Anand
Swastik Mohanty
Debapriyo Majumdar
Mandar Mitra
XAI
110
1
0
29 Apr 2024
AdvisorQA: Towards Helpful and Harmless Advice-seeking Question Answering with Collective Intelligence
AdvisorQA: Towards Helpful and Harmless Advice-seeking Question Answering with Collective Intelligence
Minbeom Kim
Hwanhee Lee
Joonsuk Park
Hwaran Lee
Kyomin Jung
94
3
0
18 Apr 2024
Evaluating Human-Language Model Interaction
Evaluating Human-Language Model Interaction
Mina Lee
Megha Srivastava
Amelia Hardy
John Thickstun
Esin Durmus
...
Hancheng Cao
Tony Lee
Rishi Bommasani
Michael S. Bernstein
Percy Liang
LM&MAALM
95
102
0
19 Dec 2022
FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation
FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation
Chen Zhang
L. F. D’Haro
Qiquan Zhang
Thomas Friedrichs
Haizhou Li
63
16
0
25 Oct 2022
Towards a Unified Multi-Dimensional Evaluator for Text Generation
Towards a Unified Multi-Dimensional Evaluator for Text Generation
Ming Zhong
Yang Liu
Da Yin
Yuning Mao
Yizhu Jiao
Peng Liu
Chenguang Zhu
Heng Ji
Jiawei Han
ELM
85
276
0
13 Oct 2022
SNaC: Coherence Error Detection for Narrative Summarization
SNaC: Coherence Error Detection for Narrative Summarization
Tanya Goyal
Junyi Jessy Li
Greg Durrett
89
28
0
19 May 2022
Modeling Exemplification in Long-form Question Answering via Retrieval
Modeling Exemplification in Long-form Question Answering via Retrieval
Shufan Wang
Fangyuan Xu
Laure Thompson
Eunsol Choi
Mohit Iyyer
58
11
0
19 May 2022
Improving Passage Retrieval with Zero-Shot Question Generation
Improving Passage Retrieval with Zero-Shot Question Generation
Devendra Singh Sachan
M. Lewis
Mandar Joshi
Armen Aghajanyan
Wen-tau Yih
J. Pineau
Luke Zettlemoyer
OODLRM
103
165
0
15 Apr 2022
Read before Generate! Faithful Long Form Question Answering with Machine
  Reading
Read before Generate! Faithful Long Form Question Answering with Machine Reading
Dan Su
Xiaoguang Li
Jindi Zhang
Lifeng Shang
Xin Jiang
Qun Liu
Pascale Fung
HILM
65
61
0
01 Mar 2022
COLD Decoding: Energy-based Constrained Text Generation with Langevin
  Dynamics
COLD Decoding: Energy-based Constrained Text Generation with Langevin Dynamics
Lianhui Qin
Sean Welleck
Daniel Khashabi
Yejin Choi
AI4CE
97
151
0
23 Feb 2022
Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation
  Practices for Generated Text
Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text
Sebastian Gehrmann
Elizabeth Clark
Thibault Sellam
ELMAI4CE
138
193
0
14 Feb 2022
Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand
Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand
Jungo Kasai
Keisuke Sakaguchi
Ronan Le Bras
Lavinia Dunagan
Jacob Morrison
Alexander R. Fabbri
Yejin Choi
Noah A. Smith
82
40
0
08 Dec 2021
SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in
  Summarization
SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization
Philippe Laban
Tobias Schnabel
Paul N. Bennett
Marti A. Hearst
HILM
107
396
0
18 Nov 2021
Multitask Prompted Training Enables Zero-Shot Task Generalization
Multitask Prompted Training Enables Zero-Shot Task Generalization
Victor Sanh
Albert Webson
Colin Raffel
Stephen H. Bach
Lintang Sutawika
...
T. Bers
Stella Biderman
Leo Gao
Thomas Wolf
Alexander M. Rush
LRM
348
1,708
0
15 Oct 2021
BARTScore: Evaluating Generated Text as Text Generation
BARTScore: Evaluating Generated Text as Text Generation
Weizhe Yuan
Graham Neubig
Pengfei Liu
119
849
0
22 Jun 2021
BlonDe: An Automatic Evaluation Metric for Document-level Machine
  Translation
BlonDe: An Automatic Evaluation Metric for Document-level Machine Translation
Yu Jiang
Tianyu Liu
Shuming Ma
Dongdong Zhang
Jian Yang
Haoyang Huang
Rico Sennrich
Ryan Cotterell
Mrinmaya Sachan
M. Zhou
47
60
0
22 Mar 2021
Evaluating Factuality in Generation with Dependency-level Entailment
Evaluating Factuality in Generation with Dependency-level Entailment
Tanya Goyal
Greg Durrett
112
151
0
12 Oct 2020
MOCHA: A Dataset for Training and Evaluating Generative Reading
  Comprehension Metrics
MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics
Anthony Chen
Gabriel Stanovsky
Sameer Singh
Matt Gardner
76
51
0
07 Oct 2020
Evaluation of Text Generation: A Survey
Evaluation of Text Generation: A Survey
Asli Celikyilmaz
Elizabeth Clark
Jianfeng Gao
ELMLM&MA
115
387
0
26 Jun 2020
Adversarial NLI for Factual Correctness in Text Summarisation Models
Adversarial NLI for Factual Correctness in Text Summarisation Models
Mario Barrantes
Benedikt Herudek
Richard Wang
46
17
0
24 May 2020
On Faithfulness and Factuality in Abstractive Summarization
On Faithfulness and Factuality in Abstractive Summarization
Joshua Maynez
Shashi Narayan
Bernd Bohnet
Ryan T. McDonald
HILM
84
1,039
0
02 May 2020
Longformer: The Long-Document Transformer
Longformer: The Long-Document Transformer
Iz Beltagy
Matthew E. Peters
Arman Cohan
RALMVLM
179
4,092
0
10 Apr 2020
Stanza: A Python Natural Language Processing Toolkit for Many Human
  Languages
Stanza: A Python Natural Language Processing Toolkit for Many Human Languages
Peng Qi
Yuhao Zhang
Yuhui Zhang
Jason Bolton
Christopher D. Manning
AI4TS
253
1,695
0
16 Mar 2020
Efficient Content-Based Sparse Attention with Routing Transformers
Efficient Content-Based Sparse Attention with Routing Transformers
Aurko Roy
M. Saffar
Ashish Vaswani
David Grangier
MoE
329
602
0
12 Mar 2020
REALM: Retrieval-Augmented Language Model Pre-Training
REALM: Retrieval-Augmented Language Model Pre-Training
Kelvin Guu
Kenton Lee
Zora Tung
Panupong Pasupat
Ming-Wei Chang
RALM
142
2,116
0
10 Feb 2020
Evaluating the Factual Consistency of Abstractive Text Summarization
Evaluating the Factual Consistency of Abstractive Text Summarization
Wojciech Kry'sciñski
Bryan McCann
Caiming Xiong
R. Socher
HILM
115
746
0
28 Oct 2019
Exploring the Limits of Transfer Learning with a Unified Text-to-Text
  Transformer
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel
Noam M. Shazeer
Adam Roberts
Katherine Lee
Sharan Narang
Michael Matena
Yanqi Zhou
Wei Li
Peter J. Liu
AIMat
462
20,317
0
23 Oct 2019
ELI5: Long Form Question Answering
ELI5: Long Form Question Answering
Angela Fan
Yacine Jernite
Ethan Perez
David Grangier
Jason Weston
Michael Auli
AI4MHELM
103
624
0
22 Jul 2019
BERTScore: Evaluating Text Generation with BERT
BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang
Varsha Kishore
Felix Wu
Kilian Q. Weinberger
Yoav Artzi
352
5,860
0
21 Apr 2019
Texygen: A Benchmarking Platform for Text Generation Models
Texygen: A Benchmarking Platform for Text Generation Models
Yaoming Zhu
Sidi Lu
Lei Zheng
Jiaxian Guo
Weinan Zhang
Jun Wang
Yong Yu
107
692
0
06 Feb 2018
1