ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2409.07355
  4. Cited By
Think Together and Work Better: Combining Humans' and LLMs' Think-Aloud Outcomes for Effective Text Evaluation

Think Together and Work Better: Combining Humans' and LLMs' Think-Aloud Outcomes for Effective Text Evaluation

21 February 2025
SeongYeub Chu
JongWoo Kim
MunYong Yi
ArXivPDFHTML

Papers citing "Think Together and Work Better: Combining Humans' and LLMs' Think-Aloud Outcomes for Effective Text Evaluation"

45 / 45 papers shown
Title
THiNK: Can Large Language Models Think-aloud?
THiNK: Can Large Language Models Think-aloud?
Yongan Yu
Mengqian Wu
Yiran Lin
Nikki G. Lobczowski
LLMAG
LRM
ELM
21
0
0
26 May 2025
Integrating Expert Knowledge into Logical Programs via LLMs
Integrating Expert Knowledge into Logical Programs via LLMs
Franciszek Górski
Oskar Wysocki
Marco Valentino
André Freitas
343
0
0
17 Feb 2025
Mixture-of-Agents Enhances Large Language Model Capabilities
Mixture-of-Agents Enhances Large Language Model Capabilities
Junlin Wang
Jue Wang
Ben Athiwaratkun
Ce Zhang
James Zou
LLMAG
AIFin
66
116
0
07 Jun 2024
DEBATE: Devil's Advocate-Based Assessment and Text Evaluation
DEBATE: Devil's Advocate-Based Assessment and Text Evaluation
Alex G. Kim
Keonwoo Kim
Sangwon Yoon
ELM
39
5
0
16 May 2024
Navigating the Path of Writing: Outline-guided Text Generation with Large Language Models
Navigating the Path of Writing: Outline-guided Text Generation with Large Language Models
Yukyung Lee
Soonwon Ka
Bokyung Son
Pilsung Kang
Jaewook Kang
LLMAG
97
6
0
22 Apr 2024
CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists
CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists
Yukyung Lee
Joonghoon Kim
Jaehee Kim
Hyowon Cho
Pilsung Kang
Pilsung Kang
Najoung Kim
ELM
57
5
0
27 Mar 2024
Shaping Human-AI Collaboration: Varied Scaffolding Levels in Co-writing
  with Language Models
Shaping Human-AI Collaboration: Varied Scaffolding Levels in Co-writing with Language Models
Paramveer S. Dhillon
Somayeh Molaei
Jiaqi Li
Maximilian Golub
Shaochun Zheng
Lionel P. Robert
LLMAG
72
44
0
18 Feb 2024
LLM-based NLG Evaluation: Current Status and Challenges
LLM-based NLG Evaluation: Current Status and Challenges
Mingqi Gao
Xinyu Hu
Jie Ruan
Xiao Pu
Xiaojun Wan
ELM
LM&MA
101
37
0
02 Feb 2024
Towards Optimizing the Costs of LLM Usage
Towards Optimizing the Costs of LLM Usage
Shivanshu Shekhar
Tanishq Dubey
Koyel Mukherjee
Apoorv Saxena
Atharv Tyagi
Nishanth Kotla
33
20
0
29 Jan 2024
Understanding Nonlinear Collaboration between Human and AI Agents: A
  Co-design Framework for Creative Design
Understanding Nonlinear Collaboration between Human and AI Agents: A Co-design Framework for Creative Design
Jiayi Zhou
Renzhong Li
Junxiu Tang
Tan Tang
Haotian Li
Weiwei Cui
Yingcai Wu
63
38
0
14 Jan 2024
PEARL: Personalizing Large Language Model Writing Assistants with
  Generation-Calibrated Retrievers
PEARL: Personalizing Large Language Model Writing Assistants with Generation-Calibrated Retrievers
Sheshera Mysore
Zhuoran Lu
Mengting Wan
Longqi Yang
Steve Menezes
Tina Baghaee
Emmanuel Barajas Gonzalez
Jennifer Neville
Tara Safavi
RALM
83
41
0
15 Nov 2023
LeanContext: Cost-Efficient Domain-Specific Question Answering Using
  LLMs
LeanContext: Cost-Efficient Domain-Specific Question Answering Using LLMs
Md. Adnan Arefeen
Biplob K. Debnath
S. Chakradhar
59
54
0
02 Sep 2023
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
Chi-Min Chan
Weize Chen
Yusheng Su
Jianxuan Yu
Wei Xue
Shan Zhang
Jie Fu
Zhiyuan Liu
ELM
LLMAG
ALM
68
467
0
14 Aug 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MH
ALM
213
11,636
0
18 Jul 2023
Reasoning or Reciting? Exploring the Capabilities and Limitations of
  Language Models Through Counterfactual Tasks
Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks
Zhaofeng Wu
Linlu Qiu
Alexis Ross
Ekin Akyürek
Boyuan Chen
Bailin Wang
Najoung Kim
Jacob Andreas
Yoon Kim
LRM
ReLM
106
211
0
05 Jul 2023
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long
  Form Text Generation
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation
Sewon Min
Kalpesh Krishna
Xinxi Lyu
M. Lewis
Wen-tau Yih
Pang Wei Koh
Mohit Iyyer
Luke Zettlemoyer
Hannaneh Hajishirzi
HILM
ALM
111
649
0
23 May 2023
Do All Languages Cost the Same? Tokenization in the Era of Commercial
  Language Models
Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models
Orevaoghene Ahia
Sachin Kumar
Hila Gonen
Jungo Kasai
David R. Mortensen
Noah A. Smith
Yulia Tsvetkov
74
91
0
23 May 2023
FrugalGPT: How to Use Large Language Models While Reducing Cost and
  Improving Performance
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
Lingjiao Chen
Matei A. Zaharia
James Zou
LLMAG
119
224
0
09 May 2023
Can ChatGPT Reproduce Human-Generated Labels? A Study of Social
  Computing Tasks
Can ChatGPT Reproduce Human-Generated Labels? A Study of Social Computing Tasks
Yiming Zhu
Peixian Zhang
Ehsan-ul Haq
Pan Hui
Gareth Tyson
DeLMO
ALM
AI4MH
60
125
0
20 Apr 2023
Supporting Human-AI Collaboration in Auditing LLMs with LLMs
Supporting Human-AI Collaboration in Auditing LLMs with LLMs
Charvi Rastogi
Marco Tulio Ribeiro
Nicholas King
Harsha Nori
Saleema Amershi
ALM
54
70
0
19 Apr 2023
Can Large Language Models Transform Computational Social Science?
Can Large Language Models Transform Computational Social Science?
Caleb Ziems
William B. Held
Omar Shaikh
Jiaao Chen
Zhehao Zhang
Diyi Yang
LLMAG
48
301
0
12 Apr 2023
Exploring the Use of Large Language Models for Reference-Free Text
  Quality Evaluation: An Empirical Study
Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: An Empirical Study
Yi Chen
Rui Wang
Haiyun Jiang
Shuming Shi
Ruifeng Xu
LM&MA
69
78
0
03 Apr 2023
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Yang Liu
Dan Iter
Yichong Xu
Shuohang Wang
Ruochen Xu
Chenguang Zhu
ELM
ALM
LM&MA
148
1,138
0
29 Mar 2023
GPT-4 Technical Report
GPT-4 Technical Report
OpenAI OpenAI
OpenAI Josh Achiam
Steven Adler
Sandhini Agarwal
Lama Ahmad
...
Shengjia Zhao
Tianhao Zheng
Juntang Zhuang
William Zhuk
Barret Zoph
LLMAG
MLLM
761
13,788
0
15 Mar 2023
Is ChatGPT a Good NLG Evaluator? A Preliminary Study
Is ChatGPT a Good NLG Evaluator? A Preliminary Study
Jiaan Wang
Yunlong Liang
Fandong Meng
Zengkui Sun
Haoxiang Shi
Zhixu Li
Jinan Xu
Jianfeng Qu
Jie Zhou
LM&MA
ELM
ALM
AI4MH
97
458
0
07 Mar 2023
Large Language Models Are State-of-the-Art Evaluators of Translation
  Quality
Large Language Models Are State-of-the-Art Evaluators of Translation Quality
Tom Kocmi
C. Federmann
ELM
80
352
0
28 Feb 2023
The State of Human-centered NLP Technology for Fact-checking
The State of Human-centered NLP Technology for Fact-checking
Anubrata Das
Houjiang Liu
Venelin Kovatchev
Matthew Lease
HILM
76
63
0
08 Jan 2023
Revisiting the Gold Standard: Grounding Summarization Evaluation with
  Robust Human Evaluation
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation
Yixin Liu
Alexander R. Fabbri
Pengfei Liu
Yilun Zhao
Linyong Nan
...
Simeng Han
Shafiq Joty
Chien-Sheng Wu
Caiming Xiong
Dragomir R. Radev
ALM
51
133
0
15 Dec 2022
Towards a Unified Multi-Dimensional Evaluator for Text Generation
Towards a Unified Multi-Dimensional Evaluator for Text Generation
Ming Zhong
Yang Liu
Da Yin
Yuning Mao
Yizhu Jiao
Peng Liu
Chenguang Zhu
Heng Ji
Jiawei Han
ELM
64
263
0
13 Oct 2022
PEER: A Collaborative Language Model
PEER: A Collaborative Language Model
Timo Schick
Jane Dwivedi-Yu
Zhengbao Jiang
Fabio Petroni
Patrick Lewis
Gautier Izacard
Qingfei You
Christoforos Nalmpantis
Edouard Grave
Sebastian Riedel
ALM
82
95
0
24 Aug 2022
Scholastic: Graphical Human-Al Collaboration for Inductive and
  Interpretive Text Analysis
Scholastic: Graphical Human-Al Collaboration for Inductive and Interpretive Text Analysis
Matt-Heun Hong
Lauren A. Marsh
Jessica L. Feuston
Joan H Ruppert
Jed R. Brubaker
D. Szafir
39
27
0
12 Aug 2022
CoAuthor: Designing a Human-AI Collaborative Writing Dataset for
  Exploring Language Model Capabilities
CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities
Mina Lee
Percy Liang
Qian Yang
HAI
62
370
0
18 Jan 2022
A Survey of Human-in-the-loop for Machine Learning
A Survey of Human-in-the-loop for Machine Learning
Xingjiao Wu
Luwei Xiao
Yixuan Sun
Junhang Zhang
Tianlong Ma
Liangbo He
SyDa
83
513
0
02 Aug 2021
BARTScore: Evaluating Generated Text as Text Generation
BARTScore: Evaluating Generated Text as Text Generation
Weizhe Yuan
Graham Neubig
Pengfei Liu
86
829
0
22 Jun 2021
OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics
OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics
Jian Guan
Zhexin Zhang
Zhuoer Feng
Zitao Liu
Wenbiao Ding
Xiaoxi Mao
Changjie Fan
Minlie Huang
42
61
0
19 May 2021
GRUEN for Evaluating Linguistic Quality of Generated Text
GRUEN for Evaluating Linguistic Quality of Generated Text
Wanzheng Zhu
S. Bhat
88
60
0
06 Oct 2020
Learning to summarize from human feedback
Learning to summarize from human feedback
Nisan Stiennon
Long Ouyang
Jeff Wu
Daniel M. Ziegler
Ryan J. Lowe
Chelsea Voss
Alec Radford
Dario Amodei
Paul Christiano
ALM
189
2,071
0
02 Sep 2020
Perception Score, A Learned Metric for Open-ended Text Generation
  Evaluation
Perception Score, A Learned Metric for Open-ended Text Generation Evaluation
Jing Gu
Qingyang Wu
Zhou Yu
37
12
0
07 Aug 2020
SummEval: Re-evaluating Summarization Evaluation
SummEval: Re-evaluating Summarization Evaluation
Alexander R. Fabbri
Wojciech Kry'sciñski
Bryan McCann
Caiming Xiong
R. Socher
Dragomir R. Radev
HILM
88
701
0
24 Jul 2020
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language
  Generation, Translation, and Comprehension
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
M. Lewis
Yinhan Liu
Naman Goyal
Marjan Ghazvininejad
Abdel-rahman Mohamed
Omer Levy
Veselin Stoyanov
Luke Zettlemoyer
AIMat
VLM
159
10,720
0
29 Oct 2019
Fine-Tuning Language Models from Human Preferences
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
429
1,664
0
18 Sep 2019
MoverScore: Text Generation Evaluating with Contextualized Embeddings
  and Earth Mover Distance
MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance
Wei Zhao
Maxime Peyrard
Fei Liu
Yang Gao
Christian M. Meyer
Steffen Eger
128
592
0
05 Sep 2019
Why Didn't You Listen to Me? Comparing User Control of Human-in-the-Loop
  Topic Models
Why Didn't You Listen to Me? Comparing User Control of Human-in-the-Loop Topic Models
Varun Kumar
Alison Smith-Renner
Leah Findlater
Kevin Seppi
Jordan L. Boyd-Graber
30
24
0
23 May 2019
BERTScore: Evaluating Text Generation with BERT
BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang
Varsha Kishore
Felix Wu
Kilian Q. Weinberger
Yoav Artzi
218
5,668
0
21 Apr 2019
BERT: Pre-training of Deep Bidirectional Transformers for Language
  Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLM
SSL
SSeg
1.1K
93,936
0
11 Oct 2018
1