ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2010.06595
  4. Cited By
With Little Power Comes Great Responsibility

With Little Power Comes Great Responsibility

13 October 2020
Dallas Card
Peter Henderson
Urvashi Khandelwal
Robin Jia
Kyle Mahowald
Dan Jurafsky
ArXivPDFHTML

Papers citing "With Little Power Comes Great Responsibility"

33 / 33 papers shown
Title
SPHERE: An Evaluation Card for Human-AI Systems
SPHERE: An Evaluation Card for Human-AI Systems
Qianou Ma
Dora Zhao
Xinran Zhao
Chenglei Si
Chenyang Yang
Ryan Louie
Ehud Reiter
Diyi Yang
Tongshuang Wu
ALM
50
0
0
24 Mar 2025
Fairness through Difference Awareness: Measuring Desired Group Discrimination in LLMs
Fairness through Difference Awareness: Measuring Desired Group Discrimination in LLMs
Angelina Wang
Michelle Phan
Daniel E. Ho
Sanmi Koyejo
49
2
0
04 Feb 2025
Efficiently Identifying Low-Quality Language Subsets in Multilingual
  Datasets: A Case Study on a Large-Scale Multilingual Audio Dataset
Efficiently Identifying Low-Quality Language Subsets in Multilingual Datasets: A Case Study on a Large-Scale Multilingual Audio Dataset
Farhan Samir
Emily P. Ahn
Shreya Prakash
Márton Soskuthy
Vered Shwartz
Jian Zhu
26
0
0
05 Oct 2024
Can Unconfident LLM Annotations Be Used for Confident Conclusions?
Can Unconfident LLM Annotations Be Used for Confident Conclusions?
Kristina Gligorić
Tijana Zrnic
Cinoo Lee
Emmanuel J. Candès
Dan Jurafsky
72
5
0
27 Aug 2024
Annotation alignment: Comparing LLM and human annotations of
  conversational safety
Annotation alignment: Comparing LLM and human annotations of conversational safety
Rajiv Movva
Pang Wei Koh
Emma Pierson
ALM
37
3
0
10 Jun 2024
The Challenges of Evaluating LLM Applications: An Analysis of Automated,
  Human, and LLM-Based Approaches
The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches
Bhashithe Abeysinghe
Ruhan Circi
ELM
36
22
0
05 Jun 2024
How Much Annotation is Needed to Compare Summarization Models?
How Much Annotation is Needed to Compare Summarization Models?
Chantal Shaib
Joe Barrow
Alexa F. Siu
Byron C. Wallace
A. Nenkova
45
2
0
28 Feb 2024
Anchor Points: Benchmarking Models with Much Fewer Examples
Anchor Points: Benchmarking Models with Much Fewer Examples
Rajan Vivek
Kawin Ethayarajh
Diyi Yang
Douwe Kiela
ALM
29
22
0
14 Sep 2023
A Call for Standardization and Validation of Text Style Transfer
  Evaluation
A Call for Standardization and Validation of Text Style Transfer Evaluation
Phil Ostheimer
M. Nagda
Marius Kloft
Sophie Fellenz
39
14
0
01 Jun 2023
Document-Level Machine Translation with Large Language Models
Document-Level Machine Translation with Large Language Models
Longyue Wang
Chenyang Lyu
Tianbo Ji
Zhirui Zhang
Dian Yu
Shuming Shi
Zhaopeng Tu
ELM
21
115
0
05 Apr 2023
A Two-Sided Discussion of Preregistration of NLP Research
A Two-Sided Discussion of Preregistration of NLP Research
Anders Søgaard
Daniel Hershcovich
Miryam de Lhoneux
OnRL
AI4CE
30
3
0
20 Feb 2023
Revisiting the Gold Standard: Grounding Summarization Evaluation with
  Robust Human Evaluation
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation
Yixin Liu
Alexander R. Fabbri
Pengfei Liu
Yilun Zhao
Linyong Nan
...
Simeng Han
Shafiq R. Joty
Chien-Sheng Wu
Caiming Xiong
Dragomir R. Radev
ALM
12
132
0
15 Dec 2022
Probing with Noise: Unpicking the Warp and Weft of Embeddings
Probing with Noise: Unpicking the Warp and Weft of Embeddings
Filip Klubicka
John D. Kelleher
28
4
0
21 Oct 2022
Searching for a higher power in the human evaluation of MT
Searching for a higher power in the human evaluation of MT
Johnny Tian-Zheng Wei
Tom Kocmi
C. Federmann
16
6
0
20 Oct 2022
Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling
Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling
Haw-Shiuan Chang
Ruei-Yao Sun
Kathryn Ricci
Andrew McCallum
43
14
0
10 Oct 2022
Resolving the Human Subjects Status of Machine Learning's Crowdworkers
Resolving the Human Subjects Status of Machine Learning's Crowdworkers
Divyansh Kaushik
Zachary Chase Lipton
A. London
25
2
0
08 Jun 2022
Life after BERT: What do Other Muppets Understand about Language?
Life after BERT: What do Other Muppets Understand about Language?
Vladislav Lialin
Kevin Zhao
Namrata Shivagunde
Anna Rumshisky
39
6
0
21 May 2022
Private Hypothesis Testing for Social Sciences
Private Hypothesis Testing for Social Sciences
Ajinkya Mulay
Sean M. Lane
Erin P. Hennes
24
0
0
07 May 2022
On the Limitations of Dataset Balancing: The Lost Battle Against
  Spurious Correlations
On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations
Roy Schwartz
Gabriel Stanovsky
29
24
0
27 Apr 2022
Modular Domain Adaptation
Modular Domain Adaptation
Junshen K. Chen
Dallas Card
Dan Jurafsky
17
1
0
26 Apr 2022
deep-significance - Easy and Meaningful Statistical Significance Testing
  in the Age of Neural Networks
deep-significance - Easy and Meaningful Statistical Significance Testing in the Age of Neural Networks
Dennis Ulmer
Christian Hardmeier
J. Frellsen
40
42
0
14 Apr 2022
Data-driven Model Generalizability in Crosslinguistic Low-resource
  Morphological Segmentation
Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation
Zoey Liu
Emily Tucker Prudhommeaux
35
4
0
05 Jan 2022
Automatic Text Evaluation through the Lens of Wasserstein Barycenters
Automatic Text Evaluation through the Lens of Wasserstein Barycenters
Pierre Colombo
Guillaume Staerman
Chloé Clavel
Pablo Piantanida
27
41
0
27 Aug 2021
Underreporting of errors in NLG output, and what to do about it
Underreporting of errors in NLG output, and what to do about it
Emiel van Miltenburg
Miruna Clinciu
Ondrej Dusek
Dimitra Gkatzia
Stephanie Inglis
...
Saad Mahamood
Emma Manning
S. Schoch
Craig Thomson
Luou Wen
25
38
0
02 Aug 2021
Is Automated Topic Model Evaluation Broken?: The Incoherence of
  Coherence
Is Automated Topic Model Evaluation Broken?: The Incoherence of Coherence
Alexander Miserlis Hoyle
Pranav Goel
Denis Peskov
Andrew Hian-Cheong
Jordan L. Boyd-Graber
Philip Resnik
35
127
0
05 Jul 2021
Is GPT-3 Text Indistinguishable from Human Text? Scarecrow: A Framework
  for Scrutinizing Machine Text
Is GPT-3 Text Indistinguishable from Human Text? Scarecrow: A Framework for Scrutinizing Machine Text
Yao Dou
Maxwell Forbes
Rik Koncel-Kedziorski
Noah A. Smith
Yejin Choi
DeLMO
8
126
0
02 Jul 2021
All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated
  Text
All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text
Elizabeth Clark
Tal August
Sofia Serrano
Nikita Haduong
Suchin Gururangan
Noah A. Smith
DeLMO
34
394
0
30 Jun 2021
The MultiBERTs: BERT Reproductions for Robustness Analysis
The MultiBERTs: BERT Reproductions for Robustness Analysis
Thibault Sellam
Steve Yadlowsky
Jason W. Wei
Naomi Saphra
Alexander DÁmour
...
Iulia Turc
Jacob Eisenstein
Dipanjan Das
Ian Tenney
Ellie Pavlick
24
93
0
30 Jun 2021
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
Jack Hessel
Ari Holtzman
Maxwell Forbes
Ronan Le Bras
Yejin Choi
CLIP
13
1,439
0
18 Apr 2021
What Will it Take to Fix Benchmarking in Natural Language Understanding?
What Will it Take to Fix Benchmarking in Natural Language Understanding?
Samuel R. Bowman
George E. Dahl
ELM
ALM
28
156
0
05 Apr 2021
The Human Evaluation Datasheet 1.0: A Template for Recording Details of
  Human Evaluation Experiments in NLP
The Human Evaluation Datasheet 1.0: A Template for Recording Details of Human Evaluation Experiments in NLP
Anastasia Shimorina
Anya Belz
22
34
0
17 Mar 2021
Nearest Neighbor Machine Translation
Nearest Neighbor Machine Translation
Urvashi Khandelwal
Angela Fan
Dan Jurafsky
Luke Zettlemoyer
M. Lewis
RALM
16
280
0
01 Oct 2020
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language
  Understanding
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
297
6,956
0
20 Apr 2018
1