With Little Power Comes Great Responsibility

13 October 2020

Robin Jia

Dan Jurafsky

Papers citing "With Little Power Comes Great Responsibility"

33 / 33 papers shown

Title
SPHERE: An Evaluation Card for Human-AI Systems Qianou Ma Dora Zhao Xinran Zhao Chenglei Si Chenyang Yang Ryan Louie Ehud Reiter Diyi Yang Tongshuang Wu ALM 50 0 0 24 Mar 2025
Fairness through Difference Awareness: Measuring Desired Group Discrimination in LLMs Angelina Wang Michelle Phan Daniel E. Ho Sanmi Koyejo 49 2 0 04 Feb 2025
Efficiently Identifying Low-Quality Language Subsets in Multilingual Datasets: A Case Study on a Large-Scale Multilingual Audio Dataset Farhan Samir Emily P. Ahn Shreya Prakash Márton Soskuthy Vered Shwartz Jian Zhu 26 0 0 05 Oct 2024
Can Unconfident LLM Annotations Be Used for Confident Conclusions? Kristina Gligorić Tijana Zrnic Cinoo Lee Emmanuel J. Candès Dan Jurafsky 72 5 0 27 Aug 2024
Annotation alignment: Comparing LLM and human annotations of conversational safety Rajiv Movva Pang Wei Koh Emma Pierson ALM 37 3 0 10 Jun 2024
The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches Bhashithe Abeysinghe Ruhan Circi ELM 36 22 0 05 Jun 2024
How Much Annotation is Needed to Compare Summarization Models? Chantal Shaib Joe Barrow Alexa F. Siu Byron C. Wallace A. Nenkova 45 2 0 28 Feb 2024
Anchor Points: Benchmarking Models with Much Fewer Examples Rajan Vivek Kawin Ethayarajh Diyi Yang Douwe Kiela ALM 29 22 0 14 Sep 2023
A Call for Standardization and Validation of Text Style Transfer Evaluation Phil Ostheimer M. Nagda Marius Kloft Sophie Fellenz 39 14 0 01 Jun 2023
Document-Level Machine Translation with Large Language Models Longyue Wang Chenyang Lyu Tianbo Ji Zhirui Zhang Dian Yu Shuming Shi Zhaopeng Tu ELM 21 115 0 05 Apr 2023
A Two-Sided Discussion of Preregistration of NLP Research Anders Søgaard Daniel Hershcovich Miryam de Lhoneux OnRL AI4CE 30 3 0 20 Feb 2023
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation Yixin Liu Alexander R. Fabbri Pengfei Liu Yilun Zhao Linyong Nan ... Simeng Han Shafiq R. Joty Chien-Sheng Wu Caiming Xiong Dragomir R. Radev ALM 12 132 0 15 Dec 2022
Probing with Noise: Unpicking the Warp and Weft of Embeddings Filip Klubicka John D. Kelleher 28 4 0 21 Oct 2022
Searching for a higher power in the human evaluation of MT Johnny Tian-Zheng Wei Tom Kocmi C. Federmann 16 6 0 20 Oct 2022
Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling Haw-Shiuan Chang Ruei-Yao Sun Kathryn Ricci Andrew McCallum 43 14 0 10 Oct 2022
Resolving the Human Subjects Status of Machine Learning's Crowdworkers Divyansh Kaushik Zachary Chase Lipton A. London 25 2 0 08 Jun 2022
Life after BERT: What do Other Muppets Understand about Language? Vladislav Lialin Kevin Zhao Namrata Shivagunde Anna Rumshisky 39 6 0 21 May 2022
Private Hypothesis Testing for Social Sciences Ajinkya Mulay Sean M. Lane Erin P. Hennes 24 0 0 07 May 2022
On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations Roy Schwartz Gabriel Stanovsky 29 24 0 27 Apr 2022
Modular Domain Adaptation Junshen K. Chen Dallas Card Dan Jurafsky 17 1 0 26 Apr 2022
deep-significance - Easy and Meaningful Statistical Significance Testing in the Age of Neural Networks Dennis Ulmer Christian Hardmeier J. Frellsen 40 42 0 14 Apr 2022
Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation Zoey Liu Emily Tucker Prudhommeaux 35 4 0 05 Jan 2022
Automatic Text Evaluation through the Lens of Wasserstein Barycenters Pierre Colombo Guillaume Staerman Chloé Clavel Pablo Piantanida 27 41 0 27 Aug 2021
Underreporting of errors in NLG output, and what to do about it Emiel van Miltenburg Miruna Clinciu Ondrej Dusek Dimitra Gkatzia Stephanie Inglis ... Saad Mahamood Emma Manning S. Schoch Craig Thomson Luou Wen 25 38 0 02 Aug 2021
Is Automated Topic Model Evaluation Broken?: The Incoherence of Coherence Alexander Miserlis Hoyle Pranav Goel Denis Peskov Andrew Hian-Cheong Jordan L. Boyd-Graber Philip Resnik 35 127 0 05 Jul 2021
Is GPT-3 Text Indistinguishable from Human Text? Scarecrow: A Framework for Scrutinizing Machine Text Yao Dou Maxwell Forbes Rik Koncel-Kedziorski Noah A. Smith Yejin Choi DeLMO 8 126 0 02 Jul 2021
All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text Elizabeth Clark Tal August Sofia Serrano Nikita Haduong Suchin Gururangan Noah A. Smith DeLMO 34 394 0 30 Jun 2021
The MultiBERTs: BERT Reproductions for Robustness Analysis Thibault Sellam Steve Yadlowsky Jason W. Wei Naomi Saphra Alexander DÁmour ... Iulia Turc Jacob Eisenstein Dipanjan Das Ian Tenney Ellie Pavlick 24 93 0 30 Jun 2021
CLIPScore: A Reference-free Evaluation Metric for Image Captioning Jack Hessel Ari Holtzman Maxwell Forbes Ronan Le Bras Yejin Choi CLIP 13 1,439 0 18 Apr 2021
What Will it Take to Fix Benchmarking in Natural Language Understanding? Samuel R. Bowman George E. Dahl ELM ALM 28 156 0 05 Apr 2021
The Human Evaluation Datasheet 1.0: A Template for Recording Details of Human Evaluation Experiments in NLP Anastasia Shimorina Anya Belz 22 34 0 17 Mar 2021
Nearest Neighbor Machine Translation Urvashi Khandelwal Angela Fan Dan Jurafsky Luke Zettlemoyer M. Lewis RALM 16 280 0 01 Oct 2020
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding Alex Jinpeng Wang Amanpreet Singh Julian Michael Felix Hill Omer Levy Samuel R. Bowman ELM 297 6,956 0 20 Apr 2018