Show Your Work: Improved Reporting of Experimental Results

6 September 2019

Papers citing "Show Your Work: Improved Reporting of Experimental Results"

50 / 65 papers shown

Title
N-Gram Induction Heads for In-Context RL: Improving Stability and Reducing Data Needs Ilya Zisman Alexander Nikulin Andrei Polubarov Nikita Lyubaykin Vladislav Kurenkov Andrei Polubarov Igor Kiselev Vladislav Kurenkov OffRL 56 2 0 04 Nov 2024
Can Unconfident LLM Annotations Be Used for Confident Conclusions? Kristina Gligorić Tijana Zrnic Cinoo Lee Emmanuel J. Candès Dan Jurafsky 72 6 0 27 Aug 2024
Better than classical? The subtle art of benchmarking quantum machine learning models Joseph Bowles Shahnawaz Ahmed Maria Schuld 42 65 0 11 Mar 2024
Collaboration or Corporate Capture? Quantifying NLP's Reliance on Industry Artifacts and Contributions Will Aitken Mohamed Abdalla K. Rudie Catherine Stinson 33 0 0 06 Dec 2023
torchdistill Meets Hugging Face Libraries for Reproducible, Coding-Free Deep Learning Studies: A Case Study on NLP Yoshitomo Matsubara VLM 34 1 0 26 Oct 2023
Target Variable Engineering Jessica Clark 35 0 0 13 Oct 2023
GPT-4 as an Agronomist Assistant? Answering Agriculture Exams Using Large Language Models B. Silva Leonardo Nunes Roberto Estevão Vijay Aski Ranveer Chandra ELM LM&MA 43 12 0 10 Oct 2023
An Easy Rejection Sampling Baseline via Gradient Refined Proposals Edward Raff Mark McLean James Holt 20 0 0 30 Sep 2023
Position: Key Claims in LLM Research Have a Long Tail of Footnotes Anna Rogers A. Luccioni 53 19 0 14 Aug 2023
On the Limitations of Simulating Active Learning Katerina Margatina Nikolaos Aletras 31 11 0 21 May 2023
Measuring and Mitigating Local Instability in Deep Neural Networks Arghya Datta Subhrangshu Nandi Jingcheng Xu Greg Ver Steeg He Xie Anoop Kumar Aram Galstyan 20 3 0 18 May 2023
When Good and Reproducible Results are a Giant with Feet of Clay: The Importance of Software Quality in NLP Sara Papi Marco Gaido Andrea Pilzer Matteo Negri 59 10 0 28 Mar 2023
Beyond Triplet: Leveraging the Most Data for Multimodal Machine Translation Yaoming Zhu Zewei Sun Shanbo Cheng Yuyang Huang Liwei Wu Mingxuan Wang 28 10 0 20 Dec 2022
We need to talk about random seeds Steven Bethard 31 8 0 24 Oct 2022
Towards a Standardised Performance Evaluation Protocol for Cooperative MARL R. Gorsane Omayma Mahjoub Ruan de Kock Roland Dubb Siddarth S. Singh Arnu Pretorius OffRL 39 49 0 21 Sep 2022
Making Intelligence: Ethical Values in IQ and ML Benchmarks Borhane Blili-Hamelin Leif Hancox-Li 41 16 0 01 Sep 2022
Efficient Methods for Natural Language Processing: A Survey Marcos Vinícius Treviso Ji-Ung Lee Tianchu Ji Betty van Aken Qingqing Cao ... Emma Strubell Niranjan Balasubramanian Leon Derczynski Iryna Gurevych Roy Schwartz 33 109 0 31 Aug 2022
Resolving the Human Subjects Status of Machine Learning's Crowdworkers Divyansh Kaushik Zachary Chase Lipton A. London 25 2 0 08 Jun 2022
deep-significance - Easy and Meaningful Statistical Significance Testing in the Age of Neural Networks Dennis Ulmer Christian Hardmeier J. Frellsen 48 42 0 14 Apr 2022
Reducing Model Jitter: Stable Re-training of Semantic Parsers in Production Environments Christopher Hidey Fei Liu Rahul Goel 32 4 0 10 Apr 2022
Design-Bench: Benchmarks for Data-Driven Offline Model-Based Optimization Brandon Trabucco Xinyang Geng Aviral Kumar Sergey Levine OffRL 32 95 0 17 Feb 2022
Adaptive Fine-Tuning of Transformer-Based Language Models for Named Entity Recognition Felix Stollenwerk 12 3 0 05 Feb 2022
Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation Zoey Liu Emily Tucker Prudhommeaux 43 4 0 05 Jan 2022
Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research Bernard Koch Emily L. Denton A. Hanna J. Foster 53 140 0 03 Dec 2021
How not to Lie with a Benchmark: Rearranging NLP Leaderboards Tatiana Shavrina Valentin Malykh ALM ELM 423 10 0 02 Dec 2021
AI and the Everything in the Whole Wide World Benchmark Inioluwa Deborah Raji Emily M. Bender Amandalynne Paullada Emily L. Denton A. Hanna 30 291 0 26 Nov 2021
Just What do You Think You're Doing, Dave?' A Checklist for Responsible Data Use in NLP Anna Rogers Timothy Baldwin Kobi Leins 104 64 0 14 Sep 2021
Greenformers: Improving Computation and Memory Efficiency in Transformer Models via Low-Rank Approximation Samuel Cahyawijaya 26 12 0 24 Aug 2021
Underreporting of errors in NLG output, and what to do about it Emiel van Miltenburg Miruna Clinciu Ondrej Dusek Dimitra Gkatzia Stephanie Inglis ... Saad Mahamood Emma Manning S. Schoch Craig Thomson Luou Wen 27 38 0 02 Aug 2021
The Benchmark Lottery Mostafa Dehghani Yi Tay A. Gritsenko Zhe Zhao N. Houlsby Fernando Diaz Donald Metzler Oriol Vinyals 42 89 0 14 Jul 2021
Anticipating Safety Issues in E2E Conversational AI: Framework and Tooling Emily Dinan Gavin Abercrombie A. S. Bergman Shannon L. Spruit Dirk Hovy Y-Lan Boureau Verena Rieser 43 105 0 07 Jul 2021
Is Automated Topic Model Evaluation Broken?: The Incoherence of Coherence Alexander Miserlis Hoyle Pranav Goel Denis Peskov Andrew Hian-Cheong Jordan L. Boyd-Graber Philip Resnik 41 128 0 05 Jul 2021
How Robust are Model Rankings: A Leaderboard Customization Approach for Equitable Evaluation Swaroop Mishra Anjana Arunkumar 34 24 0 10 Jun 2021
Good for Misconceived Reasons: An Empirical Revisiting on the Need for Visual Context in Multimodal Machine Translation Zhiyong Wu Lingpeng Kong W. Bi Xiang Li B. Kao LRM 23 77 0 30 May 2021
Measuring Shifts in Attitudes Towards COVID-19 Measures in Belgium Using Multilingual BERT Kristen M. Scott Pieter Delobelle Bettina Berendt 26 3 0 20 Apr 2021
Perspectives on Machine Learning from Psychology's Reproducibility Crisis Samuel J. Bell Onno P. Kampman 17 15 0 18 Apr 2021
Making Attention Mechanisms More Robust and Interpretable with Virtual Adversarial Training Shunsuke Kitada Hitoshi Iyatomi AAML 28 8 0 18 Apr 2021
Multilingual and Cross-Lingual Intent Detection from Spoken Data D. Gerz Pei-hao Su Razvan Kusztos Avishek Mondal M. Lis Eshan Singhal N. Mrksic Tsung-Hsien Wen Ivan Vulić 17 35 0 17 Apr 2021
What Will it Take to Fix Benchmarking in Natural Language Understanding? Samuel R. Bowman George E. Dahl ELM ALM 30 156 0 05 Apr 2021
UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark Nicholas Lourie Ronan Le Bras Chandra Bhagavatula Yejin Choi LRM 30 137 0 24 Mar 2021
Dutch Humor Detection by Generating Negative Examples Thomas Winters Pieter Delobelle 16 10 0 26 Oct 2020
Dynamic Contextualized Word Embeddings Valentin Hofmann J. Pierrehumbert Hinrich Schütze 39 51 0 23 Oct 2020
UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus George Michalopoulos Yuanxin Wang H. Kaka Helen H. Chen Alexander Wong 28 122 0 20 Oct 2020
Unsupervised Bitext Mining and Translation via Self-trained Contextual Embeddings Phillip Keung Julian Salazar Y. Lu Noah A. Smith SSL 27 25 0 15 Oct 2020
Extracting a Knowledge Base of Mechanisms from COVID-19 Papers Tom Hope Aida Amini David Wadden Madeleine van Zuylen Sravanthi Parasa Eric Horvitz Daniel S. Weld Roy Schwartz Hannaneh Hajishirzi 34 29 0 08 Oct 2020
Zero-Shot Stance Detection: A Dataset and Model using Generalized Topic Representations Emily Allaway Kathleen McKeown 19 177 0 07 Oct 2020
Easy, Reproducible and Quality-Controlled Data Collection with Crowdaq Qiang Ning Hao Wu Pradeep Dasigi Dheeru Dua Matt Gardner Robert L Logan IV Ana Marasović Zhenjin Nie 30 16 0 06 Oct 2020
Understanding tables with intermediate pre-training Julian Martin Eisenschlos Syrine Krichene Thomas Müller LMTD 15 119 0 01 Oct 2020
Parsing with Multilingual BERT, a Small Corpus, and a Small Treebank Ethan C. Chau Lucy H. Lin Noah A. Smith 19 15 0 29 Sep 2020
Improving Low Compute Language Modeling with In-Domain Embedding Initialisation Charles F Welch Rada Mihalcea Jonathan K. Kummerfeld AI4CE 19 4 0 29 Sep 2020