ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2101.04840
  4. Cited By
Robustness Gym: Unifying the NLP Evaluation Landscape

Robustness Gym: Unifying the NLP Evaluation Landscape

13 January 2021
Karan Goel
Nazneen Rajani
Jesse Vig
Samson Tan
Jason M. Wu
Stephan Zheng
Caiming Xiong
Mohit Bansal
Christopher Ré
    AAML
    OffRL
    OOD
ArXivPDFHTML

Papers citing "Robustness Gym: Unifying the NLP Evaluation Landscape"

45 / 45 papers shown
Title
Orbit: A Framework for Designing and Evaluating Multi-objective Rankers
Orbit: A Framework for Designing and Evaluating Multi-objective Rankers
Chenyang Yang
Tesi Xiao
Michael Shavlovsky
Christian Kastner
Tongshuang Wu
37
0
0
07 Nov 2024
Enhancing adversarial robustness in Natural Language Inference using
  explanations
Enhancing adversarial robustness in Natural Language Inference using explanations
Alexandros Koulakos
Maria Lymperaiou
Giorgos Filandrianos
Giorgos Stamou
SILM
AAML
40
0
0
11 Sep 2024
On Behalf of the Stakeholders: Trends in NLP Model Interpretability in the Era of LLMs
On Behalf of the Stakeholders: Trends in NLP Model Interpretability in the Era of LLMs
Nitay Calderon
Roi Reichart
38
10
0
27 Jul 2024
Evaluating Model Performance Under Worst-case Subpopulations
Evaluating Model Performance Under Worst-case Subpopulations
Mike Li
Hongseok Namkoong
Shangzhou Xia
40
17
0
01 Jul 2024
Examining the robustness of LLM evaluation to the distributional
  assumptions of benchmarks
Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks
Melissa Ailem
Katerina Marazopoulou
Charlotte Siska
James Bono
59
14
0
25 Apr 2024
Beyond Testers' Biases: Guiding Model Testing with Knowledge Bases using
  LLMs
Beyond Testers' Biases: Guiding Model Testing with Knowledge Bases using LLMs
Chenyang Yang
Rishabh Rustogi
Rachel A. Brower-Sinning
Grace A. Lewis
Christian Kastner
Tongshuang Wu
KELM
32
11
0
14 Oct 2023
InterroLang: Exploring NLP Models and Datasets through Dialogue-based
  Explanations
InterroLang: Exploring NLP Models and Datasets through Dialogue-based Explanations
Nils Feldhus
Qianli Wang
Tatiana Anikina
Sahil Chopra
Cennet Oguz
Sebastian Möller
32
9
0
09 Oct 2023
The Trickle-down Impact of Reward (In-)consistency on RLHF
The Trickle-down Impact of Reward (In-)consistency on RLHF
Lingfeng Shen
Sihao Chen
Linfeng Song
Lifeng Jin
Baolin Peng
Haitao Mi
Daniel Khashabi
Dong Yu
27
21
0
28 Sep 2023
Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis,
  and LLMs Evaluations
Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs Evaluations
Lifan Yuan
Yangyi Chen
Ganqu Cui
Hongcheng Gao
Fangyuan Zou
Xingyi Cheng
Heng Ji
Zhiyuan Liu
Maosong Sun
34
73
0
07 Jun 2023
Benchmarking Robustness of Adaptation Methods on Pre-trained
  Vision-Language Models
Benchmarking Robustness of Adaptation Methods on Pre-trained Vision-Language Models
Shuo Chen
Jindong Gu
Zhen Han
Yunpu Ma
Philip H. S. Torr
Volker Tresp
VPVLM
VLM
32
17
0
03 Jun 2023
From Adversarial Arms Race to Model-centric Evaluation: Motivating a
  Unified Automatic Robustness Evaluation Framework
From Adversarial Arms Race to Model-centric Evaluation: Motivating a Unified Automatic Robustness Evaluation Framework
Yangyi Chen
Hongcheng Gao
Ganqu Cui
Lifan Yuan
Dehan Kong
...
Longtao Huang
H. Xue
Zhiyuan Liu
Maosong Sun
Heng Ji
AAML
ELM
27
6
0
29 May 2023
An Overview on Language Models: Recent Developments and Outlook
An Overview on Language Models: Recent Developments and Outlook
Chengwei Wei
Yun Cheng Wang
Bin Wang
C.-C. Jay Kuo
25
42
0
10 Mar 2023
Designerly Understanding: Information Needs for Model Transparency to
  Support Design Ideation for AI-Powered User Experience
Designerly Understanding: Information Needs for Model Transparency to Support Design Ideation for AI-Powered User Experience
Q. V. Liao
Hariharan Subramonyam
Jennifer Wang
Jennifer Wortman Vaughan
HAI
33
58
0
21 Feb 2023
Auditing large language models: a three-layered approach
Auditing large language models: a three-layered approach
Jakob Mokander
Jonas Schuett
Hannah Rose Kirk
Luciano Floridi
AILaw
MLAU
42
194
0
16 Feb 2023
Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL
  Robustness
Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness
Shuaichen Chang
J. Wang
Mingwen Dong
Lin Pan
Henghui Zhu
...
William Yang Wang
Zhiguo Wang
Vittorio Castelli
Patrick K. L. Ng
Bing Xiang
OOD
38
34
0
21 Jan 2023
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation
Tianxing He
Jingyu Zhang
Tianle Wang
Sachin Kumar
Kyunghyun Cho
James R. Glass
Yulia Tsvetkov
40
44
0
20 Dec 2022
Azimuth: Systematic Error Analysis for Text Classification
Azimuth: Systematic Error Analysis for Text Classification
Gabrielle Gauthier Melançon
Orlando Marquez Ayala
Lindsay D. Brin
Chris Tyler
Frederic Branchaud-Charron
Joseph Marinier
Karine Grande
Dieu-Thu Le
16
3
0
16 Dec 2022
A Survey on Medical Document Summarization
A Survey on Medical Document Summarization
Raghav Jain
Anubhav Jangra
S. Saha
Adam Jatowt
3DGS
MedIm
39
19
0
03 Dec 2022
Capabilities for Better ML Engineering
Capabilities for Better ML Engineering
Chenyang Yang
Rachel A. Brower-Sinning
Grace A. Lewis
Christian Kastner
Tongshuang Wu
24
3
0
11 Nov 2022
Enhancing Tabular Reasoning with Pattern Exploiting Training
Enhancing Tabular Reasoning with Pattern Exploiting Training
Abhilash Shankarampeta
Vivek Gupta
Shuo Zhang
LMTD
RALM
ReLM
60
6
0
21 Oct 2022
TCAB: A Large-Scale Text Classification Attack Benchmark
TCAB: A Large-Scale Text Classification Attack Benchmark
Kalyani Asthana
Zhouhang Xie
Wencong You
Adam Noack
Jonathan Brophy
Sameer Singh
Daniel Lowd
33
3
0
21 Oct 2022
Evaluate & Evaluation on the Hub: Better Best Practices for Data and
  Model Measurements
Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements
Leandro von Werra
Lewis Tunstall
A. Thakur
A. Luccioni
Tristan Thrush
...
Julien Chaumond
Margaret Mitchell
Alexander M. Rush
Thomas Wolf
Douwe Kiela
ELM
23
24
0
30 Sep 2022
Perturbations and Subpopulations for Testing Robustness in Token-Based
  Argument Unit Recognition
Perturbations and Subpopulations for Testing Robustness in Token-Based Argument Unit Recognition
Jonathan Kamp
Lisa Beinborn
Antske Fokkens
11
0
0
29 Sep 2022
Shortcut Learning of Large Language Models in Natural Language
  Understanding
Shortcut Learning of Large Language Models in Natural Language Understanding
Mengnan Du
Fengxiang He
Na Zou
Dacheng Tao
Xia Hu
KELM
OffRL
31
84
0
25 Aug 2022
ferret: a Framework for Benchmarking Explainers on Transformers
ferret: a Framework for Benchmarking Explainers on Transformers
Giuseppe Attanasio
Eliana Pastor
C. Bonaventura
Debora Nozza
33
30
0
02 Aug 2022
Interactive Model Cards: A Human-Centered Approach to Model
  Documentation
Interactive Model Cards: A Human-Centered Approach to Model Documentation
Anamaria Crisan
Margaret Drouhard
Jesse Vig
Nazneen Rajani
HAI
30
87
0
05 May 2022
What do we Really Know about State of the Art NER?
What do we Really Know about State of the Art NER?
Sowmya Vajjala
Ramya Balasubramaniam
19
15
0
29 Apr 2022
Adaptor: Objective-Centric Adaptation Framework for Language Models
Adaptor: Objective-Centric Adaptation Framework for Language Models
Michal vStefánik
Vít Novotný
Nikola Groverová
Petr Sojka
27
10
0
08 Mar 2022
Identifying Adversarial Attacks on Text Classifiers
Identifying Adversarial Attacks on Text Classifiers
Zhouhang Xie
Jonathan Brophy
Adam Noack
Wencong You
Kalyani Asthana
Carter Perkins
Sabrina Reis
Sameer Singh
Daniel Lowd
AAML
19
9
0
21 Jan 2022
Measure and Improve Robustness in NLP Models: A Survey
Measure and Improve Robustness in NLP Models: A Survey
Xuezhi Wang
Haohan Wang
Diyi Yang
139
130
0
15 Dec 2021
NL-Augmenter: A Framework for Task-Sensitive Natural Language
  Augmentation
NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation
Kaustubh D. Dhole
Varun Gangal
Sebastian Gehrmann
Aadesh Gupta
Zhenhao Li
...
Tianbao Xie
Usama Yaseen
Michael A. Yee
Jing Zhang
Yue Zhang
174
86
0
06 Dec 2021
NATURE: Natural Auxiliary Text Utterances for Realistic Spoken Language
  Evaluation
NATURE: Natural Auxiliary Text Utterances for Realistic Spoken Language Evaluation
David Alfonso-Hermelo
Ahmad Rashid
Abbas Ghaddar
Huawei Noah’s
Mehdi Rezagholizadeh
29
2
0
09 Nov 2021
Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
  Language Models
Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models
Boxin Wang
Chejian Xu
Shuohang Wang
Zhe Gan
Yu Cheng
Jianfeng Gao
Ahmed Hassan Awadallah
B. Li
VLM
ELM
AAML
22
214
0
04 Nov 2021
Metadata Shaping: Natural Language Annotations for the Tail
Metadata Shaping: Natural Language Annotations for the Tail
Simran Arora
Sen Wu
Enci Liu
Christopher Ré
27
0
0
16 Oct 2021
SGD-X: A Benchmark for Robust Generalization in Schema-Guided Dialogue
  Systems
SGD-X: A Benchmark for Robust Generalization in Schema-Guided Dialogue Systems
Harrison Lee
Raghav Gupta
Abhinav Rastogi
Yuan Cao
Bin Zhang
Yonghui Wu
66
33
0
13 Oct 2021
Datasets: A Community Library for Natural Language Processing
Datasets: A Community Library for Natural Language Processing
Quentin Lhoest
Albert Villanova del Moral
Yacine Jernite
A. Thakur
Patrick von Platen
...
Thibault Goehringer
Victor Mustar
François Lagunas
Alexander M. Rush
Thomas Wolf
30
580
0
07 Sep 2021
Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive
  Text Summarization
Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization
Chujie Zheng
Kunpeng Zhang
Harry J. Wang
Ling Fan
Zhe Wang
25
6
0
26 Aug 2021
Managing ML Pipelines: Feature Stores and the Coming Wave of Embedding
  Ecosystems
Managing ML Pipelines: Feature Stores and the Coming Wave of Embedding Ecosystems
Laurel J. Orr
Atindriyo Sanyal
Xiao Ling
Karan Goel
Megan Leszczynski
17
18
0
11 Aug 2021
Mandoline: Model Evaluation under Distribution Shift
Mandoline: Model Evaluation under Distribution Shift
Mayee F. Chen
Karan Goel
N. Sohoni
Fait Poms
Kayvon Fatahalian
Christopher Ré
28
69
0
01 Jul 2021
Exploring Robust Architectures for Deep Artificial Neural Networks
Exploring Robust Architectures for Deep Artificial Neural Networks
Asim Waqas
Ghulam Rasool
Hamza Farooq
N. Bouaynaya
OOD
AAML
26
14
0
30 Jun 2021
Automatic Construction of Evaluation Suites for Natural Language
  Generation Datasets
Automatic Construction of Evaluation Suites for Natural Language Generation Datasets
Simon Mille
Kaustubh D. Dhole
Saad Mahamood
Laura Perez-Beltrachini
Varun Gangal
Mihir Kale
Emiel van Miltenburg
Sebastian Gehrmann
ELM
34
22
0
16 Jun 2021
An Empirical Survey of Data Augmentation for Limited Data Learning in
  NLP
An Empirical Survey of Data Augmentation for Limited Data Learning in NLP
Jiaao Chen
Derek Tam
Colin Raffel
Mohit Bansal
Diyi Yang
28
172
0
14 Jun 2021
Reliability Testing for Natural Language Processing Systems
Reliability Testing for Natural Language Processing Systems
Samson Tan
Shafiq R. Joty
K. Baxter
Araz Taeihagh
G. Bennett
Min-Yen Kan
15
38
0
06 May 2021
What's in a Summary? Laying the Groundwork for Advances in
  Hospital-Course Summarization
What's in a Summary? Laying the Groundwork for Advances in Hospital-Course Summarization
Griffin Adams
Emily Alsentzer
Mert Ketenci
Jason Zucker
Noémie Elhadad
47
46
0
12 Apr 2021
Bootleg: Chasing the Tail with Self-Supervised Named Entity
  Disambiguation
Bootleg: Chasing the Tail with Self-Supervised Named Entity Disambiguation
Laurel J. Orr
Megan Leszczynski
Simran Arora
Sen Wu
Neel Guha
Xiao Ling
Christopher Ré
137
48
0
20 Oct 2020
1