Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2410.19406
Cited By
v1
v2 (latest)
An Auditing Test To Detect Behavioral Shift in Language Models
25 October 2024
Leo Richter
Xuanli He
Pasquale Minervini
Matt J. Kusner
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"An Auditing Test To Detect Behavioral Shift in Language Models"
50 / 80 papers shown
Title
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Jan Betley
Daniel Tan
Niels Warncke
Anna Sztyber-Betley
Xuchan Bao
Martín Soto
Nathan Labenz
Owain Evans
AAML
150
22
0
24 Feb 2025
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
Yubo Wang
Xueguang Ma
Ge Zhang
Yuansheng Ni
Abhranil Chandra
...
Kai Wang
Alex Zhuang
Rongqi Fan
Xiang Yue
Wenhu Chen
LRM
ELM
129
463
0
03 Jun 2024
OR-Bench: An Over-Refusal Benchmark for Large Language Models
Justin Cui
Wei-Lin Chiang
Ion Stoica
Cho-Jui Hsieh
ALM
146
55
0
31 May 2024
Evaluating Frontier Models for Dangerous Capabilities
Mary Phuong
Matthew Aitchison
Elliot Catt
Sarah Cogan
Alex Kaskasoli
...
Sasha Brown
Anca Dragan
Rohin Shah
Allan Dafoe
Toby Shevlane
ELM
50
74
0
20 Mar 2024
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team
Gemma Team Thomas Mesnard
Cassidy Hardin
Robert Dadashi
Surya Bhupatiraju
...
Armand Joulin
Noah Fiedel
Evan Senter
Alek Andreev
Kathleen Kenealy
VLM
LLMAG
227
506
0
13 Mar 2024
Holding Secrets Accountable: Auditing Privacy-Preserving Machine Learning
Hidde Lycklama
Alexander Viand
Nicolas Küchler
Christian Knabenhans
Anwar Hithnawi
111
7
0
24 Feb 2024
Measuring and Controlling Instruction (In)Stability in Language Model Dialogs
Kenneth Li
Tianle Liu
Naomi Bashkansky
David Bau
Fernanda Viégas
Hanspeter Pfister
Martin Wattenberg
96
12
0
13 Feb 2024
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model
Ahmet Üstün
Viraat Aryabumi
Zheng-Xin Yong
Wei-Yin Ko
Daniel D'souza
...
Shayne Longpre
Niklas Muennighoff
Marzieh Fadaee
Julia Kreutzer
Sara Hooker
ALM
ELM
SyDa
LRM
93
229
0
12 Feb 2024
LLM-based NLG Evaluation: Current Status and Challenges
Mingqi Gao
Xinyu Hu
Jie Ruan
Xiao Pu
Xiaojun Wan
ELM
LM&MA
192
40
0
02 Feb 2024
A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models
S.M. Towhidul Islam Tonmoy
S. M. M. Zaman
Vinija Jain
Anku Rani
Vipula Rawte
Aman Chadha
Amitava Das
HILM
110
206
0
02 Jan 2024
Testing Closeness of Multivariate Distributions via Ramsey Theory
Ilias Diakonikolas
Daniel M. Kane
Sihan Liu
62
3
0
22 Nov 2023
Deep anytime-valid hypothesis testing
T. Pandeva
Patrick Forré
Aaditya Ramdas
S. Shekhar
72
5
0
30 Oct 2023
Towards Understanding Sycophancy in Language Models
Mrinank Sharma
Meg Tong
Tomasz Korbak
David Duvenaud
Amanda Askell
...
Oliver Rausch
Nicholas Schiefer
Da Yan
Miranda Zhang
Ethan Perez
354
244
0
20 Oct 2023
Jailbreaking Black Box Large Language Models in Twenty Queries
Patrick Chao
Alexander Robey
Yan Sun
Hamed Hassani
George J. Pappas
Eric Wong
AAML
139
707
0
12 Oct 2023
Mistral 7B
Albert Q. Jiang
Alexandre Sablayrolles
A. Mensch
Chris Bamford
Devendra Singh Chaplot
...
Teven Le Scao
Thibaut Lavril
Thomas Wang
Timothée Lacroix
William El Sayed
MoE
LRM
110
2,238
0
10 Oct 2023
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Xiangyu Qi
Yi Zeng
Tinghao Xie
Pin-Yu Chen
Ruoxi Jia
Prateek Mittal
Peter Henderson
SILM
129
633
0
05 Oct 2023
Gender bias and stereotypes in Large Language Models
Hadas Kotek
Rikker Dockum
David Q. Sun
118
238
0
28 Aug 2023
Deception Abilities Emerged in Large Language Models
Thilo Hagendorff
LLMAG
88
88
0
31 Jul 2023
Robust Distortion-free Watermarks for Language Models
Rohith Kuditipudi
John Thickstun
Tatsunori Hashimoto
Percy Liang
WaLM
90
184
0
28 Jul 2023
FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios
Ethan Chern
Steffi Chern
Shiqi Chen
Weizhe Yuan
Kehua Feng
Chunting Zhou
Junxian He
Graham Neubig
Pengfei Liu
HILM
68
207
0
25 Jul 2023
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset
Jiaming Ji
Mickel Liu
Juntao Dai
Xuehai Pan
Chi Zhang
Ce Bian
Chi Zhang
Ruiyang Sun
Yizhou Wang
Yaodong Yang
ALM
96
503
0
10 Jul 2023
A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation
Neeraj Varshney
Wenlin Yao
Hongming Zhang
Jianshu Chen
Dong Yu
HILM
111
175
0
08 Jul 2023
Frontier AI Regulation: Managing Emerging Risks to Public Safety
Markus Anderljung
Joslyn Barnhart
Anton Korinek
Jade Leung
Cullen O'Keefe
...
Jonas Schuett
Yonadav Shavit
Divya Siddarth
Robert F. Trager
Kevin J. Wolf
SILM
99
125
0
06 Jul 2023
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
Wei Ping
Weixin Chen
Hengzhi Pei
Chulin Xie
Mintong Kang
...
Zinan Lin
Yuk-Kit Cheng
Sanmi Koyejo
Basel Alomair
Yue Liu
119
430
0
20 Jun 2023
Explore, Establish, Exploit: Red Teaming Language Models from Scratch
Stephen Casper
Jason Lin
Joe Kwon
Gatlen Culp
Dylan Hadfield-Menell
AAML
49
99
0
15 Jun 2023
How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources
Yizhong Wang
Hamish Ivison
Pradeep Dasigi
Jack Hessel
Tushar Khot
...
David Wadden
Kelsey MacMillan
Noah A. Smith
Iz Beltagy
Hannaneh Hajishirzi
ALM
ELM
113
393
0
07 Jun 2023
Robust Multi-bit Natural Language Watermarking through Invariant Features
Kiyoon Yoo
Wonhyuk Ahn
Jiho Jang
Nojun Kwak
WaLM
216
83
0
03 May 2023
Sequential Predictive Two-Sample and Independence Testing
Aleksandr Podkopaev
Aaditya Ramdas
83
15
0
29 Apr 2023
Fundamental Limitations of Alignment in Large Language Models
Yotam Wolf
Noam Wies
Oshri Avnery
Yoav Levine
Amnon Shashua
ALM
116
147
0
19 Apr 2023
Generative Agents: Interactive Simulacra of Human Behavior
J. Park
Joseph C. O'Brien
Carrie J. Cai
Meredith Ringel Morris
Percy Liang
Michael S. Bernstein
LM&Ro
AI4CE
407
1,975
0
07 Apr 2023
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Yang Liu
Dan Iter
Yichong Xu
Shuohang Wang
Ruochen Xu
Chenguang Zhu
ELM
ALM
LM&MA
199
1,211
0
29 Mar 2023
The Stable Signature: Rooting Watermarks in Latent Diffusion Models
Pierre Fernandez
Guillaume Couairon
Hervé Jégou
Matthijs Douze
Teddy Furon
WIGM
117
197
0
27 Mar 2023
GPT-4 Technical Report
OpenAI OpenAI
OpenAI Josh Achiam
Steven Adler
Sandhini Agarwal
Lama Ahmad
...
Shengjia Zhao
Tianhao Zheng
Juntang Zhuang
William Zhuk
Barret Zoph
LLMAG
MLLM
1.5K
14,748
0
15 Mar 2023
Is ChatGPT a Good NLG Evaluator? A Preliminary Study
Jiaan Wang
Yunlong Liang
Fandong Meng
Zengkui Sun
Haoxiang Shi
Zhixu Li
Jinan Xu
Jianfeng Qu
Jie Zhou
LM&MA
ELM
ALM
AI4MH
138
471
0
07 Mar 2023
On Provable Copyright Protection for Generative Models
Nikhil Vyas
Sham Kakade
Boaz Barak
75
95
0
21 Feb 2023
Benchmarking Large Language Models for News Summarization
Tianyi Zhang
Faisal Ladhak
Esin Durmus
Percy Liang
Kathleen McKeown
Tatsunori B. Hashimoto
ELM
100
527
0
31 Jan 2023
A Watermark for Large Language Models
John Kirchenbauer
Jonas Geiping
Yuxin Wen
Jonathan Katz
Ian Miers
Tom Goldstein
VLM
WaLM
110
508
0
24 Jan 2023
Discovering Language Model Behaviors with Model-Written Evaluations
Ethan Perez
Sam Ringer
Kamilė Lukošiūtė
Karina Nguyen
Edwin Chen
...
Danny Hernandez
Deep Ganguli
Evan Hubinger
Nicholas Schiefer
Jared Kaplan
ALM
79
404
0
19 Dec 2022
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai
Saurav Kadavath
Sandipan Kundu
Amanda Askell
John Kernion
...
Dario Amodei
Nicholas Joseph
Sam McCandlish
Tom B. Brown
Jared Kaplan
SyDa
MoMe
214
1,646
0
15 Dec 2022
Constructing Highly Inductive Contexts for Dialogue Safety through Controllable Reverse Generation
Zhexin Zhang
Jiale Cheng
Hao Sun
Jiawen Deng
Fei Mi
Yasheng Wang
Lifeng Shang
Minlie Huang
SILM
141
9
0
04 Dec 2022
Game-theoretic statistics and safe anytime-valid inference
Aaditya Ramdas
Peter Grünwald
V. Vovk
Glenn Shafer
96
130
0
04 Oct 2022
CATER: Intellectual Property Protection on Text Generation APIs via Conditional Watermarks
Xuanli He
Xingliang Yuan
Yi Zeng
Lingjuan Lyu
Fangzhao Wu
Jiwei Li
R. Jia
WaLM
234
75
0
19 Sep 2022
The Alignment Problem from a Deep Learning Perspective
Richard Ngo
Lawrence Chan
Sören Mindermann
133
192
0
30 Aug 2022
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli
Liane Lovitt
John Kernion
Amanda Askell
Yuntao Bai
...
Nicholas Joseph
Sam McCandlish
C. Olah
Jared Kaplan
Jack Clark
303
489
0
23 Aug 2022
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Aarohi Srivastava
Abhinav Rastogi
Abhishek Rao
Abu Awal Md Shoeb
Abubakar Abid
...
Zhuoye Zhao
Zijian Wang
Zijie J. Wang
Zirui Wang
Ziyi Wu
ELM
211
1,777
0
09 Jun 2022
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
Yizhong Wang
Swaroop Mishra
Pegah Alipoormolabashi
Yeganeh Kordi
Amirreza Mirzaei
...
Chitta Baral
Yejin Choi
Noah A. Smith
Hannaneh Hajishirzi
Daniel Khashabi
ELM
123
859
0
16 Apr 2022
From Concept Drift to Model Degradation: An Overview on Performance-Aware Drift Detectors
Firas Bayram
Bestoun S. Ahmed
A. Kassler
49
221
0
21 Mar 2022
A New Generation of Perspective API: Efficient Multilingual Character-level Transformers
Alyssa Lees
Vinh Q. Tran
Yi Tay
Jeffrey Scott Sorensen
Jai Gupta
Donald Metzler
Lucy Vasserman
88
192
0
22 Feb 2022
Quantifying Memorization Across Neural Language Models
Nicholas Carlini
Daphne Ippolito
Matthew Jagielski
Katherine Lee
Florian Tramèr
Chiyuan Zhang
PILM
124
630
0
15 Feb 2022
Deduplicating Training Data Mitigates Privacy Risks in Language Models
Nikhil Kandpal
Eric Wallace
Colin Raffel
PILM
MU
124
295
0
14 Feb 2022
1
2
Next