Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2404.10464
Cited By
v1
v2 (latest)
DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion
16 April 2024
Yu Li
Zhihua Wei
Han Jiang
Chuanyang Gong
LLMSV
Re-assign community
ArXiv (abs)
PDF
HTML
Github (6★)
Papers citing
"DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion"
16 / 16 papers shown
Title
Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing
Han Jiang
Xiaoyuan Yi
Zhihua Wei
Ziang Xiao
Shu Wang
Xing Xie
ELM
ALM
134
8
0
20 Jun 2024
PrivacyRestore: Privacy-Preserving Inference in Large Language Models via Privacy Removal and Restoration
Huiping Zhuang
Jianwei Wang
Zhengdong Lu
Huiping Zhuang
Haoran Li
Huiping Zhuang
Cen Chen
RALM
KELM
105
8
0
03 Jun 2024
Steering Llama 2 via Contrastive Activation Addition
Nina Rimsky
Nick Gabrieli
Julian Schulz
Meg Tong
Evan Hubinger
Alexander Matt Turner
LLMSV
59
226
0
09 Dec 2023
Contrastive Decoding: Open-ended Text Generation as Optimization
Xiang Lisa Li
Ari Holtzman
Daniel Fried
Percy Liang
Jason Eisner
Tatsunori Hashimoto
Luke Zettlemoyer
M. Lewis
125
374
0
27 Oct 2022
Language Detoxification with Attribute-Discriminative Latent Space
Jin Myung Kwak
Minseon Kim
Sung Ju Hwang
52
14
0
19 Oct 2022
Spurious Correlations in Reference-Free Evaluation of Text Generation
Esin Durmus
Faisal Ladhak
Tatsunori Hashimoto
62
31
0
21 Apr 2022
Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language Models
Wei Ping
Ming-Yu Liu
Chaowei Xiao
Peng Xu
M. Patwary
Mohammad Shoeybi
Yue Liu
Anima Anandkumar
Bryan Catanzaro
100
71
0
08 Feb 2022
Ethical and social risks of harm from Language Models
Laura Weidinger
John F. J. Mellor
Maribeth Rauh
Conor Griffin
J. Uesato
...
Lisa Anne Hendricks
William S. Isaac
Sean Legassick
G. Irving
Iason Gabriel
PILM
122
1,044
0
08 Dec 2021
DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts
Alisa Liu
Maarten Sap
Ximing Lu
Swabha Swayamdipta
Chandra Bhagavatula
Noah A. Smith
Yejin Choi
MU
115
376
0
07 May 2021
Knowledge Neurons in Pretrained Transformers
Damai Dai
Li Dong
Y. Hao
Zhifang Sui
Baobao Chang
Furu Wei
KELM
MU
110
464
0
18 Apr 2021
FUDGE: Controlled Text Generation With Future Discriminators
Kevin Kaichuang Yang
Dan Klein
107
336
0
12 Apr 2021
Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP
Timo Schick
Sahana Udupa
Hinrich Schütze
313
387
0
28 Feb 2021
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
Samuel Gehman
Suchin Gururangan
Maarten Sap
Yejin Choi
Noah A. Smith
168
1,221
0
24 Sep 2020
GeDi: Generative Discriminator Guided Sequence Generation
Ben Krause
Akhilesh Deepak Gotmare
Bryan McCann
N. Keskar
Shafiq Joty
R. Socher
Nazneen Rajani
141
407
0
14 Sep 2020
Don't Stop Pretraining: Adapt Language Models to Domains and Tasks
Suchin Gururangan
Ana Marasović
Swabha Swayamdipta
Kyle Lo
Iz Beltagy
Doug Downey
Noah A. Smith
VLM
AI4CE
CLL
167
2,440
0
23 Apr 2020
Plug and Play Language Models: A Simple Approach to Controlled Text Generation
Sumanth Dathathri
Andrea Madotto
Janice Lan
Jane Hung
Eric Frank
Piero Molino
J. Yosinski
Rosanne Liu
KELM
151
979
0
04 Dec 2019
1