Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2406.02577
Cited By
Are PPO-ed Language Models Hackable?
28 May 2024
Suraj Anand
David Getzen
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Are PPO-ed Language Models Hackable?"
3 / 3 papers shown
Title
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Andrew Lee
Xiaoyan Bai
Itamar Pres
Martin Wattenberg
Jonathan K. Kummerfeld
Rada Mihalcea
77
96
0
03 Jan 2024
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
282
1,996
0
31 Dec 2020
The Woman Worked as a Babysitter: On Biases in Language Generation
Emily Sheng
Kai-Wei Chang
Premkumar Natarajan
Nanyun Peng
223
618
0
03 Sep 2019
1