Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2302.10291
Cited By
Can Large Language Models Change User Preference Adversarially?
5 January 2023
Varshini Subhash
AAML
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Can Large Language Models Change User Preference Adversarially?"
11 / 11 papers shown
Title
Tradeoffs Between Alignment and Helpfulness in Language Models with Representation Engineering
Yotam Wolf
Noam Wies
Dorin Shteyman
Binyamin Rothberg
Yoav Levine
Amnon Shashua
LLMSV
31
13
0
29 Jan 2024
RoleCraft-GLM: Advancing Personalized Role-Playing in Large Language Models
Meiling Tao
Xuechen Liang
Tianyu Shi
Lei Yu
Yiting Xie
37
4
0
17 Dec 2023
Training Socially Aligned Language Models on Simulated Social Interactions
Ruibo Liu
Ruixin Yang
Chenyan Jia
Ge Zhang
Denny Zhou
Andrew M. Dai
Diyi Yang
Soroush Vosoughi
ALM
37
45
0
26 May 2023
Fundamental Limitations of Alignment in Large Language Models
Yotam Wolf
Noam Wies
Oshri Avnery
Yoav Levine
Amnon Shashua
ALM
11
139
0
19 Apr 2023
Summary of ChatGPT-Related Research and Perspective Towards the Future of Large Language Models
Yi-Hsien Liu
Tianle Han
Siyuan Ma
Jia-Yu Zhang
Yuanyu Yang
...
Xiang Li
Ning Qiang
Dingang Shen
Tianming Liu
Bao Ge
ALM
ELM
AI4CE
LM&MA
LLMAG
38
461
0
04 Apr 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
212
494
0
01 Nov 2022
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli
Liane Lovitt
John Kernion
Amanda Askell
Yuntao Bai
...
Nicholas Joseph
Sam McCandlish
C. Olah
Jared Kaplan
Jack Clark
225
444
0
23 Aug 2022
Natural Language Descriptions of Deep Visual Features
Evan Hernandez
Sarah Schwettmann
David Bau
Teona Bagashvili
Antonio Torralba
Jacob Andreas
MILM
201
117
0
26 Jan 2022
Analyzing Dynamic Adversarial Training Data in the Limit
Eric Wallace
Adina Williams
Robin Jia
Douwe Kiela
198
30
0
16 Oct 2021
Tailor: Generating and Perturbing Text with Semantic Controls
Alexis Ross
Tongshuang Wu
Hao Peng
Matthew E. Peters
Matt Gardner
136
77
0
15 Jul 2021
Gradient-based Adversarial Attacks against Text Transformers
Chuan Guo
Alexandre Sablayrolles
Hervé Jégou
Douwe Kiela
SILM
98
227
0
15 Apr 2021
1