
v1v2 (latest)
Refusal in Language Models Is Mediated by a Single Direction
Papers citing "Refusal in Language Models Is Mediated by a Single Direction"
50 / 55 papers shown
Title |
---|
![]() Qwen Technical Report Jinze Bai Shuai Bai Yunfei Chu Zeyu Cui Kai Dang ...Zhenru Zhang Chang Zhou Jingren Zhou Xiaohuan Zhou Tianhang Zhu |