
Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning
Papers citing "Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning"
30 / 30 papers shown
Title |
---|
![]() Secrets of RLHF in Large Language Models Part II: Reward Modeling Bing Wang Rui Zheng Luyao Chen Yan Liu Shihan Dou ...Qi Zhang Xipeng Qiu Xuanjing Huang Zuxuan Wu Yuanyuan Jiang |