ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.12621
  4. Cited By
Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning

Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning

16 October 2024
Ruimeng Ye
Yang Xiao
Bo Hui
    ALM
    ELM
    OffRL
ArXivPDFHTML

Papers citing "Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning"

30 / 30 papers shown
Title
How to Mitigate Overfitting in Weak-to-strong Generalization?
Junhao Shi
Qinyuan Cheng
Zhaoye Fei
Y. Zheng
Qipeng Guo
Xipeng Qiu
91
0
0
06 Mar 2025
The Capabilities and Limitations of Weak-to-Strong Generalization: Generalization and Calibration
The Capabilities and Limitations of Weak-to-Strong Generalization: Generalization and Calibration
Wei Yao
Wenkai Yang
Ziyi Wang
Yankai Lin
Yong Liu
Yong Liu
ELM
211
3
0
03 Feb 2025
Improving Weak-to-Strong Generalization with Reliability-Aware Alignment
Improving Weak-to-Strong Generalization with Reliability-Aware Alignment
Yue Guo
Yi Yang
59
9
0
27 Jun 2024
Theoretical Analysis of Weak-to-Strong Generalization
Theoretical Analysis of Weak-to-Strong Generalization
Hunter Lang
David Sontag
Aravindan Vijayaraghavan
105
26
0
25 May 2024
Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision
Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision
Zhiqing Sun
Longhui Yu
Yikang Shen
Weiyang Liu
Yiming Yang
Sean Welleck
Chuang Gan
70
68
0
14 Mar 2024
LEVI: Generalizable Fine-tuning via Layer-wise Ensemble of Different
  Views
LEVI: Generalizable Fine-tuning via Layer-wise Ensemble of Different Views
Yuji Roh
Qingyun Liu
Huan Gui
Zhe Yuan
Yujin Tang
...
Liang Liu
Shuchao Bi
Lichan Hong
Ed H. Chi
Zhe Zhao
102
2
0
07 Feb 2024
Vision Superalignment: Weak-to-Strong Generalization for Vision
  Foundation Models
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Jianyuan Guo
Hanting Chen
Chengcheng Wang
Kai Han
Chang Xu
Yunhe Wang
VLM
47
22
0
06 Feb 2024
Improving Weak-to-Strong Generalization with Scalable Oversight and
  Ensemble Learning
Improving Weak-to-Strong Generalization with Scalable Oversight and Ensemble Learning
Jitao Sang
Yuhang Wang
Jing Zhang
Yanxu Zhu
Chao Kong
Junhong Ye
Shuyu Wei
Jinlin Xiao
78
10
0
01 Feb 2024
Superfiltering: Weak-to-Strong Data Filtering for Fast
  Instruction-Tuning
Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning
Ming Li
Yong Zhang
Shwai He
Zhitao Li
Hongyu Zhao
Jianzong Wang
Ning Cheng
Dinesh Manocha
74
74
0
01 Feb 2024
Secrets of RLHF in Large Language Models Part II: Reward Modeling
Secrets of RLHF in Large Language Models Part II: Reward Modeling
Bing Wang
Rui Zheng
Luyao Chen
Yan Liu
Shihan Dou
...
Qi Zhang
Xipeng Qiu
Xuanjing Huang
Zuxuan Wu
Yuanyuan Jiang
ALM
88
107
0
11 Jan 2024
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak
  Supervision
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Collin Burns
Pavel Izmailov
Jan Hendrik Kirchner
Bowen Baker
Leo Gao
...
Adrien Ecoffet
Manas Joglekar
Jan Leike
Ilya Sutskever
Jeff Wu
ELM
81
289
0
14 Dec 2023
FFT: Towards Harmlessness Evaluation and Analysis for LLMs with
  Factuality, Fairness, Toxicity
FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality, Fairness, Toxicity
Shiyao Cui
Zhenyu Zhang
Yilong Chen
Wenyuan Zhang
Tianyun Liu
Siqi Wang
Tingwen Liu
62
15
0
30 Nov 2023
Successfully Applying Lottery Ticket Hypothesis to Diffusion Model
Successfully Applying Lottery Ticket Hypothesis to Diffusion Model
Chao Jiang
Bo Hui
Bohan Liu
Da Yan
DiffM
56
14
0
28 Oct 2023
Large Language Model Alignment: A Survey
Large Language Model Alignment: A Survey
Tianhao Shen
Renren Jin
Yufei Huang
Chuang Liu
Weilong Dong
Zishan Guo
Xinwei Wu
Yan Liu
Deyi Xiong
LM&MA
54
198
0
26 Sep 2023
Certifying LLM Safety against Adversarial Prompting
Certifying LLM Safety against Adversarial Prompting
Aounon Kumar
Chirag Agarwal
Suraj Srinivas
Aaron Jiaxun Li
Soheil Feizi
Himabindu Lakkaraju
AAML
74
191
0
06 Sep 2023
LegalBench: A Collaboratively Built Benchmark for Measuring Legal
  Reasoning in Large Language Models
LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models
Neel Guha
Julian Nyarko
Daniel E. Ho
Christopher Ré
Adam Chilton
...
Spencer Williams
Sunny G. Gandhi
Tomer Zur
Varun J. Iyer
Zehua Li
AILaw
LRM
ELM
54
169
0
20 Aug 2023
Universal and Transferable Adversarial Attacks on Aligned Language
  Models
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
287
1,449
0
27 Jul 2023
Evaluating Superhuman Models with Consistency Checks
Evaluating Superhuman Models with Consistency Checks
Lukas Fluri
Daniel Paleka
Florian Tramèr
ELM
82
44
0
16 Jun 2023
A Study on Knowledge Distillation from Weak Teacher for Scaling Up
  Pre-trained Language Models
A Study on Knowledge Distillation from Weak Teacher for Scaling Up Pre-trained Language Models
Hayeon Lee
Rui Hou
Jongpil Kim
Davis Liang
Sung Ju Hwang
Alexander Min
38
7
0
26 May 2023
OpenAssistant Conversations -- Democratizing Large Language Model
  Alignment
OpenAssistant Conversations -- Democratizing Large Language Model Alignment
Andreas Kopf
Yannic Kilcher
Dimitri von Rutte
Sotiris Anagnostidis
Zhi Rui Tam
...
Arnav Dantuluri
Andrew Maguire
Christoph Schuhmann
Huu Nguyen
A. Mattick
ALM
LM&MA
114
628
0
14 Apr 2023
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards
  and Ethical Behavior in the MACHIAVELLI Benchmark
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
Alexander Pan
Chan Jun Shern
Andy Zou
Nathaniel Li
Steven Basart
Thomas Woodside
Jonathan Ng
Hanlin Zhang
Scott Emmons
Dan Hendrycks
54
132
0
06 Apr 2023
On Second Thought, Let's Not Think Step by Step! Bias and Toxicity in
  Zero-Shot Reasoning
On Second Thought, Let's Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning
Omar Shaikh
Hongxin Zhang
William B. Held
Michael S. Bernstein
Diyi Yang
ReLM
LRM
95
196
0
15 Dec 2022
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
780
12,893
0
04 Mar 2022
SaFeRDialogues: Taking Feedback Gracefully after Conversational Safety
  Failures
SaFeRDialogues: Taking Feedback Gracefully after Conversational Safety Failures
Megan Ung
Jing Xu
Y-Lan Boureau
56
47
0
14 Oct 2021
Denoising Diffusion Implicit Models
Denoising Diffusion Implicit Models
Jiaming Song
Chenlin Meng
Stefano Ermon
VLM
DiffM
216
7,350
0
06 Oct 2020
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
  Models
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
Samuel Gehman
Suchin Gururangan
Maarten Sap
Yejin Choi
Noah A. Smith
133
1,194
0
24 Sep 2020
Constrained Labeling for Weakly Supervised Learning
Constrained Labeling for Weakly Supervised Learning
Chidubem Arachie
Bert Huang
72
17
0
15 Sep 2020
Denoising Diffusion Probabilistic Models
Denoising Diffusion Probabilistic Models
Jonathan Ho
Ajay Jain
Pieter Abbeel
DiffM
535
18,008
0
19 Jun 2020
Aligning Superhuman AI with Human Behavior: Chess as a Model System
Aligning Superhuman AI with Human Behavior: Chess as a Model System
Reid McIlroy-Young
S. Sen
Jon M. Kleinberg
Ashton Anderson
GNN
112
102
0
02 Jun 2020
Deceiving Google's Perspective API Built for Detecting Toxic Comments
Deceiving Google's Perspective API Built for Detecting Toxic Comments
Hossein Hosseini
Sreeram Kannan
Baosen Zhang
Radha Poovendran
AAML
60
328
0
27 Feb 2017
1