ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2212.08073
  4. Cited By
Constitutional AI: Harmlessness from AI Feedback

Constitutional AI: Harmlessness from AI Feedback

15 December 2022
Yuntao Bai
Saurav Kadavath
Sandipan Kundu
Amanda Askell
John Kernion
Andy Jones
A. Chen
Anna Goldie
Azalia Mirhoseini
C. McKinnon
Carol Chen
Catherine Olsson
C. Olah
Danny Hernandez
Dawn Drain
Deep Ganguli
Dustin Li
Eli Tran-Johnson
E. Perez
Jamie Kerr
J. Mueller
Jeff Ladish
J. Landau
Kamal Ndousse
Kamilė Lukošiūtė
Liane Lovitt
Michael Sellitto
Nelson Elhage
Nicholas Schiefer
Noemí Mercado
Nova Dassarma
R. Lasenby
Robin Larson
Sam Ringer
Scott R. Johnston
Shauna Kravec
S. E. Showk
Stanislav Fort
Tamera Lanham
Timothy Telleen-Lawton
Tom Conerly
T. Henighan
Tristan Hume
Sam Bowman
Zac Hatfield-Dodds
Benjamin Mann
Dario Amodei
Nicholas Joseph
Sam McCandlish
Tom B. Brown
Jared Kaplan
    SyDaMoMe
ArXiv (abs)PDFHTML

Papers citing "Constitutional AI: Harmlessness from AI Feedback"

50 / 1,202 papers shown
Title
Prompt Optimization with Logged Bandit Data
Prompt Optimization with Logged Bandit Data
Haruka Kiyohara
Daniel Yiming Cao
Yuta Saito
Thorsten Joachims
234
0
0
03 Apr 2025
Beyond Accuracy: The Role of Calibration in Self-Improving Large Language Models
Beyond Accuracy: The Role of Calibration in Self-Improving Large Language Models
Liangjie Huang
Dawei Li
Huan Liu
Lu Cheng
LRM
112
0
0
03 Apr 2025
HALO: Human-Aligned End-to-end Image Retargeting with Layered Transformations
HALO: Human-Aligned End-to-end Image Retargeting with Layered Transformations
Yiran Xu
Siqi Xie
Zhuofang Li
Harris Shadmany
Yinxiao Li
...
Jesse Berent
Ming-Hsuan Yang
Irfan Essa
Jia-Bin Huang
Feng Yang
VOS
154
1
0
03 Apr 2025
When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning
When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning
Nishad Singhi
Hritik Bansal
Arian Hosseini
Aditya Grover
Kai-Wei Chang
Marcus Rohrbach
Anna Rohrbach
OffRLLRM
125
6
0
01 Apr 2025
Learning a Canonical Basis of Human Preferences from Binary Ratings
Learning a Canonical Basis of Human Preferences from Binary Ratings
Kailas Vodrahalli
Wei Wei
James Zou
85
0
0
31 Mar 2025
$\textit{Agents Under Siege}$: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks
Agents Under Siege\textit{Agents Under Siege}Agents Under Siege: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks
Rana Muhammad Shahroz Khan
Zhen Tan
Sukwon Yun
Charles Flemming
Tianlong Chen
AAMLLLMAG
Presented at ResearchTrend Connect | LLMAG on 23 Apr 2025
196
4
0
31 Mar 2025
Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback
Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback
Wei Shen
Guanlin Liu
Zheng Wu
Ruofei Zhu
Qingping Yang
Chao Xin
Yu Yue
Lin Yan
154
14
0
28 Mar 2025
Q-Insight: Understanding Image Quality via Visual Reinforcement Learning
Q-Insight: Understanding Image Quality via Visual Reinforcement Learning
Weiqi Li
Xinyu Zhang
Shijie Zhao
Yize Zhang
Junlin Li
Li Zhang
Jian Zhang
109
11
0
28 Mar 2025
Sharpe Ratio-Guided Active Learning for Preference Optimization in RLHF
Sharpe Ratio-Guided Active Learning for Preference Optimization in RLHF
Syrine Belakaria
Joshua Kazdan
Charles Marx
Chris Cundy
Willie Neiswanger
Sanmi Koyejo
Barbara Engelhardt
Stefano Ermon
110
0
0
28 Mar 2025
SWI: Speaking with Intent in Large Language Models
SWI: Speaking with Intent in Large Language Models
Yuwei Yin
EunJeong Hwang
Giuseppe Carenini
LRM
131
0
0
27 Mar 2025
Exploring the Roles of Large Language Models in Reshaping Transportation Systems: A Survey, Framework, and Roadmap
Exploring the Roles of Large Language Models in Reshaping Transportation Systems: A Survey, Framework, and Roadmap
Tong Nie
Jian Sun
Wei Ma
226
4
0
27 Mar 2025
Multi-head Reward Aggregation Guided by Entropy
Multi-head Reward Aggregation Guided by Entropy
Xiaomin Li
Xupeng Chen
Jingxuan Fan
Eric Hanchen Jiang
Mingye Gao
AAML
99
3
0
26 Mar 2025
Generative Linguistics, Large Language Models, and the Social Nature of Scientific Success
Generative Linguistics, Large Language Models, and the Social Nature of Scientific Success
Sophie Hao
ELMAI4CE
105
0
0
25 Mar 2025
Optimizing Language Models for Inference Time Objectives using Reinforcement Learning
Optimizing Language Models for Inference Time Objectives using Reinforcement Learning
Yunhao Tang
Kunhao Zheng
Gabriel Synnaeve
Rémi Munos
64
3
0
25 Mar 2025
RL-finetuning LLMs from on- and off-policy data with a single algorithm
RL-finetuning LLMs from on- and off-policy data with a single algorithm
Yunhao Tang
Taco Cohen
David W. Zhang
Michal Valko
Rémi Munos
OffRL
92
4
0
25 Mar 2025
InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment
InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment
Yaojie Lu
Qichao Wang
H. Cao
Xierui Wang
Xiaoyin Xu
Min Zhang
120
1
0
24 Mar 2025
MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks
MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks
Wenhao You
Bryan Hooi
Yiwei Wang
Yansen Wang
Zong Ke
Ming Yang
Zi Huang
Yujun Cai
AAML
100
0
0
24 Mar 2025
AED: Automatic Discovery of Effective and Diverse Vulnerabilities for Autonomous Driving Policy with Large Language Models
AED: Automatic Discovery of Effective and Diverse Vulnerabilities for Autonomous Driving Policy with Large Language Models
Le Qiu
Zelai Xu
Qixin Tan
Wenhao Tang
Chao Yu
Yu Wang
AAML
87
0
0
24 Mar 2025
AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration
AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration
Andy Zhou
Kevin E. Wu
Francesco Pinto
Zhongfu Chen
Yi Zeng
Yu Yang
Shuang Yang
Sanmi Koyejo
James Zou
Bo Li
LLMAGAAML
136
1
0
20 Mar 2025
Aligning Crowd-sourced Human Feedback for Reinforcement Learning on Code Generation by Large Language Models
Aligning Crowd-sourced Human Feedback for Reinforcement Learning on Code Generation by Large Language Models
M. Wong
C. Tan
ALM
125
5
0
19 Mar 2025
Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models
Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models
Yuxiang Lai
Shitian Zhao
Ming Li
Jike Zhong
Xiaofeng Yang
OffRLLRMLM&MAVLM
186
31
0
18 Mar 2025
Augmented Adversarial Trigger Learning
Augmented Adversarial Trigger Learning
Zhe Wang
Yanjun Qi
92
0
0
16 Mar 2025
OASST-ETC Dataset: Alignment Signals from Eye-tracking Analysis of LLM Responses
OASST-ETC Dataset: Alignment Signals from Eye-tracking Analysis of LLM Responses
Angela Lopez-Cardona
Sebastian Idesis
Miguel Barreda-Ángeles
Sergi Abadal
Ioannis Arapakis
140
0
0
13 Mar 2025
DarkBench: Benchmarking Dark Patterns in Large Language Models
Esben Kran
Hieu Minh "Jord" Nguyen
Akash Kundu
Sami Jawhar
Jinsuk Park
Mateusz Maria Jurewicz
105
3
0
13 Mar 2025
TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models
Xudong Tan
Peng Ye
Chongjun Tu
Jianjian Cao
Yaoxin Yang
Lin Zhang
Dongzhan Zhou
Tao Chen
VLM
159
3
0
13 Mar 2025
RankPO: Preference Optimization for Job-Talent Matching
Yize Zhang
Ming Wang
Yu Wang
Xiaohui Wang
117
0
0
13 Mar 2025
Battling Misinformation: An Empirical Study on Adversarial Factuality in Open-Source Large Language Models
Shahnewaz Karim Sakib
Anindya Bijoy Das
Shibbir Ahmed
AAML
96
2
0
12 Mar 2025
Got Compute, but No Data: Lessons From Post-training a Finnish LLM
Elaine Zosa
Ville Komulainen
S. Pyysalo
97
1
0
12 Mar 2025
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models
Marianne Arriola
Aaron Gokaslan
Justin T Chiu
Zhihan Yang
Zhixuan Qi
Jiaqi Han
Subham Sekhar Sahoo
Volodymyr Kuleshov
DiffM
272
25
0
12 Mar 2025
Robust Multi-Objective Controlled Decoding of Large Language Models
Seongho Son
William Bankes
Sangwoong Yoon
Shyam Sundhar Ramesh
Xiaohang Tang
Ilija Bogunovic
127
2
0
11 Mar 2025
Backtracking for Safety
Bilgehan Sel
Dingcheng Li
Phillip Wallis
Vaishakh Keshava
Ming Jin
Siddhartha Reddy Jonnalagadda
KELM
96
0
0
11 Mar 2025
Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning
Huilin Deng
Ding Zou
Rui Ma
Hongchen Luo
Yang Cao
Yu Kang
LRMVLM
115
22
0
10 Mar 2025
VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models
Jiacheng Ruan
Wenzhen Yuan
Xian Gao
Ye Guo
Daoxin Zhang
Zhe Xu
Yao Hu
Ting Liu
Yuzhuo Fu
LRMVLM
165
6
0
10 Mar 2025
Combinatorial Optimization via LLM-driven Iterated Fine-tuning
Pranjal Awasthi
Sreenivas Gollapudi
Ravi Kumar
Kamesh Munagala
146
1
0
10 Mar 2025
Safety Guardrails for LLM-Enabled Robots
Zachary Ravichandran
Alexander Robey
Vijay Kumar
George Pappas
Hamed Hassani
120
5
0
10 Mar 2025
Language Model Personalization via Reward Factorization
Idan Shenfeld
Felix Faltings
Pulkit Agrawal
Aldo Pacchiano
109
1
0
08 Mar 2025
Superintelligence Strategy: Expert Version
Superintelligence Strategy: Expert Version
Dan Hendrycks
Eric Schmidt
Alexandr Wang
118
3
0
07 Mar 2025
LLMs Can Generate a Better Answer by Aggregating Their Own Responses
LLMs Can Generate a Better Answer by Aggregating Their Own Responses
Zichong Li
Xinyu Feng
Yuheng Cai
Zixuan Zhang
Tianyi Liu
Chen Liang
Weizhu Chen
Haoyu Wang
Tiejun Zhao
LRM
115
2
0
06 Mar 2025
Uncovering Gaps in How Humans and LLMs Interpret Subjective Language
Erik Jones
Arjun Patrawala
Jacob Steinhardt
72
1
0
06 Mar 2025
Improving LLM Safety Alignment with Dual-Objective Optimization
Improving LLM Safety Alignment with Dual-Objective Optimization
Xuandong Zhao
Will Cai
Tianneng Shi
David Huang
Licong Lin
Song Mei
Dawn Song
AAMLMU
207
5
0
05 Mar 2025
Adversarial Training for Multimodal Large Language Models against Jailbreak Attacks
Adversarial Training for Multimodal Large Language Models against Jailbreak Attacks
Liming Lu
Shuchao Pang
Siyuan Liang
Haotian Zhu
Xiyu Zeng
Aishan Liu
Yunhuai Liu
Yongbin Zhou
AAML
172
5
0
05 Mar 2025
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning
Borong Zhang
Yuhao Zhang
Yalan Qin
Yingshan Lei
Josef Dai
Yuanpei Chen
Yaodong Yang
126
4
0
05 Mar 2025
LLM-Safety Evaluations Lack Robustness
Tim Beyer
Sophie Xhonneux
Simon Geisler
Gauthier Gidel
Leo Schwinn
Stephan Günnemann
ALMELM
485
2
0
04 Mar 2025
Generator-Assistant Stepwise Rollback Framework for Large Language Model Agent
Generator-Assistant Stepwise Rollback Framework for Large Language Model Agent
Xingzuo Li
Kehai Chen
Yunfei Long
X. Bai
Yong-mei Xu
Min Zhang
LLMAGLRM
125
1
0
04 Mar 2025
MPO: Boosting LLM Agents with Meta Plan Optimization
Weimin Xiong
Yifan Song
Qingxiu Dong
Bingchan Zhao
Feifan Song
Xun Wang
Sujian Li
LLMAG
144
3
0
04 Mar 2025
Proportionality in Thumbs Up and Down Voting
Sonja Kraiczy
Georgios Papasotiropoulos
Grzegorz Pierczynski
P. Skowron
98
0
0
03 Mar 2025
Offline RLAIF: Piloting VLM Feedback for RL via SFO
Offline RLAIF: Piloting VLM Feedback for RL via SFO
Jacob Beck
OffRL
114
0
0
02 Mar 2025
Efficient Jailbreaking of Large Models by Freeze Training: Lower Layers Exhibit Greater Sensitivity to Harmful Content
Efficient Jailbreaking of Large Models by Freeze Training: Lower Layers Exhibit Greater Sensitivity to Harmful Content
Hongyuan Shen
Min Zheng
Jincheng Wang
Yang Zhao
80
0
0
28 Feb 2025
A database to support the evaluation of gender biases in GPT-4o output
A database to support the evaluation of gender biases in GPT-4o output
Luise Mehner
Lena Alicija Philine Fiedler
Sabine Ammon
Dorothea Kolossa
78
0
0
28 Feb 2025
Societal Alignment Frameworks Can Improve LLM Alignment
Karolina Stañczak
Nicholas Meade
Mehar Bhatia
Hattie Zhou
Konstantin Böttinger
...
Timothy P. Lillicrap
Ana Marasović
Sylvie Delacroix
Gillian K. Hadfield
Siva Reddy
474
1
0
27 Feb 2025
Previous
12345...232425
Next