ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.18512
  4. Cited By
Preventing Language Models From Hiding Their Reasoning

Preventing Language Models From Hiding Their Reasoning

27 October 2023
Fabien Roger
Ryan Greenblatt
    LRM
ArXivPDFHTML

Papers citing "Preventing Language Models From Hiding Their Reasoning"

8 / 8 papers shown
Title
Reasoning Models Don't Always Say What They Think
Reasoning Models Don't Always Say What They Think
Yanda Chen
Joe Benton
Ansh Radhakrishnan
Jonathan Uesato
Carson E. Denison
...
Vlad Mikulik
Samuel R. Bowman
Jan Leike
Jared Kaplan
E. Perez
ReLM
LRM
67
12
1
08 May 2025
The Steganographic Potentials of Language Models
The Steganographic Potentials of Language Models
Artem Karpov
Tinuade Adeleke
Seong Hah Cho
Natalia Perez-Campanero
32
0
0
06 May 2025
Chain-of-Thought Reasoning In The Wild Is Not Always Faithful
Chain-of-Thought Reasoning In The Wild Is Not Always Faithful
Iván Arcuschin
Jett Janiak
Robert Krzyzanowski
Senthooran Rajamanoharan
Neel Nanda
Arthur Conmy
LRM
ReLM
62
6
0
11 Mar 2025
MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking
MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking
Sebastian Farquhar
Vikrant Varma
David Lindner
David Elson
Caleb Biddulph
Ian Goodfellow
Rohin Shah
82
1
0
22 Jan 2025
AI Control: Improving Safety Despite Intentional Subversion
AI Control: Improving Safety Despite Intentional Subversion
Ryan Greenblatt
Buck Shlegeris
Kshitij Sachan
Fabien Roger
29
38
0
12 Dec 2023
Robust Multi-bit Natural Language Watermarking through Invariant
  Features
Robust Multi-bit Natural Language Watermarking through Invariant Features
Kiyoon Yoo
Wonhyuk Ahn
Jiho Jang
Nojun Kwak
WaLM
142
77
0
03 May 2023
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
355
8,457
0
28 Jan 2022
Fine-Tuning Language Models from Human Preferences
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
280
1,587
0
18 Sep 2019
1