Discovering Latent Knowledge in Language Models Without Supervision

7 December 2022

Papers citing "Discovering Latent Knowledge in Language Models Without Supervision"

50 / 268 papers shown

Title
ICLR: In-Context Learning of Representations Core Francisco Park Andrew Lee Ekdeep Singh Lubana Yongyi Yang Maya Okawa Kento Nishi Martin Wattenberg Hidenori Tanaka AIFin 118 3 0 29 Dec 2024
Does Representation Matter? Exploring Intermediate Layers in Large Language Models Oscar Skean Md Rifat Arefin Yann LeCun Ravid Shwartz-Ziv 81 7 0 12 Dec 2024
Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models Cameron Tice Philipp Alexander Kreer Nathan Helm-Burger Prithviraj Singh Shahani Fedor Ryzhenkov Jacob Haimes Felix Hofstätter Teun van der Weij 82 1 0 02 Dec 2024
Think-to-Talk or Talk-to-Think? When LLMs Come Up with an Answer in Multi-Step Arithmetic Reasoning Keito Kudo Yoichi Aoki Tatsuki Kuribayashi Shusaku Sone Masaya Taniguchi Ana Brassard Keisuke Sakaguchi Kentaro Inui ReLM LRM 69 0 0 02 Dec 2024
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit Zeqing He Zhibo Wang Zhixuan Chu Huiyu Xu Rui Zheng Kui Ren Chun Chen 54 3 0 17 Nov 2024
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks Nathalie Maria Kirch Constantin Weisser Severin Field Helen Yannakoudakis Stephen Casper 39 2 0 02 Nov 2024
Improving Uncertainty Quantification in Large Language Models via Semantic Embeddings Yashvir S. Grewal Edwin V. Bonilla Thang D. Bui UQCV 30 3 0 30 Oct 2024
All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling Emanuele Marconato Sébastien Lachapelle Sebastian Weichwald Luigi Gresele 69 3 0 30 Oct 2024
Distinguishing Ignorance from Error in LLM Hallucinations Adi Simhi Jonathan Herzig Idan Szpektor Yonatan Belinkov HILM 53 2 0 29 Oct 2024
Rethinking the Uncertainty: A Critical Review and Analysis in the Era of Large Language Models Mohammad Beigi Sijia Wang Ying Shen Zihao Lin Adithya Kulkarni ... Ming Jin Jin-Hee Cho Dawei Zhou Chang-Tien Lu Lifu Huang 29 1 0 26 Oct 2024
Leveraging the Domain Adaptation of Retrieval Augmented Generation Models for Question Answering and Reducing Hallucination Salman Rakin Md. A. R. Shibly Zahin M. Hossain Zeeshan Khan Md. Mostofa Akbar 23 1 0 23 Oct 2024
DEAN: Deactivating the Coupled Neurons to Mitigate Fairness-Privacy Conflicts in Large Language Models Chen Qian Dongrui Liu Jie Zhang Yong Liu Jing Shao 37 1 0 22 Oct 2024
Chatting with Bots: AI, Speech Acts, and the Edge of Assertion Iwan Williams Tim Bayne 34 1 0 22 Oct 2024
Do LLMs "know" internally when they follow instructions? Juyeon Heo Christina Heinze-Deml Oussama Elachqar Shirley Ren Udhay Nallasamy Andy Miller Kwan Ho Ryan Chan Jaya Narain 51 3 0 18 Oct 2024
Fact Recall, Heuristics or Pure Guesswork? Precise Interpretations of Language Models for Fact Completion Denitsa Saynova Lovisa Hagström Moa Johansson Richard Johansson Marco Kuhlmann HILM 43 0 0 18 Oct 2024
Enhancing Fact Retrieval in PLMs through Truthfulness Paul Youssef Jorg Schlotterer C. Seifert KELM HILM 25 0 0 17 Oct 2024
Anchored Alignment for Self-Explanations Enhancement Luis Felipe Villa-Arenas Ata Nizamoglu Qianli Wang Sebastian Möller Vera Schmitt 24 0 0 17 Oct 2024
Balancing Label Quantity and Quality for Scalable Elicitation Alex Troy Mallen Nora Belrose 32 1 0 17 Oct 2024
Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation Yiming Wang Pei Zhang Baosong Yang Derek F. Wong Rui-cang Wang LRM 48 4 0 17 Oct 2024
Improving Instruction-Following in Language Models through Activation Steering Alessandro Stolfo Vidhisha Balachandran Safoora Yousefi Eric Horvitz Besmira Nushi LLMSV 62 14 0 15 Oct 2024
Safety-Aware Fine-Tuning of Large Language Models Hyeong Kyu Choi Xuefeng Du Yixuan Li 37 11 0 13 Oct 2024
NoVo: Norm Voting off Hallucinations with Attention Heads in Large Language Models Zheng Yi Ho Siyuan Liang Sen Zhang Yibing Zhan Dacheng Tao 26 2 0 11 Oct 2024
Chip-Tuning: Classify Before Language Models Say Fangwei Zhu Dian Li Jiajun Huang Gang Liu Hui Wang Zhifang Sui 25 0 0 09 Oct 2024
From Tokens to Words: On the Inner Lexicon of LLMs Guy Kaplan Matanel Oren Yuval Reif Roy Schwartz 48 12 0 08 Oct 2024
Intuitions of Compromise: Utilitarianism vs. Contractualism Jared Moore Yejin Choi Sydney Levine 33 0 0 07 Oct 2024
Evaluating Language Model Character Traits Francis Rhys Ward Zejia Yang Alex Jackson Randy Brown Chandler Smith Grace Colverd Louis Thomson Raymond Douglas Patrik Bartak Andrew Rowan 42 0 0 05 Oct 2024
Understanding Reasoning in Chain-of-Thought from the Hopfieldian View Lijie Hu Liang Liu Shu Yang Xin Chen Zhen Tan Muhammad Asif Ali Mengdi Li Di Wang LRM 41 1 0 04 Oct 2024
FactCheckmate: Preemptively Detecting and Mitigating Hallucinations in LMs Deema Alnuhait Neeraja Kirtane Muhammad Khalifa Hao Peng HILM LRM 34 2 0 03 Oct 2024
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations Hadas Orgad Michael Toker Zorik Gekhman Roi Reichart Idan Szpektor Hadas Kotek Yonatan Belinkov HILM AIFin 49 25 0 03 Oct 2024
Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA Eduard Tulchinskii Laida Kushnareva Kristian Kuznetsov Anastasia Voznyuk Andrei Andriiainen Irina Piontkovskaya Evgeny Burnaev Serguei Barannikov 67 1 0 03 Oct 2024
Integrative Decoding: Improve Factuality via Implicit Self-consistency Yi Cheng Xiao Liang Yeyun Gong Wen Xiao Song Wang ... Wenjie Li Jian Jiao Qi Chen Peng Cheng Wayne Xiong HILM 56 1 0 02 Oct 2024
Towards Inference-time Category-wise Safety Steering for Large Language Models Amrita Bhattacharjee Shaona Ghosh Traian Rebedea Christopher Parisien LLMSV 31 3 0 02 Oct 2024
Attention layers provably solve single-location regression P. Marion Raphael Berthier Gérard Biau Claire Boyer 128 2 0 02 Oct 2024
VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data Xuefeng Du Reshmi Ghosh Robert Sim Ahmed Salem Vitor Carvalho Emily Lawton Yixuan Li Jack W. Stokes VLM AAML 38 6 0 01 Oct 2024
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution Haiyan Zhao Heng Zhao Bo Shen Ali Payani Fan Yang Mengnan Du 59 2 0 30 Sep 2024
Robust LLM safeguarding via refusal feature adversarial training L. Yu Virginie Do Karen Hambardzumyan Nicola Cancedda AAML 62 10 0 30 Sep 2024
A Survey on the Honesty of Large Language Models Siheng Li Cheng Yang Taiqiang Wu Chufan Shi Yuji Zhang ... Jie Zhou Yujiu Yang Ngai Wong Xixin Wu Wai Lam HILM 32 4 0 27 Sep 2024
HaloScope: Harnessing Unlabeled LLM Generations for Hallucination Detection Xuefeng Du Chaowei Xiao Yixuan Li HILM 29 16 0 26 Sep 2024
Householder Pseudo-Rotation: A Novel Approach to Activation Editing in LLMs with Direction-Magnitude Perspective Van-Cuong Pham Thien Huu Nguyen LLMSV 37 3 0 16 Sep 2024
HALO: Hallucination Analysis and Learning Optimization to Empower LLMs with Retrieval-Augmented Context for Guided Clinical Decision Making Sumera Anjum Hanzhi Zhang Wenjun Zhou Eun Jin Paek Xiaopeng Zhao Yunhe Feng 31 1 0 16 Sep 2024
Optimal ablation for interpretability Maximilian Li Lucas Janson FAtt 47 2 0 16 Sep 2024
On the Relationship between Truth and Political Bias in Language Models S. Fulay William Brannon Shrestha Mohanty Cassandra Overney Elinor Poole-Dayan Deb Roy Jad Kabbara HILM 24 1 0 09 Sep 2024
Modularity in Transformers: Investigating Neuron Separability & Specialization Nicholas Pochinkov Thomas Jones Mohammed Rashidur Rahman 27 0 0 30 Aug 2024
Personality Alignment of Large Language Models Minjun Zhu Linyi Yang Yue Zhang Yue Zhang ALM 64 5 0 21 Aug 2024
A Little Confidence Goes a Long Way J. Scoville Shang Gao Devanshu Agrawal Javed Qadrud-Din 24 0 0 20 Aug 2024
Lower Layer Matters: Alleviating Hallucination via Multi-Layer Fusion Contrastive Decoding with Truthfulness Refocused Dingwei Chen Feiteng Fang Shiwen Ni Feng Liang Ruifeng Xu Min Yang Chengming Li HILM 21 1 0 16 Aug 2024
Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs Shiping Liu Kecheng Zheng Wei Chen MLLM 44 33 0 31 Jul 2024
Cluster-norm for Unsupervised Probing of Knowledge Walter Laurito Sharan Maiya Grégoire Dhimoïla Owen Owen Yeung Kaarel Hänni 27 2 0 26 Jul 2024
Internal Consistency and Self-Feedback in Large Language Models: A Survey Xun Liang Shichao Song Zifan Zheng Hanyu Wang Qingchen Yu ... Rong-Hua Li Peng Cheng Zhonghao Wang Feiyu Xiong Zhiyu Li HILM LRM 65 25 0 19 Jul 2024
Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach Nils Palumbo Ravi Mangal Zifan Wang Saranya Vijayakumar Corina S. Pasareanu Somesh Jha 41 1 0 18 Jul 2024