Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a Feature Decorrelation Perspective

25 June 2024

Papers citing "Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a Feature Decorrelation Perspective"

7 / 7 papers shown

Title
FineScope : Precision Pruning for Domain-Specialized Large Language Models Using SAE-Guided Self-Data Cultivation Chaitali Bhattacharyya Yeseong Kim 45 0 0 01 May 2025
Multi-Faceted Multimodal Monosemanticity Hanqi Yan Xiangxiang Cui Lu Yin Paul Pu Liang Yulan He Yifei Wang 44 0 0 16 Feb 2025
Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT Zhengfu He Xuyang Ge Qiong Tang Tianxiang Sun Qinyuan Cheng Xipeng Qiu 39 20 0 19 Feb 2024
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity Andrew Lee Xiaoyan Bai Itamar Pres Martin Wattenberg Jonathan K. Kummerfeld Rada Mihalcea 77 96 0 03 Jan 2024
Cognitive Reframing of Negative Thoughts through Human-Language Model Interaction Ashish Sharma Kevin Rushton Inna Wanyin Lin David Wadden Khendra G. Lucas Adam S. Miner Theresa Nguyen Tim Althoff 73 71 0 04 May 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing Wes Gurnee Neel Nanda Matthew Pauly Katherine Harvey Dmitrii Troitskii Dimitris Bertsimas MILM 160 188 0 02 May 2023
On Feature Decorrelation in Self-Supervised Learning Tianyu Hua Wenxiao Wang Zihui Xue Sucheng Ren Yue Wang Hang Zhao SSL OOD 133 187 0 02 May 2021