Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning

11 March 2026

Md Muntaqim Meherab

Noor Islam S. Mohammad

Faiza Feroz

BDL

CML

LRM

ArXiv (abs)PDF HTML Github

Main:10 Pages

8 Figures

Bibliography:3 Pages

8 Tables

Appendix:2 Pages

Abstract

Sparse autoencoders can localize where concepts live in language models, but not how they interact during multi-step reasoning. We propose Causal Concept Graphs (CCG): a directed acyclic graph over sparse, interpretable latent features, where edges capture learned causal dependencies between concepts. We combine task-conditioned sparse autoencoders for concept discovery with DAGMA-style differentiable structure learning for graph recovery and introduce the Causal Fidelity Score (CFS) to evaluate whether graph-guided interventions induce larger downstream effects than random ones. On ARC-Challenge, StrategyQA, and LogiQA with GPT-2 Medium, across five seeds ( $n{=}15$ paired runs), CCG achieves $\CFS=5.654\pm0.625$ , outperforming ROME-style tracing ( $3.382\pm0.233$ ), SAE-only ranking ( $2.479\pm0.196$ ), and a random baseline ( $1.032\pm0.034$ ), with $p<0.0001$ after Bonferroni correction. Learned graphs are sparse (5-6\% edge density), domain-specific, and stable across seeds.

View on arXiv

Comments on this paper