Reasoning Graphs: Self-Improving, Deterministic RAG through Evidence-Centric Feedback
- AIFinLRM
Language model agents reason from scratch on every query, discarding their chain of thought after each run. This produces lower accuracy and high variance, as the same query type can succeed or fail unpredictably. We introduce reasoning graphs, a graph structure that persists per-evidence chain of thought as structured edges connected to the evidence items they evaluate. Unlike prior memory mechanisms that retrieve distilled strategies by query similarity, reasoning graphs enable evidence-centric feedback: given a new candidate set, the system traverses all incoming evaluation edges for each evidence item across all prior runs, surfacing how that specific item has been judged before. We further introduce retrieval graphs, a complementary structure that feeds a pipeline planner to tighten the candidate funnel over successive runs. Together, both graphs form a self-improving feedback loop: accuracy improves systematically and verdict-level variance collapses. This requires no retraining; the base model remains frozen and all gains come from context engineering via graph traversal. We evaluate on MuSiQue and HotpotQA using a sequential cluster protocol, a high-reuse deployment simulation, and a determinism experiment. At 50%+ evidence profile coverage, our system reduces errors by 47% compared to vanilla RAG on the same questions (controlled dose-response, p < 0.0001). On 4-hop questions, accuracy improves by +11.0pp (p=0.0001). In high-reuse settings, the system achieves Pareto dominance: highest accuracy, 47% lower cost, and 46% lower latency. Evidence profiles improve verdict consistency by 7-8 percentage points (p=0.007, Wilcoxon); the full system drives all 11 hard probes to perfect consistency at both temperature 0 and 0.5 (p=0.004).
View on arXiv