Provable Adversarial Robustness in In-Context Learning

19 February 2026

Di Zhang

OOD

ArXiv (abs)PDF HTML Github

Main:10 Pages

5 Figures

Bibliography:1 Pages

1 Tables

Appendix:5 Pages

Abstract

Large language models adapt to new tasks through in-context learning (ICL) without parameter updates. Current theoretical explanations for this capability assume test tasks are drawn from a distribution similar to that seen during pretraining. This assumption overlooks adversarial distribution shifts that threaten real-world reliability. To address this gap, we introduce a distributionally robust meta-learning framework that provides worst-case performance guarantees for ICL under Wasserstein-based distribution shifts. Focusing on linear self-attention Transformers, we derive a non-asymptotic bound linking adversarial perturbation strength ( $\rho$ ), model capacity ( $m$ ), and the number of in-context examples ( $N$ ). The analysis reveals that model robustness scales with the square root of its capacity ( $\rho_{\text{max}} \propto \sqrt{m}$ ), while adversarial settings impose a sample complexity penalty proportional to the square of the perturbation magnitude ( $N_\rho - N_0 \propto \rho^2$ ). Experiments on synthetic tasks confirm these scaling laws. These findings advance the theoretical understanding of ICL's limits under adversarial conditions and suggest that model capacity serves as a fundamental resource for distributional robustness.

View on arXiv

Comments on this paper