ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2412.01784
84
1

Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models

2 December 2024
Cameron Tice
Philipp Alexander Kreer
Nathan Helm-Burger
Prithviraj Singh Shahani
Fedor Ryzhenkov
Jacob Haimes
Felix Hofstätter
Teun van der Weij
ArXivPDFHTML
Abstract

Capability evaluations play a critical role in ensuring the safe deployment of frontier AI systems, but this role may be undermined by intentional underperformance or ``sandbagging.'' We present a novel model-agnostic method for detecting sandbagging behavior using noise injection. Our approach is founded on the observation that introducing Gaussian noise into the weights of models either prompted or fine-tuned to sandbag can considerably improve their performance. We test this technique across a range of model sizes and multiple-choice question benchmarks (MMLU, AI2, WMDP). Our results demonstrate that noise injected sandbagging models show performance improvements compared to standard models. Leveraging this effect, we develop a classifier that consistently identifies sandbagging behavior. Our unsupervised technique can be immediately implemented by frontier labs or regulatory bodies with access to weights to improve the trustworthiness of capability evaluations.

View on arXiv
Comments on this paper