ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.17089
21
0

Trust Me, I Can Handle It: Self-Generated Adversarial Scenario Extrapolation for Robust Language Models

20 May 2025
Md Rafi Ur Rashid
Vishnu Asutosh Dasu
Ye Wang
Gang Tan
Shagufta Mehnaz
    AAML
    ELM
ArXivPDFHTML
Abstract

Large Language Models (LLMs) exhibit impressive capabilities, but remain susceptible to a growing spectrum of safety risks, including jailbreaks, toxic content, hallucinations, and bias. Existing defenses often address only a single threat type or resort to rigid outright rejection, sacrificing user experience and failing to generalize across diverse and novel attacks. This paper introduces Adversarial Scenario Extrapolation (ASE), a novel inference-time computation framework that leverages Chain-of-Thought (CoT) reasoning to simultaneously enhance LLM robustness and seamlessness. ASE guides the LLM through a self-generative process of contemplating potential adversarial scenarios and formulating defensive strategies before generating a response to the user query. Comprehensive evaluation on four adversarial benchmarks with four latest LLMs shows that ASE achieves near-zero jailbreak attack success rates and minimal toxicity, while slashing outright rejections to <4%. ASE outperforms six state-of-the-art defenses in robustness-seamlessness trade-offs, with 92-99% accuracy on adversarial Q&A and 4-10x lower bias scores. By transforming adversarial perception into an intrinsic cognitive process, ASE sets a new paradigm for secure and natural human-AI interaction.

View on arXiv
@article{rashid2025_2505.17089,
  title={ Trust Me, I Can Handle It: Self-Generated Adversarial Scenario Extrapolation for Robust Language Models },
  author={ Md Rafi Ur Rashid and Vishnu Asutosh Dasu and Ye Wang and Gang Tan and Shagufta Mehnaz },
  journal={arXiv preprint arXiv:2505.17089},
  year={ 2025 }
}
Comments on this paper