ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.16947
50
0

MixAT: Combining Continuous and Discrete Adversarial Training for LLMs

22 May 2025
Csaba Dékány
Stefan Balauca
Robin Staab
Dimitar I. Dimitrov
Martin Vechev
    AAML
ArXiv (abs)PDFHTML
Main:10 Pages
7 Figures
Bibliography:3 Pages
19 Tables
Appendix:10 Pages
Abstract

Despite recent efforts in Large Language Models (LLMs) safety and alignment, current adversarial attacks on frontier LLMs are still able to force harmful generations consistently. Although adversarial training has been widely studied and shown to significantly improve the robustness of traditional machine learning models, its strengths and weaknesses in the context of LLMs are less understood. Specifically, while existing discrete adversarial attacks are effective at producing harmful content, training LLMs with concrete adversarial prompts is often computationally expensive, leading to reliance on continuous relaxations. As these relaxations do not correspond to discrete input tokens, such latent training methods often leave models vulnerable to a diverse set of discrete attacks. In this work, we aim to bridge this gap by introducing MixAT, a novel method that combines stronger discrete and faster continuous attacks during training. We rigorously evaluate MixAT across a wide spectrum of state-of-the-art attacks, proposing the At Least One Attack Success Rate (ALO-ASR) metric to capture the worst-case vulnerability of models. We show MixAT achieves substantially better robustness (ALO-ASR < 20%) compared to prior defenses (ALO-ASR > 50%), while maintaining a runtime comparable to methods based on continuous relaxations. We further analyze MixAT in realistic deployment settings, exploring how chat templates, quantization, low-rank adapters, and temperature affect both adversarial training and evaluation, revealing additional blind spots in current methodologies. Our results demonstrate that MixAT's discrete-continuous defense offers a principled and superior robustness-accuracy tradeoff with minimal computational overhead, highlighting its promise for building safer LLMs. We provide our code and models atthis https URL.

View on arXiv
@article{dékány2025_2505.16947,
  title={ MixAT: Combining Continuous and Discrete Adversarial Training for LLMs },
  author={ Csaba Dékány and Stefan Balauca and Robin Staab and Dimitar I. Dimitrov and Martin Vechev },
  journal={arXiv preprint arXiv:2505.16947},
  year={ 2025 }
}
Comments on this paper