ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2406.03230
  4. Cited By
Defending Large Language Models Against Attacks With Residual Stream Activation Analysis

Defending Large Language Models Against Attacks With Residual Stream Activation Analysis

5 June 2024
Amelia Kawasaki
Andrew Davis
Houssam Abbas
    AAML
    KELM
ArXivPDFHTML

Papers citing "Defending Large Language Models Against Attacks With Residual Stream Activation Analysis"

13 / 13 papers shown
Title
AttentionDefense: Leveraging System Prompt Attention for Explainable Defense Against Novel Jailbreaks
AttentionDefense: Leveraging System Prompt Attention for Explainable Defense Against Novel Jailbreaks
Charlotte Siska
Anush Sankaran
AAML
77
0
0
10 Apr 2025
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially)
  Safer Language Models
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
Liwei Jiang
Kavel Rao
Seungju Han
Allyson Ettinger
Faeze Brahman
...
Niloofar Mireshghallah
Ximing Lu
Maarten Sap
Yejin Choi
Nouha Dziri
26
58
0
26 Jun 2024
TinyLlama: An Open-Source Small Language Model
TinyLlama: An Open-Source Small Language Model
Peiyuan Zhang
Guangtao Zeng
Tianduo Wang
Wei Lu
ALM
LRM
96
374
0
04 Jan 2024
Jailbreaking Black Box Large Language Models in Twenty Queries
Jailbreaking Black Box Large Language Models in Twenty Queries
Patrick Chao
Alexander Robey
Yan Sun
Hamed Hassani
George J. Pappas
Eric Wong
AAML
70
642
0
12 Oct 2023
Mistral 7B
Mistral 7B
Albert Q. Jiang
Alexandre Sablayrolles
A. Mensch
Chris Bamford
Devendra Singh Chaplot
...
Teven Le Scao
Thibaut Lavril
Thomas Wang
Timothée Lacroix
William El Sayed
MoE
LRM
36
2,102
0
10 Oct 2023
Baseline Defenses for Adversarial Attacks Against Aligned Language
  Models
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
Neel Jain
Avi Schwarzschild
Yuxin Wen
Gowthami Somepalli
John Kirchenbauer
Ping Yeh-Chiang
Micah Goldblum
Aniruddha Saha
Jonas Geiping
Tom Goldstein
AAML
92
373
0
01 Sep 2023
Universal and Transferable Adversarial Attacks on Aligned Language
  Models
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
152
1,376
0
27 Jul 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MH
ALM
197
11,484
0
18 Jul 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
...
Dacheng Li
Eric Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
ALM
OSLM
ELM
220
4,085
0
09 Jun 2023
Model-tuning Via Prompts Makes NLP Models Adversarially Robust
Model-tuning Via Prompts Makes NLP Models Adversarially Robust
Mrigank Raman
Pratyush Maini
J. Zico Kolter
Zachary Chase Lipton
Danish Pruthi
AAML
48
17
0
13 Mar 2023
Ignore Previous Prompt: Attack Techniques For Language Models
Ignore Previous Prompt: Attack Techniques For Language Models
Fábio Perez
Ian Ribeiro
SILM
69
420
0
17 Nov 2022
LoRA: Low-Rank Adaptation of Large Language Models
LoRA: Low-Rank Adaptation of Large Language Models
J. E. Hu
Yelong Shen
Phillip Wallis
Zeyuan Allen-Zhu
Yuanzhi Li
Shean Wang
Lu Wang
Weizhu Chen
OffRL
AI4TS
AI4CE
ALM
AIMat
223
9,946
0
17 Jun 2021
Explaining and Harnessing Adversarial Examples
Explaining and Harnessing Adversarial Examples
Ian Goodfellow
Jonathon Shlens
Christian Szegedy
AAML
GAN
158
18,922
0
20 Dec 2014
1