ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2404.08517
  4. Cited By
Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path
  Forward

Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path Forward

12 April 2024
Xuan Xie
Jiayang Song
Zhehua Zhou
Yuheng Huang
Da Song
Lei Ma
    OffRL
ArXivPDFHTML

Papers citing "Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path Forward"

24 / 24 papers shown
Title
ASPIRER: Bypassing System Prompts With Permutation-based Backdoors in
  LLMs
ASPIRER: Bypassing System Prompts With Permutation-based Backdoors in LLMs
Lu Yan
Siyuan Cheng
Xuan Chen
Kaiyuan Zhang
Guangyu Shen
Zhuo Zhang
Xiangyu Zhang
AAML
SILM
18
0
0
05 Oct 2024
LeCov: Multi-level Testing Criteria for Large Language Models
LeCov: Multi-level Testing Criteria for Large Language Models
Xuan Xie
Jiayang Song
Yuheng Huang
Da Song
Fuyuan Zhang
Felix Juefei-Xu
Lei Ma
ELM
29
0
0
20 Aug 2024
Multilingual Blending: LLM Safety Alignment Evaluation with Language
  Mixture
Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture
Jiayang Song
Yuheng Huang
Zhehua Zhou
Lei Ma
37
6
0
10 Jul 2024
OR-Bench: An Over-Refusal Benchmark for Large Language Models
OR-Bench: An Over-Refusal Benchmark for Large Language Models
Justin Cui
Wei-Lin Chiang
Ion Stoica
Cho-Jui Hsieh
ALM
38
33
0
31 May 2024
Detectors for Safe and Reliable LLMs: Implementations, Uses, and
  Limitations
Detectors for Safe and Reliable LLMs: Implementations, Uses, and Limitations
Swapnaja Achintalwar
Adriana Alvarado Garcia
Ateret Anaby-Tavor
Ioana Baldini
Sara E. Berger
...
Aashka Trivedi
Kush R. Varshney
Dennis L. Wei
Shalisha Witherspooon
Marcel Zalmanovici
25
10
0
09 Mar 2024
Cognitive Dissonance: Why Do Language Model Outputs Disagree with
  Internal Representations of Truthfulness?
Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?
Kevin Liu
Stephen Casper
Dylan Hadfield-Menell
Jacob Andreas
HILM
54
36
0
27 Nov 2023
Summarization is (Almost) Dead
Summarization is (Almost) Dead
Xiao Pu
Mingqi Gao
Xiaojun Wan
HILM
73
39
0
18 Sep 2023
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of
  Large Language Models for Code Generation
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
Jiawei Liu
Chun Xia
Yuyao Wang
Lingming Zhang
ELM
ALM
180
787
0
02 May 2023
The Internal State of an LLM Knows When It's Lying
The Internal State of an LLM Knows When It's Lying
A. Azaria
Tom Michael Mitchell
HILM
218
299
0
26 Apr 2023
LINe: Out-of-Distribution Detection by Leveraging Important Neurons
LINe: Out-of-Distribution Detection by Leveraging Important Neurons
Yong Hyun Ahn
Gyeong-Moon Park
Seong Tae Kim
OODD
111
31
0
24 Mar 2023
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sébastien Bubeck
Varun Chandrasekaran
Ronen Eldan
J. Gehrke
Eric Horvitz
...
Scott M. Lundberg
Harsha Nori
Hamid Palangi
Marco Tulio Ribeiro
Yi Zhang
ELM
AI4MH
AI4CE
ALM
286
2,232
0
22 Mar 2023
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for
  Generative Large Language Models
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
Potsawee Manakul
Adian Liusie
Mark J. F. Gales
HILM
LRM
150
391
0
15 Mar 2023
Unifying Evaluation of Machine Learning Safety Monitors
Unifying Evaluation of Machine Learning Safety Monitors
Joris Guérin
Raul Sena Ferreira
Kevin Delmas
Jérémie Guiochet
33
11
0
31 Aug 2022
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
311
11,915
0
04 Mar 2022
PatchCensor: Patch Robustness Certification for Transformers via
  Exhaustive Testing
PatchCensor: Patch Robustness Certification for Transformers via Exhaustive Testing
Yuheng Huang
L. Ma
Yuanchun Li
ViT
AAML
28
8
0
19 Nov 2021
Trustworthy AI: From Principles to Practices
Trustworthy AI: From Principles to Practices
Bo-wen Li
Peng Qi
Bo Liu
Shuai Di
Jingen Liu
Jiquan Pei
Jinfeng Yi
Bowen Zhou
117
355
0
04 Oct 2021
Challenges in Detoxifying Language Models
Challenges in Detoxifying Language Models
Johannes Welbl
Amelia Glaese
J. Uesato
Sumanth Dathathri
John F. J. Mellor
Lisa Anne Hendricks
Kirsty Anderson
Pushmeet Kohli
Ben Coppin
Po-Sen Huang
LM&MA
242
193
0
15 Sep 2021
Types of Out-of-Distribution Texts and How to Detect Them
Types of Out-of-Distribution Texts and How to Detect Them
Udit Arora
William Huang
He He
OODD
225
97
0
14 Sep 2021
A Token-level Reference-free Hallucination Detection Benchmark for
  Free-form Text Generation
A Token-level Reference-free Hallucination Detection Benchmark for Free-form Text Generation
Tianyu Liu
Yizhe Zhang
Chris Brockett
Yi Mao
Zhifang Sui
Weizhu Chen
W. Dolan
HILM
219
143
0
18 Apr 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
253
1,986
0
31 Dec 2020
Scaling Laws for Neural Language Models
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
228
4,460
0
23 Jan 2020
Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks
Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks
Guy Katz
Clark W. Barrett
D. Dill
Kyle D. Julian
Mykel Kochenderfer
AAML
226
1,835
0
03 Feb 2017
Safety Verification of Deep Neural Networks
Safety Verification of Deep Neural Networks
Xiaowei Huang
M. Kwiatkowska
Sen Wang
Min Wu
AAML
178
932
0
21 Oct 2016
Dropout as a Bayesian Approximation: Representing Model Uncertainty in
  Deep Learning
Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning
Y. Gal
Zoubin Ghahramani
UQCV
BDL
282
9,136
0
06 Jun 2015
1