ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2403.06009
25
10

Detectors for Safe and Reliable LLMs: Implementations, Uses, and Limitations

9 March 2024
Swapnaja Achintalwar
Adriana Alvarado Garcia
Ateret Anaby-Tavor
Ioana Baldini
Sara E. Berger
Bishwaranjan Bhattacharjee
Djallel Bouneffouf
Subhajit Chaudhury
Pin-Yu Chen
Lamogha Chiazor
Elizabeth M. Daly
Kirushikesh DB
Rogério Abreu de Paula
Pierre L. Dognin
E. Farchi
Soumya Ghosh
Michael Hind
R. Horesh
George Kour
Ja Young Lee
Nishtha Madaan
Sameep Mehta
Erik Miehling
K. Murugesan
Manish Nagireddy
Inkit Padhi
David Piorkowski
Ambrish Rawat
Orna Raz
P. Sattigeri
Hendrik Strobelt
Sarathkrishna Swaminathan
Christoph Tillmann
Aashka Trivedi
Kush R. Varshney
Dennis L. Wei
Shalisha Witherspooon
Marcel Zalmanovici
ArXivPDFHTML
Abstract

Large language models (LLMs) are susceptible to a variety of risks, from non-faithful output to biased and toxic generations. Due to several limiting factors surrounding LLMs (training cost, API access, data availability, etc.), it may not always be feasible to impose direct safety constraints on a deployed model. Therefore, an efficient and reliable alternative is required. To this end, we present our ongoing efforts to create and deploy a library of detectors: compact and easy-to-build classification models that provide labels for various harms. In addition to the detectors themselves, we discuss a wide range of uses for these detector models - from acting as guardrails to enabling effective AI governance. We also deep dive into inherent challenges in their development and discuss future work aimed at making the detectors more reliable and broadening their scope.

View on arXiv
Comments on this paper