ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.23424
43
0

What Makes an Evaluation Useful? Common Pitfalls and Best Practices

30 March 2025
Gil Gekker
Meirav Segal
Dan Lahav
Omer Nevo
    ELM
ArXivPDFHTML
Abstract

Following the rapid increase in Artificial Intelligence (AI) capabilities in recent years, the AI community has voiced concerns regarding possible safety risks. To support decision-making on the safe use and development of AI systems, there is a growing need for high-quality evaluations of dangerous model capabilities. While several attempts to provide such evaluations have been made, a clear definition of what constitutes a "good evaluation" has yet to be agreed upon. In this practitioners' perspective paper, we present a set of best practices for safety evaluations, drawing on prior work in model evaluation and illustrated through cybersecurity examples. We first discuss the steps of the initial thought process, which connects threat modeling to evaluation design. Then, we provide the characteristics and parameters that make an evaluation useful. Finally, we address additional considerations as we move from building specific evaluations to building a full and comprehensive evaluation suite.

View on arXiv
@article{gekker2025_2503.23424,
  title={ What Makes an Evaluation Useful? Common Pitfalls and Best Practices },
  author={ Gil Gekker and Meirav Segal and Dan Lahav and Omer Nevo },
  journal={arXiv preprint arXiv:2503.23424},
  year={ 2025 }
}
Comments on this paper