ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.21733
9
0

Scrapers selectively respect robots.txt directives: evidence from a large-scale empirical study

27 May 2025
Taein Kim
Karstan Bock
Claire Luo
Amanda Liswood
Emily Wenger
ArXiv (abs)PDFHTML
Main:13 Pages
10 Figures
Bibliography:2 Pages
10 Tables
Appendix:1 Pages
Abstract

Online data scraping has taken on new dimensions in recent years, as traditional scrapers have been joined by new AI-specific bots. To counteract unwanted scraping, many sites use tools like the Robots Exclusion Protocol (REP), which places athis http URLfile at the site root to dictate scraper behavior. Yet, the efficacy of the REP is not well-understood. Anecdotal evidence suggests some bots comply poorly with it, but no rigorous study exists to support (or refute) this claim. To understand the merits and limits of the REP, we conduct the first large-scale study of web scraper compliance withthis http URLdirectives using anonymized web logs from our institution. We analyze the behavior of 130 self-declared bots (and many anonymous ones) over 40 days, using a series of controlledthis http URLexperiments. We find that bots are less likely to comply with stricterthis http URLdirectives, and that certain categories of bots, including AI search crawlers, rarely checkthis http URLat all. These findings suggest that relying onthis http URLfiles to prevent unwanted scraping is risky and highlight the need for alternative approaches.

View on arXiv
Comments on this paper