ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.06275
50
0

Movie Facts and Fibs (MF2^22): A Benchmark for Long Movie Understanding

6 June 2025
Emmanouil Zaranis
António Farinhas
Saul Santos
Beatriz Canaverde
Miguel Moura Ramos
Aditya K Surikuchi
André Viveiros
Baohao Liao
Elena Bueno-Benito
Nithin Sivakumaran
Pavlo Vasylenko
Shoubin Yu
Sonal Sannigrahi
Wafaa Mohammed
Ben Peters
Danae Sánchez Villegas
Elias Stengel-Eskin
Giuseppe Attanasio
Jaehong Yoon
Stella Frank
Alessandro Suglia
Chrysoula Zerva
Desmond Elliott
Mariella Dimiccoli
Mohit Bansal
Oswald Lanz
Raffaella Bernardi
Raquel Fernández
Sandro Pezzelle
Vlad Niculae
Andre F. T. Martins
ArXiv (abs)PDFHTML
Main:16 Pages
14 Figures
Bibliography:4 Pages
6 Tables
Appendix:8 Pages
Abstract

Despite recent progress in vision-language models (VLMs), holistic understanding of long-form video content remains a significant challenge, partly due to limitations in current benchmarks. Many focus on peripheral, ``needle-in-a-haystack'' details, encouraging context-insensitive retrieval over deep comprehension. Others rely on large-scale, semi-automatically generated questions (often produced by language models themselves) that are easier for models to answer but fail to reflect genuine understanding. In this paper, we introduce MF2^22, a new benchmark for evaluating whether models can comprehend, consolidate, and recall key narrative information from full-length movies (50-170 minutes long). MF2^22 includes over 50 full-length, open-licensed movies, each paired with manually constructed sets of claim pairs -- one true (fact) and one plausible but false (fib), totalling over 850 pairs. These claims target core narrative elements such as character motivations and emotions, causal chains, and event order, and refer to memorable moments that humans can recall without rewatching the movie. Instead of multiple-choice formats, we adopt a binary claim evaluation protocol: for each pair, models must correctly identify both the true and false claims. This reduces biases like answer ordering and enables a more precise assessment of reasoning. Our experiments demonstrate that both open-weight and closed state-of-the-art models fall well short of human performance, underscoring the relative ease of the task for humans and their superior ability to retain and reason over critical narrative information -- an ability current VLMs lack.

View on arXiv
@article{zaranis2025_2506.06275,
  title={ Movie Facts and Fibs (MF$^2$): A Benchmark for Long Movie Understanding },
  author={ Emmanouil Zaranis and António Farinhas and Saul Santos and Beatriz Canaverde and Miguel Moura Ramos and Aditya K Surikuchi and André Viveiros and Baohao Liao and Elena Bueno-Benito and Nithin Sivakumaran and Pavlo Vasylenko and Shoubin Yu and Sonal Sannigrahi and Wafaa Mohammed and Ben Peters and Danae Sánchez Villegas and Elias Stengel-Eskin and Giuseppe Attanasio and Jaehong Yoon and Stella Frank and Alessandro Suglia and Chrysoula Zerva and Desmond Elliott and Mariella Dimiccoli and Mohit Bansal and Oswald Lanz and Raffaella Bernardi and Raquel Fernández and Sandro Pezzelle and Vlad Niculae and André F. T. Martins },
  journal={arXiv preprint arXiv:2506.06275},
  year={ 2025 }
}
Comments on this paper