115
0

CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization

Abstract

Evaluating creative text such as human-written stories using language models has always been a challenging task -- owing to the subjectivity of multi-annotator ratings. To mimic the thinking process of humans, chain of thought (CoT) generates free-text explanations that help guide a model's predictions and Self-Consistency (SC) marginalizes predictions over multiple generated explanations. In this study, we discover that the widely-used self-consistency reasoning methods cause suboptimal results due to an objective mismatch between generating 'fluent-looking' explanations vs. actually leading to a good rating prediction for an aspect of a story. To overcome this challenge, we propose C\textbf{C}hain-o\textbf{o}f-Ke\textbf{Ke}ywords (CoKe), that generates a sequence of keywords before\textit{before} generating a free-text rationale, that guide the rating prediction of our evaluation language model. Then, we generate a diverse set of such keywords, and aggregate the scores corresponding to these generations. On the StoryER dataset, CoKe based on our small fine-tuned evaluation models not only reach human-level performance and significantly outperform GPT-4 with a 2x boost in correlation with human annotators, but also requires drastically less number of parameters.

View on arXiv
@article{joshi2025_2503.17136,
  title={ CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization },
  author={ Brihi Joshi and Sriram Venkatapathy and Mohit Bansal and Nanyun Peng and Haw-Shiuan Chang },
  journal={arXiv preprint arXiv:2503.17136},
  year={ 2025 }
}
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.