ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2212.07919
  4. Cited By
ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning

ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning

15 December 2022
O. Yu. Golovneva
Moya Chen
Spencer Poff
Martin Corredor
Luke Zettlemoyer
Maryam Fazel-Zarandi
Asli Celikyilmaz
    ReLM
    LRM
ArXivPDFHTML

Papers citing "ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning"

50 / 111 papers shown
Title
MINERVA: Evaluating Complex Video Reasoning
MINERVA: Evaluating Complex Video Reasoning
Arsha Nagrani
Sachit Menon
Ahmet Iscen
Shyamal Buch
Ramin Mehran
...
Yukun Zhu
Carl Vondrick
Mikhail Sirotenko
Cordelia Schmid
Tobias Weyand
58
0
0
01 May 2025
Evaluating Grounded Reasoning by Code-Assisted Large Language Models for Mathematics
Evaluating Grounded Reasoning by Code-Assisted Large Language Models for Mathematics
Zena Al-Khalili
Nick Howell
Dietrich Klakow
LRM
29
0
0
24 Apr 2025
ChartQA-X: Generating Explanations for Charts
ChartQA-X: Generating Explanations for Charts
Shamanthak Hegde
Pooyan Fazli
H. Seifi
27
0
0
17 Apr 2025
Hierarchical Planning for Complex Tasks with Knowledge Graph-RAG and Symbolic Verification
Hierarchical Planning for Complex Tasks with Knowledge Graph-RAG and Symbolic Verification
Cristina Cornelio
Flavio Petruzzellis
Pietro Lio
33
0
0
06 Apr 2025
LLMs for Explainable AI: A Comprehensive Survey
LLMs for Explainable AI: A Comprehensive Survey
Ahsan Bilal
David Ebert
Beiyu Lin
72
1
0
31 Mar 2025
VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity
Jing Bi
Junjia Guo
Susan Liang
Guangyu Sun
Luchuan Song
...
Jinxi He
Jiarui Wu
A. Vosoughi
Cheng Chen
Chenliang Xu
LRM
74
1
0
14 Mar 2025
StepMathAgent: A Step-Wise Agent for Evaluating Mathematical Processes through Tree-of-Error
Steve Yang
C. Wang
Yidong Wang
Xiaotao Gu
Minlie Huang
J. Tang
LRM
LLMAG
64
0
0
13 Mar 2025
DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding
Ayesha Ishaq
Jean Lahoud
Ketan More
Omkar Thawakar
Ritesh Thawkar
...
F. Khan
Hisham Cholakkal
Ivan Laptev
Rao Muhammad Anwer
Salman Khan
LRM
71
0
0
13 Mar 2025
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency
Dongzhi Jiang
Renrui Zhang
Ziyu Guo
Yanwei Li
Yu Qi
...
Shen Yan
Bo Zhang
Chaoyou Fu
Peng Gao
Hongsheng Li
MLLM
LRM
91
21
0
13 Feb 2025
Examining False Positives under Inference Scaling for Mathematical Reasoning
Examining False Positives under Inference Scaling for Mathematical Reasoning
Yu Guang Wang
Nan Yang
Liang Wang
Furu Wei
LRM
67
3
0
10 Feb 2025
Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation
Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation
Chengwen Qi
Ren Ma
Bowen Li
He Du
Binyuan Hui
Jinwang Wu
Yuanjun Laili
Conghui He
ReLM
LRM
86
2
0
10 Feb 2025
Aligning Black-box Language Models with Human Judgments
Aligning Black-box Language Models with Human Judgments
Gerrit J. J. van den Burg
Gen Suzuki
Wei Liu
Murat Sensoy
ALM
82
0
0
07 Feb 2025
Cascaded Self-Evaluation Augmented Training for Lightweight Multimodal LLMs
Cascaded Self-Evaluation Augmented Training for Lightweight Multimodal LLMs
Zheqi Lv
Wenkai Wang
Jiawei Wang
Shengyu Zhang
Fei Wu
LRM
ReLM
51
0
0
10 Jan 2025
PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models
PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models
Mingyang Song
Zhaochen Su
Xiaoye Qu
Jiawei Zhou
Yu-Xi Cheng
LRM
53
29
0
06 Jan 2025
CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models
CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models
Zihui Cheng
Qiguang Chen
Jin Zhang
Hao Fei
Xiaocheng Feng
Wanxiang Che
Min Li
L. Qin
VLM
MLLM
LRM
75
4
0
17 Dec 2024
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Di Zhang
Jingdi Lei
Junxian Li
Xunzhi Wang
Y. Liu
...
Steve Yang
Jianbo Wu
Peng Ye
Wanli Ouyang
Dongzhan Zhou
OffRL
LRM
107
6
0
27 Nov 2024
Addressing Hallucinations in Language Models with Knowledge Graph Embeddings as an Additional Modality
Viktoriia Chekalina
Anton Razzigaev
Elizaveta Goncharova
Andrey Kuznetsov
KELM
71
0
0
18 Nov 2024
Do Large Language Models Align with Core Mental Health Counseling Competencies?
Do Large Language Models Align with Core Mental Health Counseling Competencies?
Viet Cuong Nguyen
Mohammad Taher
Dongwan Hong
Vinicius Konkolics Possobom
Vibha Thirunellayi Gopalakrishnan
...
Zihang Li
H. J. Soled
Michael L. Birnbaum
Srijan Kumar
M. D. Choudhury
ELM
LM&MA
AI4MH
39
3
0
29 Oct 2024
What Factors Affect Multi-Modal In-Context Learning? An In-Depth
  Exploration
What Factors Affect Multi-Modal In-Context Learning? An In-Depth Exploration
L. Qin
Qiguang Chen
Hao Fei
Zhi Chen
Min Li
Wanxiang Che
41
5
0
27 Oct 2024
ReasonAgain: Using Extractable Symbolic Programs to Evaluate
  Mathematical Reasoning
ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning
Xiaodong Yu
Ben Zhou
Hao Cheng
Dan Roth
ReLM
LRM
38
1
0
24 Oct 2024
MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps
MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps
Xiongtao Zhou
Jie He
Lanyu Chen
Jingyu Li
Haojing Chen
Víctor Gutiérrez-Basulto
Jeff Z. Pan
H. Chen
LRM
57
1
0
18 Oct 2024
FG-PRM: Fine-grained Hallucination Detection and Mitigation in Language
  Model Mathematical Reasoning
FG-PRM: Fine-grained Hallucination Detection and Mitigation in Language Model Mathematical Reasoning
Ruosen Li
Ziming Luo
Xinya Du
LRM
29
0
0
08 Oct 2024
Rationale-Aware Answer Verification by Pairwise Self-Evaluation
Rationale-Aware Answer Verification by Pairwise Self-Evaluation
Akira Kawabata
Saku Sugawara
LRM
33
3
0
07 Oct 2024
Defining Knowledge: Bridging Epistemology and Large Language Models
Defining Knowledge: Bridging Epistemology and Large Language Models
Constanza Fierro
Ruchira Dhar
Filippos Stamatiou
Nicolas Garneau
Anders Søgaard
KELM
23
4
0
03 Oct 2024
LASeR: Learning to Adaptively Select Reward Models with Multi-Armed
  Bandits
LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits
Duy Nguyen
Archiki Prasad
Elias Stengel-Eskin
Joey Tianyi Zhou
23
3
0
02 Oct 2024
Beyond Accuracy Optimization: Computer Vision Losses for Large Language
  Model Fine-Tuning
Beyond Accuracy Optimization: Computer Vision Losses for Large Language Model Fine-Tuning
Daniele Rege Cambrin
Giuseppe Gallipoli
Irene Benedetto
Luca Cagliero
Paolo Garza
28
0
0
20 Sep 2024
Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation
  Instructions
Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions
Bhuvanashree Murugadoss
Christian Poelitz
Ian Drosos
Vu Le
Nick McKenna
Carina Negreanu
Chris Parnin
Advait Sarkar
ELM
ALM
35
13
0
16 Aug 2024
CoverBench: A Challenging Benchmark for Complex Claim Verification
CoverBench: A Challenging Benchmark for Complex Claim Verification
Alon Jacovi
Moran Ambar
Eyal Ben-David
Uri Shaham
Amir Feder
Mor Geva
Dror Marcus
Avi Caciularu
LMTD
49
3
0
06 Aug 2024
ExoViP: Step-by-step Verification and Exploration with Exoskeleton
  Modules for Compositional Visual Reasoning
ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning
Yunhong Wang
Alan Yuille
Zhuowan Li
Zilong Zheng
LRM
36
3
0
05 Aug 2024
Leveraging LLM Reasoning Enhances Personalized Recommender Systems
Leveraging LLM Reasoning Enhances Personalized Recommender Systems
Zhe Wang
Adam Kraft
Long Jin
Chenwei Cai
Anahita Hosseini
Yuhua Ru
Zemin Zhang
Lichan Hong
Ed H. Chi
Xinyang Yi
LRM
26
7
0
22 Jul 2024
XAI meets LLMs: A Survey of the Relation between Explainable AI and
  Large Language Models
XAI meets LLMs: A Survey of the Relation between Explainable AI and Large Language Models
Erik Cambria
Lorenzo Malandri
Fabio Mercorio
Navid Nobani
Andrea Seveso
50
11
0
21 Jul 2024
Step-by-Step Reasoning to Solve Grid Puzzles: Where do LLMs Falter?
Step-by-Step Reasoning to Solve Grid Puzzles: Where do LLMs Falter?
Nemika Tyagi
Mihir Parmar
Mohith Kulkarni
Aswin Rrv
Nisarg Patel
Mutsumi Nakamura
Arindam Mitra
Chitta Baral
LRM
37
6
0
20 Jul 2024
Stepwise Verification and Remediation of Student Reasoning Errors with
  Large Language Model Tutors
Stepwise Verification and Remediation of Student Reasoning Errors with Large Language Model Tutors
Nico Daheim
Jakub Macina
Manu Kapur
Iryna Gurevych
Mrinmaya Sachan
LRM
40
5
0
12 Jul 2024
Advancing Process Verification for Large Language Models via Tree-Based
  Preference Learning
Advancing Process Verification for Large Language Models via Tree-Based Preference Learning
Mingqian He
Yongliang Shen
Wenqi Zhang
Zeqi Tan
Weiming Lu
LRM
35
5
0
29 Jun 2024
LLMs instead of Human Judges? A Large Scale Empirical Study across 20
  NLP Evaluation Tasks
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
A. Bavaresco
Raffaella Bernardi
Leonardo Bertolazzi
Desmond Elliott
Raquel Fernández
...
David Schlangen
Alessandro Suglia
Aditya K Surikuchi
Ece Takmaz
A. Testoni
ALM
ELM
54
62
0
26 Jun 2024
FZI-WIM at SemEval-2024 Task 2: Self-Consistent CoT for Complex NLI in
  Biomedical Domain
FZI-WIM at SemEval-2024 Task 2: Self-Consistent CoT for Complex NLI in Biomedical Domain
Jin Liu
Steffen Thoma
LRM
44
2
0
14 Jun 2024
Generalization-Enhanced Code Vulnerability Detection via Multi-Task
  Instruction Fine-Tuning
Generalization-Enhanced Code Vulnerability Detection via Multi-Task Instruction Fine-Tuning
Xiaohu Du
Ming Wen
Jiahao Zhu
Zifan Xie
Bin Ji
Huijun Liu
Xuanhua Shi
Hai Jin
37
14
0
06 Jun 2024
Evaluating Mathematical Reasoning of Large Language Models: A Focus on
  Error Identification and Correction
Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction
Xiaoyuan Li
Wenjie Wang
Moxin Li
Junrong Guo
Yang Zhang
Fuli Feng
ELM
LRM
33
15
0
02 Jun 2024
M$^3$CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal
  Chain-of-Thought
M3^33CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought
Qiguang Chen
Libo Qin
Jin Zhang
Zhi Chen
Xiao Xu
Wanxiang Che
LRM
34
35
0
26 May 2024
MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time
MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time
Jikun Kang
Xin Zhe Li
Xi Chen
Amirreza Kazemi
Qianyi Sun
...
Xu He
Quan He
Feng Wen
Jianye Hao
Jun Yao
LRM
ReLM
34
16
0
25 May 2024
Can LLMs Solve longer Math Word Problems Better?
Can LLMs Solve longer Math Word Problems Better?
Xin Xu
Tong Xiao
Zitong Chao
Zhenya Huang
Can Yang
Yang Wang
70
10
0
23 May 2024
LLMChain: Blockchain-based Reputation System for Sharing and Evaluating
  Large Language Models
LLMChain: Blockchain-based Reputation System for Sharing and Evaluating Large Language Models
Mouhamed Amine Bouchiha
Quentin Telnoff
Souhail Bakkali
R. Champagnat
Mourad Rabah
Mickael Coustaty
Y. Ghamri-Doudane
LRM
34
3
0
20 Apr 2024
Evaluating Mathematical Reasoning Beyond Accuracy
Evaluating Mathematical Reasoning Beyond Accuracy
Shijie Xia
Xuefeng Li
Yixin Liu
Tongshuang Wu
Pengfei Liu
LRM
ReLM
47
21
0
08 Apr 2024
LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step
  Reasoning with Large Language Models
LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models
Shibo Hao
Yi Gu
Haotian Luo
Tianyang Liu
Xiyan Shao
...
Haodi Ma
Adithya Samavedhi
Qiyue Gao
Zhen Wang
Zhiting Hu
LRM
ELM
92
22
0
08 Apr 2024
Multilingual Large Language Model: A Survey of Resources, Taxonomy and
  Frontiers
Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers
Libo Qin
Qiguang Chen
Yuhang Zhou
Zhi Chen
Hai-Tao Zheng
Lizi Liao
Min Li
Wanxiang Che
Philip S. Yu
LRM
55
36
0
07 Apr 2024
Can Small Language Models Help Large Language Models Reason Better?:
  LM-Guided Chain-of-Thought
Can Small Language Models Help Large Language Models Reason Better?: LM-Guided Chain-of-Thought
Jooyoung Lee
Fan Yang
Thanh Tran
Qian Hu
Emre Barut
Kai-Wei Chang
Chengwei Su
ReLM
LLMAG
LRM
21
10
0
04 Apr 2024
Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language
  Models -- A Survey
Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
Philipp Mondorf
Barbara Plank
ELM
LRM
LM&MA
33
35
0
02 Apr 2024
Towards Generalizable and Faithful Logic Reasoning over Natural Language
  via Resolution Refutation
Towards Generalizable and Faithful Logic Reasoning over Natural Language via Resolution Refutation
Zhouhao Sun
Xiao Ding
LI DU
Bibo Cai
Jin-Fang Gao
Ting Liu
Bing Qin
LRM
ReLM
34
0
0
02 Apr 2024
Recover: A Neuro-Symbolic Framework for Failure Detection and Recovery
Recover: A Neuro-Symbolic Framework for Failure Detection and Recovery
Cristina Cornelio
Mohammed Diab
OffRL
30
9
0
31 Mar 2024
VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
Ahmad A Mahmood
Ashmal Vayani
Muzammal Naseer
Salman Khan
Fahad Shahbaz Khan
LRM
56
7
0
21 Mar 2024
123
Next