Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2403.14743
Cited By
v1
v2
v3 (latest)
VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
21 March 2024
Ahmad A Mahmood
Ashmal Vayani
Muzammal Naseer
Salman Khan
Fahad Shahbaz Khan
LRM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding"
50 / 65 papers shown
Title
SB-Bench: Stereotype Bias Benchmark for Large Multimodal Models
Vishal Narnaware
Ashmal Vayani
Rohit Gupta
Swetha Sirnam
Mubarak Shah
169
3
0
12 Feb 2025
LLMs as Workers in Human-Computational Algorithms? Replicating Crowdsourcing Pipelines with LLMs
Tongshuang Wu
Haiyi Zhu
Maya Albayrak
Alexis Axon
Amanda Bertsch
...
Ying-Jui Tseng
Patricia Vaidos
Zhijin Wu
Wei Wu
Chenyang Yang
154
34
0
10 Jan 2025
VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning
Xueqing Wu
Yuheng Ding
Bingxuan Li
Pan Lu
Da Yin
Kai-Wei Chang
Nanyun Peng
LRM
129
4
0
03 Dec 2024
All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages
Ashmal Vayani
Dinura Dissanayake
Hasindri Watawana
Noor Ahsan
Nevasini Sasikumar
...
Monojit Choudhury
Ivan Laptev
Mubarak Shah
Salman Khan
Fahad A Khan
219
16
0
25 Nov 2024
STAR: A Benchmark for Situated Reasoning in Real-World Videos
Bo Wu
Shoubin Yu
Zhenfang Chen
Joshua B. Tenenbaum
Chuang Gan
128
195
0
15 May 2024
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT
Omkar Thawakar
Ashmal Vayani
Salman Khan
Hisham Cholakal
Rao M. Anwer
Michael Felsberg
Timothy Baldwin
Eric P. Xing
Fahad Shahbaz Khan
102
35
0
26 Feb 2024
Video Understanding with Large Language Models: A Survey
Yunlong Tang
Jing Bi
Siting Xu
Luchuan Song
Susan Liang
...
Feng Zheng
Jianguo Zhang
Chenliang Xu
Jiebo Luo
Chenliang Xu
VLM
176
99
0
29 Dec 2023
LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent
Jianing Yang
Xuweiyi Chen
Shengyi Qian
Nikhil Madaan
Madhavan Iyengar
David Fouhey
Joyce Chai
LM&Ro
LLMAG
132
99
0
21 Sep 2023
Head-to-Tail: How Knowledgeable are Large Language Models (LLMs)? A.K.A. Will LLMs Replace Knowledge Graphs?
Kai Sun
Yongjun Xu
Hanwen Zha
Yue Liu
Xinhsuai Dong
AI4MH
105
148
0
20 Aug 2023
UniVTG: Towards Unified Video-Language Temporal Grounding
Kevin Qinghong Lin
Pengchuan Zhang
Joya Chen
Shraman Pramanick
Difei Gao
Alex Jinpeng Wang
Rui Yan
Mike Zheng Shou
82
122
0
31 Jul 2023
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Enxin Song
Wenhao Chai
Guanhong Wang
Yucheng Zhang
Haoyang Zhou
...
Tianbo Ye
Yanting Zhang
Yang Lu
Lei Li
Gaoang Wang
VLM
MLLM
101
300
0
31 Jul 2023
Learning to Retrieve In-Context Examples for Large Language Models
Liang Wang
Nan Yang
Furu Wei
RALM
68
43
0
14 Jul 2023
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Muhammad Maaz
H. Rasheed
Salman Khan
Fahad Shahbaz Khan
MLLM
131
660
0
08 Jun 2023
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
Zhibin Gou
Zhihong Shao
Yeyun Gong
Yelong Shen
Yujiu Yang
Nan Duan
Weizhu Chen
KELM
LRM
87
396
0
19 May 2023
Self-Chained Image-Language Model for Video Localization and Question Answering
Shoubin Yu
Jaemin Cho
Prateek Yadav
Joey Tianyi Zhou
134
140
0
11 May 2023
Teaching Large Language Models to Self-Debug
Xinyun Chen
Maxwell Lin
Nathanael Scharli
Denny Zhou
LRM
121
702
0
11 Apr 2023
Self-Refine: Iterative Refinement with Self-Feedback
Aman Madaan
Niket Tandon
Prakhar Gupta
Skyler Hallinan
Luyu Gao
...
Bodhisattwa Prasad Majumder
Katherine Hermann
Sean Welleck
Amir Yazdanbakhsh
Peter Clark
ReLM
LRM
DiffM
178
1,678
0
30 Mar 2023
Language Models can Solve Computer Tasks
Geunwoo Kim
Pierre Baldi
Stephen Marcus McAleer
LLMAG
LM&Ro
143
374
0
30 Mar 2023
Synthetic-to-Real Domain Adaptation for Action Recognition: A Dataset and Baseline Performances
Arun V. Reddy
Ketul Shah
William Paul
Rohita Mocharla
Judy Hoffman
Kapil D. Katyal
Dinesh Manocha
Celso M. de Melo
Ramalingam Chellappa
66
18
0
17 Mar 2023
GPT-4 Technical Report
OpenAI OpenAI
OpenAI Josh Achiam
Steven Adler
Sandhini Agarwal
Lama Ahmad
...
Shengjia Zhao
Tianhao Zheng
Juntang Zhuang
William Zhuk
Barret Zoph
LLMAG
MLLM
1.5K
14,748
0
15 Mar 2023
ViperGPT: Visual Inference via Python Execution for Reasoning
Dídac Surís
Sachit Menon
Carl Vondrick
MLLM
LRM
ReLM
113
466
0
14 Mar 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
429
4,656
0
30 Jan 2023
ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning
O. Yu. Golovneva
Moya Chen
Spencer Poff
Martin Corredor
Luke Zettlemoyer
Maryam Fazel-Zarandi
Asli Celikyilmaz
ReLM
LRM
91
152
0
15 Dec 2022
Visual Programming: Compositional visual reasoning without training
Tanmay Gupta
Aniruddha Kembhavi
ReLM
VLM
LRM
141
439
0
18 Nov 2022
CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning
Hung Le
Yue Wang
Akhilesh Deepak Gotmare
Silvio Savarese
Guosheng Lin
SyDa
ALM
209
265
0
05 Jul 2022
Self-critiquing models for assisting human evaluators
William Saunders
Catherine Yeh
Jeff Wu
Steven Bills
Ouyang Long
Jonathan Ward
Jan Leike
ALM
ELM
109
306
0
12 Jun 2022
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval
Xudong Lin
Simran Tiwari
Shiyuan Huang
Manling Li
Mike Zheng Shou
Heng Ji
Shih-Fu Chang
90
21
0
05 Jun 2022
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac
Jeff Donahue
Pauline Luc
Antoine Miech
Iain Barr
...
Mikolaj Binkowski
Ricardo Barreira
Oriol Vinyals
Andrew Zisserman
Karen Simonyan
MLLM
VLM
418
3,607
0
29 Apr 2022
Training Language Models with Language Feedback
Jérémy Scheurer
Jon Ander Campos
Jun Shern Chan
Angelica Chen
Kyunghyun Cho
Ethan Perez
ALM
114
51
0
29 Apr 2022
TubeDETR: Spatio-Temporal Video Grounding with Transformers
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
101
95
0
30 Mar 2022
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Zhan Tong
Yibing Song
Jue Wang
Limin Wang
ViT
228
1,206
0
23 Mar 2022
Temporal Sentence Grounding in Videos: A Survey and Future Directions
Hao Zhang
Aixin Sun
Wei Jing
Qiufeng Wang
3DGS
85
41
0
20 Jan 2022
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
Yanghao Li
Chaoxia Wu
Haoqi Fan
K. Mangalam
Bo Xiong
Jitendra Malik
Christoph Feichtenhofer
ViT
153
693
0
02 Dec 2021
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
Tsu-Jui Fu
Linjie Li
Zhe Gan
Kevin Qinghong Lin
Wenjie Wang
Lijuan Wang
Zicheng Liu
VLM
111
221
0
24 Nov 2021
Truthful AI: Developing and governing AI that does not lie
Owain Evans
Owen Cotton-Barratt
Lukas Finnveden
Adam Bales
Avital Balwit
Peter Wills
Luca Righetti
William Saunders
HILM
289
117
0
13 Oct 2021
MURAL: Multimodal, Multitask Retrieval Across Languages
Aashi Jain
Mandy Guo
Krishna Srinivasan
Ting-Li Chen
Sneha Kudugunta
Chao Jia
Yinfei Yang
Jason Baldridge
VLM
157
52
0
10 Sep 2021
Anticipative Video Transformer
Rohit Girdhar
Kristen Grauman
ViT
65
211
0
03 Jun 2021
NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions
Junbin Xiao
Xindi Shang
Angela Yao
Tat-Seng Chua
97
506
0
18 May 2021
Visual Semantic Role Labeling for Video Understanding
Arka Sadhu
Tanmay Gupta
Mark Yatskar
Ram Nevatia
Aniruddha Kembhavi
VLM
70
71
0
02 Apr 2021
ViViT: A Video Vision Transformer
Anurag Arnab
Mostafa Dehghani
G. Heigold
Chen Sun
Mario Lucic
Cordelia Schmid
ViT
225
2,168
0
29 Mar 2021
SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events
Li Xu
He Huang
Jun Liu
ViT
LRM
91
88
0
29 Mar 2021
UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning
Wei Li
Can Gao
Guocheng Niu
Xinyan Xiao
Hao Liu
Jiachen Liu
Hua Wu
Haifeng Wang
116
380
0
31 Dec 2020
Deep Learning-Based Human Pose Estimation: A Survey
Ce Zheng
Wenhan Wu
Chong Chen
Taojiannan Yang
Sijie Zhu
Ju Shen
N. Kehtarnavaz
M. Shah
3DH
201
592
0
24 Dec 2020
SoccerNet-v2: A Dataset and Benchmarks for Holistic Understanding of Broadcast Soccer Videos
Adrien Deliège
A. Cioppa
Silvio Giancola
M. J. Seikavandi
J. Dueholm
Kamal Nasrollahi
Guohao Li
T. Moeslund
Marc Van Droogenbroeck
82
153
0
26 Nov 2020
ActBERT: Learning Global-Local Video-Text Representations
Linchao Zhu
Yi Yang
ViT
127
422
0
14 Nov 2020
Detecting Hallucinated Content in Conditional Neural Sequence Generation
Chunting Zhou
Graham Neubig
Jiatao Gu
Mona T. Diab
P. Guzmán
Luke Zettlemoyer
Marjan Ghazvininejad
HILM
111
200
0
05 Nov 2020
MovieNet: A Holistic Dataset for Movie Understanding
Qingqiu Huang
Yu Xiong
Anyi Rao
Jiaze Wang
Dahua Lin
VGen
95
244
0
21 Jul 2020
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
882
42,463
0
28 May 2020
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Xiujun Li
Xi Yin
Chunyuan Li
Pengchuan Zhang
Xiaowei Hu
...
Houdong Hu
Li Dong
Furu Wei
Yejin Choi
Jianfeng Gao
VLM
140
1,947
0
13 Apr 2020
Dense Regression Network for Video Grounding
Runhao Zeng
Haoming Xu
Wenbing Huang
Peihao Chen
Mingkui Tan
Chuang Gan
84
283
0
07 Apr 2020
1
2
Next