ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.02252
  4. Cited By
MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning
v1v2v3 (latest)

MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning

4 June 2023
Jianghui Wang
Yuxuan Wang
Dongyan Zhao
Zilong Zheng
ArXiv (abs)PDFHTML

Papers citing "MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning"

50 / 61 papers shown
Title
VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic
  Understanding with Scene and Topic Transitions
VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions
Yuxuan Wang
Zilong Zheng
Xueliang Zhao
Jinpeng Li
Yueqian Wang
Dongyan Zhao
VGen
66
9
0
30 May 2023
GPT-4 Technical Report
GPT-4 Technical Report
OpenAI OpenAI
OpenAI Josh Achiam
Steven Adler
Sandhini Agarwal
Lama Ahmad
...
Shengjia Zhao
Tianhao Zheng
Juntang Zhuang
William Zhuk
Barret Zoph
LLMAGMLLM
1.5K
14,748
0
15 Mar 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLMMLLM
429
4,642
0
30 Jan 2023
An Empirical Study of End-to-End Video-Language Transformers with Masked
  Visual Modeling
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Tsu-Jui Fu
Linjie Li
Zhe Gan
Kevin Qinghong Lin
William Yang Wang
Lijuan Wang
Zicheng Liu
VLM
83
65
0
04 Sep 2022
Revealing Single Frame Bias for Video-and-Language Learning
Revealing Single Frame Bias for Video-and-Language Learning
Jie Lei
Tamara L. Berg
Joey Tianyi Zhou
75
114
0
07 Jun 2022
Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac
Jeff Donahue
Pauline Luc
Antoine Miech
Iain Barr
...
Mikolaj Binkowski
Ricardo Barreira
Oriol Vinyals
Andrew Zisserman
Karen Simonyan
MLLMVLM
418
3,607
0
29 Apr 2022
VideoMAE: Masked Autoencoders are Data-Efficient Learners for
  Self-Supervised Video Pre-Training
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Zhan Tong
Yibing Song
Jue Wang
Limin Wang
ViT
228
1,206
0
23 Mar 2022
MViTv2: Improved Multiscale Vision Transformers for Classification and
  Detection
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
Yanghao Li
Chaoxia Wu
Haoqi Fan
K. Mangalam
Bo Xiong
Jitendra Malik
Christoph Feichtenhofer
ViT
153
693
0
02 Dec 2021
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token
  Modeling
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
Tsu-Jui Fu
Linjie Li
Zhe Gan
Kevin Qinghong Lin
Wenjie Wang
Lijuan Wang
Zicheng Liu
VLM
111
221
0
24 Nov 2021
Masked Autoencoders Are Scalable Vision Learners
Masked Autoencoders Are Scalable Vision Learners
Kaiming He
Xinlei Chen
Saining Xie
Yanghao Li
Piotr Dollár
Ross B. Girshick
ViTTPM
477
7,827
0
11 Nov 2021
FILIP: Fine-grained Interactive Language-Image Pre-Training
FILIP: Fine-grained Interactive Language-Image Pre-Training
Lewei Yao
Runhu Huang
Lu Hou
Guansong Lu
Minzhe Niu
Hang Xu
Xiaodan Liang
Zhenguo Li
Xin Jiang
Chunjing Xu
VLMCLIP
108
643
0
09 Nov 2021
Align before Fuse: Vision and Language Representation Learning with
  Momentum Distillation
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq Joty
Caiming Xiong
Guosheng Lin
FaML
221
1,975
0
16 Jul 2021
MERLOT: Multimodal Neural Script Knowledge Models
MERLOT: Multimodal Neural Script Knowledge Models
Rowan Zellers
Ximing Lu
Jack Hessel
Youngjae Yu
J. S. Park
Jize Cao
Ali Farhadi
Yejin Choi
VLMLRM
104
383
0
04 Jun 2021
Shot Contrastive Self-Supervised Learning for Scene Boundary Detection
Shot Contrastive Self-Supervised Learning for Scene Boundary Detection
Shixing Chen
Xiaohan Nie
David D. Fan
Dongqing Zhang
Vimal Bhat
Raffay Hamid
SSL
62
62
0
28 Apr 2021
VGNMN: Video-grounded Neural Module Network to Video-Grounded Language
  Tasks
VGNMN: Video-grounded Neural Module Network to Video-Grounded Language Tasks
Hung Le
Nancy F. Chen
Guosheng Lin
MLLM
70
19
0
16 Apr 2021
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Max Bain
Arsha Nagrani
Gül Varol
Andrew Zisserman
VGen
163
1,186
0
01 Apr 2021
Learning Transferable Visual Models From Natural Language Supervision
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
...
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIPVLM
978
29,871
0
26 Feb 2021
Learning Temporal Dynamics from Cycles in Narrated Video
Learning Temporal Dynamics from Cycles in Narrated Video
Dave Epstein
Jiajun Wu
Cordelia Schmid
Chen Sun
AI4TS
95
14
0
07 Jan 2021
ActBERT: Learning Global-Local Video-Text Representations
ActBERT: Learning Global-Local Video-Text Representations
Linchao Zhu
Yi Yang
ViT
127
422
0
14 Nov 2020
Self-supervised pre-training and contrastive representation learning for
  multiple-choice video QA
Self-supervised pre-training and contrastive representation learning for multiple-choice video QA
Seonhoon Kim
Seohyeong Jeong
Eunbyul Kim
Inho Kang
Nojun Kwak
SSL
93
40
0
17 Sep 2020
MovieNet: A Holistic Dataset for Movie Understanding
MovieNet: A Holistic Dataset for Movie Understanding
Qingqiu Huang
Yu Xiong
Anyi Rao
Jiaze Wang
Dahua Lin
VGen
95
242
0
21 Jul 2020
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through
  Scene Graph
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph
Fei Yu
Jiji Tang
Weichong Yin
Yu Sun
Hao Tian
Hua Wu
Haifeng Wang
101
380
0
30 Jun 2020
Self-Supervised MultiModal Versatile Networks
Self-Supervised MultiModal Versatile Networks
Jean-Baptiste Alayrac
Adrià Recasens
R. Schneider
Relja Arandjelović
Jason Ramapuram
J. Fauw
Lucas Smaira
Sander Dieleman
Andrew Zisserman
SSL
146
375
0
29 Jun 2020
Video-Grounded Dialogues with Pretrained Generation Language Models
Video-Grounded Dialogues with Pretrained Generation Language Models
Hung Le
Guosheng Lin
72
28
0
27 Jun 2020
Large-Scale Adversarial Training for Vision-and-Language Representation
  Learning
Large-Scale Adversarial Training for Vision-and-Language Representation Learning
Zhe Gan
Yen-Chun Chen
Linjie Li
Chen Zhu
Yu Cheng
Jingjing Liu
ObjDVLM
81
498
0
11 Jun 2020
HERO: Hierarchical Encoder for Video+Language Omni-representation
  Pre-training
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
Linjie Li
Yen-Chun Chen
Yu Cheng
Zhe Gan
Licheng Yu
Jingjing Liu
MLLMVLMOffRLAI4TS
129
504
0
01 May 2020
X3D: Expanding Architectures for Efficient Video Recognition
X3D: Expanding Architectures for Efficient Video Recognition
Christoph Feichtenhofer
146
1,024
0
09 Apr 2020
A Local-to-Global Approach to Multi-modal Movie Scene Segmentation
A Local-to-Global Approach to Multi-modal Movie Scene Segmentation
Anyi Rao
Linning Xu
Yu Xiong
Guodong Xu
Qingqiu Huang
Bolei Zhou
Dahua Lin
67
112
0
06 Apr 2020
Deep Multimodal Feature Encoding for Video Ordering
Deep Multimodal Feature Encoding for Video Ordering
Vivek Sharma
Makarand Tapaswi
Rainer Stiefelhagen
60
10
0
05 Apr 2020
Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning
Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning
Shizhe Chen
Yida Zhao
Qin Jin
Qi Wu
96
314
0
01 Mar 2020
Hierarchical Conditional Relation Networks for Video Question Answering
Hierarchical Conditional Relation Networks for Video Question Answering
T. Le
Vuong Le
Svetha Venkatesh
T. Tran
79
260
0
25 Feb 2020
A Graph-Based Framework to Bridge Movies and Synopses
A Graph-Based Framework to Bridge Movies and Synopses
Yu Xiong
Chengyi Zhang
Lingfeng Guo
Hang Zhou
Bolei Zhou
Dahua Lin
74
63
0
24 Oct 2019
UNITER: UNiversal Image-TExt Representation Learning
UNITER: UNiversal Image-TExt Representation Learning
Yen-Chun Chen
Linjie Li
Licheng Yu
Ahmed El Kholy
Faisal Ahmed
Zhe Gan
Yu Cheng
Jingjing Liu
VLMOT
116
448
0
25 Sep 2019
LXMERT: Learning Cross-Modality Encoder Representations from
  Transformers
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
Hao Hao Tan
Joey Tianyi Zhou
VLMMLLM
252
2,493
0
20 Aug 2019
Fusion of Detected Objects in Text for Visual Question Answering
Fusion of Detected Objects in Text for Visual Question Answering
Chris Alberti
Jeffrey Ling
Michael Collins
David Reitter
67
173
0
14 Aug 2019
VisualBERT: A Simple and Performant Baseline for Vision and Language
VisualBERT: A Simple and Performant Baseline for Vision and Language
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
VLM
153
1,965
0
09 Aug 2019
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for
  Vision-and-Language Tasks
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSLVLM
246
3,695
0
06 Aug 2019
RoBERTa: A Robustly Optimized BERT Pretraining Approach
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu
Myle Ott
Naman Goyal
Jingfei Du
Mandar Joshi
Danqi Chen
Omer Levy
M. Lewis
Luke Zettlemoyer
Veselin Stoyanov
AIMat
686
24,557
0
26 Jul 2019
XLNet: Generalized Autoregressive Pretraining for Language Understanding
XLNet: Generalized Autoregressive Pretraining for Language Understanding
Zhilin Yang
Zihang Dai
Yiming Yang
J. Carbonell
Ruslan Salakhutdinov
Quoc V. Le
AI4CE
236
8,451
0
19 Jun 2019
TVQA+: Spatio-Temporal Grounding for Video Question Answering
TVQA+: Spatio-Temporal Grounding for Video Question Answering
Jie Lei
Licheng Yu
Tamara L. Berg
Joey Tianyi Zhou
73
229
0
25 Apr 2019
Relation-Aware Graph Attention Network for Visual Question Answering
Relation-Aware Graph Attention Network for Visual Question Answering
Linjie Li
Zhe Gan
Yu Cheng
Jingjing Liu
GNN
175
345
0
29 Mar 2019
SlowFast Networks for Video Recognition
SlowFast Networks for Video Recognition
Christoph Feichtenhofer
Haoqi Fan
Jitendra Malik
Kaiming He
169
3,282
0
10 Dec 2018
Auto-Encoding Scene Graphs for Image Captioning
Auto-Encoding Scene Graphs for Image Captioning
Xu Yang
Kaihua Tang
Hanwang Zhang
Jianfei Cai
165
699
0
06 Dec 2018
From Recognition to Cognition: Visual Commonsense Reasoning
From Recognition to Cognition: Visual Commonsense Reasoning
Rowan Zellers
Yonatan Bisk
Ali Farhadi
Yejin Choi
LRMBDLOCLReLM
178
883
0
27 Nov 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language
  Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLMSSLSSeg
1.8K
95,229
0
11 Oct 2018
TVQA: Localized, Compositional Video Question Answering
TVQA: Localized, Compositional Video Question Answering
Muhammad Abdul Wahab
Licheng Yu
Mounir Nasr Allah
Tamara L. Berg
99
642
0
05 Sep 2018
Representation Learning with Contrastive Predictive Coding
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord
Yazhe Li
Oriol Vinyals
DRLSSL
351
10,364
0
10 Jul 2018
MovieGraphs: Towards Understanding Human-Centric Situations from Videos
MovieGraphs: Towards Understanding Human-Centric Situations from Videos
Paul Vicol
Makarand Tapaswi
Lluis Castrejon
Sanja Fidler
71
142
0
19 Dec 2017
Non-local Neural Networks
Non-local Neural Networks
Xinyu Wang
Ross B. Girshick
Abhinav Gupta
Kaiming He
OffRL
300
8,918
0
21 Nov 2017
Bottom-Up and Top-Down Attention for Image Captioning and Visual
  Question Answering
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Peter Anderson
Xiaodong He
Chris Buehler
Damien Teney
Mark Johnson
Stephen Gould
Lei Zhang
AIMat
123
4,223
0
25 Jul 2017
12
Next