Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2005.00200
Cited By
v1
v2 (latest)
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
1 May 2020
Linjie Li
Yen-Chun Chen
Yu Cheng
Zhe Gan
Licheng Yu
Jingjing Liu
MLLM
VLM
OffRL
AI4TS
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training"
50 / 328 papers shown
Title
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
Jiaqi Xu
Bo Liu
Yunkuo Chen
Mengli Cheng
Xing Shi
95
1
0
10 Mar 2023
Video Question Answering Using CLIP-Guided Visual-Text Attention
Shuhong Ye
Weikai Kong
Chenglin Yao
Jianfeng Ren
Xudong Jiang
64
11
0
06 Mar 2023
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Antoine Yang
Arsha Nagrani
Paul Hongsuck Seo
Antoine Miech
Jordi Pont-Tuset
Ivan Laptev
Josef Sivic
Cordelia Schmid
AI4TS
VLM
175
242
0
27 Feb 2023
Localizing Moments in Long Video Via Multimodal Guidance
Wayner Barrios
Mattia Soldan
Alberto M. Ceballos-Arroyo
Fabian Caba Heilbron
Guohao Li
91
21
0
26 Feb 2023
Deep Learning for Video-Text Retrieval: a Review
Cunjuan Zhu
Qi Jia
Wei Chen
Yanming Guo
Yu Liu
75
18
0
24 Feb 2023
Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey
Tianlin Li
Guangyao Chen
Guangwu Qian
Pengcheng Gao
Xiaoyong Wei
Yaowei Wang
Yonghong Tian
Wen Gao
AI4CE
VLM
164
216
0
20 Feb 2023
STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training
Weihong Zhong
Mao Zheng
Duyu Tang
Xuan Luo
Heng Gong
Xiaocheng Feng
Bing Qin
112
8
0
20 Feb 2023
Interactive Video Corpus Moment Retrieval using Reinforcement Learning
Zhixin Ma
Chong-Wah Ngo
73
3
0
19 Feb 2023
Video-Text Retrieval by Supervised Sparse Multi-Grained Learning
Yimu Wang
Peng Shi
79
6
0
19 Feb 2023
COVID-VTS: Fact Extraction and Verification on Short Video Platforms
Fuxiao Liu
Yaser Yacoob
Abhinav Shrivastava
85
28
0
15 Feb 2023
Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer
Min Peng
Chongyang Wang
Yu Shi
Xiang-Dong Zhou
ViT
93
7
0
04 Feb 2023
Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications
Muhammad Arslan Manzoor
S. Albarri
Ziting Xian
Zaiqiao Meng
Preslav Nakov
Shangsong Liang
AI4TS
104
32
0
01 Feb 2023
ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View Semantic Consistency
Pengzhen Ren
Changlin Li
Hang Xu
Yi Zhu
Guangrun Wang
Jian-zhuo Liu
Xiaojun Chang
Xiaodan Liang
106
45
0
31 Jan 2023
Multi-video Moment Ranking with Multimodal Clue
Danyang Hou
Liang Pang
Yanyan Lan
Huawei Shen
Xueqi Cheng
53
1
0
29 Jan 2023
Semi-Parametric Video-Grounded Text Generation
Sungdong Kim
Jin-Hwa Kim
Jiyoung Lee
Minjoon Seo
VGen
80
14
0
27 Jan 2023
Temporal Perceiving Video-Language Pre-training
Fan Ma
Xiaojie Jin
Heng Wang
Jingjia Huang
Linchao Zhu
Jiashi Feng
Yi Yang
VLM
97
15
0
18 Jan 2023
In Defense of Structural Symbolic Representation for Video Event-Relation Prediction
Andrew Lu
Xudong Lin
Yulei Niu
Shih-Fu Chang
92
2
0
06 Jan 2023
HierVL: Learning Hierarchical Video-Language Embeddings
Kumar Ashutosh
Rohit Girdhar
Lorenzo Torresani
Kristen Grauman
VLM
AI4TS
126
59
0
05 Jan 2023
What You Say Is What You Show: Visual Narration Detection in Instructional Videos
Kumar Ashutosh
Rohit Girdhar
Lorenzo Torresani
Kristen Grauman
117
4
0
05 Jan 2023
NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory
Santhosh Kumar Ramakrishnan
Ziad Al-Halah
Kristen Grauman
207
42
0
02 Jan 2023
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
Qinghao Ye
Guohai Xu
Ming Yan
Haiyang Xu
Qi Qian
Ji Zhang
Fei Huang
VLM
AI4TS
225
75
0
30 Dec 2022
Prototype-guided Cross-task Knowledge Distillation for Large-scale Models
Deng Li
Aming Wu
Yahong Han
Qingwen Tian
VLM
96
3
0
26 Dec 2022
VindLU: A Recipe for Effective Video-and-Language Pretraining
Feng Cheng
Xizi Wang
Jie Lei
David J. Crandall
Joey Tianyi Zhou
Gedas Bertasius
VLM
125
81
0
09 Dec 2022
SimVTP: Simple Video Text Pre-training with Masked Autoencoders
Yue Ma
Tianyu Yang
Yin Shan
Xiu Li
92
27
0
07 Dec 2022
Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
Fangxun Shu
Biaolong Chen
Yue Liao
Shuwen Xiao
Wenyu Sun
Xiaobo Li
Yousong Zhu
Jinqiao Wang
Si Liu
CLIP
77
12
0
02 Dec 2022
Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention
Zineng Tang
Jaemin Cho
Jie Lei
Joey Tianyi Zhou
VLM
84
9
0
21 Nov 2022
SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training
Yuanze Lin
Chen Wei
Huiyu Wang
Alan Yuille
Cihang Xie
3DGS
109
15
0
21 Nov 2022
Grafting Pre-trained Models for Multimodal Headline Generation
Lingfeng Qiao
Chen Wu
Ye Liu
Haoyuan Peng
Di Yin
Bo Ren
79
6
0
14 Nov 2022
Watching the News: Towards VideoQA Models that can Read
Soumya Jahagirdar
Minesh Mathew
Dimosthenis Karatzas
C. V. Jawahar
93
20
0
10 Nov 2022
Going for GOAL: A Resource for Grounded Football Commentaries
Alessandro Suglia
José Lopes
E. Bastianelli
Andrea Vanzo
Shubham Agarwal
Malvina Nikandrou
Lu Yu
Ioannis Konstas
Verena Rieser
69
5
0
08 Nov 2022
CLOP: Video-and-Language Pre-Training with Knowledge Regularizations
Guohao Li
Hu Yang
Feng He
Zhifan Feng
Yajuan Lyu
Hua Wu
Haifeng Wang
VLM
50
2
0
07 Nov 2022
Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization
Junru Wu
Yi Liang
Feng Han
Hassan Akbari
Zhangyang Wang
Cong Yu
75
10
0
03 Nov 2022
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention
Fenglin Liu
Xian Wu
Shen Ge
Xuancheng Ren
Wei Fan
Xu Sun
Yuexian Zou
VLM
108
13
0
28 Oct 2022
Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval
Minjoon Jung
Seongho Choi
Joo-Kyung Kim
Jin-Hwa Kim
Byoung-Tak Zhang
95
10
0
23 Oct 2022
VTC: Improving Video-Text Retrieval with User Comments
Laura Hanu
James Thewlis
Yuki M. Asano
Christian Rupprecht
VGen
118
8
0
19 Oct 2022
Temporal Action Segmentation: An Analysis of Modern Techniques
Guodong Ding
Fadime Sener
Angela Yao
188
80
0
19 Oct 2022
Dense but Efficient VideoQA for Intricate Compositional Reasoning
Jihyeon Janel Lee
Wooyoung Kang
Eun-Sol Kim
CoGe
54
4
0
19 Oct 2022
Selective Query-guided Debiasing for Video Corpus Moment Retrieval
Sunjae Yoon
Jiajing Hong
Eunseop Yoon
Dahyun Kim
Junyeong Kim
Hee Suk Yoon
Changdong Yoo
142
23
0
17 Oct 2022
Video in 10 Bits: Few-Bit VideoQA for Efficiency and Privacy
Shiyuan Huang
Robinson Piramuthu
Shih-Fu Chang
Gunnar Sigurdsson
53
1
0
15 Oct 2022
RaP: Redundancy-aware Video-language Pre-training for Text-Video Retrieval
Xing Wu
Chaochen Gao
Zijia Lin
Zhongyuan Wang
Jizhong Han
Songlin Hu
64
8
0
13 Oct 2022
Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning
Yuchong Sun
Hongwei Xue
Ruihua Song
Bei Liu
Huan Yang
Jianlong Fu
AI4TS
VLM
96
72
0
12 Oct 2022
Voteñ'Rank: Revision of Benchmarking with Social Choice Theory
Mark Rofin
Vladislav Mikhailov
Mikhail Florinskiy
A. Kravchenko
E. Tutubalina
Tatiana Shavrina
Daniel Karabekyan
Ekaterina Artemova
87
11
0
11 Oct 2022
Contrastive Video-Language Learning with Fine-grained Frame Sampling
Zixu Wang
Yujie Zhong
Yishu Miao
Lin Ma
Lucia Specia
97
12
0
10 Oct 2022
Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks
Pedro Rodriguez
Mahmoud Azab
Becka Silvert
Renato Sanchez
Linzy Labson
Hardik Shah
Seungwhan Moon
110
1
0
10 Oct 2022
Hierarchical3D Adapters for Long Video-to-text Summarization
Pinelopi Papalampidi
Mirella Lapata
VGen
99
13
0
10 Oct 2022
Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling
Hsin-Ying Lee
Hung-Ting Su
Bing-Chen Tsai
Tsung-Han Wu
Jia-Fong Yeh
Winston H. Hsu
95
2
0
08 Oct 2022
EgoTaskQA: Understanding Human Tasks in Egocentric Videos
Baoxiong Jia
Ting Lei
Song-Chun Zhu
Siyuan Huang
EgoV
89
65
0
08 Oct 2022
Music-to-Text Synaesthesia: Generating Descriptive Text from Music Recordings
Zhihuan Kuang
Shi Zong
Jianbing Zhang
Jiajun Chen
Hongfu Liu
69
5
0
02 Oct 2022
Learning Transferable Spatiotemporal Representations from Natural Script Knowledge
Ziyun Zeng
Yuying Ge
Xihui Liu
Bin Chen
Ping Luo
Shutao Xia
Yixiao Ge
AI4TS
98
8
0
30 Sep 2022
TVLT: Textless Vision-Language Transformer
Zineng Tang
Jaemin Cho
Yixin Nie
Joey Tianyi Zhou
VLM
137
31
0
28 Sep 2022
Previous
1
2
3
4
5
6
7
Next