Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2005.00200
Cited By
v1
v2 (latest)
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
1 May 2020
Linjie Li
Yen-Chun Chen
Yu Cheng
Zhe Gan
Licheng Yu
Jingjing Liu
MLLM
VLM
OffRL
AI4TS
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training"
50 / 328 papers shown
Title
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning
Chen Jiang
Hong Liu
Xuzheng Yu
Qing Wang
Yuan Cheng
...
Zhongyi Liu
Qingpei Guo
Wei Chu
Ming-Hsuan Yang
Yuan Qi
115
11
0
20 Sep 2023
In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval
Nina Shvetsova
Anna Kukleva
Bernt Schiele
Hilde Kuehne
DiffM
79
4
0
16 Sep 2023
EgoPCA: A New Framework for Egocentric Hand-Object Interaction Understanding
Yue Xu
Yong-Lu Li
Zhemin Huang
Michael Xu Liu
Cewu Lu
Yu-Wing Tai
Chi-Keung Tang
EgoV
62
10
0
05 Sep 2023
ATM: Action Temporality Modeling for Video Question Answering
Junwen Chen
Jie Zhu
Yu Kong
69
1
0
05 Sep 2023
Can I Trust Your Answer? Visually Grounded Video Question Answering
Junbin Xiao
Angela Yao
Yicong Li
Tat-Seng Chua
137
61
0
04 Sep 2023
Language-Conditioned Change-point Detection to Identify Sub-Tasks in Robotics Domains
Divyanshu Raj
Chitta Baral
N. Gopalan
128
1
0
01 Sep 2023
Distraction-free Embeddings for Robust VQA
Atharvan Dogra
Deeksha Varshney
Ashwin Kalyan
Ameet Deshpande
Neeraj Kumar
102
0
0
31 Aug 2023
CoVR: Learning Composed Video Retrieval from Web Video Captions
Lucas Ventura
Antoine Yang
Cordelia Schmid
Gül Varol
84
29
0
28 Aug 2023
VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control
Zi-Yuan Hu
Yanyang Li
Michael R. Lyu
Liwei Wang
VLM
90
16
0
18 Aug 2023
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
Dohwan Ko
Ji Soo Lee
M. Choi
Jaewon Chu
Jihwan Park
Hyunwoo J. Kim
55
6
0
18 Aug 2023
Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer
Guangyi Chen
Xiao Liu
Guangrun Wang
Kun Zhang
Philip H.S.Torr
Xiaoping Zhang
Yansong Tang
119
19
0
16 Aug 2023
Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval
Chaorui Deng
Qi Chen
Pengda Qin
Dave Zhenyu Chen
Qi Wu
VLM
CLIP
83
34
0
15 Aug 2023
JOTR: 3D Joint Contrastive Learning with Transformers for Occluded Human Mesh Recovery
Jiahao Li
Zongxin Yang
Xiaohan Wang
Jianxin Ma
Chang Zhou
Yi Yang
100
13
0
31 Jul 2023
Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment
Sarah Ibrahimi
Xiaohang Sun
Pichao Wang
Amanmeet Garg
Ashutosh Sanan
Mohamed Omar
101
18
0
24 Jul 2023
Towards Video Anomaly Retrieval from Video Anomaly Detection: New Benchmarks and Model
Peng Wu
Jing Liu
Xiangteng He
Yuxin Peng
Peng Wang
Yanning Zhang
124
34
0
24 Jul 2023
No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection
Qi Zhang
S. Zheng
Qin Jin
90
0
0
20 Jul 2023
Fine-grained Text-Video Retrieval with Frozen Image Encoders
Zuozhuo Dai
Fang Shao
Qingkun Su
Zilong Dong
Siyu Zhu
221
1
0
14 Jul 2023
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Shraman Pramanick
Yale Song
Sayan Nag
Kevin Qinghong Lin
Hardik Shah
Mike Zheng Shou
Ramalingam Chellappa
Pengchuan Zhang
VLM
124
100
0
11 Jul 2023
Reading Between the Lanes: Text VideoQA on the Road
George Tom
Minesh Mathew
Sergi Garcia
Dimosthenis Karatzas
C. V. Jawahar
88
8
0
08 Jul 2023
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
Sihan Chen
Xingjian He
Handong Li
Xiaojie Jin
Jiashi Feng
Qingbin Liu
VLM
CLIP
83
9
0
15 Jun 2023
Instruct-ReID: A Multi-purpose Person Re-identification Task with Instructions
Weizhen He
Yihe Deng
Shixiang Tang
Qihao Chen
Qingsong Xie
...
Feng Zhu
Rui Zhao
Wanli Ouyang
Donglian Qi
Yunfeng Yan
125
25
0
13 Jun 2023
DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents
Fuxiao Liu
Hao Tan
Chris Tensmeyer
CLIP
VLM
103
18
0
09 Jun 2023
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks
Haiyang Xu
Qinghao Ye
Xuan-Wei Wu
Mingshi Yan
Yuan Miao
...
Qingfang Qian
Maofei Que
Ji Zhang
Xiaoyan Zeng
Feiyan Huang
VLM
MLLM
101
25
0
07 Jun 2023
Learning to Ground Instructional Articles in Videos through Narrations
E. Mavroudi
Triantafyllos Afouras
Lorenzo Torresani
DiffM
85
24
0
06 Jun 2023
MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning
Jianghui Wang
Yuxuan Wang
Dongyan Zhao
Zilong Zheng
96
1
0
04 Jun 2023
Learning Emotion Representations from Verbal and Nonverbal Communication
Sitao Zhang
Yimu Pan
Jianmin Wang
VLM
135
24
0
22 May 2023
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
Xingjian He
Sihan Chen
Fan Ma
Zhicheng Huang
Xiaojie Jin
Zikang Liu
Dongmei Fu
Yi Yang
Qingbin Liu
Jiashi Feng
VLM
CLIP
108
18
0
22 May 2023
Paxion: Patching Action Knowledge in Video-Language Foundation Models
Zhenhailong Wang
Ansel Blume
Sha Li
Genglin Liu
Jaemin Cho
Zineng Tang
Joey Tianyi Zhou
Heng Ji
KELM
VGen
51
32
0
18 May 2023
TG-VQA: Ternary Game of Video Question Answering
Hao Li
Peng Jin
Ze-Long Cheng
Songyang Zhang
Kai-xiang Chen
Zhennan Wang
Chang-rui Liu
Jie Chen
90
10
0
17 May 2023
Mask to reconstruct: Cooperative Semantics Completion for Video-text Retrieval
Han Fang
Zhifei Yang
Xianghao Zang
Chao Ban
Hao Sun
VGen
72
3
0
13 May 2023
Self-Chained Image-Language Model for Video Localization and Question Answering
Shoubin Yu
Jaemin Cho
Prateek Yadav
Joey Tianyi Zhou
147
142
0
11 May 2023
VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation
Xilun Chen
L. Yu
Wenhan Xiong
Barlas Ouguz
Yashar Mehdad
Wen-tau Yih
VGen
58
3
0
04 May 2023
Making the Most of What You Have: Adapting Pre-trained Visual Language Models in the Low-data Regime
Chuhan Zhang
Antoine Miech
Jiajun Shen
Jean-Baptiste Alayrac
Pauline Luc
VLM
VPVLM
90
2
0
03 May 2023
In-Context Learning Unlocked for Diffusion Models
Zhendong Wang
Yi Ding
Yadong Lu
Yelong Shen
Pengcheng He
Weizhu Chen
Zhangyang Wang
Mingyuan Zhou
VLM
DiffM
150
78
0
01 May 2023
SViTT: Temporal Learning of Sparse Video-Text Transformers
Yi Li
Kyle Min
Subarna Tripathi
Nuno Vasconcelos
63
13
0
18 Apr 2023
Delving into Shape-aware Zero-shot Semantic Segmentation
Xinyu Liu
Beiwen Tian
Zhen Wang
Rui Wang
Kehua Sheng
Bo Zhang
Hao Zhao
Guyue Zhou
VLM
106
20
0
17 Apr 2023
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Sihan Chen
Xingjian He
Longteng Guo
Xinxin Zhu
Weining Wang
Jinhui Tang
Jinhui Tang
VLM
136
112
0
17 Apr 2023
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
Jun Chen
Deyao Zhu
Kilichbek Haydarov
Xiang Li
Mohamed Elhoseiny
111
38
0
09 Apr 2023
Personality-aware Human-centric Multimodal Reasoning: A New Task, Dataset and Baselines
Yaochen Zhu
Xiangqing Shen
Rui Xia
121
4
0
05 Apr 2023
Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data
Vladislav Lialin
Stephen Rawls
David M. Chan
Shalini Ghosh
Anna Rumshisky
Wael Hamza
VLM
AI4TS
96
6
0
04 Apr 2023
Procedure-Aware Pretraining for Instructional Video Understanding
Honglu Zhou
Roberto Martín-Martín
Mubbasir Kapadia
Silvio Savarese
Juan Carlos Niebles
125
40
0
31 Mar 2023
Self-Supervised Multimodal Learning: A Survey
Yongshuo Zong
Oisin Mac Aodha
Timothy M. Hospedales
SSL
125
50
0
31 Mar 2023
Learning Procedure-aware Video Representation from Instructional Videos and Their Narrations
Yiwu Zhong
Licheng Yu
Yang Bai
Shangwen Li
Xueting Yan
Yin Li
AI4TS
106
34
0
31 Mar 2023
Hierarchical Video-Moment Retrieval and Step-Captioning
Abhaysinh Zala
Jaemin Cho
Satwik Kottur
Xilun Chen
Barlas Ouguz
Yasher Mehdad
Joey Tianyi Zhou
3DV
95
54
0
29 Mar 2023
Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding
Yuanhao Xiong
Long Zhao
Boqing Gong
Ming-Hsuan Yang
Florian Schroff
Ting Liu
Cho-Jui Hsieh
Liangzhe Yuan
VLM
62
0
0
28 Mar 2023
SEM-POS: Grammatically and Semantically Correct Video Captioning
Asmar Nadeem
A. Hilton
R. Dawes
Graham A. Thomas
A. Mustafa
73
8
0
26 Mar 2023
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
Dohwan Ko
Joon-Young Choi
Hyeong Kyu Choi
Kyoung-Woon On
Byungseok Roh
Hyunwoo J. Kim
121
23
0
23 Mar 2023
VideoXum: Cross-modal Visual and Textural Summarization of Videos
Jingyang Lin
Hang Hua
Ming Chen
Yikang Li
Jenhao Hsiao
C. Ho
Jiebo Luo
109
33
0
21 Mar 2023
Accommodating Audio Modality in CLIP for Multimodal Processing
Ludan Ruan
Anwen Hu
Yuqing Song
Liang Zhang
S. Zheng
Qin Jin
VLM
78
10
0
12 Mar 2023
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos
Teng Wang
Jinrui Zhang
Feng Zheng
Wenhao Jiang
Ran Cheng
Ping Luo
VLM
82
11
0
11 Mar 2023
Previous
1
2
3
4
5
6
7
Next