ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2005.00200
  4. Cited By
HERO: Hierarchical Encoder for Video+Language Omni-representation
  Pre-training
v1v2 (latest)

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

1 May 2020
Linjie Li
Yen-Chun Chen
Yu Cheng
Zhe Gan
Licheng Yu
Jingjing Liu
    MLLMVLMOffRLAI4TS
ArXiv (abs)PDFHTML

Papers citing "HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training"

50 / 328 papers shown
Title
CinePile: A Long Video Question Answering Dataset and Benchmark
CinePile: A Long Video Question Answering Dataset and Benchmark
Ruchit Rawal
Khalid Saifullah
Ronen Basri
David Jacobs
Gowthami Somepalli
Tom Goldstein
103
57
0
14 May 2024
Unified Video-Language Pre-training with Synchronized Audio
Unified Video-Language Pre-training with Synchronized Audio
Shentong Mo
Haofan Wang
Huaxia Li
Xu Tang
79
2
0
12 May 2024
Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning
Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning
Shibo Jie
Yehui Tang
Ning Ding
Zhi-Hong Deng
Kai Han
Yunhe Wang
VLM
117
11
0
09 May 2024
SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval
SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval
Xuzheng Yu
Chen Jiang
Xingning Dong
Tian Gan
Ming Yang
Qingpei Guo
117
2
0
22 Apr 2024
Koala: Key frame-conditioned long video-LLM
Koala: Key frame-conditioned long video-LLM
Reuben Tan
Ximeng Sun
Ping Hu
Jui-hsien Wang
Hanieh Deilamsalehy
Bryan A. Plummer
Bryan C. Russell
Kate Saenko
111
41
0
05 Apr 2024
VideoDistill: Language-aware Vision Distillation for Video Question
  Answering
VideoDistill: Language-aware Vision Distillation for Video Question Answering
Bo Zou
Chao Yang
Yu Qiao
Chengbin Quan
Youjian Zhao
VGen
87
1
0
01 Apr 2024
Ranking Distillation for Open-Ended Video Question Answering with
  Insufficient Labels
Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels
Tianming Liang
Chaolei Tan
Beihao Xia
Wei-Shi Zheng
Jianfang Hu
81
1
0
21 Mar 2024
DAM: Dynamic Adapter Merging for Continual Video QA Learning
DAM: Dynamic Adapter Merging for Continual Video QA Learning
Feng Cheng
Ziyang Wang
Yi-Lin Sung
Yan-Bo Lin
Mohit Bansal
Gedas Bertasius
CLLMoMe
90
11
0
13 Mar 2024
AQuA: Automated Question-Answering in Software Tutorial Videos with
  Visual Anchors
AQuA: Automated Question-Answering in Software Tutorial Videos with Visual Anchors
Saelyne Yang
Jo Vermeulen
G. Fitzmaurice
Justin Matejka
47
8
0
08 Mar 2024
TV-TREES: Multimodal Entailment Trees for Neuro-Symbolic Video Reasoning
TV-TREES: Multimodal Entailment Trees for Neuro-Symbolic Video Reasoning
Kate Sanders
Nathaniel Weir
Benjamin Van Durme
LRM
97
11
0
29 Feb 2024
Improving Video Corpus Moment Retrieval with Partial Relevance
  Enhancement
Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement
Danyang Hou
Liang Pang
Huawei Shen
Xueqi Cheng
106
5
0
21 Feb 2024
Event-aware Video Corpus Moment Retrieval
Event-aware Video Corpus Moment Retrieval
Danyang Hou
Liang Pang
Huawei Shen
Xueqi Cheng
84
2
0
21 Feb 2024
Video ReCap: Recursive Captioning of Hour-Long Videos
Video ReCap: Recursive Captioning of Hour-Long Videos
Md. Mohaiminul Islam
Ngan Ho
Xitong Yang
Tushar Nagarajan
Lorenzo Torresani
Gedas Bertasius
VGenVLM
114
50
0
20 Feb 2024
Question-Instructed Visual Descriptions for Zero-Shot Video Question
  Answering
Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering
David Romero
Thamar Solorio
146
4
0
16 Feb 2024
DoRA: Weight-Decomposed Low-Rank Adaptation
DoRA: Weight-Decomposed Low-Rank Adaptation
Shih-yang Liu
Chien-Yi Wang
Hongxu Yin
Pavlo Molchanov
Yu-Chiang Frank Wang
Kwang-Ting Cheng
Min-Hung Chen
157
422
0
14 Feb 2024
Comment-aided Video-Language Alignment via Contrastive Pre-training for
  Short-form Video Humor Detection
Comment-aided Video-Language Alignment via Contrastive Pre-training for Short-form Video Humor Detection
Yang Liu
Tongfei Shen
Dong Zhang
Qingying Sun
Shoushan Li
Guodong Zhou
60
5
0
14 Feb 2024
M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based
  Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
Xingning Dong
Zipeng Feng
Chunluan Zhou
Xuzheng Yu
Ming Yang
Qingpei Guo
VLM
80
3
0
31 Jan 2024
SNP-S3: Shared Network Pre-training and Significant Semantic
  Strengthening for Various Video-Text Tasks
SNP-S3: Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks
Xingning Dong
Qingpei Guo
Tian Gan
Qing Wang
Jianlong Wu
Xiangyuan Ren
Yuan Cheng
Wei Chu
63
5
0
31 Jan 2024
YTCommentQA: Video Question Answerability in Instructional Videos
YTCommentQA: Video Question Answerability in Instructional Videos
Saelyne Yang
Sunghyun Park
Yunseok Jang
Moontae Lee
114
3
0
30 Jan 2024
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model
  for Multimodal Processing
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing
Xianghu Yue
Xiaohai Tian
Lu Lu
Malu Zhang
Zhizheng Wu
Haizhou Li
80
0
0
22 Jan 2024
ActionHub: A Large-scale Action Video Description Dataset for Zero-shot
  Action Recognition
ActionHub: A Large-scale Action Video Description Dataset for Zero-shot Action Recognition
Jiaming Zhou
Junwei Liang
Kun-Yu Lin
Jinrui Yang
Wei-Shi Zheng
VLM
94
8
0
22 Jan 2024
Detours for Navigating Instructional Videos
Detours for Navigating Instructional Videos
Kumar Ashutosh
Zihui Xue
Tushar Nagarajan
Kristen Grauman
127
6
0
03 Jan 2024
Video Understanding with Large Language Models: A Survey
Video Understanding with Large Language Models: A Survey
Yunlong Tang
Jing Bi
Siting Xu
Luchuan Song
Susan Liang
...
Feng Zheng
Jianguo Zhang
Chenliang Xu
Jiebo Luo
Chenliang Xu
VLM
222
100
0
29 Dec 2023
Implicit Affordance Acquisition via Causal Action-Effect Modeling in the
  Video Domain
Implicit Affordance Acquisition via Causal Action-Effect Modeling in the Video Domain
Hsiu-yu Yang
Carina Silberer
63
1
0
18 Dec 2023
ViLA: Efficient Video-Language Alignment for Video Question Answering
ViLA: Efficient Video-Language Alignment for Video Question Answering
Xijun Wang
Junbang Liang
Chun-Kai Wang
Kenan Deng
Yu Lou
Ming-Chyuan Lin
Shan Yang
107
15
0
13 Dec 2023
Grounded Question-Answering in Long Egocentric Videos
Grounded Question-Answering in Long Egocentric Videos
Shangzhe Di
Weidi Xie
134
27
0
11 Dec 2023
Audio-Visual LLM for Video Understanding
Audio-Visual LLM for Video Understanding
Fangxun Shu
Lei Zhang
Hao Jiang
Cihang Xie
VLMMLLM
74
44
0
11 Dec 2023
MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie
  Understanding
MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding
Hongjie Zhang
Yi Liu
Lu Dong
Yifei Huang
Z. Ling
Yali Wang
Limin Wang
Yu Qiao
99
31
0
08 Dec 2023
Generating Illustrated Instructions
Generating Illustrated Instructions
Sachit Menon
Ishan Misra
Rohit Girdhar
DiffM
86
5
0
07 Dec 2023
VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of
  Video-Language Models
VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models
Shicheng Li
Lei Li
Shuhuai Ren
Yuanxin Liu
Yi Liu
Rundong Gao
Xu Sun
Lu Hou
95
37
0
29 Nov 2023
E-ViLM: Efficient Video-Language Model via Masked Video Modeling with
  Semantic Vector-Quantized Tokenizer
E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer
Jacob Zhiyuan Fang
Skyler Zheng
Vasu Sharma
Robinson Piramuthu
VLM
157
0
0
28 Nov 2023
Characterizing Video Question Answering with Sparsified Inputs
Characterizing Video Question Answering with Sparsified Inputs
Shiyuan Huang
Robinson Piramuthu
Vicente Ordonez
Shih-Fu Chang
Gunnar Sigurdsson
55
0
0
27 Nov 2023
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in
  Video-Language Models
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models
.Ilker Kesen
Andrea Pedrotti
Mustafa Dogan
Michele Cafagna
Emre Can Acikgoz
...
Iacer Calixto
Anette Frank
Albert Gatt
Aykut Erdem
Erkut Erdem
94
19
0
13 Nov 2023
Teach me with a Whisper: Enhancing Large Language Models for Analyzing
  Spoken Transcripts using Speech Embeddings
Teach me with a Whisper: Enhancing Large Language Models for Analyzing Spoken Transcripts using Speech Embeddings
Fatema Hasan
Yulong Li
James R. Foulds
Shimei Pan
Bishwaranjan Bhattacharjee
79
2
0
13 Nov 2023
Active Reasoning in an Open-World Environment
Active Reasoning in an Open-World Environment
Manjie Xu
Guangyuan Jiang
Weihan Liang
Fangqiu Yi
Yixin Zhu
LLMAGLRM
67
11
0
03 Nov 2023
ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life
  Videos
ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos
Te-Lin Wu
Zi-Yi Dou
Qingyuan Hu
Yu Hou
Nischal Reddy Chandra
Marjorie Freedman
R. Weischedel
Nanyun Peng
116
8
0
02 Nov 2023
Modular Blended Attention Network for Video Question Answering
Modular Blended Attention Network for Video Question Answering
Mingjie Zhou
62
0
0
02 Nov 2023
MM-VID: Advancing Video Understanding with GPT-4V(ision)
MM-VID: Advancing Video Understanding with GPT-4V(ision)
Kevin Qinghong Lin
Faisal Ahmed
Linjie Li
Chung-Ching Lin
E. Azarnasab
...
Lin Liang
Zicheng Liu
Yumao Lu
Ce Liu
Lijuan Wang
MLLM
86
65
0
30 Oct 2023
Harvest Video Foundation Models via Efficient Post-Pretraining
Harvest Video Foundation Models via Efficient Post-Pretraining
Yizhuo Li
Kunchang Li
Yinan He
Yi Wang
Yali Wang
Limin Wang
Yu Qiao
Ping Luo
CLIPVLMVGen
111
2
0
30 Oct 2023
Exploring Iterative Refinement with Diffusion Models for Video Grounding
Exploring Iterative Refinement with Diffusion Models for Video Grounding
Xiao Liang
Tao Shi
Yaoyuan Liang
Te Tao
Shao-Lun Huang
DiffM
96
2
0
26 Oct 2023
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA
Asmar Nadeem
Adrian Hilton
R. Dawes
Graham A. Thomas
A. Mustafa
80
10
0
25 Oct 2023
Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and
  Beyond
Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and Beyond
Zhecan Wang
Long Chen
Haoxuan You
Keyang Xu
Yicheng He
Wenhao Li
Noal Codella
Kai-Wei Chang
Shih-Fu Chang
107
3
0
23 Oct 2023
Balance Act: Mitigating Hubness in Cross-Modal Retrieval with Query and
  Gallery Banks
Balance Act: Mitigating Hubness in Cross-Modal Retrieval with Query and Gallery Banks
Yimu Wang
Xiangru Jian
Bo Xue
55
11
0
17 Oct 2023
SCANet: Scene Complexity Aware Network for Weakly-Supervised Video Moment Retrieval
SCANet: Scene Complexity Aware Network for Weakly-Supervised Video Moment Retrieval
Sunjae Yoon
Gwanhyeong Koo
Dahyun Kim
Changdong Yoo
93
12
0
08 Oct 2023
Analyzing Zero-Shot Abilities of Vision-Language Models on Video
  Understanding Tasks
Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks
Avinash Madasu
Anahita Bhiwandiwalla
Vasudev Lal
VLM
74
0
0
07 Oct 2023
Human-centric Behavior Description in Videos: New Benchmark and Model
Human-centric Behavior Description in Videos: New Benchmark and Model
Lingru Zhou
Yi-Meng Gao
Manqing Zhang
Peng Wu
Peng Wang
Yanning Zhang
56
1
0
04 Oct 2023
Social Media Fashion Knowledge Extraction as Captioning
Social Media Fashion Knowledge Extraction as Captioning
Yifei Yuan
Wenxuan Zhang
Yang Deng
Wai Lam
54
1
0
28 Sep 2023
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
Avamarie Brueggeman
Andrea Madotto
Zhaojiang Lin
Tushar Nagarajan
Matt Smith
...
Peyman Heidari
Yue Liu
Kavya Srinet
Babak Damavandi
Anuj Kumar
MLLM
89
94
0
27 Sep 2023
Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts
Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts
Bipin Rajendran
Bashir M. Al-Hashimi
MLLMVLM
92
3
0
27 Sep 2023
VidChapters-7M: Video Chapters at Scale
VidChapters-7M: Video Chapters at Scale
Antoine Yang
Arsha Nagrani
Ivan Laptev
Josef Sivic
Cordelia Schmid
VGen
102
28
0
25 Sep 2023
Previous
1234567
Next