ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2005.00200
  4. Cited By
HERO: Hierarchical Encoder for Video+Language Omni-representation
  Pre-training
v1v2 (latest)

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

1 May 2020
Linjie Li
Yen-Chun Chen
Yu Cheng
Zhe Gan
Licheng Yu
Jingjing Liu
    MLLMVLMOffRLAI4TS
ArXiv (abs)PDFHTML

Papers citing "HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training"

50 / 328 papers shown
Title
InstructionBench: An Instructional Video Understanding Benchmark
InstructionBench: An Instructional Video Understanding Benchmark
Haiwan Wei
Yitian Yuan
Xiaohan Lan
Wei Ke
Lin Ma
ELM
90
3
0
01 Jul 2025
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning
Daeun Lee
Jaehong Yoon
Jaemin Cho
Mohit Bansal
LRM
91
0
0
04 Jun 2025
MaCP: Minimal yet Mighty Adaptation via Hierarchical Cosine Projection
MaCP: Minimal yet Mighty Adaptation via Hierarchical Cosine Projection
Yixian Shen
Qi Bi
Jia-Hong Huang
Hongyi Zhu
Andy D. Pimentel
Anuj Pathania
29
0
0
29 May 2025
Robust Relevance Feedback for Interactive Known-Item Video Search
Robust Relevance Feedback for Interactive Known-Item Video Search
Zhixin Ma
Chong-Wah Ngo
67
0
0
21 May 2025
GeoMM: On Geodesic Perspective for Multi-modal Learning
GeoMM: On Geodesic Perspective for Multi-modal Learning
Shibin Mei
Hang Wang
Bingbing Ni
82
0
0
16 May 2025
HierSum: A Global and Local Attention Mechanism for Video Summarization
HierSum: A Global and Local Attention Mechanism for Video Summarization
Apoorva Beedu
Irfan Essa
371
0
0
25 Apr 2025
A Lightweight Moment Retrieval System with Global Re-Ranking and Robust Adaptive Bidirectional Temporal Search
A Lightweight Moment Retrieval System with Global Re-Ranking and Robust Adaptive Bidirectional Temporal Search
Tinh-Anh Nguyen-Nhu
H. Tran
Nguyen-Khang Le
Minh-Nhat Nguyen
T. Nguyen
...
Huu-Phong Phan-Nguyen
Huy-Thach Pham
Quan Nguyen
Hoang M. Le
Quang-Vinh Dinh
99
0
0
12 Apr 2025
Towards Efficient and Robust Moment Retrieval System: A Unified Framework for Multi-Granularity Models and Temporal Reranking
Towards Efficient and Robust Moment Retrieval System: A Unified Framework for Multi-Granularity Models and Temporal Reranking
H. Tran
Tinh-Anh Nguyen-Nhu
Huu-Phong Phan-Nguyen
T. Nguyen
Nhat-Minh Nguyen-Dich
Anh Dao
Huy-Duc Do
Quan Nguyen
Hoang M. Le
Quang-Vinh Dinh
73
0
0
11 Apr 2025
REVEAL: Relation-based Video Representation Learning for Video-Question-Answering
REVEAL: Relation-based Video Representation Learning for Video-Question-Answering
Sofian Chaybouti
Walid Bousselham
Moritz Wolter
Hilde Kuehne
394
0
0
07 Apr 2025
TC-MGC: Text-Conditioned Multi-Grained Contrastive Learning for Text-Video Retrieval
TC-MGC: Text-Conditioned Multi-Grained Contrastive Learning for Text-Video Retrieval
Xiaolun Jing
Genke Yang
Jian Chu
57
0
0
07 Apr 2025
Advancing Egocentric Video Question Answering with Multimodal Large Language Models
Advancing Egocentric Video Question Answering with Multimodal Large Language Models
Alkesh Patel
Vibhav Chitalia
Yinfei Yang
58
2
0
06 Apr 2025
Stitch-a-Recipe: Video Demonstration from Multistep Descriptions
Stitch-a-Recipe: Video Demonstration from Multistep Descriptions
Chi Hsuan Wu
Kumar Ashutosh
Kristen Grauman
DiffM
107
0
0
18 Mar 2025
ALLVB: All-in-One Long Video Understanding Benchmark
ALLVB: All-in-One Long Video Understanding Benchmark
Xichen Tan
Yuanjing Luo
Yunfan Ye
Fang Liu
Zhiping Cai
MLLMVLM
169
1
0
10 Mar 2025
SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding
SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding
Zhenyu Yang
Yihan Hu
Zemin Du
Dizhan Xue
Shengsheng Qian
Jiahong Wu
Fan Yang
W. Dong
Changsheng Xu
113
9
0
15 Feb 2025
Efficient Vision Language Model Fine-tuning for Text-based Person Anomaly Search
Efficient Vision Language Model Fine-tuning for Text-based Person Anomaly Search
J. He
Shengeng Tang
Ao Liu
Lechao Cheng
Jingjing Wu
Yanyan Wei
79
0
0
05 Feb 2025
OneLLM: One Framework to Align All Modalities with Language
OneLLM: One Framework to Align All Modalities with Language
Jiaming Han
Kaixiong Gong
Yiyuan Zhang
Jiaqi Wang
Kaipeng Zhang
Dahua Lin
Yu Qiao
Peng Gao
Xiangyu Yue
MLLM
254
134
0
10 Jan 2025
When SAM2 Meets Video Shadow and Mirror Detection
When SAM2 Meets Video Shadow and Mirror Detection
Leiping Jie
VLM
87
0
0
26 Dec 2024
Query-centric Audio-Visual Cognition Network for Moment Retrieval,
  Segmentation and Step-Captioning
Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning
Yunbin Tu
Liang-Sheng Li
Li Su
Qingming Huang
119
0
0
18 Dec 2024
Do Language Models Understand Time?
Do Language Models Understand Time?
Xi Ding
Lei Wang
335
2
0
18 Dec 2024
GEXIA: Granularity Expansion and Iterative Approximation for Scalable
  Multi-grained Video-language Learning
GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning
Yanjie Wang
Zhikang Zhang
Jue Wang
D. Fan
Zhenlin Xu
Linda Liu
Xiang Hao
Vimal Bhat
Xinyu Li
VLM
122
1
0
10 Dec 2024
PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos
PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos
Meng Cao
Haoran Tang
Haoze Zhao
Hangyu Guo
Jing Liu
Ge Zhang
Ruyang Liu
Qiang Sun
Ian Reid
Xiaodan Liang
215
3
0
02 Dec 2024
SparrowVQE: Visual Question Explanation for Course Content Understanding
SparrowVQE: Visual Question Explanation for Course Content Understanding
Jialu Li
Manish Kumar Thota
Ruslan Gokhman
Radek Holik
Youshan Zhang
107
1
0
12 Nov 2024
Multi-Modal interpretable automatic video captioning
Multi-Modal interpretable automatic video captioning
Antoine Hanna-Asaad
Decky Aspandi
Titus Zaharia
67
0
0
11 Nov 2024
ProMQA: Question Answering Dataset for Multimodal Procedural Activity
  Understanding
ProMQA: Question Answering Dataset for Multimodal Procedural Activity Understanding
Kimihiro Hasegawa
Wiradee Imrattanatrai
Zhi-Qi Cheng
Masaki Asada
Susan Holm
Yuran Wang
Ken Fukuda
Teruko Mitamura
48
1
0
29 Oct 2024
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language
  Tuning
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning
Zhiwei Hao
Jianyuan Guo
Li Shen
Yong Luo
Han Hu
Yonggang Wen
VLM
99
0
0
23 Oct 2024
LocoMotion: Learning Motion-Focused Video-Language Representations
LocoMotion: Learning Motion-Focused Video-Language Representations
Hazel Doughty
Fida Mohammad Thoker
Cees G. M. Snoek
119
2
0
15 Oct 2024
Multi-granularity Contrastive Cross-modal Collaborative Generation for
  End-to-End Long-term Video Question Answering
Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering
Ting Yu
Kunhao Fu
Jian Zhang
Qingming Huang
Jun Yu
76
2
0
12 Oct 2024
GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video
  Paragraph Captioning
GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video Paragraph Captioning
Eileen Wang
Caren Han
Josiah Poon
72
0
0
12 Oct 2024
VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained
  Video Understanding
VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding
Houlun Chen
Xin Wang
Hong Chen
Zeyang Zhang
Wei Feng
Bin Huang
Jia Jia
Wenwu Zhu
VGen
100
4
0
11 Oct 2024
End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting
End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting
Yongqi Wang
Xinxiao Wu
Shuo Yang
Jiebo Luo
458
1
0
19 Sep 2024
QTG-VQA: Question-Type-Guided Architectural for VideoQA Systems
QTG-VQA: Question-Type-Guided Architectural for VideoQA Systems
Zhixian He
Pengcheng Zhao
Fuwei Zhang
Shujin Lin
77
0
0
14 Sep 2024
VLM-KD: Knowledge Distillation from VLM for Long-Tail Visual Recognition
VLM-KD: Knowledge Distillation from VLM for Long-Tail Visual Recognition
Zaiwei Zhang
Gregory P. Meyer
Zhichao Lu
Ashish Shrivastava
Avinash Ravichandran
Eric M. Wolff
VLM
99
3
0
29 Aug 2024
QD-VMR: Query Debiasing with Contextual Understanding Enhancement for
  Video Moment Retrieval
QD-VMR: Query Debiasing with Contextual Understanding Enhancement for Video Moment Retrieval
Chenghua Gao
Min Li
Jianshuo Liu
Junxing Ren
Lin Chen
Haoyu Liu
Bo Meng
Jitao Fu
Wenwen Su
50
0
0
23 Aug 2024
ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation
ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation
Jingyun Wang
Guoliang Kang
VLMSSL
107
7
0
13 Aug 2024
VideoQA in the Era of LLMs: An Empirical Study
VideoQA in the Era of LLMs: An Empirical Study
Junbin Xiao
Nanxin Huang
Hangyu Qin
Dongyang Li
Yicong Li
...
Zhulin Tao
Jianxing Yu
Liang Lin
Tat-Seng Chua
Angela Yao
104
14
0
08 Aug 2024
Lighthouse: A User-Friendly Library for Reproducible Video Moment
  Retrieval and Highlight Detection
Lighthouse: A User-Friendly Library for Reproducible Video Moment Retrieval and Highlight Detection
Taichi Nishimura
Shota Nakada
Hokuto Munakata
Tatsuya Komatsu
VLM
89
2
0
06 Aug 2024
Learning Video Context as Interleaved Multimodal Sequences
Learning Video Context as Interleaved Multimodal Sequences
S. Shao
Pengchuan Zhang
Y. Li
Xide Xia
A. Meso
Ziteng Gao
Jinheng Xie
N. Holliman
Mike Zheng Shou
108
6
0
31 Jul 2024
Causal Understanding For Video Question Answering
Causal Understanding For Video Question Answering
Bhanu Prakash Reddy Guda
Tanmay Kulkarni
Adithya Sampath
Swarnashree Mysore Sathyendra
CML
88
0
0
23 Jul 2024
Nearest Neighbor Future Captioning: Generating Descriptions for Possible
  Collisions in Object Placement Tasks
Nearest Neighbor Future Captioning: Generating Descriptions for Possible Collisions in Object Placement Tasks
Takumi Komatsu
Motonari Kambara
Shumpei Hatanaka
Haruka Matsuo
Tsubasa Hirakawa
Takayoshi Yamashita
H. Fujiyoshi
Komei Sugiura
66
0
0
18 Jul 2024
Meta-optimized Angular Margin Contrastive Framework for Video-Language
  Representation Learning
Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning
Thong Nguyen
Yi Bin
Xiaobao Wu
Xinshuai Dong
Zhiyuan Hu
Khoi M. Le
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
144
6
0
04 Jul 2024
ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos
ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos
Jr-Jen Chen
Yu-Chien Liao
Hsi-Che Lin
Yu-Chu Yu
Yen-Chun Chen
Yu-Chiang Frank Wang
88
13
0
27 Jun 2024
Enhancing Video-Language Representations with Structural Spatio-Temporal
  Alignment
Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment
Hao Fei
Shengqiong Wu
Meishan Zhang
Hao Fei
Tat-Seng Chua
Shuicheng Yan
AI4TS
128
43
0
27 Jun 2024
Large Language Models Meet Text-Centric Multimodal Sentiment Analysis: A
  Survey
Large Language Models Meet Text-Centric Multimodal Sentiment Analysis: A Survey
Hao Yang
Yanyan Zhao
Yang Wu
Shilong Wang
Tian Zheng
Hongbo Zhang
Zongyang Ma
Wanxiang Che
Bing Qin
133
14
0
12 Jun 2024
LVBench: An Extreme Long Video Understanding Benchmark
LVBench: An Extreme Long Video Understanding Benchmark
Weihan Wang
Zehai He
Wenyi Hong
Yean Cheng
Xiaohan Zhang
...
Shiyu Huang
Bin Xu
Yuxiao Dong
Ming Ding
Jie Tang
ELMVLM
146
91
0
12 Jun 2024
Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives
Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives
Thong Nguyen
Yi Bin
Junbin Xiao
Leigang Qu
Yicong Li
Jay Zhangjie Wu
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
VLM
173
13
1
09 Jun 2024
MuJo: Multimodal Joint Feature Space Learning for Human Activity Recognition
MuJo: Multimodal Joint Feature Space Learning for Human Activity Recognition
Stefan Gerd Fritsch
Cennet Oğuz
Vitor Fortes Rey
L. Ray
Maximilian Kiefer-Emmanouilidis
Paul Lukowicz
HAI
116
0
0
06 Jun 2024
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu
Yuhan Dai
Yondong Luo
Lei Li
Shuhuai Ren
...
Xiawu Zheng
Enhong Chen
Caifeng Shan
Xing Sun
Xing Sun
VLMMLLM
179
421
0
31 May 2024
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Ziyang Wang
Shoubin Yu
Elias Stengel-Eskin
Jaehong Yoon
Feng Cheng
Gedas Bertasius
Mohit Bansal
150
70
0
29 May 2024
Instruct-ReID++: Towards Universal Purpose Instruction-Guided Person Re-identification
Instruct-ReID++: Towards Universal Purpose Instruction-Guided Person Re-identification
Weizhen He
Yiheng Deng
Yunfeng Yan
Feng Zhu
Yizhou Wang
Lei Bai
Qingsong Xie
Donglian Qi
Wanli Ouyang
Shixiang Tang
164
3
0
28 May 2024
From CNNs to Transformers in Multimodal Human Action Recognition: A
  Survey
From CNNs to Transformers in Multimodal Human Action Recognition: A Survey
Muhammad Bilal Shaikh
Syed Mohammed Shamsul Islam
Douglas Chai
Naveed Akhtar
108
10
0
22 May 2024
1234567
Next