Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2011.00597
Cited By
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
1 November 2020
Simon Ging
Mohammadreza Zolfaghari
Hamed Pirsiavash
Thomas Brox
ViT
CLIP
Re-assign community
ArXiv
PDF
HTML
Papers citing
"COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning"
50 / 95 papers shown
Title
Hadamard product in deep learning: Introduction, Advances and Challenges
Grigorios G. Chrysos
Yongtao Wu
Razvan Pascanu
Philip Torr
V. Cevher
AAML
98
0
0
17 Apr 2025
MM-NeRF: Multimodal-Guided 3D Multi-Style Transfer of Neural Radiance Field
Zijian Győző Yang
Zhongwei Qiu
Chang Xu
Dongmei Fu
50
2
0
28 Jan 2025
GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video Paragraph Captioning
Eileen Wang
Caren Han
Josiah Poon
37
0
0
12 Oct 2024
Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset
Yuchen Yang
Yingxuan Duan
VGen
32
0
0
19 Jun 2024
Instruct-ReID++: Towards Universal Purpose Instruction-Guided Person Re-identification
Weizhen He
Yiheng Deng
Yunfeng Yan
Feng Zhu
Yizhou Wang
Lei Bai
Qingsong Xie
Donglian Qi
Wanli Ouyang
Shixiang Tang
95
2
0
28 May 2024
An Empirical Study of Excitation and Aggregation Design Adaptions in CLIP4Clip for Video-Text Retrieval
Xiaolun Jing
Genke Yang
Jian Chu
CLIP
39
1
0
25 May 2024
Towards Multimodal Video Paragraph Captioning Models Robust to Missing Modality
Sishuo Chen
Lei Li
Shuhuai Ren
Rundong Gao
Yuanxin Liu
Xiaohan Bi
Xu Sun
Lu Hou
42
3
0
28 Mar 2024
Partial Federated Learning
Tiantian Feng
Anil Ramakrishna
Jimit Majmudar
Charith Peris
Jixuan Wang
Clement Chung
Richard Zemel
Morteza Ziyadi
Rahul Gupta
44
1
0
03 Mar 2024
Event-aware Video Corpus Moment Retrieval
Danyang Hou
Liang Pang
Huawei Shen
Xueqi Cheng
28
1
0
21 Feb 2024
Video Editing for Video Retrieval
Bin Zhu
Kevin Flanagan
A. Fragomeni
Michael Wray
Dima Damen
CLIP
34
0
0
04 Feb 2024
TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection
Hao Sun
Mingyao Zhou
Wenjing Chen
Wei Xie
PINN
3DGS
ViT
21
32
0
04 Jan 2024
Vision-Language Models Learn Super Images for Efficient Partially Relevant Video Retrieval
Taichi Nishimura
Shota Nakada
Masayoshi Kondo
VLM
21
0
0
01 Dec 2023
Robust Domain Misinformation Detection via Multi-modal Feature Alignment
Hui Liu
Wenya Wang
Hao Sun
Anderson de Rezende Rocha
Haoliang Li
43
11
0
24 Nov 2023
CLearViD: Curriculum Learning for Video Description
Cheng-Yu Chuang
Pooyan Fazli
38
1
0
08 Nov 2023
Driving through the Concept Gridlock: Unraveling Explainability Bottlenecks in Automated Driving
J. Echterhoff
An Yan
Kyungtae Han
Amr Abdelraouf
Rohit Gupta
Julian McAuley
21
7
0
25 Oct 2023
Collaborative Three-Stream Transformers for Video Captioning
Hao Wang
Libo Zhang
Hengrui Fan
Tiejian Luo
36
6
0
18 Sep 2023
Opening the Vocabulary of Egocentric Actions
Dibyadip Chatterjee
Fadime Sener
Shugao Ma
Angela Yao
VLM
41
16
0
22 Aug 2023
Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment
Sarah Ibrahimi
Xiaohang Sun
Pichao Wang
Amanmeet Garg
Ashutosh Sanan
Mohamed Omar
46
14
0
24 Jul 2023
Video-Mined Task Graphs for Keystep Recognition in Instructional Videos
Kumar Ashutosh
Santhosh Kumar Ramakrishnan
Triantafyllos Afouras
Kristen Grauman
23
24
0
17 Jul 2023
MomentDiff: Generative Video Moment Retrieval from Random to Real
P. Li
Chen-Wei Xie
Hongtao Xie
Liming Zhao
Lei Zhang
Yun Zheng
Deli Zhao
Yongdong Zhang
DiffM
VGen
39
56
0
06 Jul 2023
Hierarchical Matching and Reasoning for Multi-Query Image Retrieval
Zhong Ji
Zhihao Li
Yan Zhang
Haoran Wang
Yanwei Pang
Xuelong Li
24
11
0
26 Jun 2023
Exploring the Role of Audio in Video Captioning
Yuhan Shen
Linjie Yang
Longyin Wen
Haichao Yu
Ehsan Elhamifar
Heng Wang
18
2
0
21 Jun 2023
Vision-Language Models can Identify Distracted Driver Behavior from Naturalistic Videos
Md Zahid Hasan
Jiajing Chen
Jiyang Wang
Mohammed Shaiqur Rahman
Ameya Joshi
Senem Velipasalar
C. Hegde
Anuj Sharma
S. Sarkar
VLM
46
18
0
16 Jun 2023
Instruct-ReID: A Multi-purpose Person Re-identification Task with Instructions
Weizhen He
Yihe Deng
Shixiang Tang
Qihao Chen
Qingsong Xie
...
Feng Zhu
Rui Zhao
Wanli Ouyang
Donglian Qi
Yunfeng Yan
74
19
0
13 Jun 2023
Learning to Ground Instructional Articles in Videos through Narrations
E. Mavroudi
Triantafyllos Afouras
Lorenzo Torresani
DiffM
33
22
0
06 Jun 2023
Self-Supervised Multimodal Learning: A Survey
Yongshuo Zong
Oisin Mac Aodha
Timothy M. Hospedales
SSL
24
43
0
31 Mar 2023
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
Dohwan Ko
Joon-Young Choi
Hyeong Kyu Choi
Kyoung-Woon On
Byungseok Roh
Hyunwoo J. Kim
52
19
0
23 Mar 2023
Text with Knowledge Graph Augmented Transformer for Video Captioning
Xin Gu
G. Chen
Yufei Wang
Libo Zhang
Tiejian Luo
Longyin Wen
27
47
0
22 Mar 2023
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos
Teng Wang
Jinrui Zhang
Feng Zheng
Wenhao Jiang
Ran Cheng
Ping Luo
VLM
33
11
0
11 Mar 2023
Models See Hallucinations: Evaluating the Factuality in Video Captioning
Hui Liu
Xiaojun Wan
HILM
34
10
0
06 Mar 2023
Deep Learning for Video-Text Retrieval: a Review
Cunjuan Zhu
Qi Jia
Wei Chen
Yanming Guo
Yu Liu
24
14
0
24 Feb 2023
HierVL: Learning Hierarchical Video-Language Embeddings
Kumar Ashutosh
Rohit Girdhar
Lorenzo Torresani
Kristen Grauman
VLM
AI4TS
22
52
0
05 Jan 2023
Test of Time: Instilling Video-Language Models with a Sense of Time
Piyush Bagad
Makarand Tapaswi
Cees G. M. Snoek
83
36
0
05 Jan 2023
Contextual Explainable Video Representation: Human Perception-based Understanding
Khoa T. Vo
Kashu Yamazaki
Phong H. Nguyen
Pha Nguyen
Khoa Luu
Ngan Le
13
9
0
12 Dec 2022
Improving Cross-Modal Retrieval with Set of Diverse Embeddings
Dongwon Kim
Nam-Won Kim
Suha Kwak
24
37
0
30 Nov 2022
VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning
Kashu Yamazaki
Khoa T. Vo
Sang Truong
Bhiksha Raj
Ngan Le
29
35
0
28 Nov 2022
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention
Fenglin Liu
Xian Wu
Shen Ge
Xuancheng Ren
Wei Fan
Xu Sun
Yuexian Zou
VLM
75
12
0
28 Oct 2022
Linear Video Transformer with Feature Fixation
Kaiyue Lu
Zexia Liu
Jianyuan Wang
Weixuan Sun
Zhen Qin
...
Xuyang Shen
Huizhong Deng
Xiaodong Han
Yuchao Dai
Yiran Zhong
30
4
0
15 Oct 2022
Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning
Yuchong Sun
Hongwei Xue
Ruihua Song
Bei Liu
Huan Yang
Jianlong Fu
AI4TS
VLM
20
68
0
12 Oct 2022
Contrastive Video-Language Learning with Fine-grained Frame Sampling
Zixu Wang
Yujie Zhong
Yishu Miao
Lin Ma
Lucia Specia
49
11
0
10 Oct 2022
ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval
A. Fragomeni
Michael Wray
Dima Damen
CLIP
ViT
25
3
0
09 Oct 2022
Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding
Yang Jin
Yongzhi Li
Zehuan Yuan
Yadong Mu
31
32
0
27 Sep 2022
Leveraging Self-Supervised Training for Unintentional Action Recognition
Enea Duka
Anna Kukleva
Bernt Schiele
30
1
0
23 Sep 2022
Can Offline Reinforcement Learning Help Natural Language Understanding?
Ziqi Zhang
Yile Wang
Yue Zhang
Donglin Wang
OffRL
33
0
0
15 Sep 2022
Hierarchical Local-Global Transformer for Temporal Sentence Grounding
Xiang Fang
Daizong Liu
Pan Zhou
Zichuan Xu
Rui Li
19
28
0
31 Aug 2022
Partially Relevant Video Retrieval
Jianfeng Dong
Xianke Chen
Minsong Zhang
Xun Yang
Shujie Chen
Xirong Li
Xun Wang
17
39
0
26 Aug 2022
LocVTP: Video-Text Pre-training for Temporal Localization
Meng Cao
Tianyu Yang
Junwu Weng
Can Zhang
Jue Wang
Yuexian Zou
22
64
0
21 Jul 2022
Relational Future Captioning Model for Explaining Likely Collisions in Daily Tasks
Motonari Kambara
K. Sugiura
22
6
0
19 Jul 2022
Clover: Towards A Unified Video-Language Alignment and Fusion Model
Jingjia Huang
Yinan Li
Jiashi Feng
Xinglong Wu
Xiaoshuai Sun
Rongrong Ji
VLM
24
48
0
16 Jul 2022
Robustness Analysis of Video-Language Models Against Visual and Language Perturbations
Madeline Chantry Schiappa
Shruti Vyas
Hamid Palangi
Y. S. Rawat
Vibhav Vineet
VLM
120
17
0
05 Jul 2022
1
2
Next