Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2208.00934
Cited By
Video Question Answering with Iterative Video-Text Co-Tokenization
1 August 2022
A. Piergiovanni
K. Morton
Weicheng Kuo
Michael S. Ryoo
A. Angelova
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Video Question Answering with Iterative Video-Text Co-Tokenization"
19 / 19 papers shown
Title
Hierarchical Banzhaf Interaction for General Video-Language Representation Learning
Peng Jin
Hao Li
Li Yuan
Shuicheng Yan
Jie Chen
54
1
0
31 Dec 2024
FILS: Self-Supervised Video Feature Prediction In Semantic Language Space
Mona Ahmadian
Frank Guerin
Andrew Gilbert
44
1
0
05 Jun 2024
Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels
Tianming Liang
Chaolei Tan
Beihao Xia
Wei-Shi Zheng
Jianfang Hu
36
1
0
21 Mar 2024
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
A. Piergiovanni
Isaac Noble
Dahun Kim
Michael S. Ryoo
Victor Gomes
A. Angelova
36
19
0
09 Nov 2023
Harvest Video Foundation Models via Efficient Post-Pretraining
Yizhuo Li
Kunchang Li
Yinan He
Yi Wang
Yali Wang
Limin Wang
Yu Qiao
Ping Luo
CLIP
VLM
VGen
51
2
0
30 Oct 2023
Can I Trust Your Answer? Visually Grounded Video Question Answering
Junbin Xiao
Angela Yao
Yicong Li
Tat-Seng Chua
33
46
0
04 Sep 2023
Diversifying Joint Vision-Language Tokenization Learning
Vardaan Pahuja
A. Piergiovanni
A. Angelova
19
0
0
06 Jun 2023
Joint Adaptive Representations for Image-Language Learning
A. Piergiovanni
A. Angelova
VLM
26
0
0
31 May 2023
TG-VQA: Ternary Game of Video Question Answering
Hao Li
Peng Jin
Ze-Long Cheng
Songyang Zhang
Kai-xiang Chen
Zhennan Wang
Chang-rui Liu
Jie Chen
26
10
0
17 May 2023
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
Weicheng Kuo
A. Piergiovanni
Dahun Kim
Xiyang Luo
Benjamin Caine
...
Luowei Zhou
Andrew M. Dai
Zhifeng Chen
Claire Cui
A. Angelova
MLLM
VLM
29
23
0
29 Mar 2023
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
Peng Jin
Jinfa Huang
Pengfei Xiong
Shangxuan Tian
Chang-rui Liu
Xiang Ji
Li-ming Yuan
Jie Chen
42
49
0
25 Mar 2023
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
A. Piergiovanni
Weicheng Kuo
A. Angelova
ViT
36
54
0
06 Dec 2022
Compound Tokens: Channel Fusion for Vision-Language Representation Learning
Maxwell Mbabilla Aladago
A. Piergiovanni
19
1
0
02 Dec 2022
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
Peng Jin
Jinfa Huang
Fenglin Liu
Xian Wu
Shen Ge
Guoli Song
David A. Clifton
Jing Chen
VLM
42
63
0
21 Nov 2022
Video Question Answering: Datasets, Algorithms and Challenges
Yaoyao Zhong
Junbin Xiao
Wei Ji
Yicong Li
Wei Deng
Tat-Seng Chua
24
85
0
02 Mar 2022
Bridge to Answer: Structure-aware Graph Interaction Network for Video Question Answering
Jungin Park
Jiyoung Lee
Kwanghoon Sohn
165
100
0
29 Apr 2021
Is Space-Time Attention All You Need for Video Understanding?
Gedas Bertasius
Heng Wang
Lorenzo Torresani
ViT
280
1,982
0
09 Feb 2021
Multi-modal Transformer for Video Retrieval
Valentin Gabeur
Chen Sun
Alahari Karteek
Cordelia Schmid
ViT
424
596
0
21 Jul 2020
ECO: Efficient Convolutional Network for Online Video Understanding
Mohammadreza Zolfaghari
Kamaljeet Singh
Thomas Brox
130
496
0
24 Apr 2018
1