ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1904.03493
  4. Cited By
VATEX: A Large-Scale, High-Quality Multilingual Dataset for
  Video-and-Language Research

VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research

6 April 2019
Xin Eric Wang
Jiawei Wu
Junkun Chen
Lei Li
Yuan-fang Wang
William Yang Wang
ArXivPDFHTML

Papers citing "VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research"

50 / 339 papers shown
Title
Video ReCap: Recursive Captioning of Hour-Long Videos
Video ReCap: Recursive Captioning of Hour-Long Videos
Md. Mohaiminul Islam
Ngan Ho
Xitong Yang
Tushar Nagarajan
Lorenzo Torresani
Gedas Bertasius
VGen
VLM
35
44
0
20 Feb 2024
VideoPrism: A Foundational Visual Encoder for Video Understanding
VideoPrism: A Foundational Visual Encoder for Video Understanding
Long Zhao
N. B. Gundavarapu
Liangzhe Yuan
Hao Zhou
Shen Yan
...
Huisheng Wang
Hartwig Adam
Mikhail Sirotenko
Ting Liu
Boqing Gong
VGen
41
29
0
20 Feb 2024
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong
  Vision-language Adapter
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter
Junfei Xiao
Zheng Xu
Alan L. Yuille
Shen Yan
Boyu Wang
33
3
0
16 Feb 2024
ContPhy: Continuum Physical Concept Learning and Reasoning from Videos
ContPhy: Continuum Physical Concept Learning and Reasoning from Videos
Zhicheng Zheng
Xin Yan
Zhenfang Chen
Jingzhou Wang
Qin Zhi Eddie Lim
Joshua B. Tenenbaum
Chuang Gan
LRM
40
6
0
09 Feb 2024
M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based
  Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
Xingning Dong
Zipeng Feng
Chunluan Zhou
Xuzheng Yu
Ming Yang
Qingpei Guo
VLM
36
2
0
31 Jan 2024
Knowledge Guided Entity-aware Video Captioning and A Basketball
  Benchmark
Knowledge Guided Entity-aware Video Captioning and A Basketball Benchmark
Zeyu Xi
Ge Shi
Xuefen Li
Junchi Yan
Zun Li
Lifang Wu
Zilin Liu
Liang Wang
44
0
0
25 Jan 2024
ActionHub: A Large-scale Action Video Description Dataset for Zero-shot
  Action Recognition
ActionHub: A Large-scale Action Video Description Dataset for Zero-shot Action Recognition
Jiaming Zhou
Junwei Liang
Kun-Yu Lin
Jinrui Yang
Wei-Shi Zheng
VLM
21
8
0
22 Jan 2024
DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval
DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval
Xiangpeng Yang
Linchao Zhu
Xiaohan Wang
Yi Yang
VLM
34
23
0
19 Jan 2024
CBVS: A Large-Scale Chinese Image-Text Benchmark for Real-World Short
  Video Search Scenarios
CBVS: A Large-Scale Chinese Image-Text Benchmark for Real-World Short Video Search Scenarios
Xiangshuo Qiao
Xianxin Li
Xiaozhe Qu
Jie M. Zhang
Yang Liu
Yu Luo
Cihang Jin
Jin Ma
VLM
33
0
0
19 Jan 2024
Distilling Vision-Language Models on Millions of Videos
Distilling Vision-Language Models on Millions of Videos
Yue Zhao
Long Zhao
Xingyi Zhou
Jialin Wu
Chun-Te Chu
...
Hartwig Adam
Ting Liu
Boqing Gong
Philipp Krahenbuhl
Liangzhe Yuan
VLM
34
13
0
11 Jan 2024
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved
  Pre-Training
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Alex Jinpeng Wang
Linjie Li
K. Lin
Jianfeng Wang
Kevin Lin
Zhengyuan Yang
Lijuan Wang
Mike Zheng Shou
VLM
VGen
27
12
0
01 Jan 2024
Towards Efficient and Effective Text-to-Video Retrieval with
  Coarse-to-Fine Visual Representation Learning
Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning
Kaibin Tian
Yanhua Cheng
Yi Liu
Xinglin Hou
Quan Chen
Han Li
22
3
0
01 Jan 2024
Video Understanding with Large Language Models: A Survey
Video Understanding with Large Language Models: A Survey
Yunlong Tang
Jing Bi
Siting Xu
Luchuan Song
Susan Liang
...
Feng Zheng
Jianguo Zhang
Ping Luo
Jiebo Luo
Chenliang Xu
VLM
54
83
0
29 Dec 2023
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
  Language, Audio, and Action
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
Jiasen Lu
Christopher Clark
Sangho Lee
Zichen Zhang
Savya Khosla
Ryan Marten
Derek Hoiem
Aniruddha Kembhavi
VLM
MLLM
37
144
0
28 Dec 2023
Visual Instruction Tuning towards General-Purpose Multimodal Model: A
  Survey
Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey
Jiaxing Huang
Jingyi Zhang
Kai Jiang
Han Qiu
Shijian Lu
41
22
0
27 Dec 2023
Set Prediction Guided by Semantic Concepts for Diverse Video Captioning
Set Prediction Guided by Semantic Concepts for Diverse Video Captioning
Yifan Lu
Ziqi Zhang
Chunfen Yuan
Peng Li
Yan Wang
Bing Li
Weiming Hu
34
3
0
25 Dec 2023
Open-Vocabulary Video Relation Extraction
Open-Vocabulary Video Relation Extraction
Wentao Tian
Zheng Wang
Yu Fu
Jingjing Chen
Lechao Cheng
23
2
0
25 Dec 2023
WAVER: Writing-style Agnostic Text-Video Retrieval via Distilling
  Vision-Language Models Through Open-Vocabulary Knowledge
WAVER: Writing-style Agnostic Text-Video Retrieval via Distilling Vision-Language Models Through Open-Vocabulary Knowledge
Huy Le
Tung Kieu
Anh Nguyen
Ngan Le
VGen
26
1
0
15 Dec 2023
CL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual
  Knowledge Transfer
CL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual Knowledge Transfer
Yabing Wang
Fan Wang
Jianfeng Dong
Hao Luo
VLM
24
9
0
14 Dec 2023
Leveraging Generative Language Models for Weakly Supervised Sentence
  Component Analysis in Video-Language Joint Learning
Leveraging Generative Language Models for Weakly Supervised Sentence Component Analysis in Video-Language Joint Learning
Zaber Ibn Abdul Hakim
Najibul Haque Sarker
Rahul Pratap Singh
Bishmoy Paul
Ali Dabouei
Min Xu
17
1
0
10 Dec 2023
Vision-Language Models Learn Super Images for Efficient Partially
  Relevant Video Retrieval
Vision-Language Models Learn Super Images for Efficient Partially Relevant Video Retrieval
Taichi Nishimura
Shota Nakada
Masayoshi Kondo
VLM
21
0
0
01 Dec 2023
ART$\boldsymbol{\cdot}$V: Auto-Regressive Text-to-Video Generation with
  Diffusion Models
ART⋅\boldsymbol{\cdot}⋅V: Auto-Regressive Text-to-Video Generation with Diffusion Models
Wenming Weng
Ruoyu Feng
Yanhui Wang
Qi Dai
Chunyu Wang
...
Jianmin Bao
Yuhui Yuan
Chong Luo
Yueyi Zhang
Zhiwei Xiong
VGen
25
32
0
30 Nov 2023
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware
  representations to LLMs and Emergent Cross-modal Reasoning
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning
Artemis Panagopoulou
Le Xue
Ning Yu
Junnan Li
Dongxu Li
Shafiq R. Joty
Ran Xu
Silvio Savarese
Caiming Xiong
Juan Carlos Niebles
VLM
MLLM
38
45
0
30 Nov 2023
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context
  Learning
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning
Chaoyi Zhang
K. Lin
Zhengyuan Yang
Jianfeng Wang
Linjie Li
Chung-Ching Lin
Zicheng Liu
Lijuan Wang
VGen
21
28
0
29 Nov 2023
VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of
  Video-Language Models
VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models
Shicheng Li
Lei Li
Shuhuai Ren
Yuanxin Liu
Yi Liu
Rundong Gao
Xu Sun
Lu Hou
34
29
0
29 Nov 2023
Side4Video: Spatial-Temporal Side Network for Memory-Efficient
  Image-to-Video Transfer Learning
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning
Huanjin Yao
Wenhao Wu
Zhiheng Li
VLM
95
9
0
27 Nov 2023
AutoEval-Video: An Automatic Benchmark for Assessing Large Vision
  Language Models in Open-Ended Video Question Answering
AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering
Xiuyuan Chen
Yuan Lin
Yuchen Zhang
Weiran Huang
ELM
MLLM
23
26
0
25 Nov 2023
VideoCon: Robust Video-Language Alignment via Contrast Captions
VideoCon: Robust Video-Language Alignment via Contrast Captions
Hritik Bansal
Yonatan Bitton
Idan Szpektor
Kai-Wei Chang
Aditya Grover
37
14
0
15 Nov 2023
Sinkhorn Transformations for Single-Query Postprocessing in Text-Video
  Retrieval
Sinkhorn Transformations for Single-Query Postprocessing in Text-Video Retrieval
Konstantin Yakovlev
Gregory Polyakov
I. Alimova
Alexander Podolskiy
A. Bout
Sergey I. Nikolenko
Irina Piontkovskaya
CLIP
16
1
0
14 Nov 2023
Dense Video Captioning: A Survey of Techniques, Datasets and Evaluation
  Protocols
Dense Video Captioning: A Survey of Techniques, Datasets and Evaluation Protocols
Iqra Qasim
Alexander Horsch
Dilip K. Prasad
19
5
0
05 Nov 2023
ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab
ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab
Jieming Cui
Ziren Gong
Baoxiong Jia
Siyuan Huang
Zilong Zheng
Jianzhu Ma
Yixin Zhu
34
3
0
01 Nov 2023
Video-Helpful Multimodal Machine Translation
Video-Helpful Multimodal Machine Translation
Yihang Li
Shuichiro Shimizu
Chenhui Chu
Sadao Kurohashi
Wei Li
24
2
0
31 Oct 2023
MM-VID: Advancing Video Understanding with GPT-4V(ision)
MM-VID: Advancing Video Understanding with GPT-4V(ision)
Kevin Qinghong Lin
Faisal Ahmed
Linjie Li
Chung-Ching Lin
E. Azarnasab
...
Lin Liang
Zicheng Liu
Yumao Lu
Ce Liu
Lijuan Wang
MLLM
28
63
0
30 Oct 2023
Harvest Video Foundation Models via Efficient Post-Pretraining
Harvest Video Foundation Models via Efficient Post-Pretraining
Yizhuo Li
Kunchang Li
Yinan He
Yi Wang
Yali Wang
Limin Wang
Yu Qiao
Ping Luo
CLIP
VLM
VGen
51
2
0
30 Oct 2023
Semantic and Expressive Variation in Image Captions Across Languages
Semantic and Expressive Variation in Image Captions Across Languages
Andre Ye
Sebastin Santy
Jena D. Hwang
Amy X. Zhang
Ranjay Krishna
VLM
53
3
0
22 Oct 2023
Grounded and Well-rounded: A Methodological Approach to the Study of
  Cross-modal and Cross-lingual Grounding
Grounded and Well-rounded: A Methodological Approach to the Study of Cross-modal and Cross-lingual Grounding
Timothee Mickus
Elaine Zosa
Denis Paperno
36
0
0
18 Oct 2023
A Survey on Video Diffusion Models
A Survey on Video Diffusion Models
Zhen Xing
Qijun Feng
Haoran Chen
Qi Dai
Hang-Rui Hu
Hang Xu
Zuxuan Wu
Yu-Gang Jiang
EGVM
VGen
57
116
0
16 Oct 2023
PaLI-3 Vision Language Models: Smaller, Faster, Stronger
PaLI-3 Vision Language Models: Smaller, Faster, Stronger
Xi Chen
Xiao Wang
Lucas Beyer
Alexander Kolesnikov
Jialin Wu
...
Keran Rong
Tianli Yu
Daniel Keysers
Xiao-Qi Zhai
Radu Soricut
MLLM
VLM
30
93
0
13 Oct 2023
Learning Interactive Real-World Simulators
Learning Interactive Real-World Simulators
Mengjiao Yang
Yilun Du
Kamyar Ghasemipour
Jonathan Tompson
Leslie Kaelbling
Dale Schuurmans
Pieter Abbeel
LM&Ro
PINN
16
177
0
09 Oct 2023
Human-centric Behavior Description in Videos: New Benchmark and Model
Human-centric Behavior Description in Videos: New Benchmark and Model
Lingru Zhou
Yi-Meng Gao
Manqing Zhang
Peng Wu
Peng Wang
Yanning Zhang
36
1
0
04 Oct 2023
A Spatio-Temporal Attention-Based Method for Detecting Student Classroom
  Behaviors
A Spatio-Temporal Attention-Based Method for Detecting Student Classroom Behaviors
Fan Yang
27
2
0
04 Oct 2023
Video-adverb retrieval with compositional adverb-action embeddings
Video-adverb retrieval with compositional adverb-action embeddings
Thomas Hummel
Otniel-Bogdan Mercea
A. Sophia Koepke
Zeynep Akata
17
1
0
26 Sep 2023
Towards Surveillance Video-and-Language Understanding: New Dataset,
  Baselines, and Challenges
Towards Surveillance Video-and-Language Understanding: New Dataset, Baselines, and Challenges
Tongtong Yuan
Xuange Zhang
Kun Liu
Bo Liu
Chen Chen
Jian Jin
Zhenzhen Jiao
AI4TS
19
13
0
25 Sep 2023
Accurate and Fast Compressed Video Captioning
Accurate and Fast Compressed Video Captioning
Yaojie Shen
Xin Gu
Kai Xu
Hengrui Fan
Longyin Wen
Libo Zhang
ViT
23
26
0
22 Sep 2023
Unified Coarse-to-Fine Alignment for Video-Text Retrieval
Unified Coarse-to-Fine Alignment for Video-Text Retrieval
Ziyang Wang
Yi-Lin Sung
Feng Cheng
Gedas Bertasius
Joey Tianyi Zhou
98
44
0
18 Sep 2023
In-Style: Bridging Text and Uncurated Videos with Style Transfer for
  Text-Video Retrieval
In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval
Nina Shvetsova
Anna Kukleva
Bernt Schiele
Hilde Kuehne
DiffM
25
3
0
16 Sep 2023
Dual-view Curricular Optimal Transport for Cross-lingual Cross-modal
  Retrieval
Dual-view Curricular Optimal Transport for Cross-lingual Cross-modal Retrieval
Yabing Wang
Shuhui Wang
Hao Luo
Jianfeng Dong
F. Wang
Meng Han
Xun Wang
Meng Wang
22
9
0
11 Sep 2023
Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation
Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation
Jiaxi Gu
Shicong Wang
Haoyu Zhao
Tianyi Lu
Xing Zhang
Zuxuan Wu
Songcen Xu
Wei Zhang
Yu-Gang Jiang
Hang Xu
DiffM
VGen
39
43
0
07 Sep 2023
Distraction-free Embeddings for Robust VQA
Distraction-free Embeddings for Robust VQA
Atharvan Dogra
Deeksha Varshney
A. Kalyan
A. Deshpande
Neeraj Kumar
14
0
0
31 Aug 2023
MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual
  Captioning
MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning
Bang-ju Yang
Fenglin Liu
X. Wu
Yaowei Wang
Xu Sun
Yuexian Zou
VLM
CLIP
44
13
0
25 Aug 2023
Previous
1234567
Next