ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.06942
  4. Cited By
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
  and Generation

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

13 July 2023
Yi Wang
Yinan He
Yizhuo Li
Kunchang Li
Jiashuo Yu
X. Ma
Xinhao Li
Guo Chen
Xinyuan Chen
Yaohui Wang
Conghui He
Ping Luo
Ziwei Liu
Yali Wang
Limin Wang
Yu Qiao
    VLM
    VGen
ArXivPDFHTML

Papers citing "InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation"

50 / 55 papers shown
Title
InfLVG: Reinforce Inference-Time Consistent Long Video Generation with GRPO
Xueji Fang
Liyuan Ma
Zhiyang Chen
Mingyuan Zhou
Guo-Jun Qi
VGen
118
0
0
23 May 2025
Temporal-Oriented Recipe for Transferring Large Vision-Language Model to Video Understanding
Temporal-Oriented Recipe for Transferring Large Vision-Language Model to Video Understanding
Thong Nguyen
Zhiyuan Hu
Xu Lin
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
VLM
39
0
0
19 May 2025
Video-GPT via Next Clip Diffusion
Video-GPT via Next Clip Diffusion
Shaobin Zhuang
Zhipeng Huang
Ying Zhang
Fangyikang Wang
Canmiao Fu
Binxin Yang
Chong Sun
Chen Li
Yali Wang
DiffM
VGen
111
0
0
18 May 2025
SafeVid: Toward Safety Aligned Video Large Multimodal Models
SafeVid: Toward Safety Aligned Video Large Multimodal Models
Yixu Wang
Jiaxin Song
Yifeng Gao
Xin Wang
Yang Yao
Yan Teng
Xingjun Ma
Yingchun Wang
Yu-Gang Jiang
71
0
0
17 May 2025
Generative Pre-trained Autoregressive Diffusion Transformer
Generative Pre-trained Autoregressive Diffusion Transformer
Yuan Zhang
Jiacheng Jiang
Guoqing Ma
Zhiying Lu
Haoyang Huang
Jianlong Yuan
Nan Duan
VGen
70
1
0
12 May 2025
Video-Bench: Human-Aligned Video Generation Benchmark
Video-Bench: Human-Aligned Video Generation Benchmark
Hui Han
Siyuan Li
Jiaqi Chen
Yiwen Yuan
Yuling Wu
...
You Li
Jing Zhang
Chi Zhang
Li Li
Yongxin Ni
EGVM
VGen
93
0
0
07 Apr 2025
VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models
VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models
Dahun Kim
A. Piergiovanni
Ganesh Mallya
A. Angelova
CoGe
82
0
0
04 Apr 2025
Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation
Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation
Chuanqi Cheng
Jian Guan
Wei Wu
Rui Yan
VLM
104
0
0
03 Apr 2025
UniViTAR: Unified Vision Transformer with Native Resolution
UniViTAR: Unified Vision Transformer with Native Resolution
Limeng Qiao
Yiyang Gan
Bairui Wang
Jie Qin
Shuang Xu
Siqi Yang
Lin Ma
87
0
0
02 Apr 2025
ScalingNoise: Scaling Inference-Time Search for Generating Infinite Videos
ScalingNoise: Scaling Inference-Time Search for Generating Infinite Videos
Haolin Yang
Feilong Tang
Ming Hu
Yulong Li
Junjie Guo
...
Zelin Peng
Junjun He
Junjun He
Zongyuan Ge
Imran Razzak
DiffM
VGen
167
2
0
20 Mar 2025
Concat-ID: Towards Universal Identity-Preserving Video Synthesis
Concat-ID: Towards Universal Identity-Preserving Video Synthesis
Yong Zhong
Zhuoyi Yang
Jiayan Teng
Xiaotao Gu
Chongxuan Li
VGen
77
2
0
18 Mar 2025
4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models
4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models
Wanhua Li
Renping Zhou
Jiawei Zhou
Yingwei Song
Johannes Herter
Minghan Qin
Gao Huang
Hanspeter Pfister
3DGS
VLM
79
1
0
13 Mar 2025
VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation
VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation
Wenhao Wang
Yue Yang
DiffM
VGen
119
0
0
03 Mar 2025
FreqPrior: Improving Video Diffusion Models with Frequency Filtering Gaussian Noise
FreqPrior: Improving Video Diffusion Models with Frequency Filtering Gaussian Noise
Yunlong Yuan
Yuanfan Guo
Chunwei Wang
Wei Zhang
Hang Xu
L. Zhang
DiffM
VGen
143
3
0
20 Feb 2025
AdaDiff: Adaptive Step Selection for Fast Diffusion Models
AdaDiff: Adaptive Step Selection for Fast Diffusion Models
Hui Zhang
Zuxuan Wu
Zhen Xing
Jie Shao
Yu-Gang Jiang
77
9
0
31 Dec 2024
Gramian Multimodal Representation Learning and Alignment
Gramian Multimodal Representation Learning and Alignment
Giordano Cicchetti
Eleonora Grassucci
Luigi Sigillo
Danilo Comminiello
142
1
0
16 Dec 2024
Can video generation replace cinematographers? Research on the cinematic language of generated video
Can video generation replace cinematographers? Research on the cinematic language of generated video
Xuelong Li
Kai WU
Siyi Yang
YiZhan Qu
Guohua. Zhang
...
Mingliang Xiong
Hao Deng
Qingwen Liu
Gang Li
Bin He
VGen
DiffM
108
1
0
16 Dec 2024
Mojito: Motion Trajectory and Intensity Control for Video Generation
Mojito: Motion Trajectory and Intensity Control for Video Generation
Xuehai He
Shuohang Wang
Jianwei Yang
Xiaoxia Wu
Yansen Wang
Kuan-Chieh Wang
Z. Zhan
Olatunji Ruwase
Yelong Shen
Xinze Wang
VGen
147
1
0
12 Dec 2024
PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation
PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation
Qiyao Xue
Xiangyu Yin
Boyuan Yang
Wei Gao
DiffM
VGen
126
11
0
30 Nov 2024
Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing
Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing
Kaifeng Gao
Jiaxin Shi
Hanwang Zhang
Chunping Wang
Jun Xiao
Long Chen
VGen
DiffM
139
1
0
25 Nov 2024
MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation
MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation
Weijia Wu
Mingyu Liu
Zeyu Zhu
Xi Xia
Haoen Feng
Wen Wang
Kevin Qinghong Lin
Chunhua Shen
Mike Zheng Shou
DiffM
VGen
140
2
0
22 Nov 2024
On Learning Multi-Modal Forgery Representation for Diffusion Generated Video Detection
On Learning Multi-Modal Forgery Representation for Diffusion Generated Video Detection
Xiufeng Song
Xiao Guo
Junxuan Zhang
Qirui Li
Lei Bai
Xiaoming Liu
Guangtao Zhai
Xiaohong Liu
DiffM
VGen
101
9
0
31 Oct 2024
SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators
SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators
Rasoul Shafipour
David Harrison
Maxwell Horton
Jeffrey Marker
Houman Bedayat
Sachin Mehta
Mohammad Rastegari
Mahyar Najibi
Saman Naderiparizi
MQ
77
3
0
14 Oct 2024
Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content
Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content
Qiuheng Wang
Yukai Shi
Jiarong Ou
Ruoxin Chen
Ke Lin
...
Mingwu Zheng
Xin Tao
Fei Yang
Pengfei Wan
Di Zhang
VGen
111
23
0
10 Oct 2024
ViLLa: Video Reasoning Segmentation with Large Language Model
ViLLa: Video Reasoning Segmentation with Large Language Model
Rongkun Zheng
Lu Qi
Xi Chen
Yi Wang
Kun Wang
Yu Qiao
Hengshuang Zhao
VOS
LRM
91
3
0
18 Jul 2024
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
Kepan Nan
Rui Xie
Penghao Zhou
Tiehan Fan
Zhenheng Yang
Zhijie Chen
Xiang Li
Jian Yang
Ying Tai
111
76
0
02 Jul 2024
OpenDataLab: Empowering General Artificial Intelligence with Open
  Datasets
OpenDataLab: Empowering General Artificial Intelligence with Open Datasets
Conghui He
Wei Li
Zhenjiang Jin
Chao Xu
Bin Wang
Dahua Lin
AI4CE
31
11
0
04 Jun 2024
MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators
MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators
Shenghai Yuan
Jinfa Huang
Yujun Shi
Yongqi Xu
Ruijie Zhu
Bin Lin
Xinhua Cheng
Li-xin Yuan
Jiebo Luo
VGen
107
35
0
07 Apr 2024
StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text
StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text
Roberto Henschel
Levon Khachatryan
Daniil Hayrapetyan
Hayk Poghosyan
Vahram Tadevosyan
Zhangyang Wang
Shant Navasardyan
Humphrey Shi
DiffM
VGen
112
81
0
21 Mar 2024
World Model on Million-Length Video And Language With Blockwise RingAttention
World Model on Million-Length Video And Language With Blockwise RingAttention
Hao Liu
Wilson Yan
Matei A. Zaharia
Pieter Abbeel
VGen
76
71
0
13 Feb 2024
Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos
Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos
M. S. Seyfioglu
Wisdom O. Ikezogwo
Fatemeh Ghezloo
Ranjay Krishna
Linda G. Shapiro
82
41
0
07 Dec 2023
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and
  Dataset
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Sihan Chen
Handong Li
Qunbo Wang
Zijia Zhao
Ming-Ting Sun
Xinxin Zhu
Qingbin Liu
98
101
0
29 May 2023
TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at
  Scale
TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale
Ziyun Zeng
Yixiao Ge
Zhan Tong
Xihui Liu
Shutao Xia
Ying Shan
52
9
0
23 May 2023
VideoLLM: Modeling Video Sequence with Large Language Models
VideoLLM: Modeling Video Sequence with Large Language Models
Guo Chen
Yin-Dong Zheng
Jiahao Wang
Jilan Xu
Yifei Huang
...
Yi Wang
Yali Wang
Yu Qiao
Tong Lu
Limin Wang
MLLM
107
78
0
22 May 2023
EVA-CLIP: Improved Training Techniques for CLIP at Scale
EVA-CLIP: Improved Training Techniques for CLIP at Scale
Quan-Sen Sun
Yuxin Fang
Ledell Yu Wu
Xinlong Wang
Yue Cao
CLIP
VLM
101
478
0
27 Mar 2023
Tag2Text: Guiding Vision-Language Model via Image Tagging
Tag2Text: Guiding Vision-Language Model via Image Tagging
Xinyu Huang
Youcai Zhang
Jinyu Ma
Weiwei Tian
Rui Feng
Yuejie Zhang
Yaqian Li
Yandong Guo
Lei Zhang
CLIP
MLLM
VLM
3DV
78
74
0
10 Mar 2023
VindLU: A Recipe for Effective Video-and-Language Pretraining
VindLU: A Recipe for Effective Video-and-Language Pretraining
Feng Cheng
Xizi Wang
Jie Lei
David J. Crandall
Joey Tianyi Zhou
Gedas Bertasius
VLM
49
79
0
09 Dec 2022
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video
  UniFormer
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
Kunchang Li
Yali Wang
Yinan He
Yizhuo Li
Yi Wang
Limin Wang
Yu Qiao
ViT
66
108
0
17 Nov 2022
InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges
InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges
Guo Chen
Sen Xing
Zhe Chen
Yi Wang
Kunchang Li
...
Hongjie Zhang
Tong Lu
Yali Wang
Liming Wang
Yu Qiao
46
46
0
17 Nov 2022
Revealing Single Frame Bias for Video-and-Language Learning
Revealing Single Frame Bias for Video-and-Language Learning
Jie Lei
Tamara L. Berg
Joey Tianyi Zhou
47
112
0
07 Jun 2022
Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac
Jeff Donahue
Pauline Luc
Antoine Miech
Iain Barr
...
Mikolaj Binkowski
Ricardo Barreira
Oriol Vinyals
Andrew Zisserman
Karen Simonyan
MLLM
VLM
268
3,458
0
29 Apr 2022
Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive
  Transformer
Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer
Songwei Ge
Thomas Hayes
Harry Yang
Xiaoyue Yin
Guan Pang
David Jacobs
Jia-Bin Huang
Devi Parikh
ViT
72
215
0
07 Apr 2022
Learning Audio-Video Modalities from Image Captions
Learning Audio-Video Modalities from Image Captions
Arsha Nagrani
Paul Hongsuck Seo
Bryan Seybold
Anja Hauth
Santiago Manén
Chen Sun
Cordelia Schmid
CLIP
41
85
0
01 Apr 2022
MERLOT Reserve: Neural Script Knowledge through Vision and Language and
  Sound
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
Rowan Zellers
Jiasen Lu
Ximing Lu
Youngjae Yu
Yanpeng Zhao
Mohammadreza Salehi
Aditya Kusupati
Jack Hessel
Ali Farhadi
Yejin Choi
70
208
0
07 Jan 2022
Scaling Up Vision-Language Pre-training for Image Captioning
Scaling Up Vision-Language Pre-training for Image Captioning
Xiaowei Hu
Zhe Gan
Jianfeng Wang
Zhengyuan Yang
Zicheng Liu
Yumao Lu
Lijuan Wang
MLLM
VLM
89
247
0
24 Nov 2021
RedCaps: web-curated image-text data created by the people, for the
  people
RedCaps: web-curated image-text data created by the people, for the people
Karan Desai
Gaurav Kaul
Zubin Aysola
Justin Johnson
76
166
0
22 Nov 2021
Advancing High-Resolution Video-Language Representation with Large-Scale
  Video Transcriptions
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
Hongwei Xue
Tiankai Hang
Yanhong Zeng
Yuchong Sun
Bei Liu
Huan Yang
Jianlong Fu
B. Guo
AI4TS
VLM
54
191
0
19 Nov 2021
FILIP: Fine-grained Interactive Language-Image Pre-Training
FILIP: Fine-grained Interactive Language-Image Pre-Training
Lewei Yao
Runhu Huang
Lu Hou
Guansong Lu
Minzhe Niu
Hang Xu
Xiaodan Liang
Zhenguo Li
Xin Jiang
Chunjing Xu
VLM
CLIP
57
627
0
09 Nov 2021
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text
  Understanding
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Hu Xu
Gargi Ghosh
Po-Yao (Bernie) Huang
Dmytro Okhonko
Armen Aghajanyan
Florian Metze
Luke Zettlemoyer
Florian Metze Luke Zettlemoyer Christoph Feichtenhofer
CLIP
VLM
286
567
0
28 Sep 2021
How Much Can CLIP Benefit Vision-and-Language Tasks?
How Much Can CLIP Benefit Vision-and-Language Tasks?
Sheng Shen
Liunian Harold Li
Hao Tan
Joey Tianyi Zhou
Anna Rohrbach
Kai-Wei Chang
Z. Yao
Kurt Keutzer
CLIP
VLM
MLLM
242
407
0
13 Jul 2021
12
Next