ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2212.03191
  4. Cited By
InternVideo: General Video Foundation Models via Generative and
  Discriminative Learning

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

6 December 2022
Yi Wang
Kunchang Li
Yizhuo Li
Yinan He
Bingkun Huang
Zhiyu Zhao
Hongjie Zhang
Jilan Xu
Yi Liu
Zun Wang
Sen Xing
Guo Chen
Junting Pan
Jiashuo Yu
Yali Wang
Limin Wang
Yu Qiao
    VLM
    VGen
ArXivPDFHTML

Papers citing "InternVideo: General Video Foundation Models via Generative and Discriminative Learning"

50 / 241 papers shown
Title
TMT-VIS: Taxonomy-aware Multi-dataset Joint Training for Video Instance
  Segmentation
TMT-VIS: Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation
Rongkun Zheng
Lu Qi
Xi Chen
Yi Wang
Kun Wang
Yu Qiao
Hengshuang Zhao
24
2
0
11 Dec 2023
Audio-Visual LLM for Video Understanding
Audio-Visual LLM for Video Understanding
Fangxun Shu
Lei Zhang
Hao Jiang
Cihang Xie
VLM
MLLM
27
38
0
11 Dec 2023
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding
Yizhou Wang
Ruiyi Zhang
Haoliang Wang
Uttaran Bhattacharya
Yun Fu
Gang Wu
MLLM
30
10
0
04 Dec 2023
Adapting Short-Term Transformers for Action Detection in Untrimmed
  Videos
Adapting Short-Term Transformers for Action Detection in Untrimmed Videos
Min Yang
Huan Gao
Ping Guo
Limin Wang
ViT
30
5
0
04 Dec 2023
Zero-Shot Video Question Answering with Procedural Programs
Zero-Shot Video Question Answering with Procedural Programs
Rohan Choudhury
Koichiro Niinuma
Kris M. Kitani
László A. Jeni
19
21
0
01 Dec 2023
A Video is Worth 10,000 Words: Training and Benchmarking with Diverse
  Captions for Better Long Video Retrieval
A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval
M. Gwilliam
Michael Cogswell
Meng Ye
Karan Sikka
Abhinav Shrivastava
Ajay Divakaran
3DV
15
1
1
30 Nov 2023
Motion-Conditioned Image Animation for Video Editing
Motion-Conditioned Image Animation for Video Editing
Wilson Yan
Andrew Brown
Pieter Abbeel
Rohit Girdhar
S. Azadi
DiffM
VGen
58
12
0
30 Nov 2023
CAST: Cross-Attention in Space and Time for Video Action Recognition
CAST: Cross-Attention in Space and Time for Video Action Recognition
Dongho Lee
Jongseo Lee
Jinwoo Choi
EgoV
35
12
0
30 Nov 2023
Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains
Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains
Rohan Myer Krishnan
Zitian Tang
Zhiqiu Yu
Chen Sun
53
1
0
30 Nov 2023
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context
  Learning
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning
Chaoyi Zhang
K. Lin
Zhengyuan Yang
Jianfeng Wang
Linjie Li
Chung-Ching Lin
Zicheng Liu
Lijuan Wang
VGen
21
28
0
29 Nov 2023
VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of
  Video-Language Models
VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models
Shicheng Li
Lei Li
Shuhuai Ren
Yuanxin Liu
Yi Liu
Rundong Gao
Xu Sun
Lu Hou
34
29
0
29 Nov 2023
PALM: Predicting Actions through Language Models
PALM: Predicting Actions through Language Models
Sanghwan Kim
Daoji Huang
Yongqin Xian
Otmar Hilliges
Luc Van Gool
Xi Wang
VLM
22
10
0
29 Nov 2023
End-to-End Temporal Action Detection with 1B Parameters Across 1000
  Frames
End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames
Shuming Liu
Chen-Da Liu-Zhang
Chen Zhao
Bernard Ghanem
33
25
0
28 Nov 2023
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Kunchang Li
Yali Wang
Yinan He
Yizhuo Li
Yi Wang
...
Jilan Xu
Guo Chen
Ping Luo
Limin Wang
Yu Qiao
VLM
MLLM
56
399
0
28 Nov 2023
AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset
AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset
Zhixi Cai
Shreya Ghosh
Aman Pankaj Adatia
Munawar Hayat
Abhinav Dhall
Kalin Stefanov
21
27
0
26 Nov 2023
Mug-STAN: Adapting Image-Language Pretrained Models for General Video
  Understanding
Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding
Ruyang Liu
Jingjia Huang
Wei-Nan Gao
Thomas H. Li
Ge Li
VLM
29
3
0
25 Nov 2023
Multi-modal Instance Refinement for Cross-domain Action Recognition
Multi-modal Instance Refinement for Cross-domain Action Recognition
Yuan Qing
Naixing Wu
Shaohua Wan
Lixin Duan
14
0
0
24 Nov 2023
Vamos: Versatile Action Models for Video Understanding
Vamos: Versatile Action Models for Video Understanding
Shijie Wang
Qi Zhao
Minh Quan Do
Nakul Agarwal
Kwonjoon Lee
Chen Sun
27
19
0
22 Nov 2023
SPOT! Revisiting Video-Language Models for Event Understanding
SPOT! Revisiting Video-Language Models for Event Understanding
Gengyuan Zhang
Jinhe Bi
Jindong Gu
Yanyu Chen
Volker Tresp
19
2
0
21 Nov 2023
VLM-Eval: A General Evaluation on Video Large Language Models
VLM-Eval: A General Evaluation on Video Large Language Models
Shuailin Li
Yuang Zhang
Yucheng Zhao
Qiuyue Wang
Fan Jia
Yingfei Liu
Tiancai Wang
MLLM
ELM
34
2
0
20 Nov 2023
VideoCon: Robust Video-Language Alignment via Contrast Captions
VideoCon: Robust Video-Language Alignment via Contrast Captions
Hritik Bansal
Yonatan Bitton
Idan Szpektor
Kai-Wei Chang
Aditya Grover
37
14
0
15 Nov 2023
Sinkhorn Transformations for Single-Query Postprocessing in Text-Video
  Retrieval
Sinkhorn Transformations for Single-Query Postprocessing in Text-Video Retrieval
Konstantin Yakovlev
Gregory Polyakov
I. Alimova
Alexander Podolskiy
A. Bout
Sergey I. Nikolenko
Irina Piontkovskaya
CLIP
16
1
0
14 Nov 2023
Semantic-aware Video Representation for Few-shot Action Recognition
Semantic-aware Video Representation for Few-shot Action Recognition
Yutao Tang
Benjamin Bejar
René Vidal
42
7
0
10 Nov 2023
Mirasol3B: A Multimodal Autoregressive model for time-aligned and
  contextual modalities
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
A. Piergiovanni
Isaac Noble
Dahun Kim
Michael S. Ryoo
Victor Gomes
A. Angelova
36
19
0
09 Nov 2023
LRM: Large Reconstruction Model for Single Image to 3D
LRM: Large Reconstruction Model for Single Image to 3D
Yicong Hong
Kai Zhang
Jiuxiang Gu
Sai Bi
Yang Zhou
Difan Liu
Feng Liu
Kalyan Sunkavalli
Trung Bui
Hao Tan
3DV
3DH
42
412
0
08 Nov 2023
OmniVec: Learning robust representations with cross modal sharing
OmniVec: Learning robust representations with cross modal sharing
Siddharth Srivastava
Gaurav Sharma
SSL
27
64
0
07 Nov 2023
Large Language Models are Temporal and Causal Reasoners for Video
  Question Answering
Large Language Models are Temporal and Causal Reasoners for Video Question Answering
Dohwan Ko
Ji Soo Lee
Wooyoung Kang
Byungseok Roh
Hyunwoo J. Kim
LRM
33
31
0
24 Oct 2023
Can Language Models Laugh at YouTube Short-form Videos?
Can Language Models Laugh at YouTube Short-form Videos?
Dayoon Ko
Sangho Lee
Gunhee Kim
34
6
0
22 Oct 2023
Query-aware Long Video Localization and Relation Discrimination for Deep
  Video Understanding
Query-aware Long Video Localization and Relation Discrimination for Deep Video Understanding
Yuanxing Xu
Yuting Wei
Bin Wu
25
0
0
19 Oct 2023
Unifying Image Processing as Visual Prompting Question Answering
Unifying Image Processing as Visual Prompting Question Answering
Yihao Liu
Xiangyu Chen
Xianzheng Ma
Xintao Wang
Jiantao Zhou
Yu Qiao
Chao Dong
MLLM
22
18
0
16 Oct 2023
Large Models for Time Series and Spatio-Temporal Data: A Survey and
  Outlook
Large Models for Time Series and Spatio-Temporal Data: A Survey and Outlook
Ming Jin
Qingsong Wen
Yuxuan Liang
Chaoli Zhang
Siqiao Xue
...
Shirui Pan
Vincent S. Tseng
Yu Zheng
Lei Chen
Hui Xiong
AI4TS
SyDa
35
117
0
16 Oct 2023
PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm
PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm
Haoyi Zhu
Honghui Yang
Xiaoyang Wu
Di Huang
Sha Zhang
...
Hengshuang Zhao
Chunhua Shen
Yu Qiao
Tong He
Wanli Ouyang
SSL
71
43
0
12 Oct 2023
Building an Open-Vocabulary Video CLIP Model with Better Architectures,
  Optimization and Data
Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data
Zuxuan Wu
Zejia Weng
Wujian Peng
Xitong Yang
Ang Li
Larry S. Davis
Yu-Gang Jiang
CLIP
VLM
36
21
0
08 Oct 2023
LanguageBind: Extending Video-Language Pretraining to N-modality by
  Language-based Semantic Alignment
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Bin Zhu
Bin Lin
Munan Ning
Yang Yan
Jiaxi Cui
...
Zongwei Li
Wancai Zhang
Zhifeng Li
Wei Liu
Liejie Yuan
VLM
MLLM
27
202
0
03 Oct 2023
Training a Large Video Model on a Single Machine in a Day
Training a Large Video Model on a Single Machine in a Day
Yue Zhao
Philipp Krahenbuhl
VLM
29
15
0
28 Sep 2023
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
Avamarie Brueggeman
Andrea Madotto
Zhaojiang Lin
Tushar Nagarajan
Matt Smith
...
Peyman Heidari
Yue Liu
Kavya Srinet
Babak Damavandi
Anuj Kumar
MLLM
32
93
0
27 Sep 2023
BT-Adapter: Video Conversation is Feasible Without Video Instruction
  Tuning
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Ruyang Liu
Chen Li
Yixiao Ge
Ying Shan
Thomas H. Li
Ge Li
25
28
0
27 Sep 2023
ENIGMA-51: Towards a Fine-Grained Understanding of Human-Object
  Interactions in Industrial Scenarios
ENIGMA-51: Towards a Fine-Grained Understanding of Human-Object Interactions in Industrial Scenarios
Francesco Ragusa
Rosario Leonardi
Michele Mazzamuto
Claudia Bonanno
Rosario Scavo
Antonino Furnari
G. Farinella
30
7
0
26 Sep 2023
Can I Trust Your Answer? Visually Grounded Video Question Answering
Can I Trust Your Answer? Visually Grounded Video Question Answering
Junbin Xiao
Angela Yao
Yicong Li
Tat-Seng Chua
33
46
0
04 Sep 2023
Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models
Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models
Dezhao Luo
Jiabo Huang
Shaogang Gong
Hailin Jin
Yang Liu
VLM
19
9
0
01 Sep 2023
Language Reward Modulation for Pretraining Reinforcement Learning
Language Reward Modulation for Pretraining Reinforcement Learning
Ademi Adeniji
Amber Xie
Carmelo Sferrazza
Younggyo Seo
Stephen James
Pieter Abbeel
39
26
0
23 Aug 2023
StoryBench: A Multifaceted Benchmark for Continuous Story Visualization
StoryBench: A Multifaceted Benchmark for Continuous Story Visualization
Emanuele Bugliarello
Hernan Moraldo
Ruben Villegas
Mohammad Babaeizadeh
M. Saffar
Han Zhang
D. Erhan
V. Ferrari
Pieter-Jan Kindermans
P. Voigtlaender
VGen
35
10
0
22 Aug 2023
Multi-event Video-Text Retrieval
Multi-event Video-Text Retrieval
Gengyuan Zhang
Jisen Ren
Jindong Gu
Volker Tresp
19
13
0
22 Aug 2023
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language
  Understanding
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding
K. Mangalam
Raiymbek Akshulakov
Jitendra Malik
25
247
0
17 Aug 2023
Helping Hands: An Object-Aware Ego-Centric Video Recognition Model
Helping Hands: An Object-Aware Ego-Centric Video Recognition Model
Chuhan Zhang
Ankush Gupta
Andrew Zisserman
VLM
26
19
0
15 Aug 2023
Memory-and-Anticipation Transformer for Online Action Understanding
Memory-and-Anticipation Transformer for Online Action Understanding
Jiahao Wang
Guo Chen
Yifei Huang
Liming Wang
Tong Lu
OffRL
59
37
0
15 Aug 2023
MovieChat: From Dense Token to Sparse Memory for Long Video
  Understanding
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Enxin Song
Wenhao Chai
Guanhong Wang
Yucheng Zhang
Haoyang Zhou
...
Tianbo Ye
Yanting Zhang
Yang Lu
Jenq-Neng Hwang
Gaoang Wang
VLM
MLLM
22
262
0
31 Jul 2023
Scaling Data Generation in Vision-and-Language Navigation
Scaling Data Generation in Vision-and-Language Navigation
Zun Wang
Jialu Li
Yicong Hong
Yi Wang
Qi Wu
Mohit Bansal
Stephen Gould
Hao Tan
Yu Qiao
LM&Ro
34
56
0
28 Jul 2023
Fine-grained Text-Video Retrieval with Frozen Image Encoders
Fine-grained Text-Video Retrieval with Frozen Image Encoders
Zuozhuo Dai
Fang Shao
Qingkun Su
Zilong Dong
Siyu Zhu
167
1
0
14 Jul 2023
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
  and Generation
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
Yi Wang
Yinan He
Yizhuo Li
Kunchang Li
Jiashuo Yu
...
Ping Luo
Ziwei Liu
Yali Wang
Limin Wang
Yu Qiao
VLM
VGen
33
244
0
13 Jul 2023
Previous
12345
Next