Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2102.08981
Cited By
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
17 February 2021
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts"
50 / 850 papers shown
Title
Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes
Zehan Wang
Haifeng Huang
Yang Zhao
Ziang Zhang
Zhou Zhao
19
62
0
17 Aug 2023
TeCH: Text-guided Reconstruction of Lifelike Clothed Humans
Yangyi Huang
Hongwei Yi
Yuliang Xiu
Tingting Liao
Jiaxiang Tang
Deng Cai
Justus Thies
DiffM
63
81
0
16 Aug 2023
CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation
Hongguang Zhu
Yunchao Wei
Xiaodan Liang
Chunjie Zhang
Yao-Min Zhao
VLM
32
28
0
14 Aug 2023
MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation
Kaixin Cai
Pengzhen Ren
Yi Zhu
Hang Xu
Jian-zhuo Liu
Changlin Li
Guangrun Wang
Xiaodan Liang
VLM
29
14
0
09 Aug 2023
Distributionally Robust Classification on a Data Budget
Ben Feuer
Ameya Joshi
Minh Pham
Chinmay Hegde
OOD
39
2
0
07 Aug 2023
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Weihao Yu
Zhengyuan Yang
Linjie Li
Jianfeng Wang
Kevin Qinghong Lin
Zicheng Liu
Xinchao Wang
Lijuan Wang
MLLM
60
622
0
04 Aug 2023
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
Weiyun Wang
Min Shi
Qingyun Li
Wen Wang
Zhenhang Huang
...
Zhiguo Cao
Yushi Chen
Tong Lu
Jifeng Dai
Yu Qiao
LRM
MLLM
53
85
0
03 Aug 2023
RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension
Qiang-feng Zhou
Chaohui Yu
Shaofeng Zhang
Sitong Wu
Zhibin Wang
Fan Wang
34
27
0
03 Aug 2023
Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model
Ka Leong Cheng
Wenpo Song
Zheng Ma
Wenhao Zhu
Zi-Yue Zhu
Jianbing Zhang
CLIP
VLM
32
10
0
02 Aug 2023
Transferable Decoding with Visual Entities for Zero-Shot Image Captioning
Junjie Fei
Teng Wang
Jinrui Zhang
Zhenyu He
Chengjie Wang
Feng Zheng
VLM
36
34
0
31 Jul 2023
Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks
Kousik Rajesh
Mrigank Raman
M. A. Karim
Pranit Chawla
VLM
25
2
0
31 Jul 2023
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
Mustafa Shukor
Corentin Dancette
Alexandre Ramé
Matthieu Cord
MoMe
MLLM
61
43
0
30 Jul 2023
Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation
Zhiyuan Li
Dongnan Liu
Heng Wang
Chaoyi Zhang
Weidong (Tom) Cai
RALM
44
0
0
27 Jul 2023
Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures
Kun Yuan
V. Srivastav
Tong Yu
Joël L. Lavanchy
Pietro Mascagni
Pietro Mascagni
N. Padoy
Nicolas Padoy
39
21
0
27 Jul 2023
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Muhammad Awais
Muzammal Naseer
Salman Khan
Rao Muhammad Anwer
Hisham Cholakkal
M. Shah
Ming-Hsuan Yang
Fahad Shahbaz Khan
VLM
43
119
0
25 Jul 2023
CLIP-KD: An Empirical Study of CLIP Model Distillation
Chuanguang Yang
Zhulin An
Libo Huang
Junyu Bi
Xinqiang Yu
Hansheng Yang
Boyu Diao
Yongjun Xu
VLM
37
27
0
24 Jul 2023
Improving Multimodal Datasets with Image Captioning
Thao Nguyen
S. Gadre
Gabriel Ilharco
Sewoong Oh
Ludwig Schmidt
VLM
19
71
0
19 Jul 2023
Generative Prompt Model for Weakly Supervised Object Localization
Yuzhong Zhao
QiXiang Ye
Weijia Wu
Chunhua Shen
Fang Wan
WSOL
VLM
DiffM
37
28
0
19 Jul 2023
A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future
Chaoyang Zhu
Long Chen
ObjD
VLM
36
33
0
18 Jul 2023
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
Yang Zhao
Zhijie Lin
Daquan Zhou
Zilong Huang
Jiashi Feng
Bingyi Kang
MLLM
44
108
0
17 Jul 2023
PiTL: Cross-modal Retrieval with Weakly-supervised Vision-language Pre-training via Prompting
Zixin Guo
Tong Wang
Selen Pehlivan
Abduljalil Radman
Jorma T. Laaksonen
VLM
33
2
0
14 Jul 2023
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
Yi Wang
Yinan He
Yizhuo Li
Kunchang Li
Jiashuo Yu
...
Ping Luo
Ziwei Liu
Yali Wang
Limin Wang
Yu Qiao
VLM
VGen
35
251
0
13 Jul 2023
SVIT: Scaling up Visual Instruction Tuning
Bo Zhao
Boya Wu
Muyang He
Tiejun Huang
MLLM
44
120
0
09 Jul 2023
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
Shilong Zhang
Pei Sun
Shoufa Chen
Min Xiao
Wenqi Shao
Wenwei Zhang
Yu Liu
Kai-xiang Chen
Ping Luo
VLM
MLLM
92
225
0
07 Jul 2023
Vision Language Transformers: A Survey
Clayton Fields
C. Kennington
VLM
31
5
0
06 Jul 2023
Several categories of Large Language Models (LLMs): A Short Survey
Saurabh Pahune
Manoj Chandrasekharan
AILaw
25
14
0
05 Jul 2023
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
Yan Zeng
Hanbo Zhang
Jiani Zheng
Jiangnan Xia
Guoqiang Wei
Yang Wei
Yuchen Zhang
Tao Kong
MLLM
27
73
0
05 Jul 2023
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
Yanzhe Zhang
Ruiyi Zhang
Jiuxiang Gu
Yufan Zhou
Nedim Lipka
Diyi Yang
Tongfei Sun
VLM
MLLM
33
219
0
29 Jun 2023
CLIPAG: Towards Generator-Free Text-to-Image Generation
Roy Ganz
Michael Elad
VLM
38
7
0
29 Jun 2023
Federated Generative Learning with Foundation Models
Jie Zhang
Xiaohua Qi
Bo Zhao
FedML
44
21
0
28 Jun 2023
CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a \
10,000 Budget; An Extra \
4,000 Unlocks 81.8% Accuracy
Xianhang Li
Zeyu Wang
Cihang Xie
CLIP
VLM
56
19
0
27 Jun 2023
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Fuxiao Liu
Kevin Qinghong Lin
Linjie Li
Jianfeng Wang
Yaser Yacoob
Lijuan Wang
VLM
MLLM
40
246
0
26 Jun 2023
A Survey on Multimodal Large Language Models
Shukang Yin
Chaoyou Fu
Sirui Zhao
Ke Li
Xing Sun
Tong Xu
Enhong Chen
MLLM
LRM
62
562
0
23 Jun 2023
TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter
Binjie Zhang
Yixiao Ge
Xuyuan Xu
Ying Shan
Mike Zheng Shou
52
8
0
22 Jun 2023
DreamTime: An Improved Optimization Strategy for Diffusion-Guided 3D Generation
Yukun Huang
Jianan Wang
Yukai Shi
Zhengjun Zha
Xianbiao Qi
Lei Zhang
51
63
0
21 Jun 2023
OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
Hugo Laurenccon
Lucile Saulnier
Léo Tronchon
Stas Bekman
Amanpreet Singh
...
Siddharth Karamcheti
Alexander M. Rush
Douwe Kiela
Matthieu Cord
Victor Sanh
25
231
0
21 Jun 2023
RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing
Zilun Zhang
Tiancheng Zhao
Yulong Guo
Jianwei Yin
DiffM
VLM
32
56
0
20 Jun 2023
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models
Junting Pan
Ziyi Lin
Yuying Ge
Xiatian Zhu
Renrui Zhang
Yi Wang
Yu Qiao
Hongsheng Li
MLLM
39
26
0
15 Jun 2023
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
Peng Xu
Wenqi Shao
Kaipeng Zhang
Peng Gao
Shuo Liu
Meng Lei
Fanqing Meng
Siyuan Huang
Yu Qiao
Ping Luo
ELM
MLLM
41
159
0
15 Jun 2023
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
Sihan Chen
Xingjian He
Handong Li
Xiaojie Jin
Jiashi Feng
Jiaheng Liu
VLM
CLIP
30
8
0
15 Jun 2023
Safeguarding Data in Multimodal AI: A Differentially Private Approach to CLIP Training
Alyssa Huang
Peihan Liu
Ryumei Nakada
Linjun Zhang
Wanrong Zhang
VLM
76
5
0
13 Jun 2023
Image Captioners Are Scalable Vision Learners Too
Michael Tschannen
Manoj Kumar
Andreas Steiner
Xiaohua Zhai
N. Houlsby
Lucas Beyer
VLM
CLIP
32
54
0
13 Jun 2023
Scalable 3D Captioning with Pretrained Models
Tiange Luo
C. Rockwell
Honglak Lee
Justin Johnson
32
153
0
12 Jun 2023
Retrieval-Enhanced Contrastive Vision-Text Models
Ahmet Iscen
Mathilde Caron
Alireza Fathi
Cordelia Schmid
CLIP
VLM
31
26
0
12 Jun 2023
Sticker820K: Empowering Interactive Retrieval with Stickers
Sijie Zhao
Yixiao Ge
Zhongang Qi
Lin Song
Xiaohan Ding
Zehua Xie
Ying Shan
34
6
0
12 Jun 2023
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
Bo Li
Yuanhan Zhang
Liangyu Chen
Jinghao Wang
Fanyi Pu
Jingkang Yang
C. Li
Ziwei Liu
MLLM
VLM
45
224
0
08 Jun 2023
ScaleDet: A Scalable Multi-Dataset Object Detector
Yanbei Chen
Manchen Wang
Abhay Mittal
Zhenlin Xu
Paolo Favaro
Joseph Tighe
Davide Modolo
ObjD
19
19
0
08 Jun 2023
On the Generalization of Multi-modal Contrastive Learning
Qi Zhang
Yifei Wang
Yisen Wang
32
24
0
07 Jun 2023
Recognize Anything: A Strong Image Tagging Model
Youcai Zhang
Xinyu Huang
Jinyu Ma
Zhaoyang Li
Zhaochuan Luo
...
Tong Luo
Yaqian Li
Siyi Liu
Yandong Guo
Lei Zhang
VLM
47
225
0
06 Jun 2023
Diversifying Joint Vision-Language Tokenization Learning
Vardaan Pahuja
A. Piergiovanni
A. Angelova
29
0
0
06 Jun 2023
Previous
1
2
3
...
10
11
12
...
15
16
17
Next