Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2102.08981
Cited By
v1
v2 (latest)
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
17 February 2021
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts"
50 / 871 papers shown
Title
SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding
Haoxiang Wang
Pavan Kumar Anasosalu Vasu
Fartash Faghri
Raviteja Vemulapalli
Mehrdad Farajtabar
Sachin Mehta
Mohammad Rastegari
Oncel Tuzel
Hadi Pouransari
VLM
128
72
0
23 Oct 2023
Matryoshka Diffusion Models
Jiatao Gu
Shuangfei Zhai
Yizhen Zhang
Joshua M. Susskind
Navdeep Jaitly
DiffM
102
47
0
23 Oct 2023
Open-Set Image Tagging with Multi-Grained Text Supervision
Xinyu Huang
Yi-Jie Huang
Youcai Zhang
Weiwei Tian
Rui Feng
Yuejie Zhang
Yanchun Xie
Yaqian Li
Lei Zhang
VLM
87
35
0
23 Oct 2023
Data Pruning via Moving-one-Sample-out
Haoru Tan
Sitong Wu
Fei Du
Yukang Chen
Zhibin Wang
Fan Wang
Xiaojuan Qi
121
39
0
23 Oct 2023
Leveraging Image-Text Similarity and Caption Modification for the DataComp Challenge: Filtering Track and BYOD Track
Shuhei Yokoo
Peifei Zhu
Yuchi Ishikawa
Mikihiro Tanaka
Masayoshi Kondo
Hirokatsu Kataoka
26
1
0
23 Oct 2023
Nightshade: Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models
Shawn Shan
Wenxin Ding
Josephine Passananti
Stanley Wu
Haitao Zheng
Ben Y. Zhao
SILM
DiffM
106
53
0
20 Oct 2023
Semi-supervised multimodal coreference resolution in image narrations
A. Goel
Basura Fernando
Frank Keller
Hakan Bilen
82
4
0
20 Oct 2023
MarineGPT: Unlocking Secrets of Ocean to the Public
Ziqiang Zheng
Jipeng Zhang
Tuan-Anh Vu
Shizhe Diao
Yue Him Wong Tim
Sai-Kit Yeung
130
13
0
20 Oct 2023
Visual Grounding Helps Learn Word Meanings in Low-Data Regimes
Chengxu Zhuang
Evelina Fedorenko
Jacob Andreas
74
12
0
20 Oct 2023
MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter
Zhiyuan Liu
Changhao Nai
Yancheng Luo
Hao Fei
Yixin Cao
Kenji Kawaguchi
Xiang Wang
Tat-Seng Chua
92
93
0
19 Oct 2023
Weakly-Supervised Semantic Segmentation with Image-Level Labels: from Traditional Models to Foundation Models
Zhaozheng Chen
Qianru Sun
VLM
138
9
0
19 Oct 2023
Learning from Rich Semantics and Coarse Locations for Long-tailed Object Detection
Lingchen Meng
Xiyang Dai
Jianwei Yang
Dongdong Chen
Yinpeng Chen
Mengchen Liu
Yi-Ling Chen
Zuxuan Wu
Lu Yuan
Yu-Gang Jiang
74
7
0
18 Oct 2023
TOSS:High-quality Text-guided Novel View Synthesis from a Single Image
Yukai Shi
Jianan Wang
He Cao
Boshi Tang
Xianbiao Qi
Tianyu Yang
Yukun Huang
Shilong Liu
Lei Zhang
H. Shum
DiffM
66
20
0
16 Oct 2023
Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification
Sravanti Addepalli
Ashish Ramayee Asokan
Lakshay Sharma
R. V. Babu
VLM
59
22
0
12 Oct 2023
Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning
Junyu Lu
Di Zhang
Xiaojun Wu
Xinyu Gao
Ruyi Gan
Jiaxing Zhang
Yan Song
Pingjian Zhang
VLM
MLLM
55
7
0
12 Oct 2023
CrIBo: Self-Supervised Learning via Cross-Image Object-Level Bootstrapping
Tim Lebailly
Thomas Stegmüller
Behzad Bozorgtabar
Jean-Philippe Thiran
Tinne Tuytelaars
SSL
127
8
0
11 Oct 2023
VeCLIP: Improving CLIP Training via Visual-enriched Captions
Zhengfeng Lai
Haotian Zhang
Bowen Zhang
Wentao Wu
Haoping Bai
...
Zhe Gan
Jiulong Shan
Chen-Nee Chuah
Yinfei Yang
Meng Cao
CLIP
VLM
105
31
0
11 Oct 2023
On the Evaluation and Refinement of Vision-Language Instruction Tuning Datasets
Ning Liao
Shaofeng Zhang
Renqiu Xia
Min Cao
Yu Qiao
Junchi Yan
MLLM
64
0
0
10 Oct 2023
TextPSG: Panoptic Scene Graph Generation from Textual Descriptions
Chengyang Zhao
Songlin Yang
Zhenfang Chen
Mingyu Ding
Chuang Gan
159
17
0
10 Oct 2023
Implicit Concept Removal of Diffusion Models
Zhili Liu
Kai Chen
Yifan Zhang
Jianhua Han
Lanqing Hong
Hang Xu
Zhenguo Li
Dit-Yan Yeung
James T. Kwok
74
14
0
09 Oct 2023
Lightweight In-Context Tuning for Multimodal Unified Models
Yixin Chen
Shuai Zhang
Boran Han
Jiaya Jia
65
2
0
08 Oct 2023
Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data
Zuxuan Wu
Zejia Weng
Wujian Peng
Xitong Yang
Ang Li
Larry S. Davis
Yu-Gang Jiang
CLIP
VLM
95
22
0
08 Oct 2023
Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks
Avinash Madasu
Anahita Bhiwandiwalla
Vasudev Lal
VLM
71
0
0
07 Oct 2023
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
Nina Shvetsova
Anna Kukleva
Xudong Hong
Christian Rupprecht
Bernt Schiele
Hilde Kuehne
105
26
0
07 Oct 2023
On the Performance of Multimodal Language Models
Utsav Garg
Erhan Bas
MLLM
29
0
0
04 Oct 2023
Kosmos-G: Generating Images in Context with Multimodal Large Language Models
Xichen Pan
Li Dong
Shaohan Huang
Zhiliang Peng
Wenhu Chen
Furu Wei
VLM
152
68
0
04 Oct 2023
Sieve: Multimodal Dataset Pruning Using Image Captioning Models
Anas Mahmoud
Mostafa Elhoushi
Amro Abbas
Yu Yang
Newsha Ardalani
Hugh Leather
Ari S. Morcos
VLM
CLIP
80
21
0
03 Oct 2023
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Bin Zhu
Bin Lin
Munan Ning
Yang Yan
Jiaxi Cui
...
Zongwei Li
Wancai Zhang
Zhifeng Li
Wei Liu
Liejie Yuan
VLM
MLLM
185
229
0
03 Oct 2023
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models
Yiyang Zhou
Chenhang Cui
Jaehong Yoon
Linjun Zhang
Zhun Deng
Chelsea Finn
Mohit Bansal
Huaxiu Yao
MLLM
167
186
0
01 Oct 2023
GeRA: Label-Efficient Geometrically Regularized Alignment
Dustin Klebe
Tal Shnitzer
Mikhail Yurochkin
Leonid Karlinsky
Justin Solomon
83
2
0
01 Oct 2023
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants
Tianyu Yu
Jinyi Hu
Yuan Yao
Haoye Zhang
Yue Zhao
...
Jiao Xue
Dahai Li
Zhiyuan Liu
Hai-Tao Zheng
Maosong Sun
VLM
MLLM
45
20
0
01 Oct 2023
Practical Membership Inference Attacks Against Large-Scale Multi-Modal Models: A Pilot Study
Myeongseob Ko
Ming Jin
Chenguang Wang
Ruoxi Jia
99
29
0
29 Sep 2023
Data Filtering Networks
Alex Fang
Albin Madappally Jose
Amit Jain
Ludwig Schmidt
Alexander Toshev
Vaishaal Shankar
CLIP
134
144
0
29 Sep 2023
Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks
Hao Chen
Jindong Wang
Ankit Shah
Ran Tao
Hongxin Wei
Berfin cSimcsek
Masashi Sugiyama
Bhiksha Raj
108
31
0
29 Sep 2023
ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens
Yangyang Guo
Haoyu Zhang
Yongkang Wong
Liqiang Nie
Mohan Kankanhalli
VLM
69
3
0
28 Sep 2023
The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering
Hai-ping Yu
Yu Tian
Sateesh Kumar
Linjie Yang
Heng Wang
VLM
64
19
0
27 Sep 2023
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
Pan Zhang
Xiaoyi Wang
Bin Wang
Yuhang Cao
Chao Xu
...
Conghui He
Xingcheng Zhang
Yu Qiao
Da Lin
Jiaqi Wang
MLLM
191
241
0
26 Sep 2023
BLIP-Adapter: Parameter-Efficient Transfer Learning for Mobile Screenshot Captioning
Ching-Yu Chiang
I-Hua Chang
Shih-Wei Liao
83
1
0
26 Sep 2023
CWCL: Cross-Modal Transfer with Continuously Weighted Contrastive Loss
R. S. Srinivasa
Jaejin Cho
Chouchang Yang
Yashas Malur Saidutta
Ching Hua Lee
Yilin Shen
Hongxia Jin
VLM
63
10
0
26 Sep 2023
VidChapters-7M: Video Chapters at Scale
Antoine Yang
Arsha Nagrani
Ivan Laptev
Josef Sivic
Cordelia Schmid
VGen
102
28
0
25 Sep 2023
Rewrite Caption Semantics: Bridging Semantic Gaps for Language-Supervised Semantic Segmentation
Yun Xing
Jian Kang
Aoran Xiao
Jiahao Nie
Ling Shao
Shijian Lu
VLM
89
13
0
24 Sep 2023
A Survey on Image-text Multimodal Models
Ruifeng Guo
Jingxuan Wei
Linzhuang Sun
Khai-Nguyen Nguyen
Guiyong Chang
Dawei Liu
Sibo Zhang
Zhengbing Yao
Mingjun Xu
Liping Bu
VLM
128
7
0
23 Sep 2023
In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval
Nina Shvetsova
Anna Kukleva
Bernt Schiele
Hilde Kuehne
DiffM
77
4
0
16 Sep 2023
PatFig: Generating Short and Long Captions for Patent Figures
Dana Aubakirova
Kim Gerdes
Lufei Liu
48
11
0
15 Sep 2023
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
Haozhe Zhao
Zefan Cai
Shuzheng Si
Xiaojian Ma
Kaikai An
Liang Chen
Zixuan Liu
Sheng Wang
Wenjuan Han
Baobao Chang
MLLM
VLM
128
143
0
14 Sep 2023
TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild
Huayang Li
Siheng Li
Deng Cai
Longyue Wang
Lemao Liu
Taro Watanabe
Yujiu Yang
Shuming Shi
MLLM
140
18
0
14 Sep 2023
PROGrasp: Pragmatic Human-Robot Communication for Object Grasping
Gi-Cheon Kang
Junghyun Kim
Jaein Kim
Byoung-Tak Zhang
105
5
0
14 Sep 2023
Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics
Haoqin Tu
Bingchen Zhao
Chen Wei
Cihang Xie
MLLM
71
15
0
13 Sep 2023
NExT-GPT: Any-to-Any Multimodal LLM
Shengqiong Wu
Hao Fei
Leigang Qu
Wei Ji
Tat-Seng Chua
MLLM
117
507
0
11 Sep 2023
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
Yang Jin
Kun Xu
Kun Xu
Liwei Chen
Chao Liao
...
Xiaoqiang Lei
Di Zhang
Wenwu Ou
Kun Gai
Yadong Mu
MLLM
VLM
79
50
0
09 Sep 2023
Previous
1
2
3
...
9
10
11
...
16
17
18
Next