Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2102.08981
Cited By
v1
v2 (latest)
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
17 February 2021
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts"
50 / 871 papers shown
Title
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
Kanchana Ranasinghe
Satya Narayan Shukla
Omid Poursaeed
Michael S. Ryoo
Tsung-Yu Lin
LRM
77
31
0
11 Apr 2024
Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models
Simon Schrodi
David T. Hoffmann
Max Argus
Volker Fischer
Thomas Brox
VLM
142
4
0
11 Apr 2024
BRAVE: Broadening the visual encoding of vision-language models
Ouguzhan Fatih Kar
A. Tonioni
Petra Poklukar
Achin Kulshrestha
Amir Zamir
Federico Tombari
MLLM
VLM
80
32
0
10 Apr 2024
X-VARS: Introducing Explainability in Football Refereeing with Multi-Modal Large Language Model
Jan Held
Hani Itani
A. Cioppa
Silvio Giancola
Guohao Li
Marc Van Droogenbroeck
84
17
0
07 Apr 2024
To Cool or not to Cool? Temperature Network Meets Large Foundation Models via DRO
Zi-Hao Qiu
Siqi Guo
Mao Xu
Tuo Zhao
Lijun Zhang
Tianbao Yang
AI4TS
AI4CE
121
4
0
06 Apr 2024
Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation
Ji-Jia Wu
Andy Chia-Hao Chang
Chieh-Yu Chuang
Chun-Pei Chen
Yu-Lun Liu
Min-Hung Chen
Hou-Ning Hu
Yung-Yu Chuang
Yen-Yu Lin
VLM
117
10
0
05 Apr 2024
PARIS3D: Reasoning-based 3D Part Segmentation Using Large Multimodal Model
Amrin Kareem
Jean Lahoud
Hisham Cholakkal
LRM
92
4
0
04 Apr 2024
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance
Vishaal Udandarao
Ameya Prabhu
Adhiraj Ghosh
Yash Sharma
Philip Torr
Adel Bibi
Samuel Albanie
Matthias Bethge
VLM
221
55
0
04 Apr 2024
Would Deep Generative Models Amplify Bias in Future Models?
Tianwei Chen
Yusuke Hirota
Mayu Otani
Noa Garcia
Yuta Nakashima
90
15
0
04 Apr 2024
Many-to-many Image Generation with Auto-regressive Diffusion Models
Ying Shen
Yizhe Zhang
Shuangfei Zhai
Lifu Huang
J. Susskind
Jiatao Gu
122
6
0
03 Apr 2024
Which Model Generated This Image? A Model-Agnostic Approach for Origin Attribution
Fengyuan Liu
Haochen Luo
Yiming Li
Philip Torr
Jindong Gu
VLM
63
7
0
03 Apr 2024
VLRM: Vision-Language Models act as Reward Models for Image Captioning
Maksim Dzabraev
Alexander Kunitsyn
Andrei Ivaniuta
VLM
MLLM
73
3
0
02 Apr 2024
MotionChain: Conversational Motion Controllers via Multimodal Prompts
Biao Jiang
Xin Chen
C. Zhang
Fukun Yin
Zhuoyuan Li
Gang Yu
Jiayuan Fan
VGen
LRM
98
11
0
02 Apr 2024
Release of Pre-Trained Models for the Japanese Language
Kei Sawada
Tianyu Zhao
Makoto Shing
Kentaro Mitsui
Akio Kaga
Yukiya Hono
Toshiaki Wakatsuki
Koh Mitsuda
62
15
0
02 Apr 2024
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Agneet Chatterjee
Gabriela Ben-Melech Stan
Estelle Aflalo
Sayak Paul
Dhruba Ghosh
...
Ludwig Schmidt
Hanna Hajishirzi
Vasudev Lal
Chitta Baral
Yezhou Yang
EGVM
VLM
116
18
0
01 Apr 2024
LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction
Bo Zou
Chao Yang
Yu Qiao
Chengbin Quan
Youjian Zhao
105
6
0
01 Apr 2024
TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias
Sang-Kee Jo
Soohyun Ryu
Sungyub Kim
Eunho Yang
Kyungsu Kim
98
2
0
30 Mar 2024
Learn "No" to Say "Yes" Better: Improving Vision-Language Models via Negations
Jaisidh Singh
Ishaan Shrivastava
Mayank Vatsa
Richa Singh
Aparna Bharati
VLM
CoGe
86
20
0
29 Mar 2024
Constructing Multilingual Visual-Text Datasets Revealing Visual Multilingual Ability of Vision Language Models
Jesse Atuhurra
Iqra Ali
Tatsuya Hiraoka
Hidetaka Kamigaito
Tomoya Iwakura
Taro Watanabe
108
1
0
29 Mar 2024
DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs
Donghyun Kim
Byeongho Heo
Dongyoon Han
85
17
0
28 Mar 2024
Text Data-Centric Image Captioning with Interactive Prompts
Yiyu Wang
Hao Luo
Jungang Xu
Yingfei Sun
Fan Wang
VLM
76
0
0
28 Mar 2024
Language Plays a Pivotal Role in the Object-Attribute Compositional Generalization of CLIP
Reza Abbasi
Mohammad Samiei
M. Rohban
M. Baghshah
VLM
CoGe
62
0
0
27 Mar 2024
Centered Masking for Language-Image Pre-Training
Mingliang Liang
Martha Larson
VLM
CLIP
58
4
0
23 Mar 2024
Long-CLIP: Unlocking the Long-Text Capability of CLIP
Beichen Zhang
Pan Zhang
Xiao-wen Dong
Yuhang Zang
Jiaqi Wang
CLIP
VLM
118
142
0
22 Mar 2024
A Multimodal Approach for Cross-Domain Image Retrieval
Lucas Iijima
Tania Stathaki
66
1
0
22 Mar 2024
PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model
Zheng Zhang
Yeyao Ma
Enming Zhang
Xiang Bai
VLM
MLLM
127
47
0
21 Mar 2024
Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling
Chengxu Zhuang
Evelina Fedorenko
Jacob Andreas
66
2
0
21 Mar 2024
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
Anwen Hu
Haiyang Xu
Jiabo Ye
Mingshi Yan
Liang Zhang
...
Chen Li
Ji Zhang
Qin Jin
Fei Huang
Jingren Zhou
VLM
117
125
0
19 Mar 2024
Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity
Siddharth Joshi
Arnav Jain
Ali Payani
Baharan Mirzasoleiman
VLM
CLIP
111
8
0
18 Mar 2024
MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control
Enshen Zhou
Yiran Qin
Zhen-fei Yin
Yuzhou Huang
Ruimao Zhang
Lu Sheng
Yu Qiao
Jing Shao
LM&Ro
AI4CE
113
36
0
18 Mar 2024
GenView: Enhancing View Quality with Pretrained Generative Model for Self-Supervised Learning
Xiaojie Li
Yibo Yang
Hefei Ling
Jianlong Wu
Yue Yu
Guohao Li
Min Zhang
SSL
101
6
0
18 Mar 2024
TAG: Guidance-free Open-Vocabulary Semantic Segmentation
Yasufumi Kawano
Yoshimitsu Aoki
VLM
58
4
0
17 Mar 2024
Reward Guided Latent Consistency Distillation
Jiachen Li
Weixi Feng
Wenhu Chen
William Y. Wang
EGVM
82
15
0
16 Mar 2024
OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models
Zhe Kong
Yong Zhang
Tianyu Yang
Tao Wang
Kaihao Zhang
Bizhu Wu
Guanying Chen
Wei Liu
Wenhan Luo
DiffM
105
31
0
16 Mar 2024
LuoJiaHOG: A Hierarchy Oriented Geo-aware Image Caption Dataset for Remote Sensing Image-Text Retrival
Yuanxin Zhao
Mi Zhang
Bingnan Yang
Zhan Zhang
Jiaju Kang
Jianya Gong
62
2
0
16 Mar 2024
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Brandon McKinzie
Zhe Gan
J. Fauconnier
Sam Dodge
Bowen Zhang
...
Zirui Wang
Ruoming Pang
Peter Grasch
Alexander Toshev
Yinfei Yang
MLLM
127
209
0
14 Mar 2024
Renovating Names in Open-Vocabulary Segmentation Benchmarks
Haiwen Huang
Songyou Peng
Dan Zhang
Andreas Geiger
VLM
76
3
0
14 Mar 2024
GiT: Towards Generalist Vision Transformer through Universal Language Interface
Haiyang Wang
Hao Tang
Li Jiang
Shaoshuai Shi
Muhammad Ferjad Naeem
Hongsheng Li
Bernt Schiele
Liwei Wang
VLM
101
13
0
14 Mar 2024
Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models
Yu-Chu Yu
Chi-Pin Huang
Jr-Jen Chen
Kai-Po Chang
Yung-Hsuan Lai
Fu-En Yang
Yu-Chiang Frank Wang
CLL
VLM
97
9
0
14 Mar 2024
A Decade's Battle on Dataset Bias: Are We There Yet?
Zhuang Liu
Kaiming He
94
37
0
13 Mar 2024
MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric
Haokun Lin
Haoli Bai
Zhili Liu
Lu Hou
Muyi Sun
Linqi Song
Ying Wei
Zhenan Sun
CLIP
VLM
94
17
0
12 Mar 2024
Synth
2
^2
2
: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings
Sahand Sharifzadeh
Christos Kaplanis
Shreya Pathak
D. Kumaran
Anastasija Ilić
Jovana Mitrović
Charles Blundell
Andrea Banino
VLM
97
12
0
12 Mar 2024
Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost
Oana Ignat
Longju Bai
Joan Nwatu
Rada Mihalcea
76
6
0
12 Mar 2024
Tell, Don't Show!: Language Guidance Eases Transfer Across Domains in Images and Videos
Tarun Kalluri
Bodhisattwa Prasad Majumder
Manmohan Chandraker
VLM
80
5
0
08 Mar 2024
Face2Diffusion for Fast and Editable Face Personalization
Kaede Shiohara
Toshihiko Yamasaki
DiffM
60
12
0
08 Mar 2024
Controllable Generation with Text-to-Image Diffusion Models: A Survey
Pu Cao
Feng Zhou
Qing-Huang Song
Lu Yang
132
38
0
07 Mar 2024
Multi-Grained Cross-modal Alignment for Learning Open-vocabulary Semantic Segmentation from Text Supervision
Yajie Liu
Pu Ge
Qingjie Liu
Di Huang
125
2
0
06 Mar 2024
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Patrick Esser
Sumith Kulal
A. Blattmann
Rahim Entezari
Jonas Muller
...
Zion English
Kyle Lacey
Alex Goodwin
Yannik Marek
Robin Rombach
DiffM
321
1,410
0
05 Mar 2024
PromptKD: Unsupervised Prompt Distillation for Vision-Language Models
Zheng Li
Xiang Li
Xinyi Fu
Xing Zhang
Weiqiang Wang
Shuo Chen
Jian Yang
VLM
104
43
0
05 Mar 2024
What do we learn from inverting CLIP models?
Hamid Kazemi
Atoosa Malemir Chegini
Jonas Geiping
Soheil Feizi
Tom Goldstein
55
6
0
05 Mar 2024
Previous
1
2
3
...
6
7
8
...
16
17
18
Next