Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2102.08981
Cited By
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
17 February 2021
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts"
50 / 850 papers shown
Title
Long-CLIP: Unlocking the Long-Text Capability of CLIP
Beichen Zhang
Pan Zhang
Xiao-wen Dong
Yuhang Zang
Jiaqi Wang
CLIP
VLM
45
110
0
22 Mar 2024
A Multimodal Approach for Cross-Domain Image Retrieval
Lucas Iijima
Tania Stathaki
36
1
0
22 Mar 2024
PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model
Zheng-Wei Zhang
Yeyao Ma
Enming Zhang
Xiang Bai
VLM
MLLM
42
32
0
21 Mar 2024
Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling
Chengxu Zhuang
Evelina Fedorenko
Jacob Andreas
45
2
0
21 Mar 2024
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
Anwen Hu
Haiyang Xu
Jiabo Ye
Mingshi Yan
Liang Zhang
...
Chen Li
Ji Zhang
Qin Jin
Fei Huang
Jingren Zhou
VLM
49
106
0
19 Mar 2024
Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity
Siddharth Joshi
Arnav Jain
Ali Payani
Baharan Mirzasoleiman
VLM
CLIP
38
8
0
18 Mar 2024
MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control
Enshen Zhou
Yiran Qin
Zhen-fei Yin
Yuzhou Huang
Ruimao Zhang
Lu Sheng
Yu Qiao
Jing Shao
LM&Ro
AI4CE
50
34
0
18 Mar 2024
GenView: Enhancing View Quality with Pretrained Generative Model for Self-Supervised Learning
Xiaojie Li
Yibo Yang
Hefei Ling
Jianlong Wu
Yue Yu
Guohao Li
Min Zhang
SSL
39
6
0
18 Mar 2024
TAG: Guidance-free Open-Vocabulary Semantic Segmentation
Yasufumi Kawano
Yoshimitsu Aoki
VLM
30
2
0
17 Mar 2024
Reward Guided Latent Consistency Distillation
Jiachen Li
Weixi Feng
Wenhu Chen
William Y. Wang
EGVM
36
11
0
16 Mar 2024
OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models
Zhe Kong
Yong Zhang
Tianyu Yang
Tao Wang
Kaihao Zhang
Bizhu Wu
Guanying Chen
Wei Liu
Wenhan Luo
DiffM
51
27
0
16 Mar 2024
LuoJiaHOG: A Hierarchy Oriented Geo-aware Image Caption Dataset for Remote Sensing Image-Text Retrival
Yuanxin Zhao
Mi Zhang
Bingnan Yang
Zhan Zhang
Jiaju Kang
Jianya Gong
35
2
0
16 Mar 2024
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Brandon McKinzie
Zhe Gan
J. Fauconnier
Sam Dodge
Bowen Zhang
...
Zirui Wang
Ruoming Pang
Peter Grasch
Alexander Toshev
Yinfei Yang
MLLM
43
189
0
14 Mar 2024
Renovating Names in Open-Vocabulary Segmentation Benchmarks
Haiwen Huang
Songyou Peng
Dan Zhang
Andreas Geiger
VLM
39
3
0
14 Mar 2024
GiT: Towards Generalist Vision Transformer through Universal Language Interface
Haiyang Wang
Hao Tang
Li Jiang
Shaoshuai Shi
Muhammad Ferjad Naeem
Hongsheng Li
Bernt Schiele
Liwei Wang
VLM
51
10
0
14 Mar 2024
Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models
Yu-Chu Yu
Chi-Pin Huang
Jr-Jen Chen
Kai-Po Chang
Yung-Hsuan Lai
Fu-En Yang
Yu-Chiang Frank Wang
CLL
VLM
50
7
0
14 Mar 2024
A Decade's Battle on Dataset Bias: Are We There Yet?
Zhuang Liu
Kaiming He
50
28
0
13 Mar 2024
MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric
Haokun Lin
Haoli Bai
Zhili Liu
Lu Hou
Muyi Sun
Linqi Song
Ying Wei
Zhenan Sun
CLIP
VLM
63
15
0
12 Mar 2024
Synth
2
^2
2
: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings
Sahand Sharifzadeh
Christos Kaplanis
Shreya Pathak
D. Kumaran
Anastasija Ilić
Jovana Mitrović
Charles Blundell
Andrea Banino
VLM
51
9
0
12 Mar 2024
Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost
Oana Ignat
Longju Bai
Joan Nwatu
Rada Mihalcea
44
6
0
12 Mar 2024
Tell, Don't Show!: Language Guidance Eases Transfer Across Domains in Images and Videos
Tarun Kalluri
Bodhisattwa Prasad Majumder
Manmohan Chandraker
VLM
47
4
0
08 Mar 2024
Face2Diffusion for Fast and Editable Face Personalization
Kaede Shiohara
Toshihiko Yamasaki
DiffM
22
11
0
08 Mar 2024
Controllable Generation with Text-to-Image Diffusion Models: A Survey
Pu Cao
Feng Zhou
Qing-Huang Song
Lu Yang
78
37
0
07 Mar 2024
Multi-Grained Cross-modal Alignment for Learning Open-vocabulary Semantic Segmentation from Text Supervision
Yajie Liu
Pu Ge
Qingjie Liu
Di Huang
75
2
0
06 Mar 2024
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Patrick Esser
Sumith Kulal
A. Blattmann
Rahim Entezari
Jonas Muller
...
Zion English
Kyle Lacey
Alex Goodwin
Yannik Marek
Robin Rombach
DiffM
147
1,089
0
05 Mar 2024
PromptKD: Unsupervised Prompt Distillation for Vision-Language Models
Zheng Li
Xiang Li
Xinyi Fu
Xing Zhang
Weiqiang Wang
Shuo Chen
Jian Yang
VLM
47
36
0
05 Mar 2024
What do we learn from inverting CLIP models?
Hamid Kazemi
Atoosa Malemir Chegini
Jonas Geiping
S. Feizi
Tom Goldstein
38
3
0
05 Mar 2024
Differentially Private Representation Learning via Image Captioning
Tom Sander
Yaodong Yu
Maziar Sanjabi
Alain Durmus
Yi Ma
Kamalika Chaudhuri
Chuan Guo
76
3
0
04 Mar 2024
Peacock: A Family of Arabic Multimodal Large Language Models and Benchmarks
Fakhraddin Alwajih
El Moatez Billah Nagoudi
Gagan Bhatia
Abdelrahman Mohamed
Muhammad Abdul-Mageed
VLM
LRM
35
11
0
01 Mar 2024
Improving Explicit Spatial Relationships in Text-to-Image Generation through an Automatically Derived Dataset
Ander Salaberria
Gorka Azkune
Oier López de Lacalle
A. Soroa
Eneko Agirre
Frank Keller
EGVM
40
2
0
01 Mar 2024
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Tsai-Shien Chen
Aliaksandr Siarohin
Willi Menapace
Ekaterina Deyneka
Hsiang-wei Chao
...
Yuwei Fang
Hsin-Ying Lee
Jian Ren
Ming-Hsuan Yang
Sergey Tulyakov
VGen
89
180
0
29 Feb 2024
Grounding Language Models for Visual Entity Recognition
Zilin Xiao
Ming Gong
Paola Cascante-Bonilla
Xingyao Zhang
Jie Wu
Vicente Ordonez
VLM
51
9
0
28 Feb 2024
TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages
Minsu Kim
Jee-weon Jung
Hyeongseop Rha
Soumi Maiti
Siddhant Arora
Xuankai Chang
Shinji Watanabe
Y. Ro
33
7
0
25 Feb 2024
Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models
Chaoya Jiang
Wei Ye
Mengfan Dong
Hongrui Jia
Haiyang Xu
Mingshi Yan
Ji Zhang
Shikun Zhang
VLM
MLLM
48
15
0
24 Feb 2024
Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment
Yunxin Li
Xinyu Chen
Baotian Hu
Haoyuan Shi
Min-Ling Zhang
44
3
0
21 Feb 2024
Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning
Jihai Zhang
Xiang Lan
Xiaoye Qu
Yu Cheng
Mengling Feng
Bryan Hooi
SSL
26
4
0
19 Feb 2024
SInViG: A Self-Evolving Interactive Visual Agent for Human-Robot Interaction
Jie Xu
Hanbo Zhang
Xinghang Li
Huaping Liu
Xuguang Lan
Tao Kong
LM&Ro
43
3
0
19 Feb 2024
ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models
Guiming Hardy Chen
Shunian Chen
Ruifei Zhang
Junying Chen
Xiangbo Wu
Zhiyi Zhang
Zhihong Chen
Jianquan Li
Xiang Wan
Benyou Wang
VLM
SyDa
41
129
0
18 Feb 2024
Cobra Effect in Reference-Free Image Captioning Metrics
Zheng Ma
Changxin Wang
Yawen Ouyang
Fei Zhao
Jianbing Zhang
Shujian Huang
Jiajun Chen
38
2
0
18 Feb 2024
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM
Yutao Hu
Tian-Xin Li
Quanfeng Lu
Wenqi Shao
Junjun He
Yu Qiao
Ping Luo
ELM
LM&MA
37
52
0
14 Feb 2024
Pretraining Vision-Language Model for Difference Visual Question Answering in Longitudinal Chest X-rays
Yeongjae Cho
Taehee Kim
Heejun Shin
Sungzoon Cho
Dongmyung Shin
15
2
0
14 Feb 2024
Discovering Universal Semantic Triggers for Text-to-Image Synthesis
Shengfang Zhai
Weilong Wang
Jiajun Li
Yinpeng Dong
Hang Su
Qingni Shen
EGVM
49
3
0
12 Feb 2024
Examining Gender and Racial Bias in Large Vision-Language Models Using a Novel Dataset of Parallel Images
Kathleen C. Fraser
S. Kiritchenko
54
34
0
08 Feb 2024
Multimodal Unsupervised Domain Generalization by Retrieving Across the Modality Gap
Christopher Liao
Christian So
Theodoros Tsiligkaridis
Brian Kulis
41
0
0
06 Feb 2024
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
Yang Jin
Zhicheng Sun
Kun Xu
Kun Xu
Liwei Chen
...
Yuliang Liu
Di Zhang
Yang Song
Kun Gai
Yadong Mu
VGen
55
42
0
05 Feb 2024
GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering
Ziyu Ma
Shutao Li
Bin Sun
Jianfei Cai
Zuxiang Long
Fuyan Ma
39
2
0
04 Feb 2024
Variance Alignment Score: A Simple But Tough-to-Beat Data Selection Method for Multimodal Contrastive Learning
Yiping Wang
Yifang Chen
Wendan Yan
Kevin G. Jamieson
S. Du
33
5
0
03 Feb 2024
SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training?
Hasan Hammoud
Hani Itani
Fabio Pizzati
Philip Torr
Adel Bibi
Guohao Li
CLIP
VLM
122
37
0
02 Feb 2024
Can MLLMs Perform Text-to-Image In-Context Learning?
Yuchen Zeng
Wonjun Kang
Yicong Chen
Hyung Il Koo
Kangwook Lee
MLLM
36
9
0
02 Feb 2024
Towards 3D Molecule-Text Interpretation in Language Models
Sihang Li
Zhiyuan Liu
Yancheng Luo
Xiang Wang
Xiangnan He
Kenji Kawaguchi
Tat-Seng Chua
Qi Tian
AI4CE
40
43
0
25 Jan 2024
Previous
1
2
3
...
6
7
8
...
15
16
17
Next