Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2102.08981
Cited By
v1
v2 (latest)
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
17 February 2021
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts"
50 / 871 papers shown
Title
LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation
Tongtian Yue
Longteng Guo
Yepeng Tang
Zijia Zhao
Xinxin Zhu
Hua Huang
Jing Liu
MLLM
VLM
16
0
0
20 Jun 2025
Show-o2: Improved Native Unified Multimodal Models
Jinheng Xie
Zhenheng Yang
Mike Zheng Shou
VGen
44
0
0
18 Jun 2025
Image Corruption-Inspired Membership Inference Attacks against Large Vision-Language Models
Zongyu Wu
Minhua Lin
Zhiwei Zhang
Fali Wang
Xianren Zhang
Xiang Zhang
Suhang Wang
36
0
0
14 Jun 2025
Vision Generalist Model: A Survey
Ziyi Wang
Yongming Rao
Shuofeng Sun
Xinrun Liu
Yi Wei
...
Zuyan Liu
Yanbo Wang
Hongmin Liu
Jie Zhou
Jiwen Lu
65
0
0
11 Jun 2025
Info-Coevolution: An Efficient Framework for Data Model Coevolution
Ziheng Qin
Hailun Xu
Wei Chee Yew
Qi Jia
Yang Luo
Kanchan Sarkar
Danhui Guan
Kai Wang
Yang You
32
0
0
09 Jun 2025
STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis
Jiatao Gu
Tianrong Chen
David Berthelot
Huangjie Zheng
Yuyang Wang
Ruixiang Zhang
Laurent Dinh
Miguel Angel Bautista
Josh Susskind
Shuangfei Zhai
45
0
0
06 Jun 2025
FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing
Guangzhao Li
Yanming Yang
Chenxi Song
Chi Zhang
DiffM
VGen
107
0
0
05 Jun 2025
DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models
Revant Teotia
Candace Ross
Karen Ullrich
S. Chopra
Adriana Romero-Soriano
Melissa Hall
Matthew Muckley
EGVM
VLM
155
0
0
05 Jun 2025
Contra4: Evaluating Contrastive Cross-Modal Reasoning in Audio, Video, Image, and 3D
Artemis Panagopoulou
Le Xue
Honglu Zhou
Silvio Savarese
Ran Xu
Caiming Xiong
Chris Callison-Burch
Mark Yatskar
Juan Carlos Niebles
50
0
0
02 Jun 2025
Data Pruning by Information Maximization
Haoru Tan
Sitong Wu
Wei Huang
Shizhen Zhao
Xiaojuan Qi
61
1
0
02 Jun 2025
Efficiency without Compromise: CLIP-aided Text-to-Image GANs with Increased Diversity
Yuya Kobayashi
Yuhta Takida
Takashi Shibuya
Yuki Mitsufuji
DiffM
54
0
0
02 Jun 2025
Entity Image and Mixed-Modal Image Retrieval Datasets
Cristian-Ioan Blaga
Paul Suganthan
Sahil Dua
Krishna Srinivasan
Enrique Alfonseca
Peter Dornbach
Tom Duerig
I. Zitouni
Zhe Dong
VLM
25
0
0
02 Jun 2025
Uncovering Visual-Semantic Psycholinguistic Properties from the Distributional Structure of Text Embedding Space
Si Wu
Sebastian Bruch
59
0
0
29 May 2025
Vid-SME: Membership Inference Attacks against Large Video Understanding Models
Qi Li
Runpeng Yu
Xinchao Wang
24
2
0
29 May 2025
QuARI: Query Adaptive Retrieval Improvement
Eric Xing
Abby Stylianou
Robert Pless
Nathan Jacobs
VLM
27
0
0
27 May 2025
What You Perceive Is What You Conceive: A Cognition-Inspired Framework for Open Vocabulary Image Segmentation
Jianghang Lin
Yue Hu
Jiangtao Shen
Yunhang Shen
Liujuan Cao
Shengchuan Zhang
Chia-Wen Lin
ObjD
VLM
211
0
0
26 May 2025
AmorLIP: Efficient Language-Image Pretraining via Amortization
Haotian Sun
Yitong Li
Yuchen Zhuang
Niao He
Hanjun Dai
Bo Dai
VLM
86
0
0
25 May 2025
Enhancing Text-to-Image Diffusion Transformer via Split-Text Conditioning
Yu Zhang
Jialei Zhou
Xinchen Li
Qi Zhang
Zhongwei Wan
Tianyu Wang
Duoqian Miao
Changwei Wang
LongBing Cao
DiffM
63
2
0
25 May 2025
TNG-CLIP:Training-Time Negation Data Generation for Negation Awareness of CLIP
Yuliang Cai
Jesse Thomason
Mohammad Rostami
VLM
27
0
0
24 May 2025
EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models
G. MEng
Sunan He
Jinpeng Wang
Tao Dai
Letian Zhang
Jieming Zhu
Qing Li
Gang Wang
Rui Zhang
Yong Jiang
VLM
296
0
0
24 May 2025
Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM
Donghwan Chi
Hyomin Kim
Yoonjin Oh
Yongjin Kim
Donghoon Lee
DaeJin Jo
Jongmin Kim
Junyeob Baek
Sungjin Ahn
Sungwoong Kim
MLLM
VLM
480
0
0
23 May 2025
Investigating Fine- and Coarse-grained Structural Correspondences Between Deep Neural Networks and Human Object Image Similarity Judgments Using Unsupervised Alignment
Soh Takahashi
Masaru Sasaki
Ken Takeda
Masafumi Oizumi
58
0
0
22 May 2025
DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?
Qirui Jiao
Daoyuan Chen
Yilun Huang
Xika Lin
Ying Shen
Yaliang Li
VLM
66
0
0
22 May 2025
RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language
Subrata Biswas
Mohammad Nur Hossain Khan
Bashima Islam
104
0
0
21 May 2025
MMaDA: Multimodal Large Diffusion Language Models
Ling Yang
Ye Tian
Bowen Li
Xinchen Zhang
Ke Shen
Yunhai Tong
Mengdi Wang
VLM
LRM
141
6
0
21 May 2025
GeoMM: On Geodesic Perspective for Multi-modal Learning
Shibin Mei
Hang Wang
Bingbing Ni
74
0
0
16 May 2025
Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis
Bingda Tang
Boyang Zheng
Xichen Pan
Sayak Paul
Saining Xie
78
0
0
15 May 2025
Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training
Yiran Chen
Hao Peng
Tong Zhang
Heng Ji
VLM
79
0
0
13 May 2025
Model Steering: Learning with a Reference Model Improves Generalization Bounds and Scaling Laws
Xiyuan Wei
Ming Lin
Fanjiang Ye
Fengguang Song
Liangliang Cao
My T. Thai
Tianbao Yang
LLMSV
103
0
0
10 May 2025
Emotion-Qwen: Training Hybrid Experts for Unified Emotion and General Vision-Language Understanding
Dawei Huang
Qing Li
Chuan Yan
Zebang Cheng
Jiaming Ji
Xiang Li
Yangqiu Song
Xiaobei Wang
Zheng Lian
Xiaojiang Peng
65
1
0
10 May 2025
Towards Developmentally Plausible Rewards: Communicative Success as a Learning Signal for Interactive Language Models
Lennart Stöpler
Rufat Asadli
Mitja Nikolaus
Ryan Cotterell
Alex Warstadt
LRM
80
2
0
09 May 2025
X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP
Hanxun Huang
Sarah Monazam Erfani
Yige Li
Xingjun Ma
James Bailey
AAML
155
1
0
08 May 2025
FG-CLIP: Fine-Grained Visual and Textual Alignment
Chunyu Xie
Bin Wang
Fanjing Kong
Jincheng Li
Dawei Liang
Gengshen Zhang
Dawei Leng
Yuhui Yin
CLIP
VLM
182
1
0
08 May 2025
TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
Haokun Lin
Teng Wang
Yixiao Ge
Yuying Ge
Zhichao Lu
Ying Wei
Qingfu Zhang
Zhenan Sun
Ying Shan
MLLM
VLM
146
5
0
08 May 2025
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
Wei Wei
Jintao Guo
Shanshan Zhao
Minghao Fu
Lunhao Duan
...
Guo-Hua Wang
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
DiffM
303
1
0
05 May 2025
Using Knowledge Graphs to harvest datasets for efficient CLIP model training
Simon Ging
Sebastian Walter
Jelena Bratulić
Johannes Dienert
Hannah Bast
Thomas Brox
CLIP
63
0
0
05 May 2025
Dynamic Robot Tool Use with Vision Language Models
Noah Trupin
Zixing Wang
A. H. Qureshi
85
0
0
02 May 2025
Multi-Modal Language Models as Text-to-Image Model Evaluators
Jiahui Chen
Candace Ross
Reyhane Askari Hemmat
Koustuv Sinha
Melissa Hall
M. Drozdzal
Adriana Romero-Soriano
EGVM
105
0
0
01 May 2025
Classifier-to-Bias: Toward Unsupervised Automatic Bias Detection for Visual Classifiers
Quentin Guimard
Moreno DÍncà
Massimiliano Mancini
Elisa Ricci
SSL
138
0
0
29 Apr 2025
What's Pulling the Strings? Evaluating Integrity and Attribution in AI Training and Inference through Concept Shift
Jiamin Chang
Haoyang Li
Hammond Pearce
Ruoxi Sun
Yue Liu
Minhui Xue
83
0
0
28 Apr 2025
ReSpec: Relevance and Specificity Grounded Online Filtering for Learning on Video-Text Data Streams
C. Kim
Jihwan Moon
Sangwoo Moon
Heeseung Yun
Sihaeng Lee
Aniruddha Kembhavi
Soonyoung Lee
Gunhee Kim
Sangho Lee
Christopher Clark
101
0
0
21 Apr 2025
Grounding-MD: Grounded Video-language Pre-training for Open-World Moment Detection
Weijun Zhuang
Qizhang Li
Xin Li
Ming-Yu Liu
Xiaopeng Hong
Feng Gao
Fan Yang
W. Zuo
79
0
0
20 Apr 2025
ESPLoRA: Enhanced Spatial Precision with Low-Rank Adaption in Text-to-Image Diffusion Models for High-Definition Synthesis
Andrea Rigo
Luca Stornaiuolo
Mauro Martino
Bruno Lepri
N. Sebe
85
0
0
18 Apr 2025
Post-pre-training for Modality Alignment in Vision-Language Foundation Models
Shinýa Yamaguchi
Dewei Feng
Sekitoshi Kanai
Kazuki Adachi
Daiki Chijiwa
VLM
88
2
0
17 Apr 2025
Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training
Xinsong Zhang
Yarong Zeng
Xinting Huang
Hu Hu
Runquan Xie
Han Hu
Zhanhui Kang
MLLM
VLM
269
2
0
17 Apr 2025
Multimodal LLM Augmented Reasoning for Interpretable Visual Perception Analysis
Shravan Chaudhari
Trilokya Akula
Yoon Kim
Tom Blake
LRM
87
0
0
16 Apr 2025
SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL
Junke Wang
Zhi Tian
Xinyu Wang
Xinyu Zhang
Weilin Huang
Zuxuan Wu
Yu Jiang
VGen
165
17
0
15 Apr 2025
Enhancing Features in Long-tailed Data Using Large Vision Model
Pengxiao Han
Changkun Ye
Jinguang Tong
Cuicui Jiang
Jie Hong
Li Fang
Xuesong Li
VLM
246
0
0
15 Apr 2025
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer
Weixian Lei
Jiacong Wang
Haochen Wang
Xuelong Li
Jun Hao Liew
Jiashi Feng
Zilong Huang
74
5
0
14 Apr 2025
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model
Yang Shi
Jiaheng Liu
Yushuo Guan
Zhikai Wu
Yize Zhang
...
Bohan Zeng
Wei Zhang
Fuzheng Zhang
Wenjing Yang
Di Zhang
VGen
VLM
136
2
0
14 Apr 2025
1
2
3
4
...
16
17
18
Next