ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2102.08981
  4. Cited By
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize
  Long-Tail Visual Concepts

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

17 February 2021
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
    VLM
ArXivPDFHTML

Papers citing "Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts"

50 / 850 papers shown
Title
Long-CLIP: Unlocking the Long-Text Capability of CLIP
Long-CLIP: Unlocking the Long-Text Capability of CLIP
Beichen Zhang
Pan Zhang
Xiao-wen Dong
Yuhang Zang
Jiaqi Wang
CLIP
VLM
45
110
0
22 Mar 2024
A Multimodal Approach for Cross-Domain Image Retrieval
A Multimodal Approach for Cross-Domain Image Retrieval
Lucas Iijima
Tania Stathaki
36
1
0
22 Mar 2024
PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model
PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model
Zheng-Wei Zhang
Yeyao Ma
Enming Zhang
Xiang Bai
VLM
MLLM
42
32
0
21 Mar 2024
Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling
Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling
Chengxu Zhuang
Evelina Fedorenko
Jacob Andreas
45
2
0
21 Mar 2024
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document
  Understanding
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
Anwen Hu
Haiyang Xu
Jiabo Ye
Mingshi Yan
Liang Zhang
...
Chen Li
Ji Zhang
Qin Jin
Fei Huang
Jingren Zhou
VLM
49
106
0
19 Mar 2024
Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data
  Quality over Quantity
Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity
Siddharth Joshi
Arnav Jain
Ali Payani
Baharan Mirzasoleiman
VLM
CLIP
38
8
0
18 Mar 2024
MineDreamer: Learning to Follow Instructions via Chain-of-Imagination
  for Simulated-World Control
MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control
Enshen Zhou
Yiran Qin
Zhen-fei Yin
Yuzhou Huang
Ruimao Zhang
Lu Sheng
Yu Qiao
Jing Shao
LM&Ro
AI4CE
50
34
0
18 Mar 2024
GenView: Enhancing View Quality with Pretrained Generative Model for
  Self-Supervised Learning
GenView: Enhancing View Quality with Pretrained Generative Model for Self-Supervised Learning
Xiaojie Li
Yibo Yang
Hefei Ling
Jianlong Wu
Yue Yu
Guohao Li
Min Zhang
SSL
39
6
0
18 Mar 2024
TAG: Guidance-free Open-Vocabulary Semantic Segmentation
TAG: Guidance-free Open-Vocabulary Semantic Segmentation
Yasufumi Kawano
Yoshimitsu Aoki
VLM
30
2
0
17 Mar 2024
Reward Guided Latent Consistency Distillation
Reward Guided Latent Consistency Distillation
Jiachen Li
Weixi Feng
Wenhu Chen
William Y. Wang
EGVM
36
11
0
16 Mar 2024
OMG: Occlusion-friendly Personalized Multi-concept Generation in
  Diffusion Models
OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models
Zhe Kong
Yong Zhang
Tianyu Yang
Tao Wang
Kaihao Zhang
Bizhu Wu
Guanying Chen
Wei Liu
Wenhan Luo
DiffM
51
27
0
16 Mar 2024
LuoJiaHOG: A Hierarchy Oriented Geo-aware Image Caption Dataset for
  Remote Sensing Image-Text Retrival
LuoJiaHOG: A Hierarchy Oriented Geo-aware Image Caption Dataset for Remote Sensing Image-Text Retrival
Yuanxin Zhao
Mi Zhang
Bingnan Yang
Zhan Zhang
Jiaju Kang
Jianya Gong
35
2
0
16 Mar 2024
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Brandon McKinzie
Zhe Gan
J. Fauconnier
Sam Dodge
Bowen Zhang
...
Zirui Wang
Ruoming Pang
Peter Grasch
Alexander Toshev
Yinfei Yang
MLLM
43
189
0
14 Mar 2024
Renovating Names in Open-Vocabulary Segmentation Benchmarks
Renovating Names in Open-Vocabulary Segmentation Benchmarks
Haiwen Huang
Songyou Peng
Dan Zhang
Andreas Geiger
VLM
39
3
0
14 Mar 2024
GiT: Towards Generalist Vision Transformer through Universal Language
  Interface
GiT: Towards Generalist Vision Transformer through Universal Language Interface
Haiyang Wang
Hao Tang
Li Jiang
Shaoshuai Shi
Muhammad Ferjad Naeem
Hongsheng Li
Bernt Schiele
Liwei Wang
VLM
51
10
0
14 Mar 2024
Select and Distill: Selective Dual-Teacher Knowledge Transfer for
  Continual Learning on Vision-Language Models
Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models
Yu-Chu Yu
Chi-Pin Huang
Jr-Jen Chen
Kai-Po Chang
Yung-Hsuan Lai
Fu-En Yang
Yu-Chiang Frank Wang
CLL
VLM
50
7
0
14 Mar 2024
A Decade's Battle on Dataset Bias: Are We There Yet?
A Decade's Battle on Dataset Bias: Are We There Yet?
Zhuang Liu
Kaiming He
50
28
0
13 Mar 2024
MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with
  Module-wise Pruning Error Metric
MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric
Haokun Lin
Haoli Bai
Zhili Liu
Lu Hou
Muyi Sun
Linqi Song
Ying Wei
Zhenan Sun
CLIP
VLM
63
15
0
12 Mar 2024
Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and
  Image Embeddings
Synth2^22: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings
Sahand Sharifzadeh
Christos Kaplanis
Shreya Pathak
D. Kumaran
Anastasija Ilić
Jovana Mitrović
Charles Blundell
Andrea Banino
VLM
51
9
0
12 Mar 2024
Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model
  Performance and Annotation Cost
Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost
Oana Ignat
Longju Bai
Joan Nwatu
Rada Mihalcea
44
6
0
12 Mar 2024
Tell, Don't Show!: Language Guidance Eases Transfer Across Domains in
  Images and Videos
Tell, Don't Show!: Language Guidance Eases Transfer Across Domains in Images and Videos
Tarun Kalluri
Bodhisattwa Prasad Majumder
Manmohan Chandraker
VLM
47
4
0
08 Mar 2024
Face2Diffusion for Fast and Editable Face Personalization
Face2Diffusion for Fast and Editable Face Personalization
Kaede Shiohara
Toshihiko Yamasaki
DiffM
22
11
0
08 Mar 2024
Controllable Generation with Text-to-Image Diffusion Models: A Survey
Controllable Generation with Text-to-Image Diffusion Models: A Survey
Pu Cao
Feng Zhou
Qing-Huang Song
Lu Yang
78
37
0
07 Mar 2024
Multi-Grained Cross-modal Alignment for Learning Open-vocabulary
  Semantic Segmentation from Text Supervision
Multi-Grained Cross-modal Alignment for Learning Open-vocabulary Semantic Segmentation from Text Supervision
Yajie Liu
Pu Ge
Qingjie Liu
Di Huang
75
2
0
06 Mar 2024
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Patrick Esser
Sumith Kulal
A. Blattmann
Rahim Entezari
Jonas Muller
...
Zion English
Kyle Lacey
Alex Goodwin
Yannik Marek
Robin Rombach
DiffM
147
1,089
0
05 Mar 2024
PromptKD: Unsupervised Prompt Distillation for Vision-Language Models
PromptKD: Unsupervised Prompt Distillation for Vision-Language Models
Zheng Li
Xiang Li
Xinyi Fu
Xing Zhang
Weiqiang Wang
Shuo Chen
Jian Yang
VLM
47
36
0
05 Mar 2024
What do we learn from inverting CLIP models?
What do we learn from inverting CLIP models?
Hamid Kazemi
Atoosa Malemir Chegini
Jonas Geiping
S. Feizi
Tom Goldstein
38
3
0
05 Mar 2024
Differentially Private Representation Learning via Image Captioning
Differentially Private Representation Learning via Image Captioning
Tom Sander
Yaodong Yu
Maziar Sanjabi
Alain Durmus
Yi Ma
Kamalika Chaudhuri
Chuan Guo
76
3
0
04 Mar 2024
Peacock: A Family of Arabic Multimodal Large Language Models and
  Benchmarks
Peacock: A Family of Arabic Multimodal Large Language Models and Benchmarks
Fakhraddin Alwajih
El Moatez Billah Nagoudi
Gagan Bhatia
Abdelrahman Mohamed
Muhammad Abdul-Mageed
VLM
LRM
35
11
0
01 Mar 2024
Improving Explicit Spatial Relationships in Text-to-Image Generation
  through an Automatically Derived Dataset
Improving Explicit Spatial Relationships in Text-to-Image Generation through an Automatically Derived Dataset
Ander Salaberria
Gorka Azkune
Oier López de Lacalle
A. Soroa
Eneko Agirre
Frank Keller
EGVM
40
2
0
01 Mar 2024
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Tsai-Shien Chen
Aliaksandr Siarohin
Willi Menapace
Ekaterina Deyneka
Hsiang-wei Chao
...
Yuwei Fang
Hsin-Ying Lee
Jian Ren
Ming-Hsuan Yang
Sergey Tulyakov
VGen
89
180
0
29 Feb 2024
Grounding Language Models for Visual Entity Recognition
Grounding Language Models for Visual Entity Recognition
Zilin Xiao
Ming Gong
Paola Cascante-Bonilla
Xingyao Zhang
Jie Wu
Vicente Ordonez
VLM
51
9
0
28 Feb 2024
TMT: Tri-Modal Translation between Speech, Image, and Text by Processing
  Different Modalities as Different Languages
TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages
Minsu Kim
Jee-weon Jung
Hyeongseop Rha
Soumi Maiti
Siddhant Arora
Xuankai Chang
Shinji Watanabe
Y. Ro
33
7
0
25 Feb 2024
Hal-Eval: A Universal and Fine-grained Hallucination Evaluation
  Framework for Large Vision Language Models
Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models
Chaoya Jiang
Wei Ye
Mengfan Dong
Hongrui Jia
Haiyang Xu
Mingshi Yan
Ji Zhang
Shikun Zhang
VLM
MLLM
48
15
0
24 Feb 2024
Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension
  with Enhanced Visual Knowledge Alignment
Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment
Yunxin Li
Xinyu Chen
Baotian Hu
Haoyuan Shi
Min-Ling Zhang
44
3
0
21 Feb 2024
Learning the Unlearned: Mitigating Feature Suppression in Contrastive
  Learning
Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning
Jihai Zhang
Xiang Lan
Xiaoye Qu
Yu Cheng
Mengling Feng
Bryan Hooi
SSL
26
4
0
19 Feb 2024
SInViG: A Self-Evolving Interactive Visual Agent for Human-Robot
  Interaction
SInViG: A Self-Evolving Interactive Visual Agent for Human-Robot Interaction
Jie Xu
Hanbo Zhang
Xinghang Li
Huaping Liu
Xuguang Lan
Tao Kong
LM&Ro
43
3
0
19 Feb 2024
ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language
  Models
ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models
Guiming Hardy Chen
Shunian Chen
Ruifei Zhang
Junying Chen
Xiangbo Wu
Zhiyi Zhang
Zhihong Chen
Jianquan Li
Xiang Wan
Benyou Wang
VLM
SyDa
41
129
0
18 Feb 2024
Cobra Effect in Reference-Free Image Captioning Metrics
Cobra Effect in Reference-Free Image Captioning Metrics
Zheng Ma
Changxin Wang
Yawen Ouyang
Fei Zhao
Jianbing Zhang
Shujian Huang
Jiajun Chen
38
2
0
18 Feb 2024
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for
  Medical LVLM
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM
Yutao Hu
Tian-Xin Li
Quanfeng Lu
Wenqi Shao
Junjun He
Yu Qiao
Ping Luo
ELM
LM&MA
37
52
0
14 Feb 2024
Pretraining Vision-Language Model for Difference Visual Question
  Answering in Longitudinal Chest X-rays
Pretraining Vision-Language Model for Difference Visual Question Answering in Longitudinal Chest X-rays
Yeongjae Cho
Taehee Kim
Heejun Shin
Sungzoon Cho
Dongmyung Shin
15
2
0
14 Feb 2024
Discovering Universal Semantic Triggers for Text-to-Image Synthesis
Discovering Universal Semantic Triggers for Text-to-Image Synthesis
Shengfang Zhai
Weilong Wang
Jiajun Li
Yinpeng Dong
Hang Su
Qingni Shen
EGVM
49
3
0
12 Feb 2024
Examining Gender and Racial Bias in Large Vision-Language Models Using a
  Novel Dataset of Parallel Images
Examining Gender and Racial Bias in Large Vision-Language Models Using a Novel Dataset of Parallel Images
Kathleen C. Fraser
S. Kiritchenko
54
34
0
08 Feb 2024
Multimodal Unsupervised Domain Generalization by Retrieving Across the
  Modality Gap
Multimodal Unsupervised Domain Generalization by Retrieving Across the Modality Gap
Christopher Liao
Christian So
Theodoros Tsiligkaridis
Brian Kulis
41
0
0
06 Feb 2024
Video-LaVIT: Unified Video-Language Pre-training with Decoupled
  Visual-Motional Tokenization
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
Yang Jin
Zhicheng Sun
Kun Xu
Kun Xu
Liwei Chen
...
Yuliang Liu
Di Zhang
Yang Song
Kun Gai
Yadong Mu
VGen
55
42
0
05 Feb 2024
GeReA: Question-Aware Prompt Captions for Knowledge-based Visual
  Question Answering
GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering
Ziyu Ma
Shutao Li
Bin Sun
Jianfei Cai
Zuxiang Long
Fuyan Ma
39
2
0
04 Feb 2024
Variance Alignment Score: A Simple But Tough-to-Beat Data Selection
  Method for Multimodal Contrastive Learning
Variance Alignment Score: A Simple But Tough-to-Beat Data Selection Method for Multimodal Contrastive Learning
Yiping Wang
Yifang Chen
Wendan Yan
Kevin G. Jamieson
S. Du
33
5
0
03 Feb 2024
SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training?
SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training?
Hasan Hammoud
Hani Itani
Fabio Pizzati
Philip Torr
Adel Bibi
Guohao Li
CLIP
VLM
122
37
0
02 Feb 2024
Can MLLMs Perform Text-to-Image In-Context Learning?
Can MLLMs Perform Text-to-Image In-Context Learning?
Yuchen Zeng
Wonjun Kang
Yicong Chen
Hyung Il Koo
Kangwook Lee
MLLM
36
9
0
02 Feb 2024
Towards 3D Molecule-Text Interpretation in Language Models
Towards 3D Molecule-Text Interpretation in Language Models
Sihang Li
Zhiyuan Liu
Yancheng Luo
Xiang Wang
Xiangnan He
Kenji Kawaguchi
Tat-Seng Chua
Qi Tian
AI4CE
40
43
0
25 Jan 2024
Previous
123...678...151617
Next