ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2102.08981
  4. Cited By
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize
  Long-Tail Visual Concepts
v1v2 (latest)

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

17 February 2021
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
    VLM
ArXiv (abs)PDFHTML

Papers citing "Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts"

50 / 871 papers shown
Title
Graph Perceiver IO: A General Architecture for Graph Structured Data
Graph Perceiver IO: A General Architecture for Graph Structured Data
Seyun Bae
Hoyoon Byun
Changdae Oh
Yoon-Sik Cho
Kyungwoo Song
GNN
258
3
0
24 Feb 2025
Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence
Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence
Wenzhe Yin
Zehao Xiao
Pan Zhou
Shujian Yu
Jiayi Shen
Jan-Jakob Sonke
E. Gavves
177
1
0
24 Feb 2025
ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval
ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval
Guanqi Zhan
Yuanpei Liu
Kai Han
Weidi Xie
Andrew Zisserman
VLM
531
0
0
21 Feb 2025
Object-centric Binding in Contrastive Language-Image Pretraining
Object-centric Binding in Contrastive Language-Image Pretraining
Rim Assouel
Pietro Astolfi
Florian Bordes
M. Drozdzal
Adriana Romero Soriano
OCLVLMCoGe
161
3
0
19 Feb 2025
Megrez-Omni Technical Report
Boxun Li
Yadong Li
Zehan Li
Congyi Liu
Weilin Liu
...
Dong Zhou
Yueqing Zhuang
Shengen Yan
Guohao Dai
Yansen Wang
81
0
0
19 Feb 2025
VRoPE: Rotary Position Embedding for Video Large Language Models
VRoPE: Rotary Position Embedding for Video Large Language Models
Zikang Liu
Longteng Guo
Yepeng Tang
Tongtian Yue
Junxian Cai
Kai Ma
Qingbin Liu
Xi Chen
Jing Liu
125
1
0
17 Feb 2025
GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis
GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis
Angelos Zavras
Dimitrios Michail
Xiao Xiang Zhu
Begüm Demir
Ioannis Papoutsis
VLM
196
1
0
13 Feb 2025
Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation
Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation
H. Seo
Wongi Jeong
Jae-sun Seo
Se Young Chun
140
0
0
12 Feb 2025
I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models
I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models
Zhenxing Mi
Kuan-Chieh Wang
Guocheng Qian
Hanrong Ye
Runtao Liu
Sergey Tulyakov
Kfir Aberman
Dan Xu
LRM
97
2
0
12 Feb 2025
Intrinsic Bias is Predicted by Pretraining Data and Correlates with Downstream Performance in Vision-Language Encoders
Intrinsic Bias is Predicted by Pretraining Data and Correlates with Downstream Performance in Vision-Language Encoders
Kshitish Ghate
Isaac Slaughter
Kyra Wilson
Mona Diab
Aylin Caliskan
209
1
0
11 Feb 2025
UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths
Weijia Mao
Zhiyong Yang
Mike Zheng Shou
MoE
196
1
0
10 Feb 2025
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding
Ahmed Masry
Juan A. Rodriguez
Tianyu Zhang
Suyuchen Wang
Chao Wang
...
I. Laradji
David Vazquez
Perouz Taslakian
Spandana Gella
Sai Rajeswar
96
0
0
03 Feb 2025
Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens
Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens
Dongwon Kim
Ju He
Qihang Yu
Chenglin Yang
Xiaohui Shen
Suha Kwak
Liang-Chieh Chen
VLM
137
11
0
13 Jan 2025
BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs
BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs
Sheng Zhang
Yanbo Xu
Naoto Usuyama
Hanwen Xu
J. Bagga
...
Carlo Bifulco
M. Lungren
Tristan Naumann
Sheng Wang
Hoifung Poon
LM&MAMedIm
244
235
0
10 Jan 2025
MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation
MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation
S. Joshi
Besmira Nushi
Vidhisha Balachandran
Varun Chandrasekaran
Vibhav Vineet
Neel Joshi
Baharan Mirzasoleiman
MLLMVLM
170
0
0
07 Jan 2025
Visual Large Language Models for Generalized and Specialized Applications
Yifan Li
Zhixin Lai
Wentao Bao
Zhen Tan
Anh Dao
Kewei Sui
Jiayi Shen
Dong Liu
Huan Liu
Yu Kong
VLM
171
15
0
06 Jan 2025
GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models
GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models
Zhangyang Qi
Zhixiong Zhang
Ye Fang
Jiaqi Wang
Hengshuang Zhao
229
16
0
02 Jan 2025
Altogether: Image Captioning via Re-aligning Alt-text
Altogether: Image Captioning via Re-aligning Alt-text
Hu Xu
Po-Yao (Bernie) Huang
Xiaoqing Ellen Tan
Ching-Feng Yeh
Jacob Kahn
...
Luke Zettlemoyer
Wen-tau Yih
Shang-Wen Li
Saining Xie
Christoph Feichtenhofer
DiffM
91
9
0
31 Dec 2024
Towards Visual Grounding: A Survey
Towards Visual Grounding: A Survey
Linhui Xiao
Xiaoshan Yang
X. Lan
Yaowei Wang
Changsheng Xu
ObjD
284
5
0
31 Dec 2024
A High-Quality Text-Rich Image Instruction Tuning Dataset via Hybrid
  Instruction Generation
A High-Quality Text-Rich Image Instruction Tuning Dataset via Hybrid Instruction Generation
Shijie Zhou
Ruiyi Zhang
Yufan Zhou
Changyou Chen
VLM
117
1
0
20 Dec 2024
DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level
  Vision-Language Alignment
DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment
Cijo Jose
Théo Moutakanni
Dahyun Kang
Federico Baldassarre
Timothée Darcet
...
Maxime Oquab
Oriane Siméoni
Huy V. Vo
Patrick Labatut
Piotr Bojanowski
CLIPVLM
178
8
0
20 Dec 2024
JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts
JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts
Taein Son
Soo Won Seo
Jisong Kim
S. Lee
Jun Won Choi
VGen
135
0
0
18 Dec 2024
Do Language Models Understand Time?
Do Language Models Understand Time?
Xi Ding
Lei Wang
332
2
0
18 Dec 2024
Detecting Daily Living Gait Amid Huntington's Disease Chorea using a
  Foundation Deep Learning Model
Detecting Daily Living Gait Amid Huntington's Disease Chorea using a Foundation Deep Learning Model
Dafna Schwartz
Lori Quinn
Nora E. Fritz
Lisa M. Muratori
Jeffery M. Hausdorff
Ran Gilad Bachrach
108
0
0
15 Dec 2024
FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion
  Models
FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models
Tong Wu
Yinghao Xu
Ryan Po
Mengchen Zhang
Guandao Yang
Jiaqi Wang
Ziqiang Liu
Dahua Lin
Gordon Wetzstein
113
0
0
10 Dec 2024
Florence-VL: Enhancing Vision-Language Models with Generative Vision
  Encoder and Depth-Breadth Fusion
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
Jiuhai Chen
Jianwei Yang
Haiping Wu
Dianqi Li
Jianfeng Gao
Tianyi Zhou
Bin Xiao
VLM
120
6
0
05 Dec 2024
FLAIR: VLM with Fine-grained Language-informed Image Representations
FLAIR: VLM with Fine-grained Language-informed Image Representations
Rui Xiao
Sanghwan Kim
Mariana-Iuliana Georgescu
Zeynep Akata
Stephan Alaniz
VLMCLIP
138
4
0
04 Dec 2024
Eyes on the Road: State-of-the-Art Video Question Answering Models
  Assessment for Traffic Monitoring Tasks
Eyes on the Road: State-of-the-Art Video Question Answering Models Assessment for Traffic Monitoring Tasks
Joseph Raj Vishal
Divesh Basina
Aarya Choudhary
Bharatesh Chakravarthi
149
1
0
02 Dec 2024
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
Sanghwan Kim
Rui Xiao
Mariana-Iuliana Georgescu
Stephan Alaniz
Zeynep Akata
VLM
346
3
0
02 Dec 2024
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
Shufan Li
Konstantinos Kallidromitis
Akash Gokul
Zichun Liao
Yusuke Kato
Kazuki Kozuka
Aditya Grover
VGen
177
9
0
02 Dec 2024
Advancing Myopia To Holism: Fully Contrastive Language-Image
  Pre-training
Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training
Haicheng Wang
Chen Ju
Weixiong Lin
Shuai Xiao
Mengting Chen
...
Mingshuai Yao
Jinsong Lan
Ying Chen
Qingwen Liu
Yanfeng Wang
VLMCLIP
121
4
0
30 Nov 2024
TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding
  with Superior Temporal Localization Ability
TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
Shimin Chen
Xiaohan Lan
Yitian Yuan
Zequn Jie
Lin Ma
VLMMLLM
159
17
0
27 Nov 2024
InsightEdit: Towards Better Instruction Following for Image Editing
InsightEdit: Towards Better Instruction Following for Image Editing
Yingjing Xu
Jie Kong
Jiazhi Wang
Xiao Pan
Bo Lin
Qiang Liu
DiffM
128
1
0
26 Nov 2024
MC-LLaVA: Multi-Concept Personalized Vision-Language Model
Ruichuan An
Sihan Yang
Ming Lu
Kai Zeng
Yulin Luo
...
Hao Liang
Qi She
Shanghang Zhang
Wentao Zhang
Wentao Zhang
196
11
0
18 Nov 2024
Harnessing Vision Foundation Models for High-Performance, Training-Free
  Open Vocabulary Segmentation
Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation
Yuheng Shi
Minjing Dong
Chang Xu
VLM
118
3
0
14 Nov 2024
Boosting Latent Diffusion with Perceptual Objectives
Boosting Latent Diffusion with Perceptual Objectives
Tariq Berrada
Pietro Astolfi
Jakob Verbeek
Melissa Hall
Marton Havasi
M. Drozdzal
Yohann Benchetrit
Adriana Romero Soriano
Karteek Alahari
77
0
0
06 Nov 2024
VLA-3D: A Dataset for 3D Semantic Scene Understanding and Navigation
VLA-3D: A Dataset for 3D Semantic Scene Understanding and Navigation
Haochen Zhang
Nader Zantout
Pujith Kachana
Zongyuan Wu
Ji Zhang
Wenshan Wang
3DVLM&Ro
86
6
0
05 Nov 2024
Classification Done Right for Vision-Language Pre-Training
Classification Done Right for Vision-Language Pre-Training
Zilong Huang
Qinghao Ye
Bingyi Kang
Jiashi Feng
Haoqi Fan
CLIPVLM
122
4
0
05 Nov 2024
On Improved Conditioning Mechanisms and Pre-training Strategies for Diffusion Models
On Improved Conditioning Mechanisms and Pre-training Strategies for Diffusion Models
Tariq Berrada Ifriqi
Pietro Astolfi
Melissa Hall
Reyhane Askari Hemmat
Yohann Benchetrit
...
Matthew Muckley
Karteek Alahari
Adriana Romero Soriano
Jakob Verbeek
M. Drozdzal
AI4CEVLM
139
4
0
05 Nov 2024
Membership Inference Attacks against Large Vision-Language Models
Membership Inference Attacks against Large Vision-Language Models
Zhan Li
Yongtao Wu
Yihang Chen
F. Tonin
Elias Abad Rocamora
Volkan Cevher
78
9
0
05 Nov 2024
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic
  Vision-Language Negatives
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives
Maitreya Patel
Abhiram Kusumba
Sheng Cheng
Changhoon Kim
Tejas Gokhale
Chitta Baral
Yezhou Yang
CoGeCLIP
143
14
0
04 Nov 2024
SeafloorAI: A Large-scale Vision-Language Dataset for Seafloor
  Geological Survey
SeafloorAI: A Large-scale Vision-Language Dataset for Seafloor Geological Survey
Kien X. Nguyen
Fengchun Qiao
Arthur Trembanis
Xi Peng
84
1
0
31 Oct 2024
CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP
CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP
Tianyu Yang
Lisen Dai
Zheyuan Liu
Minhao Cheng
Meng Jiang
Yapeng Tian
VLMMU
96
5
0
30 Oct 2024
Face-MLLM: A Large Face Perception Model
Face-MLLM: A Large Face Perception Model
Haomiao Sun
Mingjie He
Tianheng Lian
Hu Han
Shiguang Shan
VLMCVBMLRM
67
6
0
28 Oct 2024
Rectified Diffusion Guidance for Conditional Generation
Rectified Diffusion Guidance for Conditional Generation
Mengfei Xia
Nan Xue
Yujun Shen
Ran Yi
Tieliang Gong
Yang Liu
DiffM
70
6
0
24 Oct 2024
Probabilistic Language-Image Pre-Training
Probabilistic Language-Image Pre-Training
Sanghyuk Chun
Wonjae Kim
Song Park
Sangdoo Yun
MLLMVLMCLIP
489
6
2
24 Oct 2024
Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM
  Pretraining
Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM Pretraining
Han Huang
Yuqi Huo
Zijia Zhao
Haoyu Lu
Shu Wu
Bin Wang
Qiang Liu
Weipeng Chen
Liang Wang
VLM
67
1
0
21 Oct 2024
Test-time Adaptation for Cross-modal Retrieval with Query Shift
Test-time Adaptation for Cross-modal Retrieval with Query Shift
Haobin Li
Peng Hu
Qianjun Zhang
Xi Peng
Xiting Liu
Mouxing Yang
TTA
87
0
0
21 Oct 2024
Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension
Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension
Yin Xie
Kaicheng Yang
Ninghua Yang
Weimo Deng
Xiangzi Dai
...
Yumeng Wang
Xiang An
Yongle Zhao
Ziyong Feng
Jiankang Deng
MLLMVLM
72
1
0
18 Oct 2024
FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion
  Model
FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model
ZiDong Wang
Zeyu Lu
Di Huang
Cai Zhou
Wanli Ouyang
and Lei Bai
123
6
0
17 Oct 2024
Previous
123456...161718
Next