ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2102.08981
  4. Cited By
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize
  Long-Tail Visual Concepts

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

17 February 2021
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
    VLM
ArXivPDFHTML

Papers citing "Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts"

50 / 850 papers shown
Title
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from
  Fine-grained Correctional Human Feedback
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
M. Steyvers
Yuan Yao
Haoye Zhang
Taiwen He
Yifeng Han
...
Xinyue Hu
Zhiyuan Liu
Hai-Tao Zheng
Maosong Sun
Tat-Seng Chua
MLLM
VLM
150
182
0
01 Dec 2023
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware
  representations to LLMs and Emergent Cross-modal Reasoning
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning
Artemis Panagopoulou
Le Xue
Ning Yu
Junnan Li
Dongxu Li
Shafiq Joty
Ran Xu
Silvio Savarese
Caiming Xiong
Juan Carlos Niebles
VLM
MLLM
41
46
0
30 Nov 2023
MLLMs-Augmented Visual-Language Representation Learning
MLLMs-Augmented Visual-Language Representation Learning
Yanqing Liu
Kai Wang
Wenqi Shao
Ping Luo
Yu Qiao
Mike Zheng Shou
Kaipeng Zhang
Yang You
VLM
29
11
0
30 Nov 2023
VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of
  Video-Language Models
VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models
Shicheng Li
Lei Li
Shuhuai Ren
Yuanxin Liu
Yi Liu
Rundong Gao
Xu Sun
Lu Hou
42
30
0
29 Nov 2023
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced
  Training
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Pavan Kumar Anasosalu Vasu
Hadi Pouransari
Fartash Faghri
Raviteja Vemulapalli
Oncel Tuzel
CLIP
VLM
49
44
0
28 Nov 2023
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Kunchang Li
Yali Wang
Yinan He
Yizhuo Li
Yi Wang
...
Jilan Xu
Guo Chen
Ping Luo
Limin Wang
Yu Qiao
VLM
MLLM
87
413
0
28 Nov 2023
Large Language Models Meet Computer Vision: A Brief Survey
Large Language Models Meet Computer Vision: A Brief Survey
Raby Hamadi
LM&MA
29
4
0
28 Nov 2023
Towards Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage
  and Sharing in LLMs
Towards Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs
Yunxin Li
Baotian Hu
Wei Wang
Xiaochun Cao
Min Zhang
29
4
0
27 Nov 2023
EgoThink: Evaluating First-Person Perspective Thinking Capability of
  Vision-Language Models
EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models
Sijie Cheng
Zhicheng Guo
Jingwen Wu
Kechen Fang
Peng Li
Huaping Liu
Yang Liu
EgoV
LRM
46
16
0
27 Nov 2023
Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image
  Generation
Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation
Yuhui Zhang
Brandon McKinzie
Zhe Gan
Vaishaal Shankar
Alexander Toshev
31
3
0
27 Nov 2023
Fully Authentic Visual Question Answering Dataset from Online
  Communities
Fully Authentic Visual Question Answering Dataset from Online Communities
Chongyan Chen
Mengchen Liu
Noel Codella
Yunsheng Li
Lu Yuan
Danna Gurari
54
5
0
27 Nov 2023
Beyond Pixels: Exploring Human-Readable SVG Generation for Simple Images
  with Vision Language Models
Beyond Pixels: Exploring Human-Readable SVG Generation for Simple Images with Vision Language Models
Tong Zhang
Haoyang Liu
Peiyan Zhang
Yuxuan Cheng
Haohan Wang
14
4
0
27 Nov 2023
Effective Backdoor Mitigation in Vision-Language Models Depends on the Pre-training Objective
Effective Backdoor Mitigation in Vision-Language Models Depends on the Pre-training Objective
Sahil Verma
Gantavya Bhatt
Avi Schwarzschild
Soumye Singhal
Arnav M. Das
Chirag Shah
John P Dickerson
Jeff Bilmes
J. Bilmes
AAML
61
1
0
25 Nov 2023
DRESS: Instructing Large Vision-Language Models to Align and Interact
  with Humans via Natural Language Feedback
DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback
Yangyi Chen
Karan Sikka
Michael Cogswell
Heng Ji
Ajay Divakaran
40
59
0
16 Nov 2023
Unlock the Power: Competitive Distillation for Multi-Modal Large
  Language Models
Unlock the Power: Competitive Distillation for Multi-Modal Large Language Models
Xinwei Li
Li Lin
Shuai Wang
Chen Qian
25
3
0
14 Nov 2023
To See is to Believe: Prompting GPT-4V for Better Visual Instruction
  Tuning
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
Junke Wang
Lingchen Meng
Zejia Weng
Bo He
Zuxuan Wu
Yu-Gang Jiang
MLLM
VLM
38
94
0
13 Nov 2023
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in
  Video-Language Models
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models
.Ilker Kesen
Andrea Pedrotti
Mustafa Dogan
Michele Cafagna
Emre Can Acikgoz
...
Iacer Calixto
Anette Frank
Albert Gatt
Aykut Erdem
Erkut Erdem
41
15
0
13 Nov 2023
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with
  Modality Collaboration
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
Qinghao Ye
Haiyang Xu
Jiabo Ye
Mingshi Yan
Anwen Hu
Haowei Liu
Qi Qian
Ji Zhang
Fei Huang
Jingren Zhou
MLLM
VLM
129
389
0
07 Nov 2023
Exploring Dataset-Scale Indicators of Data Quality
Exploring Dataset-Scale Indicators of Data Quality
Ben Feuer
Chinmay Hegde
29
1
0
07 Nov 2023
Enhancing Multimodal Compositional Reasoning of Visual Language Models
  with Generative Negative Mining
Enhancing Multimodal Compositional Reasoning of Visual Language Models with Generative Negative Mining
U. Sahin
Hang Li
Qadeer Ahmad Khan
Daniel Cremers
Volker Tresp
VLM
CoGe
28
12
0
07 Nov 2023
CoVLM: Composing Visual Entities and Relationships in Large Language
  Models Via Communicative Decoding
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding
Junyan Li
Delin Chen
Yining Hong
Zhenfang Chen
Peihao Chen
Yikang Shen
Chuang Gan
MLLM
38
15
0
06 Nov 2023
GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks
GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks
Xinlu Zhang
Yujie Lu
Weizhi Wang
An Yan
Jun Yan
Lianke Qin
Heng Wang
Xifeng Yan
William Y. Wang
Linda R. Petzold
LM&MA
MLLM
ELM
38
77
0
02 Nov 2023
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
Yifan Du
Hangyu Guo
Kun Zhou
Wayne Xin Zhao
Jinpeng Wang
Chuyuan Wang
Mingchen Cai
Ruihua Song
Ji-Rong Wen
VLM
MLLM
LRM
78
22
0
02 Nov 2023
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language
  Understanding
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
Shuhuai Ren
Sishuo Chen
Shicheng Li
Xu Sun
Lu Hou
ViT
51
28
0
29 Oct 2023
Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic
  Segmentation
Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic Segmentation
Fei Zhang
Tianfei Zhou
Boyang Li
Hao He
Chaofan Ma
Tianjiao Zhang
Jiangchao Yao
Ya Zhang
Yanfeng Wang
VLM
45
17
0
29 Oct 2023
Text Augmented Spatial-aware Zero-shot Referring Image Segmentation
Text Augmented Spatial-aware Zero-shot Referring Image Segmentation
Yuchen Suo
Linchao Zhu
Yi Yang
39
13
0
27 Oct 2023
SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial
  Understanding
SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding
Haoxiang Wang
Pavan Kumar Anasosalu Vasu
Fartash Faghri
Raviteja Vemulapalli
Mehrdad Farajtabar
Sachin Mehta
Mohammad Rastegari
Oncel Tuzel
Hadi Pouransari
VLM
40
67
0
23 Oct 2023
Matryoshka Diffusion Models
Matryoshka Diffusion Models
Jiatao Gu
Shuangfei Zhai
Yizhen Zhang
Joshua M. Susskind
Navdeep Jaitly
DiffM
29
45
0
23 Oct 2023
Open-Set Image Tagging with Multi-Grained Text Supervision
Open-Set Image Tagging with Multi-Grained Text Supervision
Xinyu Huang
Yi-Jie Huang
Youcai Zhang
Weiwei Tian
Rui Feng
Yuejie Zhang
Yanchun Xie
Yaqian Li
Lei Zhang
VLM
35
28
0
23 Oct 2023
Data Pruning via Moving-one-Sample-out
Data Pruning via Moving-one-Sample-out
Haoru Tan
Sitong Wu
Fei Du
Yukang Chen
Zhibin Wang
Fan Wang
Xiaojuan Qi
55
34
0
23 Oct 2023
Leveraging Image-Text Similarity and Caption Modification for the
  DataComp Challenge: Filtering Track and BYOD Track
Leveraging Image-Text Similarity and Caption Modification for the DataComp Challenge: Filtering Track and BYOD Track
Shuhei Yokoo
Peifei Zhu
Yuchi Ishikawa
Mikihiro Tanaka
Masayoshi Kondo
Hirokatsu Kataoka
24
0
0
23 Oct 2023
Nightshade: Prompt-Specific Poisoning Attacks on Text-to-Image
  Generative Models
Nightshade: Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models
Shawn Shan
Wenxin Ding
Josephine Passananti
Stanley Wu
Haitao Zheng
Ben Y. Zhao
SILM
DiffM
36
45
0
20 Oct 2023
Semi-supervised multimodal coreference resolution in image narrations
Semi-supervised multimodal coreference resolution in image narrations
A. Goel
Basura Fernando
Frank Keller
Hakan Bilen
52
4
0
20 Oct 2023
MarineGPT: Unlocking Secrets of Ocean to the Public
MarineGPT: Unlocking Secrets of Ocean to the Public
Ziqiang Zheng
Jipeng Zhang
Tuan-Anh Vu
Shizhe Diao
Yue Him Wong Tim
Sai-Kit Yeung
53
12
0
20 Oct 2023
Visual Grounding Helps Learn Word Meanings in Low-Data Regimes
Visual Grounding Helps Learn Word Meanings in Low-Data Regimes
Chengxu Zhuang
Evelina Fedorenko
Jacob Andreas
27
10
0
20 Oct 2023
MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and
  Uni-Modal Adapter
MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter
Zhiyuan Liu
Sihang Li
Yancheng Luo
Hao Fei
Yixin Cao
Kenji Kawaguchi
Xiang Wang
Tat-Seng Chua
38
82
0
19 Oct 2023
Weakly-Supervised Semantic Segmentation with Image-Level Labels: from
  Traditional Models to Foundation Models
Weakly-Supervised Semantic Segmentation with Image-Level Labels: from Traditional Models to Foundation Models
Zhaozheng Chen
Qianru Sun
VLM
37
7
0
19 Oct 2023
Learning from Rich Semantics and Coarse Locations for Long-tailed Object
  Detection
Learning from Rich Semantics and Coarse Locations for Long-tailed Object Detection
Lingchen Meng
Xiyang Dai
Jianwei Yang
Dongdong Chen
Yinpeng Chen
Mengchen Liu
Yi-Ling Chen
Zuxuan Wu
Lu Yuan
Yu-Gang Jiang
21
6
0
18 Oct 2023
TOSS:High-quality Text-guided Novel View Synthesis from a Single Image
TOSS:High-quality Text-guided Novel View Synthesis from a Single Image
Yukai Shi
Jianan Wang
He Cao
Boshi Tang
Xianbiao Qi
Tianyu Yang
Yukun Huang
Shilong Liu
Lei Zhang
H. Shum
DiffM
32
20
0
16 Oct 2023
Leveraging Vision-Language Models for Improving Domain Generalization in
  Image Classification
Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification
Sravanti Addepalli
Ashish Ramayee Asokan
Lakshay Sharma
R. V. Babu
VLM
29
15
0
12 Oct 2023
Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task
  Instruction Tuning
Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning
Junyu Lu
Di Zhang
Xiaojun Wu
Xinyu Gao
Ruyi Gan
Jiaxing Zhang
Yan Song
Pingjian Zhang
VLM
MLLM
22
7
0
12 Oct 2023
CrIBo: Self-Supervised Learning via Cross-Image Object-Level
  Bootstrapping
CrIBo: Self-Supervised Learning via Cross-Image Object-Level Bootstrapping
Tim Lebailly
Thomas Stegmüller
Behzad Bozorgtabar
Jean-Philippe Thiran
Tinne Tuytelaars
SSL
57
6
0
11 Oct 2023
VeCLIP: Improving CLIP Training via Visual-enriched Captions
VeCLIP: Improving CLIP Training via Visual-enriched Captions
Zhengfeng Lai
Haotian Zhang
Bowen Zhang
Wentao Wu
Haoping Bai
...
Zhe Gan
Jiulong Shan
Chen-Nee Chuah
Yinfei Yang
Meng Cao
CLIP
VLM
37
28
0
11 Oct 2023
On the Evaluation and Refinement of Vision-Language Instruction Tuning
  Datasets
On the Evaluation and Refinement of Vision-Language Instruction Tuning Datasets
Ning Liao
Shaofeng Zhang
Renqiu Xia
Min Cao
Yu Qiao
Junchi Yan
MLLM
34
0
0
10 Oct 2023
TextPSG: Panoptic Scene Graph Generation from Textual Descriptions
TextPSG: Panoptic Scene Graph Generation from Textual Descriptions
Chengyang Zhao
Songlin Yang
Zhenfang Chen
Mingyu Ding
Chuang Gan
59
15
0
10 Oct 2023
Implicit Concept Removal of Diffusion Models
Implicit Concept Removal of Diffusion Models
Zhili Liu
Kai Chen
Yifan Zhang
Jianhua Han
Lanqing Hong
Hang Xu
Zhenguo Li
Dit-Yan Yeung
James T. Kwok
28
13
0
09 Oct 2023
Lightweight In-Context Tuning for Multimodal Unified Models
Lightweight In-Context Tuning for Multimodal Unified Models
Yixin Chen
Shuai Zhang
Boran Han
Jiaya Jia
24
2
0
08 Oct 2023
Building an Open-Vocabulary Video CLIP Model with Better Architectures,
  Optimization and Data
Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data
Zuxuan Wu
Zejia Weng
Wujian Peng
Xitong Yang
Ang Li
Larry S. Davis
Yu-Gang Jiang
CLIP
VLM
41
21
0
08 Oct 2023
Analyzing Zero-Shot Abilities of Vision-Language Models on Video
  Understanding Tasks
Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks
Avinash Madasu
Anahita Bhiwandiwalla
Vasudev Lal
VLM
37
0
0
07 Oct 2023
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
Nina Shvetsova
Anna Kukleva
Xudong Hong
Christian Rupprecht
Bernt Schiele
Hilde Kuehne
50
25
0
07 Oct 2023
Previous
123...8910...151617
Next