Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2111.11432
Cited By
Florence: A New Foundation Model for Computer Vision
22 November 2021
Lu Yuan
Dongdong Chen
Yi-Ling Chen
Noel Codella
Xiyang Dai
Jianfeng Gao
Houdong Hu
Xuedong Huang
Boxin Li
Chunyuan Li
Ce Liu
Mengchen Liu
Zicheng Liu
Yumao Lu
Yu Shi
Lijuan Wang
Jianfeng Wang
Bin Xiao
Zhen Xiao
Jianwei Yang
Michael Zeng
Luowei Zhou
Pengchuan Zhang
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Florence: A New Foundation Model for Computer Vision"
50 / 664 papers shown
Title
Genomic Interpreter: A Hierarchical Genomic Deep Neural Network with 1D Shifted Window Transformer
Zehui Li
Akashaditya Das
W. Beardall
Yiren Zhao
Guy-Bart Stan
18
4
0
08 Jun 2023
MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos
Jielin Qiu
Jiacheng Zhu
William Jongwon Han
Aditesh Kumar
Karthik Mittal
...
Linjie Li
Jianfeng Wang
Ding Zhao
Bo Li
Lijuan Wang
VGen
26
5
0
07 Jun 2023
Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!
Zaid Khan
B. Vijaykumar
S. Schulter
Xiang Yu
Y. Fu
Manmohan Chandraker
VLM
MLLM
32
18
0
06 Jun 2023
Diversifying Joint Vision-Language Tokenization Learning
Vardaan Pahuja
A. Piergiovanni
A. Angelova
27
0
0
06 Jun 2023
Understanding Segment Anything Model: SAM is Biased Towards Texture Rather than Shape
Chaoning Zhang
Yu Qiao
Shehbaz Tariq
Sheng Zheng
Chenshuang Zhang
Chenghao Li
Hyundong Shin
Choong Seon Hong
VLM
42
10
0
03 Jun 2023
Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning
Cristina Menghini
Andrew T. Delworth
Stephen H. Bach
VLM
18
22
0
02 Jun 2023
Towards In-context Scene Understanding
Ivana Balazevic
David Steiner
Nikhil Parthasarathy
Relja Arandjelović
Olivier J. Hénaff
35
28
0
02 Jun 2023
Consistency-guided Prompt Learning for Vision-Language Models
Shuvendu Roy
Ali Etemad
VLM
VPVLM
28
52
0
01 Jun 2023
Exploring Open-Vocabulary Semantic Segmentation without Human Labels
Jun Chen
Deyao Zhu
Guocheng Qian
Guohao Li
Zhicheng Yan
Chenchen Zhu
Fanyi Xiao
Mohamed Elhoseiny
Sean Culatana
VLM
44
11
0
01 Jun 2023
Joint Adaptive Representations for Image-Language Learning
A. Piergiovanni
A. Angelova
VLM
34
0
0
31 May 2023
Multi-modal Queried Object Detection in the Wild
Yifan Xu
Mengdan Zhang
Chaoyou Fu
Peixian Chen
Xiaoshan Yang
Ke Li
Changsheng Xu
ObjD
VLM
38
30
0
30 May 2023
Learning without Forgetting for Vision-Language Models
Da-Wei Zhou
Yuanhan Zhang
Jingyi Ning
Jingyi Ning
De-Chuan Zhan
De-Chuan Zhan
Ziwei Liu
VLM
CLL
74
37
0
30 May 2023
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Sihan Chen
Handong Li
Qunbo Wang
Zijia Zhao
Ming-Ting Sun
Xinxin Zhu
Qingbin Liu
40
97
0
29 May 2023
Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models
Shuai Zhao
Xiaohan Wang
Linchao Zhu
Yezhou Yang
VLM
34
22
0
29 May 2023
Deeply Coupled Cross-Modal Prompt Learning
Xuejing Liu
Wei Tang
Jinghui Lu
Rui Zhao
Zhaojun Guo
Fei Tan
VLM
28
17
0
29 May 2023
KAFA: Rethinking Image Ad Understanding with Knowledge-Augmented Feature Adaptation of Vision-Language Models
Zhiwei Jia
P. Narayana
Arjun Reddy Akula
G. Pruthi
Haoran Su
Sugato Basu
Varun Jampani
VLM
OffRL
15
4
0
28 May 2023
Learning from Children: Improving Image-Caption Pretraining via Curriculum
Hammad A. Ayyubi
R. Lokesh
Alireza Zareian
Bohong Wu
Shih-Fu Chang
VLM
CLIP
38
2
0
27 May 2023
Three Towers: Flexible Contrastive Learning with Pretrained Image Models
Jannik Kossen
Mark Collier
Basil Mustafa
Tianlin Li
Xiaohua Zhai
Lucas Beyer
Andreas Steiner
Jesse Berent
Rodolphe Jenatton
Efi Kokiopoulou
VLM
45
12
0
26 May 2023
MEMEX: Detecting Explanatory Evidence for Memes via Knowledge-Enriched Contextualization
Shivam Sharma
S Ramaneswaran
Udit Arora
Md. Shad Akhtar
Tanmoy Chakraborty
38
9
0
25 May 2023
ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers
J. Yao
Xinggang Wang
Shusheng Yang
Baoyuan Wang
ViT
46
57
0
24 May 2023
PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts
Yunshui Li
Binyuan Hui
Zhichao Yin
Min Yang
Fei Huang
Yongbin Li
MoE
35
19
0
24 May 2023
Training Transitive and Commutative Multimodal Transformers with LoReTTa
Manuel Tran
Yashin Dicente Cid
Amal Lahiani
Fabian J. Theis
Tingying Peng
Eldad Klaiman
26
2
0
23 May 2023
Parts of Speech-Grounded Subspaces in Vision-Language Models
James Oldfield
Christos Tzelepis
Yannis Panagakis
M. Nicolaou
Ioannis Patras
26
9
0
23 May 2023
CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model
Shuai Zhao
Xiaohan Wang
Linchao Zhu
Yezhou Yang
CLIP
VLM
33
25
0
23 May 2023
i-Code Studio: A Configurable and Composable Framework for Integrative AI
Yuwei Fang
Mahmoud Khademi
Chenguang Zhu
Ziyi Yang
Reid Pryzant
...
Yao Qian
Takuya Yoshioka
Lu Yuan
Michael Zeng
Xuedong Huang
38
2
0
23 May 2023
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
Xingjian He
Sihan Chen
Fan Ma
Zhicheng Huang
Xiaojie Jin
Zikang Liu
Dongmei Fu
Yi Yang
Qingbin Liu
Jiashi Feng
VLM
CLIP
23
17
0
22 May 2023
Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design
Ibrahim M. Alabdulmohsin
Xiaohua Zhai
Alexander Kolesnikov
Lucas Beyer
VLM
47
59
0
22 May 2023
Album Storytelling with Iterative Story-aware Captioning and Large Language Models
Munan Ning
Yujia Xie
Dongdong Chen
Zeyin Song
Lu Yuan
Yonghong Tian
QiXiang Ye
Liuliang Yuan
33
8
0
22 May 2023
i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data
Ziyi Yang
Mahmoud Khademi
Yichong Xu
Reid Pryzant
Yuwei Fang
...
Yu Shi
Lu Yuan
Takuya Yoshioka
Michael Zeng
Xuedong Huang
17
2
0
21 May 2023
Pengi: An Audio Language Model for Audio Tasks
Soham Deshmukh
Benjamin Elizalde
Rita Singh
Huaming Wang
MLLM
AuLLM
39
161
0
19 May 2023
Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
Zikang Liu
Sihan Chen
Longteng Guo
Handong Li
Xingjian He
Qingbin Liu
17
1
0
19 May 2023
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Peng Wang
Shijie Wang
Junyang Lin
Shuai Bai
Xiaohuan Zhou
Jingren Zhou
Xinggang Wang
Chang Zhou
VLM
MLLM
ObjD
50
116
0
18 May 2023
Semantically Aligned Task Decomposition in Multi-Agent Reinforcement Learning
Wenhao Li
Dan Qiao
Baoxiang Wang
Xiangfeng Wang
Bo Jin
H. Zha
46
5
0
18 May 2023
Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners
Xuehai He
Weixi Feng
Tsu-Jui Fu
Varun Jampani
Arjun Reddy Akula
P. Narayana
Sugato Basu
William Yang Wang
Junfeng Fang
DiffM
57
7
0
18 May 2023
Paxion: Patching Action Knowledge in Video-Language Foundation Models
Zhenhailong Wang
Ansel Blume
Sha Li
Genglin Liu
Jaemin Cho
Zineng Tang
Joey Tianyi Zhou
Heng Ji
KELM
VGen
27
26
0
18 May 2023
Rethinking Multimodal Content Moderation from an Asymmetric Angle with Mixed-modality
Jialing Yuan
Ye Yu
Gaurav Mittal
Matthew Hall
Sandra Sajeev
Mei Chen
27
9
0
17 May 2023
Online Continual Learning Without the Storage Constraint
Ameya Prabhu
Zhipeng Cai
P. Dokania
Philip Torr
V. Koltun
Ozan Sener
CLL
120
31
0
16 May 2023
A Comprehensive Survey on Segment Anything Model for Vision and Beyond
Chunhui Zhang
Li Liu
Yawen Cui
Guanjie Huang
Weilin Lin
Yiqian Yang
Yuehong Hu
VLM
45
90
0
14 May 2023
On the Hidden Mystery of OCR in Large Multimodal Models
Yuliang Liu
Zhang Li
Mingxin Huang
Chunyuan Li
Dezhi Peng
Mingyu Liu
Lianwen Jin
Xiang Bai
VLM
MLLM
34
55
0
13 May 2023
A Survey on Segment Anything Model (SAM): Vision Foundation Model Meets Prompt Engineering
Chaoning Zhang
Fachrina Dewi Puspitasari
Sheng Zheng
Chenghao Li
Yu Qiao
...
Caiyan Qin
François Rameau
Lik-Hang Lee
Sung-Ho Bae
Choong Seon Hong
VLM
84
63
0
12 May 2023
Musketeer: Joint Training for Multi-task Vision Language Model with Task Explanation Prompts
Zhaoyang Zhang
Yantao Shen
Kunyu Shi
Zhaowei Cai
Jun Fang
Siqi Deng
Hao Yang
Davide Modolo
Zhuowen Tu
Stefano Soatto
VLM
28
2
0
11 May 2023
An Inverse Scaling Law for CLIP Training
Xianhang Li
Zeyu Wang
Cihang Xie
VLM
CLIP
48
55
0
11 May 2023
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers
Dahun Kim
A. Angelova
Weicheng Kuo
ObjD
ViT
VLM
35
74
0
11 May 2023
Self-Chained Image-Language Model for Video Localization and Question Answering
Shoubin Yu
Jaemin Cho
Prateek Yadav
Joey Tianyi Zhou
56
130
0
11 May 2023
VideoChat: Chat-Centric Video Understanding
Kunchang Li
Yinan He
Yi Wang
Yizhuo Li
Wen Wang
Ping Luo
Yali Wang
Limin Wang
Yu Qiao
MLLM
69
534
0
10 May 2023
Visual Tuning
Bruce X. B. Yu
Jianlong Chang
Haixin Wang
Lin Liu
Shijie Wang
...
Lingxi Xie
Haojie Li
Zhouchen Lin
Qi Tian
Chang Wen Chen
VLM
62
38
0
10 May 2023
ImageBind: One Embedding Space To Bind Them All
Rohit Girdhar
Alaaeldin El-Nouby
Zhuang Liu
Mannat Singh
Kalyan Vasudev Alwala
Armand Joulin
Ishan Misra
VLM
56
855
0
09 May 2023
Making the Most of What You Have: Adapting Pre-trained Visual Language Models in the Low-data Regime
Chuhan Zhang
Antoine Miech
Jiajun Shen
Jean-Baptiste Alayrac
Pauline Luc
VLM
VPVLM
47
2
0
03 May 2023
Unsupervised Improvement of Audio-Text Cross-Modal Representations
Zhepei Wang
Cem Subakan
Krishna Subramani
Junkai Wu
Tiago Tavares
Fabio Ayres
Paris Smaragdis
SSL
27
3
0
03 May 2023
Attack-SAM: Towards Attacking Segment Anything Model With Adversarial Examples
Chenshuang Zhang
Chaoning Zhang
Taegoo Kang
Donghun Kim
Sung-Ho Bae
In So Kweon
AAML
VLM
47
3
0
01 May 2023
Previous
1
2
3
...
7
8
9
...
12
13
14
Next