ResearchTrend.AI
  • Papers
  • Communities
  • Organizations
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2111.11432
  4. Cited By
Florence: A New Foundation Model for Computer Vision

Florence: A New Foundation Model for Computer Vision

22 November 2021
Lu Yuan
Dongdong Chen
Yi-Ling Chen
Noel Codella
Xiyang Dai
Jianfeng Gao
Houdong Hu
Xuedong Huang
Boxin Li
Chunyuan Li
Ce Liu
Mengchen Liu
Zicheng Liu
Yumao Lu
Yu Shi
Lijuan Wang
Jianfeng Wang
Bin Xiao
Zhen Xiao
Jianwei Yang
Michael Zeng
Luowei Zhou
Pengchuan Zhang
    VLM
ArXiv (abs)PDFHTML

Papers citing "Florence: A New Foundation Model for Computer Vision"

50 / 668 papers shown
Title
A Survey on Segment Anything Model (SAM): Vision Foundation Model Meets
  Prompt Engineering
A Survey on Segment Anything Model (SAM): Vision Foundation Model Meets Prompt Engineering
Chaoning Zhang
Fachrina Dewi Puspitasari
Sheng Zheng
Chenghao Li
Yu Qiao
...
Caiyan Qin
François Rameau
Lik-Hang Lee
Sung-Ho Bae
Choong Seon Hong
VLM
167
67
0
12 May 2023
Musketeer: Joint Training for Multi-task Vision Language Model with Task
  Explanation Prompts
Musketeer: Joint Training for Multi-task Vision Language Model with Task Explanation Prompts
Zhaoyang Zhang
Yantao Shen
Kunyu Shi
Zhaowei Cai
Jun Fang
Siqi Deng
Hao Yang
Davide Modolo
Zhuowen Tu
Stefano Soatto
VLM
83
2
0
11 May 2023
An Inverse Scaling Law for CLIP Training
An Inverse Scaling Law for CLIP Training
Xianhang Li
Zeyu Wang
Cihang Xie
VLMCLIP
125
58
0
11 May 2023
Region-Aware Pretraining for Open-Vocabulary Object Detection with
  Vision Transformers
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers
Dahun Kim
A. Angelova
Weicheng Kuo
ObjDViTVLM
90
80
0
11 May 2023
Self-Chained Image-Language Model for Video Localization and Question
  Answering
Self-Chained Image-Language Model for Video Localization and Question Answering
Shoubin Yu
Jaemin Cho
Prateek Yadav
Joey Tianyi Zhou
163
142
0
11 May 2023
VideoChat: Chat-Centric Video Understanding
VideoChat: Chat-Centric Video Understanding
Kunchang Li
Yinan He
Yi Wang
Yizhuo Li
Wen Wang
Ping Luo
Yali Wang
Limin Wang
Yu Qiao
MLLM
145
586
0
10 May 2023
Visual Tuning
Visual Tuning
Bruce X. B. Yu
Jianlong Chang
Haixin Wang
Lin Liu
Shijie Wang
...
Lingxi Xie
Haojie Li
Zhouchen Lin
Qi Tian
Chang Wen Chen
VLM
177
41
0
10 May 2023
ImageBind: One Embedding Space To Bind Them All
ImageBind: One Embedding Space To Bind Them All
Rohit Girdhar
Alaaeldin El-Nouby
Zhuang Liu
Mannat Singh
Kalyan Vasudev Alwala
Armand Joulin
Ishan Misra
VLM
223
945
0
09 May 2023
Making the Most of What You Have: Adapting Pre-trained Visual Language
  Models in the Low-data Regime
Making the Most of What You Have: Adapting Pre-trained Visual Language Models in the Low-data Regime
Chuhan Zhang
Antoine Miech
Jiajun Shen
Jean-Baptiste Alayrac
Pauline Luc
VLMVPVLM
90
2
0
03 May 2023
Unsupervised Improvement of Audio-Text Cross-Modal Representations
Unsupervised Improvement of Audio-Text Cross-Modal Representations
Zhepei Wang
Cem Subakan
Krishna Subramani
Junkai Wu
Tiago Tavares
Fabio Ayres
Paris Smaragdis
SSL
81
2
0
03 May 2023
Attack-SAM: Towards Attacking Segment Anything Model With Adversarial
  Examples
Attack-SAM: Towards Attacking Segment Anything Model With Adversarial Examples
Chenshuang Zhang
Chaoning Zhang
Taegoo Kang
Donghun Kim
Sung-Ho Bae
In So Kweon
AAMLVLM
99
3
0
01 May 2023
ChatVideo: A Tracklet-centric Multimodal and Versatile Video
  Understanding System
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System
Junke Wang
Dongdong Chen
Chong Luo
Xiyang Dai
Lu Yuan
Zuxuan Wu
Yu-Gang Jiang
173
57
0
27 Apr 2023
DataComp: In search of the next generation of multimodal datasets
DataComp: In search of the next generation of multimodal datasets
S. Gadre
Gabriel Ilharco
Alex Fang
J. Hayase
Georgios Smyrnis
...
A. Dimakis
J. Jitsev
Y. Carmon
Vaishaal Shankar
Ludwig Schmidt
VLM
143
452
0
27 Apr 2023
Vision Conformer: Incorporating Convolutions into Vision Transformer
  Layers
Vision Conformer: Incorporating Convolutions into Vision Transformer Layers
Brian Kenji Iwana
Akihiro Kusuda
ViT
81
2
0
27 Apr 2023
LEMaRT: Label-Efficient Masked Region Transform for Image Harmonization
LEMaRT: Label-Efficient Masked Region Transform for Image Harmonization
Sheng Liu
C. P. Huynh
Congmin Chen
Maxim Arap
Raffay Hamid
114
19
0
25 Apr 2023
Hypernymization of named entity-rich captions for grounding-based
  multi-modal pretraining
Hypernymization of named entity-rich captions for grounding-based multi-modal pretraining
Giacomo Nebbia
Adriana Kovashka
110
0
0
25 Apr 2023
A Strong and Reproducible Object Detector with Only Public Datasets
A Strong and Reproducible Object Detector with Only Public Datasets
Tianhe Ren
Jianwei Yang
Siyi Liu
Ailing Zeng
Feng Li
Hao Zhang
Hongyang Li
Zhaoyang Zeng
Lei Zhang
ObjD
85
11
0
25 Apr 2023
Hint-Aug: Drawing Hints from Foundation Vision Transformers Towards
  Boosted Few-Shot Parameter-Efficient Tuning
Hint-Aug: Drawing Hints from Foundation Vision Transformers Towards Boosted Few-Shot Parameter-Efficient Tuning
Zhongzhi Yu
Shang Wu
Y. Fu
Shunyao Zhang
Yingyan Lin
88
6
0
25 Apr 2023
A Cookbook of Self-Supervised Learning
A Cookbook of Self-Supervised Learning
Randall Balestriero
Mark Ibrahim
Vlad Sobal
Ari S. Morcos
Shashank Shekhar
...
Pierre Fernandez
Amir Bar
Hamed Pirsiavash
Yann LeCun
Micah Goldblum
SyDaFedMLSSL
172
285
0
24 Apr 2023
Implicit Temporal Modeling with Learnable Alignment for Video
  Recognition
Implicit Temporal Modeling with Learnable Alignment for Video Recognition
S. Tu
Qi Dai
Zuxuan Wu
Zhi-Qi Cheng
Hang-Rui Hu
Yu-Gang Jiang
109
37
0
20 Apr 2023
LipsFormer: Introducing Lipschitz Continuity to Vision Transformers
LipsFormer: Introducing Lipschitz Continuity to Vision Transformers
Xianbiao Qi
Jianan Wang
Yihao Chen
Yukai Shi
Lei Zhang
98
21
0
19 Apr 2023
Visual Instruction Tuning
Visual Instruction Tuning
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
SyDaVLMMLLM
587
4,950
0
17 Apr 2023
DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP
  Training
DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training
Yihao Chen
Xianbiao Qi
Jianan Wang
Lei Zhang
82
18
0
17 Apr 2023
Progressive Visual Prompt Learning with Contrastive Feature Re-formation
Progressive Visual Prompt Learning with Contrastive Feature Re-formation
C. Xu
Yuhan Zhu
Haocheng Shen
Fengyuan Shi
Boheng Chen
Yixuan Liao
Xiaoxin Chen
Limin Wang
VLM
107
22
0
17 Apr 2023
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Sihan Chen
Xingjian He
Longteng Guo
Xinxin Zhu
Weining Wang
Jinhui Tang
Jinhui Tang
VLM
141
112
0
17 Apr 2023
Leveraging sparse and shared feature activations for disentangled
  representation learning
Leveraging sparse and shared feature activations for disentangled representation learning
Marco Fumero
F. Wenzel
Luca Zancato
Alessandro Achille
Emanuele Rodolà
Stefano Soatto
Bernhard Schölkopf
Francesco Locatello
OODDRL
95
25
0
17 Apr 2023
Chain of Thought Prompt Tuning in Vision Language Models
Chain of Thought Prompt Tuning in Vision Language Models
Jiaxin Ge
Hongyin Luo
Siyuan Qian
Yulu Gan
Jie Fu
Shanghang Zhang
VLMLRMMLLM
109
29
0
16 Apr 2023
On the Opportunities and Challenges of Foundation Models for Geospatial
  Artificial Intelligence
On the Opportunities and Challenges of Foundation Models for Geospatial Artificial Intelligence
Gengchen Mai
Weiming Huang
Jin Sun
Suhang Song
Deepak Mishra
...
Yingjie Hu
Chris Cundy
Ziyuan Li
Rui Zhu
Ni Lao
AI4CE
122
134
0
13 Apr 2023
Segment Everything Everywhere All at Once
Segment Everything Everywhere All at Once
Xueyan Zou
Jianwei Yang
Hao Zhang
Feng Li
Linjie Li
Jianfeng Wang
Lijuan Wang
Jianfeng Gao
Yong Jae Lee
MLLMVLM
143
493
0
13 Apr 2023
Lossless Adaptation of Pretrained Vision Models For Robotic Manipulation
Lossless Adaptation of Pretrained Vision Models For Robotic Manipulation
Mohit Sharma
Claudio Fantacci
Yuxiang Zhou
Skanda Koppula
N. Heess
Jonathan Scholz
Y. Aytar
VLM
111
31
0
13 Apr 2023
APPLeNet: Visual Attention Parameterized Prompt Learning for Few-Shot
  Remote Sensing Image Generalization using CLIP
APPLeNet: Visual Attention Parameterized Prompt Learning for Few-Shot Remote Sensing Image Generalization using CLIP
Mainak Singha
Ankit Jha
Bhupendra S. Solanki
Shirsha Bose
Biplab Banerjee
VLM
107
27
0
12 Apr 2023
A Billion-scale Foundation Model for Remote Sensing Images
A Billion-scale Foundation Model for Remote Sensing Images
Keumgang Cha
Junghoon Seo
Taekyung Lee
121
71
0
11 Apr 2023
Improving Image Recognition by Retrieving from Web-Scale Image-Text Data
Improving Image Recognition by Retrieving from Web-Scale Image-Text Data
Ahmet Iscen
Alireza Fathi
Cordelia Schmid
VLM3DV
87
26
0
11 Apr 2023
Detection Transformer with Stable Matching
Detection Transformer with Stable Matching
Siyi Liu
Tianhe Ren
Jia-Yu Chen
Zhaoyang Zeng
Hao Zhang
...
Hongyang Li
Jun Huang
Hang Su
Jun Zhu
Lei Zhang
80
36
0
10 Apr 2023
Interaction-Aware Prompting for Zero-Shot Spatio-Temporal Action
  Detection
Interaction-Aware Prompting for Zero-Shot Spatio-Temporal Action Detection
Wei-Jhe Huang
Jheng-Hsien Yeh
Min-Hung Chen
Gueter Josmy Faure
S. Lai
96
4
0
10 Apr 2023
On Robustness in Multimodal Learning
On Robustness in Multimodal Learning
Brandon McKinzie
Joseph Cheng
Vaishaal Shankar
Yinfei Yang
Jonathon Shlens
Alexander Toshev
70
2
0
10 Apr 2023
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
Jun Chen
Deyao Zhu
Kilichbek Haydarov
Xiang Li
Mohamed Elhoseiny
111
38
0
09 Apr 2023
Token Boosting for Robust Self-Supervised Visual Transformer
  Pre-training
Token Boosting for Robust Self-Supervised Visual Transformer Pre-training
Tianjiao Li
Lin Geng Foo
Ping Hu
Xindi Shang
Hossein Rahmani
Zehuan Yuan
Jing Liu
122
7
0
09 Apr 2023
Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting
Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting
Syed Talal Wasim
Muzammal Naseer
Salman Khan
Fahad Shahbaz Khan
M. Shah
VLMVPVLM
117
79
0
06 Apr 2023
VicTR: Video-conditioned Text Representations for Activity Recognition
VicTR: Video-conditioned Text Representations for Activity Recognition
Kumara Kahatapitiya
Anurag Arnab
Arsha Nagrani
Michael S. Ryoo
109
23
0
05 Apr 2023
Black Box Few-Shot Adaptation for Vision-Language models
Black Box Few-Shot Adaptation for Vision-Language models
Yassine Ouali
Adrian Bulat
Brais Martínez
Georgios Tzimiropoulos
VLM
90
35
0
04 Apr 2023
RegionPLC: Regional Point-Language Contrastive Learning for Open-World
  3D Scene Understanding
RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding
Jihan Yang
Runyu Ding
Weipeng Deng
Zhe Wang
Xiaojuan Qi
146
69
0
03 Apr 2023
Vision-Language Models for Vision Tasks: A Survey
Vision-Language Models for Vision Tasks: A Survey
Jingyi Zhang
Jiaxing Huang
Sheng Jin
Shijian Lu
VLM
169
553
0
03 Apr 2023
From Isolated Islands to Pangea: Unifying Semantic Space for Human
  Action Understanding
From Isolated Islands to Pangea: Unifying Semantic Space for Human Action Understanding
Yong-Lu Li
Xiaoqian Wu
Xinpeng Liu
Zehao Wang
Yiming Dou
...
Junyi Zhang
Yixing Li
Jingru Tan
Xudong Lu
Cewu Lu
173
17
0
02 Apr 2023
DIME-FM: DIstilling Multimodal and Efficient Foundation Models
DIME-FM: DIstilling Multimodal and Efficient Foundation Models
Ximeng Sun
Pengchuan Zhang
Peizhao Zhang
Hardik Shah
Kate Saenko
Xide Xia
VLM
109
22
0
31 Mar 2023
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
Lucas Beyer
Bo Wan
Gagan Madan
Filip Pavetić
Andreas Steiner
...
Emanuele Bugliarello
Tianlin Li
Qihang Yu
Liang-Chieh Chen
Xiaohua Zhai
130
9
0
30 Mar 2023
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
Weicheng Kuo
A. Piergiovanni
Dahun Kim
Xiyang Luo
Benjamin Caine
...
Luowei Zhou
Andrew M. Dai
Zhifeng Chen
Claire Cui
A. Angelova
MLLMVLM
131
25
0
29 Mar 2023
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Kunchang Li
Yali Wang
Yizhuo Li
Yi Wang
Yinan He
Limin Wang
Yu Qiao
VGen
138
169
0
28 Mar 2023
Sigmoid Loss for Language Image Pre-Training
Sigmoid Loss for Language Image Pre-Training
Xiaohua Zhai
Basil Mustafa
Alexander Kolesnikov
Lucas Beyer
CLIPVLM
331
1,208
0
27 Mar 2023
Text-to-Image Diffusion Models are Zero-Shot Classifiers
Text-to-Image Diffusion Models are Zero-Shot Classifiers
Kevin Clark
P. Jaini
DiffMVLM
124
116
0
27 Mar 2023
Previous
123...8910...121314
Next