Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2303.15389
Cited By
EVA-CLIP: Improved Training Techniques for CLIP at Scale
27 March 2023
Quan-Sen Sun
Yuxin Fang
Ledell Yu Wu
Xinlong Wang
Yue Cao
CLIP
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"EVA-CLIP: Improved Training Techniques for CLIP at Scale"
50 / 360 papers shown
Title
UniGS: Unified Language-Image-3D Pretraining with Gaussian Splatting
Haoyuan Li
Yanpeng Zhou
Tao Tang
Jifei Song
Yihan Zeng
Michael C. Kampffmeyer
Hang Xu
Xiaodan Liang
3DGS
67
1
0
25 Feb 2025
CLIPure: Purification in Latent Space via CLIP for Adversarially Robust Zero-Shot Classification
Mingkun Zhang
Keping Bi
Wei Chen
J. Guo
Xueqi Cheng
BDL
VLM
52
1
0
25 Feb 2025
Surgical Scene Understanding in the Era of Foundation AI Models: A Comprehensive Review
Ufaq Khan
Umair Nawaz
A. Qayyum
Shazad Ashraf
Muhammad Bilal
Junaid Qadir
76
0
0
24 Feb 2025
Directional Gradient Projection for Robust Fine-Tuning of Foundation Models
Chengyue Huang
Junjiao Tian
Brisa Maneechotesuwan
Shivang Chopra
Z. Kira
57
0
0
21 Feb 2025
Black Sheep in the Herd: Playing with Spuriously Correlated Attributes for Vision-Language Recognition
Xinyu Tian
Shu Zou
Zhaoyuan Yang
Mengqi He
Jing Zhang
VLM
48
0
0
19 Feb 2025
InsightVision: A Comprehensive, Multi-Level Chinese-based Benchmark for Evaluating Implicit Visual Semantics in Large Vision Language Models
Xiaofei Yin
Y. Hong
Ya Guo
Yi Tu
Weiqiang Wang
Gongshen Liu
Huijia Zhu
VLM
63
0
0
19 Feb 2025
VRoPE: Rotary Position Embedding for Video Large Language Models
Zikang Liu
Longteng Guo
Yepeng Tang
Junxian Cai
Kai Ma
Xi Chen
Jiaheng Liu
49
0
0
17 Feb 2025
Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information Retrieval
Ze Liu
Junjie Zhou
Yueze Wang
Zheng Liu
Defu Lian
OffRL
112
0
0
17 Feb 2025
GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis
Angelos Zavras
Dimitrios Michail
Xiao Xiang Zhu
Begüm Demir
Ioannis Papoutsis
VLM
86
0
0
13 Feb 2025
Learning Human Skill Generators at Key-Step Levels
Yilu Wu
Chenhui Zhu
Shuai Wang
Hanlin Wang
Jing Wang
Zhaoxiang Zhang
Limin Wang
VGen
119
0
0
12 Feb 2025
Intrinsic Bias is Predicted by Pretraining Data and Correlates with Downstream Performance in Vision-Language Encoders
Kshitish Ghate
Isaac Slaughter
Kyra Wilson
Mona Diab
Aylin Caliskan
86
1
0
11 Feb 2025
CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification
Cristiano Patrício
Isabel Rio-Torto
J. S. Cardoso
Luís F. Teixeira
João C. Neves
VLM
227
1
0
21 Jan 2025
VideoAuteur: Towards Long Narrative Video Generation
Junfei Xiao
Feng Cheng
Lu Qi
Liangke Gui
Jiepeng Cen
Zhibei Ma
Alan L. Yuille
Lu Jiang
VGen
58
2
0
10 Jan 2025
Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques
Lijie Tao
H. Zhang
Haizhao Jing
Yu Liu
Kelu Yao
Guoting Wei
Xizhe Xue
37
0
0
03 Jan 2025
YOLO-UniOW: Efficient Universal Open-World Object Detection
Lihao Liu
Juexiao Feng
Hui Chen
Ao Wang
Lin Song
J. Han
Guiguang Ding
ObjD
VLM
49
2
0
31 Dec 2024
Towards Visual Grounding: A Survey
Linhui Xiao
Xiaoshan Yang
X. Lan
Yaowei Wang
Changsheng Xu
ObjD
55
4
0
31 Dec 2024
How Panel Layouts Define Manga: Insights from Visual Ablation Experiments
Siyuan Feng
Teruya Yoshinaga
Katsuhiko Hayashi
Koki Washio
Hidetaka Kamigaito
33
0
0
26 Dec 2024
Retention Score: Quantifying Jailbreak Risks for Vision Language Models
Zaitang Li
Pin-Yu Chen
Tsung-Yi Ho
AAML
36
0
0
23 Dec 2024
HyperCLIP: Adapting Vision-Language models with Hypernetworks
Victor Akinwande
Mohammad Sadegh Norouzzadeh
Devin Willmott
Anna Bair
Madan Ravi Ganesh
J. Zico Kolter
CLIP
VLM
93
0
0
21 Dec 2024
DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment
Cijo Jose
Théo Moutakanni
Dahyun Kang
Federico Baldassarre
Timothée Darcet
...
Maxime Oquab
Oriane Siméoni
Huy V. Vo
Patrick Labatut
Piotr Bojanowski
CLIP
VLM
100
6
0
20 Dec 2024
Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic Segmentation
J. Zhang
Li Zhang
Shijian Li
VLM
81
0
0
18 Dec 2024
Unlocking the Potential of Weakly Labeled Data: A Co-Evolutionary Learning Framework for Abnormality Detection and Report Generation
Jinghan Sun
Dong-mei Wei
Zhe Xu
Donghuan Lu
Hong Liu
Hong Wang
Sotirios A. Tsaftaris
Jingyu Sun
Yefeng Zheng
Liansheng Wang
MedIm
106
0
0
18 Dec 2024
DINO-Foresight
\texttt{DINO-Foresight}
DINO-Foresight
: Looking into the Future with DINO
Efstathios Karypidis
Ioannis Kakogeorgiou
Spyros Gidaris
N. Komodakis
AI4CE
87
1
0
16 Dec 2024
Gramian Multimodal Representation Learning and Alignment
Giordano Cicchetti
Eleonora Grassucci
Luigi Sigillo
Danilo Comminiello
91
1
0
16 Dec 2024
Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method
Xinshuai Song
Weixing Chen
Yong-Jin Liu
Weikai Chen
Guanbin Li
Liang Lin
123
3
0
12 Dec 2024
jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images
Andreas Koukounas
Georgios Mastrapas
Bo Wang
Mohammad Kalim Akram
Sedigheh Eslami
Michael Gunther
Isabelle Mohr
Saba Sturua
Scott Martens
Nan Wang
VLM
110
7
0
11 Dec 2024
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs
Wangbo Zhao
Yizeng Han
Jiasheng Tang
Zechao Li
Yibing Song
Kaidi Wang
Zhangyang Wang
Yang You
83
7
0
04 Dec 2024
Video LLMs for Temporal Reasoning in Long Videos
Fawad Javed Fateh
Umer Ahmed
Hamza Khan
M. Zia
Quoc-Huy Tran
VLM
89
0
0
04 Dec 2024
AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?
Shouwei Ruan
Hanqin Liu
Yao Huang
Xiaoqi Wang
Caixin Kang
Hang Su
Yinpeng Dong
Xingxing Wei
VGen
93
0
0
04 Dec 2024
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format
Yueqian Wang
Xiaojun Meng
Yijiao Wang
Jianxin Liang
Jiansheng Wei
Huishuai Zhang
Dongyan Zhao
VGen
85
8
0
27 Nov 2024
NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects?
Jiaxuan Li
Junwen Mo
MinhDuc Vo
Akihiro Sugimoto
Hideki Nakayama
87
0
0
26 Nov 2024
Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge
Yaqi Zhao
Yuanyang Yin
Lin Li
Mingan Lin
Victor Shea-Jay Huang
Siwei Chen
Xin Wu
Baoqun Yin
Zenan Zhou
Wentao Zhang
77
0
0
25 Nov 2024
A Study on Unsupervised Domain Adaptation for Semantic Segmentation in the Era of Vision-Language Models
Manuel Schwonberg
Claus Werner
Hanno Gottschalk
Carsten Meyer
VLM
90
0
0
25 Nov 2024
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration
Haozhan Shen
Kangjia Zhao
Tiancheng Zhao
Ruochen Xu
Zilun Zhang
Mingwei Zhu
Jianwei Yin
97
4
0
25 Nov 2024
DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving
Xianda Guo
Ruijun Zhang
Yiqun Duan
Yuhang He
Chenming Zhang
Shuai Liu
Long Chen
LRM
91
11
0
20 Nov 2024
Anatomy-Guided Radiology Report Generation with Pathology-Aware Regional Prompts
Yijian Gao
D. C. Marshall
Xiaodan Xing
Junzhi Ning
G. Papanastasiou
G. Yang
M. Komorowski
MedIm
26
0
0
16 Nov 2024
Thinking Before Looking: Improving Multimodal LLM Reasoning via Mitigating Visual Hallucination
Haojie Zheng
Tianyang Xu
Hanchi Sun
Shu Pu
Ruoxi Chen
Lichao Sun
MLLM
LRM
84
8
0
15 Nov 2024
Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation
Yuheng Shi
Minjing Dong
Chang Xu
VLM
43
1
0
14 Nov 2024
Silver medal Solution for Image Matching Challenge 2024
Yian Wang
3DV
3DPC
36
0
0
04 Nov 2024
Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective
Shenghao Xie
Wenqiang Zu
Mingyang Zhao
Duo Su
Shilong Liu
Ruohua Shi
Guoqi Li
Shanghang Zhang
Lei Ma
LRM
47
3
0
29 Oct 2024
Improving Generalization in Visual Reasoning via Self-Ensemble
Tien-Huy Nguyen
Quang-Khai Tran
Anh-Tuan Quang-Hoang
VLM
LRM
58
5
0
28 Oct 2024
Multi-path Exploration and Feedback Adjustment for Text-to-Image Person Retrieval
Bin Kang
Bin Chen
J. T. Wang
Yong Xu
24
0
0
26 Oct 2024
GiVE: Guiding Visual Encoder to Perceive Overlooked Information
Junjie Li
Jianghong Ma
Xiaofeng Zhang
Yuhang Li
Jianyang Shi
43
0
0
26 Oct 2024
Parameter-Efficient Fine-Tuning in Large Models: A Survey of Methodologies
Liwen Wang
Sheng Chen
Linnan Jiang
Shu Pan
Runze Cai
Sen Yang
Fei Yang
49
3
0
24 Oct 2024
PETAH: Parameter Efficient Task Adaptation for Hybrid Transformers in a resource-limited Context
Maximilian Augustin
Syed Shakib Sarwar
Mostafa Elhoushi
Sai Qian Zhang
Yuecheng Li
B. D. Salvo
28
0
0
23 Oct 2024
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance
Zhangwei Gao
Zhe Chen
Erfei Cui
Yiming Ren
Weiyun Wang
...
Lewei Lu
Tong Lu
Yu Qiao
Jifeng Dai
Wenhai Wang
VLM
70
24
0
21 Oct 2024
Zero-Shot Scene Reconstruction from Single Images with Deep Prior Assembly
Junsheng Zhou
Yu-Shen Liu
Zhizhong Han
ViT
34
9
0
21 Oct 2024
Visual Motif Identification: Elaboration of a Curated Comparative Dataset and Classification Methods
Adam Phillips
Daniel Grandes Rodriguez
Miriam Sánchez-Manzano
Alan Salvadó
Manuel Garin
G. Haro
C. Ballester
24
0
0
21 Oct 2024
TIPS: Text-Image Pretraining with Spatial awareness
Kevis-Kokitsi Maninis
Kaifeng Chen
Soham Ghosh
Arjun Karpur
Koert Chen
...
Jan Dlabal
Dan Gnanapragasam
Mojtaba Seyedhosseini
Howard Zhou
Andre Araujo
VLM
35
3
0
21 Oct 2024
Open-vocabulary vs. Closed-set: Best Practice for Few-shot Object Detection Considering Text Describability
Yusuke Hosoya
Masanori Suganuma
Takayuki Okatani
ObjD
21
0
0
20 Oct 2024
Previous
1
2
3
4
5
6
7
8
Next