Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2502.14786
Cited By
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
21 February 2025
Michael Tschannen
A. Gritsenko
Xiao Wang
Muhammad Ferjad Naeem
Ibrahim Alabdulmohsin
Nikhil Parthasarathy
Talfan Evans
Lucas Beyer
Ye Xia
Basil Mustafa
Olivier J. Hénaff
Jeremiah Harmsen
Andreas Steiner
Xiaohua Zhai
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features"
45 / 45 papers shown
Title
Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint
Heekyung Lee
Jiaxin Ge
Tsung-Han Wu
Minwoo Kang
Trevor Darrell
David M. Chan
ReLM
CoGe
LRM
9
0
0
29 May 2025
AquaMonitor: A multimodal multi-view image sequence dataset for real-life aquatic invertebrate biodiversity monitoring
Mikko Impio
Philipp M. Rehsen
Tiina Laamanen
Arne J. Beermann
Florian Leese
Jenni Raitoharju
19
0
0
28 May 2025
Zero-Shot Vision Encoder Grafting via LLM Surrogates
Kaiyu Yue
Vasu Singla
Menglin Jia
John Kirchenbauer
Rifaa Qadri
Zikui Cai
A. Bhatele
Furong Huang
Tom Goldstein
VLM
8
0
0
28 May 2025
WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference
Sihan Chen
Dan Zhao
Jongwoo Ko
Colby R. Banbury
Huiping Zhuang
Luming Liang
Tianyi Chen
12
0
0
26 May 2025
LlamaSeg: Image Segmentation via Autoregressive Mask Generation
Jiru Deng
Tengjin Weng
Tianyu Yang
Wenhan Luo
Zhiheng Li
Wenhao Jiang
VLM
84
0
0
26 May 2025
ReFineVLA: Reasoning-Aware Teacher-Guided Transfer Fine-Tuning
Tuan V. Vo
T. Nguyen
Khang Nguyen
Duy Ho Minh Nguyen
Minh Nhat Vu
LRM
19
0
0
25 May 2025
DocMMIR: A Framework for Document Multi-modal Information Retrieval
Zirui Li
Siwei Wu
Xingyu Wang
Yi Zhou
Yizhi Li
Chenghua Lin
VLM
31
0
0
25 May 2025
REN: Fast and Efficient Region Encodings from Patch-Based Image Encoders
Savya Khosla
Sethuraman TV
Barnett Lee
Alexander Schwing
Derek Hoiem
VGen
55
0
0
23 May 2025
RemoteSAM: Towards Segment Anything for Earth Observation
Liang Yao
Fan Liu
Delong Chen
Chuanyi Zhang
Yijun Wang
Ziyun Chen
Wei Xu
Shimin Di
Yuhui Zheng
92
0
0
23 May 2025
MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation
Bohan Zhou
Yi Zhan
Zhongbin Zhang
Zongqing Lu
35
0
0
22 May 2025
CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms
Shilin Yan
Jiaming Han
Joey Tsai
Hongwei Xue
Rongyao Fang
Lingyi Hong
Ziyu Guo
Ray Zhang
VLM
51
3
0
22 May 2025
NTIRE 2025 challenge on Text to Image Generation Model Quality Assessment
Shuhao Han
Haotian Fan
Fangyuan Kong
Wenjie Liao
Chunle Guo
...
Jian Guo
Zhizhuo Shao
Ziyu Feng
Bing Li
Weiming Hu
94
6
0
22 May 2025
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
Zebin You
Shen Nie
Xiaolu Zhang
Jun Hu
Jun Zhou
Zhiwu Lu
J. Wen
Chongxuan Li
MLLM
VLM
51
0
0
22 May 2025
Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval
Siting Li
Xiang Gao
Simon Shaolei Du
40
0
0
21 May 2025
Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping
Subash Khanal
Srikumar Sastry
Aayush Dhakal
Adeel Ahmad
Nathan Jacobs
56
0
0
19 May 2025
Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput
Bo Zhang
Shuo Li
Runhe Tian
Yang Yang
Jixin Tang
Jinhao Zhou
Lin Ma
VLM
46
0
0
14 May 2025
Symbolically-Guided Visual Plan Inference from Uncurated Video Data
Wenyan Yang
Ahmet Tikna
Yi Zhao
Yuying Zhang
Luigi Palopoli
Marco Roveri
Joni Pajarinen
VGen
43
0
0
13 May 2025
A Vision-Language Foundation Model for Leaf Disease Identification
Khang Nguyen Quoc
Lan Le Thi Thu
Luyl-Da Quach
VLM
79
0
0
11 May 2025
Fine-Tuning Video-Text Contrastive Model for Primate Behavior Retrieval from Unlabeled Raw Videos
Giulio Cesare Mastrocinque Santo
Patrícia Izar
Irene Delval
Victor de Napole Gregolin
Nina S. T. Hirata
VGen
51
0
0
08 May 2025
TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
Haokun Lin
Teng Wang
Yixiao Ge
Yuying Ge
Zhichao Lu
Ying Wei
Qingfu Zhang
Zhenan Sun
Ying Shan
MLLM
VLM
94
1
0
08 May 2025
Using Knowledge Graphs to harvest datasets for efficient CLIP model training
Simon Ging
Sebastian Walter
Jelena Bratulić
Johannes Dienert
Hannah Bast
Thomas Brox
CLIP
47
0
0
05 May 2025
Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation
Volodymyr Havrylov
Haiwen Huang
Dan Zhang
Andreas Geiger
384
0
0
04 May 2025
Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection
Daniel Bogdoll
Rajanikant Ananta
Abeyankar Giridharan
Isabel Moore
Gregory Stevens
Henry X. Liu
VLM
68
0
0
30 Apr 2025
Online Federation For Mixtures of Proprietary Agents with Black-Box Encoders
Xuwei Yang
Fatemeh Tavakoli
D. B. Emerson
Anastasis Kratsios
FedML
94
0
0
30 Apr 2025
ClearVision: Leveraging CycleGAN and SigLIP-2 for Robust All-Weather Classification in Traffic Camera Imagery
Anush Lakshman Sivaraman
Kojo Adu-Gyamfi
Ibne Farabi Shihab
Anuj Sharma
33
0
0
28 Apr 2025
Boosting Generative Image Modeling via Joint Image-Feature Synthesis
Theodoros Kouzelis
Efstathios Karypidis
Ioannis Kakogeorgiou
Spyros Gidaris
N. Komodakis
DiffM
61
0
0
22 Apr 2025
LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models
Haiwen Huang
Anpei Chen
Volodymyr Havrylov
Andreas Geiger
Dan Zhang
46
1
0
18 Apr 2025
Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya
Po-Yao (Bernie) Huang
Peize Sun
Jang Hyun Cho
Andrea Madotto
...
Shiyu Dong
Nikhila Ravi
Daniel Li
Piotr Dollár
Christoph Feichtenhofer
ObjD
VOS
168
5
0
17 Apr 2025
Interpreting the linear structure of vision-language model embedding spaces
Isabel Papadimitriou
Huangyuan Su
Thomas Fel
Naomi Saphra
Sham Kakade
VLM
76
1
0
16 Apr 2025
FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
Zheng Liu
Mengjie Liu
Jianfei Chen
Jingwei Xu
Tengjiao Wang
Zeang Sheng
Wentao Zhang
MLLM
82
1
0
14 Apr 2025
CollEX -- A Multimodal Agentic RAG System Enabling Interactive Exploration of Scientific Collections
Florian Schneider
Narges Baba Ahmadi
Niloufar Baba Ahmadi
Iris Vogel
Martin Semmann
Chris Biemann
50
2
0
10 Apr 2025
SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement
Xinze Wang
Zhiyong Yang
Chao Feng
Hongjin Lu
Linjie Li
Chung-Ching Lin
Kevin Qinghong Lin
Furong Huang
Lijuan Wang
OODD
ReLM
LRM
VLM
120
12
0
10 Apr 2025
Falcon: Fractional Alternating Cut with Overcoming Minima in Unsupervised Segmentation
Xiao Zhang
Xiangyu Han
Xiwen Lai
Yao Sun
Pei Zhang
Konrad Kording
47
0
0
08 Apr 2025
SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models
Justus Westerhoff
Erblina Purellku
Jakob Hackstein
Jonas Loos
Leo Pinetzki
Lorenz Hufe
AAML
43
0
0
07 Apr 2025
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
Yuxiao Chen
L. Meng
Wujian Peng
Zuxuan Wu
Yu-Gang Jiang
VLM
126
1
0
24 Mar 2025
SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining
Yue Li
Qi Ma
Runyi Yang
Huapeng Li
Mengjiao Ma
...
E. Konukoglu
Theo Gevers
Luc Van Gool
Martin R. Oswald
Danda Pani Paudel
3DGS
VLM
143
0
0
23 Mar 2025
Beyond Accuracy: What Matters in Designing Well-Behaved Models?
Robin Hesse
Doğukan Bağcı
Bernt Schiele
Simone Schaub-Meyer
Stefan Roth
VLM
84
0
0
21 Mar 2025
TULIP: Towards Unified Language-Image Pretraining
Zineng Tang
Long Lian
Seun Eisape
Xudong Wang
Roei Herzig
Adam Yala
Alane Suhr
Trevor Darrell
David M. Chan
VLM
CLIP
MLLM
122
5
0
19 Mar 2025
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Nvidia
Johan Bjorck
Fernando Castañeda
Nikita Cherniadev
Xingye Da
...
Ao Zhang
Hao Zhang
Yizhou Zhao
Ruijie Zheng
Yuke Zhu
VLM
106
37
0
18 Mar 2025
Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?
Tianyuan Qu
Longxiang Tang
Bohao Peng
Senqiao Yang
Bei Yu
Jiaya Jia
VLM
370
0
0
16 Mar 2025
BREEN: Bridge Data-Efficient Encoder-Free Multimodal Learning with Learnable Queries
Tianle Li
Yongming Rao
Winston Hu
Yu Cheng
MLLM
88
0
0
16 Mar 2025
Mellow: a small audio language model for reasoning
Soham Deshmukh
Satvik Dixit
Rita Singh
Bhiksha Raj
AuLLM
ReLM
LRM
88
3
0
11 Mar 2025
ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval
Guanqi Zhan
Yuanpei Liu
Kai Han
Weidi Xie
Andrew Zisserman
VLM
385
0
0
21 Feb 2025
VRoPE: Rotary Position Embedding for Video Large Language Models
Zikang Liu
Longteng Guo
Yepeng Tang
Tongtian Yue
Junxian Cai
Kai Ma
Qingbin Liu
Xi Chen
Jing Liu
73
1
0
17 Feb 2025
Law of Vision Representation in MLLMs
Shijia Yang
Bohan Zhai
Quanzeng You
Jianbo Yuan
Hongxia Yang
Chenfeng Xu
68
10
0
29 Aug 2024
1