Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2501.04001
Cited By
v1
v2 (latest)
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
7 January 2025
Haobo Yuan
Xianrui Li
Tao Zhang
Zilong Huang
Shilin Xu
S. Ji
Yunhai Tong
Lu Qi
Jiashi Feng
Ming-Hsuan Yang
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos"
8 / 108 papers shown
Title
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
1.2K
42,753
0
28 May 2020
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
Zhicheng Huang
Zhaoyang Zeng
Bei Liu
Dongmei Fu
Jianlong Fu
ViT
197
440
0
02 Apr 2020
Video Object Segmentation-based Visual Servo Control and Object Depth Estimation on a Mobile Robot
Brent A. Griffin
V. Florence
Jason J. Corso
VOS
83
22
0
20 Mar 2019
Video Object Segmentation with Language Referring Expressions
Anna Khoreva
Anna Rohrbach
Bernt Schiele
VOS
85
197
0
21 Mar 2018
MAttNet: Modular Attention Network for Referring Expression Comprehension
Licheng Yu
Zhe Lin
Xiaohui Shen
Jimei Yang
Xin Lu
Joey Tianyi Zhou
Tamara L. Berg
ObjD
130
834
0
24 Jan 2018
Modeling Context in Referring Expressions
Licheng Yu
Patrick Poirson
Shan Yang
Alexander C. Berg
Tamara L. Berg
155
1,281
0
31 Jul 2016
A Diagram Is Worth A Dozen Images
Aniruddha Kembhavi
M. Salvato
Eric Kolve
Minjoon Seo
Hannaneh Hajishirzi
Ali Farhadi
3DV
106
506
0
24 Mar 2016
VQA: Visual Question Answering
Aishwarya Agrawal
Jiasen Lu
Stanislaw Antol
Margaret Mitchell
C. L. Zitnick
Dhruv Batra
Devi Parikh
CoGe
500
5,527
0
03 May 2015
Previous
1
2
3