ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2501.04001
  4. Cited By
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
v1v2 (latest)

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

7 January 2025
Haobo Yuan
Xianrui Li
Tao Zhang
Zilong Huang
Shilin Xu
S. Ji
Yunhai Tong
Lu Qi
Jiashi Feng
Ming-Hsuan Yang
    VLM
ArXiv (abs)PDFHTML

Papers citing "Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos"

50 / 108 papers shown
Title
Osprey: Pixel Understanding with Visual Instruction Tuning
Osprey: Pixel Understanding with Visual Instruction Tuning
Yuqian Yuan
Wentong Li
Jian Liu
Dongqi Tang
Xinjie Luo
Chi Qin
Lei Zhang
Jianke Zhu
MLLMVLM
132
87
0
15 Dec 2023
GSVA: Generalized Segmentation via Multimodal Large Language Models
GSVA: Generalized Segmentation via Multimodal Large Language Models
Zhuofan Xia
Dongchen Han
Yizeng Han
Xuran Pan
Shiji Song
Gao Huang
VLM
152
68
0
15 Dec 2023
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models
Hao Zhang
Hongyang Li
Feng Li
Tianhe Ren
Xueyan Zou
...
Shijia Huang
Jianfeng Gao
Lei Zhang
Chun-yue Li
Jianwei Yang
189
76
0
05 Dec 2023
PixelLM: Pixel Reasoning with Large Multimodal Model
PixelLM: Pixel Reasoning with Large Multimodal Model
Zhongwei Ren
Zhicheng Huang
Yunchao Wei
Yao-Min Zhao
Dongmei Fu
Jiashi Feng
Xiaojie Jin
VLMMLLMLRM
124
109
0
04 Dec 2023
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual
  Prompts
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
Mu Cai
Haotian Liu
Dennis Park
Siva Karthik Mustikovela
Gregory P. Meyer
Yuning Chai
Yong Jae Lee
VLMLRMMLLM
130
99
0
01 Dec 2023
VTimeLLM: Empower LLM to Grasp Video Moments
VTimeLLM: Empower LLM to Grasp Video Moments
Bin Huang
Xin Wang
Hong Chen
Zihan Song
Wenwu Zhu
MLLM
154
132
0
30 Nov 2023
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Yanwei Li
Chengyao Wang
Jiaya Jia
VLMMLLM
96
301
0
28 Nov 2023
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning
  Benchmark for Expert AGI
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Xiang Yue
Yuansheng Ni
Kai Zhang
Tianyu Zheng
Ruoqi Liu
...
Yibo Liu
Wenhao Huang
Huan Sun
Yu-Chuan Su
Wenhu Chen
OSLMELMVLM
471
960
0
27 Nov 2023
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models
Shehan Munasinghe
Rusiru Thushara
Muhammad Maaz
H. Rasheed
Salman Khan
Mubarak Shah
Fahad Khan
VLMMLLM
83
36
0
22 Nov 2023
Video-LLaVA: Learning United Visual Representation by Alignment Before
  Projection
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin
Yang Ye
Bin Zhu
Jiaxi Cui
Munan Ning
Peng Jin
Li-ming Yuan
VLMMLLM
390
711
0
16 Nov 2023
Chat-UniVi: Unified Visual Representation Empowers Large Language Models
  with Image and Video Understanding
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Peng Jin
Ryuichi Takanobu
Caiwan Zhang
Xiaochun Cao
Li-ming Yuan
MLLM
153
249
0
14 Nov 2023
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Shilong Liu
Hao Cheng
Haotian Liu
Hao Zhang
Feng Li
...
Hang Su
Jun Zhu
Lei Zhang
Jianfeng Gao
Chun-yue Li
MLLMVLM
121
126
0
09 Nov 2023
GLaMM: Pixel Grounding Large Multimodal Model
GLaMM: Pixel Grounding Large Multimodal Model
H. Rasheed
Muhammad Maaz
Sahal Shaji Mullappilly
Abdelrahman M. Shaker
Salman Khan
Hisham Cholakkal
Rao M. Anwer
Erix Xing
Ming-Hsuan Yang
Fahad S. Khan
MLLMVLM
173
239
0
06 Nov 2023
Rethinking Evaluation Metrics of Open-Vocabulary Segmentaion
Rethinking Evaluation Metrics of Open-Vocabulary Segmentaion
Hao Zhou
Tiancheng Shen
Xu Yang
Hai Huang
Xiangtai Li
Lu Qi
Ming-Hsuan Yang
139
12
0
06 Nov 2023
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation,
  Generation and Editing
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Wei-Ge Chen
Irina Spiridonova
Jianwei Yang
Jianfeng Gao
Chun-yue Li
MLLMVLM
97
37
0
01 Nov 2023
Improved Baselines with Visual Instruction Tuning
Improved Baselines with Visual Instruction Tuning
Haotian Liu
Chunyuan Li
Yuheng Li
Yong Jae Lee
VLMMLLM
249
2,834
0
05 Oct 2023
InternLM-XComposer: A Vision-Language Large Model for Advanced
  Text-image Comprehension and Composition
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
Pan Zhang
Xiaoyi Wang
Bin Wang
Yuhang Cao
Chao Xu
...
Conghui He
Xingcheng Zhang
Yu Qiao
Da Lin
Jiaqi Wang
MLLM
198
241
0
26 Sep 2023
Qwen-VL: A Versatile Vision-Language Model for Understanding,
  Localization, Text Reading, and Beyond
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai
Shuai Bai
Shusheng Yang
Shijie Wang
Sinan Tan
Peng Wang
Junyang Lin
Chang Zhou
Jingren Zhou
MLLMVLMObjD
205
945
0
24 Aug 2023
MeViS: A Large-scale Benchmark for Video Segmentation with Motion
  Expressions
MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions
Henghui Ding
Chang Liu
Shuting He
Xudong Jiang
Chen Change Loy
VOS
134
117
0
16 Aug 2023
LISA: Reasoning Segmentation via Large Language Model
LISA: Reasoning Segmentation via Large Language Model
Xin Lai
Zhuotao Tian
Yukang Chen
Yanwei Li
Yuhui Yuan
Shu Liu
Jiaya Jia
LM&RoVLMMLLMLRM
173
463
0
01 Aug 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MHALM
565
12,138
0
18 Jul 2023
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation
Yin-Yin He
Menghan Xia
Haoxin Chen
Xiaodong Cun
Yuan Gong
...
Yong Zhang
Xintao Wang
Chao-Liang Weng
Ying Shan
Qifeng Chen
DiffMVGen
66
79
0
13 Jul 2023
MMBench: Is Your Multi-modal Model an All-around Player?
MMBench: Is Your Multi-modal Model an All-around Player?
Yuanzhan Liu
Haodong Duan
Yuanhan Zhang
Yue Liu
Songyang Zhang
...
Jiaqi Wang
Conghui He
Ziwei Liu
Kai-xiang Chen
Dahua Lin
204
1,060
0
12 Jul 2023
Kosmos-2: Grounding Multimodal Large Language Models to the World
Kosmos-2: Grounding Multimodal Large Language Models to the World
Zhiliang Peng
Wenhui Wang
Li Dong
Y. Hao
Shaohan Huang
Shuming Ma
Furu Wei
MLLMObjDVLM
130
765
0
26 Jun 2023
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language
  Models
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu
Peixian Chen
Yunhang Shen
Yulei Qin
Mengdan Zhang
...
Xiawu Zheng
Ke Li
Xing Sun
Zhenyu Qiu
Rongrong Ji
ELMMLLM
161
860
0
23 Jun 2023
DVIS: Decoupled Video Instance Segmentation Framework
DVIS: Decoupled Video Instance Segmentation Framework
Tao Zhang
Xingye Tian
Yuehua Wu
Shunping Ji
Xuebo Wang
Yuan Zhang
Pengfei Wan
91
48
0
06 Jun 2023
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video
  Understanding
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Hang Zhang
Xin Li
Lidong Bing
MLLM
226
1,068
0
05 Jun 2023
GRES: Generalized Referring Expression Segmentation
GRES: Generalized Referring Expression Segmentation
Chang Liu
Henghui Ding
Xudong Jiang
170
167
0
01 Jun 2023
Contextual Object Detection with Multimodal Large Language Models
Contextual Object Detection with Multimodal Large Language Models
Yuhang Zang
Wei Li
Jun Han
Kaiyang Zhou
Chen Change Loy
ObjDVLMMLLM
121
88
0
29 May 2023
OpenVIS: Open-vocabulary Video Instance Segmentation
OpenVIS: Open-vocabulary Video Instance Segmentation
Pinxue Guo
Tony Huang
Peiyang He
Xuefeng Liu
Tianjun Xiao
Zhaoyu Chen
Wenqiang Zhang
VLM
101
16
0
26 May 2023
Evaluating Object Hallucination in Large Vision-Language Models
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li
Yifan Du
Kun Zhou
Jinpeng Wang
Wayne Xin Zhao
Ji-Rong Wen
MLLMLRM
370
816
0
17 May 2023
Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style
  Transfer
Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style Transfer
Nisha Huang
Yuxin Zhang
Weiming Dong
DiffMVGen
66
17
0
09 May 2023
Visual Instruction Tuning
Visual Instruction Tuning
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
SyDaVLMMLLM
587
4,950
0
17 Apr 2023
Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models
Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models
Wen Wang
Yan Jiang
K. Xie
Zide Liu
Hao Chen
Yue Cao
Xinlong Wang
Chunhua Shen
DiffMVGen
115
116
0
30 Mar 2023
Pix2Video: Video Editing using Image Diffusion
Pix2Video: Video Editing using Image Diffusion
Duygu Ceylan
C. Huang
Niloy J. Mitra
DiffMVGen
164
262
0
22 Mar 2023
Universal Instance Perception as Object Discovery and Retrieval
Universal Instance Perception as Object Discovery and Retrieval
B. Yan
Yi Jiang
Jiannan Wu
D. Wang
Ping Luo
Zehuan Yuan
Huchuan Lu
VOSVLMLRM
157
176
0
12 Mar 2023
Video-P2P: Video Editing with Cross-attention Control
Video-P2P: Video Editing with Cross-attention Control
Shaoteng Liu
Yuechen Zhang
Wenbo Li
Zhe Lin
Jiaya Jia
DiffMVGen
227
221
0
08 Mar 2023
LLaMA: Open and Efficient Foundation Language Models
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron
Thibaut Lavril
Gautier Izacard
Xavier Martinet
Marie-Anne Lachaux
...
Faisal Azhar
Aurelien Rodriguez
Armand Joulin
Edouard Grave
Guillaume Lample
ALMPILM
1.7K
13,558
0
27 Feb 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLMMLLM
640
4,679
0
30 Jan 2023
GRiT: A Generative Region-to-text Transformer for Object Understanding
GRiT: A Generative Region-to-text Transformer for Object Understanding
Jialian Wu
Jianfeng Wang
Zhengyuan Yang
Zhe Gan
Zicheng Liu
Junsong Yuan
Lijuan Wang
ObjDVLM
83
119
0
01 Dec 2022
Learn to Explain: Multimodal Reasoning via Thought Chains for Science
  Question Answering
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Pan Lu
Swaroop Mishra
Tony Xia
Liang Qiu
Kai-Wei Chang
Song-Chun Zhu
Oyvind Tafjord
Peter Clark
Ashwin Kalyan
ELMReLMLRM
304
1,303
0
20 Sep 2022
VINet: Visual and Inertial-based Terrain Classification and Adaptive
  Navigation over Unknown Terrain
VINet: Visual and Inertial-based Terrain Classification and Adaptive Navigation over Unknown Terrain
Tianrui Guan
Ruitao Song
Zhixian Ye
Liangjun Zhang
89
11
0
16 Sep 2022
Instance As Identity: A Generic Online Paradigm for Video Instance
  Segmentation
Instance As Identity: A Generic Online Paradigm for Video Instance Segmentation
Feng Zhu
Zongxin Yang
Xin Yu
Yezhou Yang
Yunchao Wei
85
10
0
05 Aug 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified
  Vision-Language Understanding and Generation
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Junnan Li
Dongxu Li
Caiming Xiong
Guosheng Lin
MLLMBDLVLMCLIP
586
4,443
0
28 Jan 2022
Language as Queries for Referring Video Object Segmentation
Language as Queries for Referring Video Object Segmentation
Jiannan Wu
Yi Jiang
Pei Sun
Zehuan Yuan
Ping Luo
84
155
0
03 Jan 2022
LAVT: Language-Aware Vision Transformer for Referring Image Segmentation
LAVT: Language-Aware Vision Transformer for Referring Image Segmentation
Zhao Yang
Jiaqi Wang
Yansong Tang
Kai-xiang Chen
Hengshuang Zhao
Philip Torr
228
333
0
04 Dec 2021
Masked-attention Mask Transformer for Universal Image Segmentation
Masked-attention Mask Transformer for Universal Image Segmentation
Bowen Cheng
Ishan Misra
Alex Schwing
Alexander Kirillov
Rohit Girdhar
ISeg
427
2,407
0
02 Dec 2021
LoRA: Low-Rank Adaptation of Large Language Models
LoRA: Low-Rank Adaptation of Large Language Models
J. E. Hu
Yelong Shen
Phillip Wallis
Zeyuan Allen-Zhu
Yuanzhi Li
Shean Wang
Lu Wang
Weizhu Chen
OffRLAI4TSAI4CEALMAIMat
834
10,659
0
17 Jun 2021
Video Instance Segmentation using Inter-Frame Communication Transformers
Video Instance Segmentation using Inter-Frame Communication Transformers
Sukjun Hwang
Miran Heo
Seoung Wug Oh
Seon Joo Kim
ViT
134
139
0
07 Jun 2021
GANav: Efficient Terrain Segmentation for Robot Navigation in
  Unstructured Outdoor Environments
GANav: Efficient Terrain Segmentation for Robot Navigation in Unstructured Outdoor Environments
Tianrui Guan
D. Kothandaraman
Rohan Chandra
A. Sathyamoorthy
K. Weerakoon
Tianyi Zhou
95
109
0
07 Mar 2021
Previous
123
Next