Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2012.06946
Cited By
MiniVLM: A Smaller and Faster Vision-Language Model
13 December 2020
Jianfeng Wang
Xiaowei Hu
Pengchuan Zhang
Xiujun Li
Lijuan Wang
L. Zhang
Jianfeng Gao
Zicheng Liu
VLM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"MiniVLM: A Smaller and Faster Vision-Language Model"
42 / 42 papers shown
Title
Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation
Yongchao Feng
Yajie Liu
Shuai Yang
Wenrui Cai
Jingyang Zhang
...
Jiahui Lv
Ziqiang Liu
Tengyuan Shi
Qingjie Liu
Yixuan Wang
MLLM
VLM
63
1
0
13 Apr 2025
FedMABench: Benchmarking Mobile Agents on Decentralized Heterogeneous User Data
Wenhao Wang
Zijie Yu
Rui Ye
Junzhe Zhang
S. Chen
Yanfeng Wang
FedML
51
0
0
07 Mar 2025
Vision-Language Models for Edge Networks: A Comprehensive Survey
Ahmed Sharshar
Latif U. Khan
Waseem Ullah
Mohsen Guizani
VLM
70
3
0
11 Feb 2025
Shifted Window Fourier Transform And Retention For Image Captioning
J. Hu
Roberto Cavicchioli
Alessandro Capotondi
VLM
36
0
0
25 Aug 2024
TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning
Quang Minh Dinh
Minh Khoi Ho
Anh Quan Dang
Hung Phong Tran
45
6
0
14 Apr 2024
Zoom-shot: Fast and Efficient Unsupervised Zero-Shot Transfer of CLIP to Vision Encoders with Multimodal Loss
Jordan Shipard
Arnold Wiliem
Kien Nguyen Thanh
Wei Xiang
Clinton Fookes
VLM
CLIP
38
2
0
22 Jan 2024
Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection
Wei Ye
Chaoya Jiang
Haiyang Xu
Chenhao Ye
Chenliang Li
Mingshi Yan
Shikun Zhang
Songhang Huang
Fei Huang
VLM
29
0
0
11 Jan 2024
TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training
Chaoya Jiang
Wei Ye
Haiyang Xu
Qinghao Ye
Mingshi Yan
Ji Zhang
Shikun Zhang
CLIP
VLM
21
4
0
14 Dec 2023
E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer
Jacob Zhiyuan Fang
Skyler Zheng
Vasu Sharma
Robinson Piramuthu
VLM
38
0
0
28 Nov 2023
Module-wise Adaptive Distillation for Multimodality Foundation Models
Chen Liang
Jiahui Yu
Ming-Hsuan Yang
Matthew A. Brown
Huayu Chen
Tuo Zhao
Boqing Gong
Tianyi Zhou
11
10
0
06 Oct 2023
DLIP: Distilling Language-Image Pre-training
Huafeng Kuang
Jie Wu
Xiawu Zheng
Ming Li
Xuefeng Xiao
Rui Wang
Min Zheng
Rongrong Ji
VLM
38
4
0
24 Aug 2023
BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization
Chaoya Jiang
Haiyang Xu
Wei Ye
Qinghao Ye
Chenliang Li
Mingshi Yan
Bin Bi
Shikun Zhang
Fei Huang
Songfang Huang
VLM
31
9
0
17 Jul 2023
Linear Alignment of Vision-language Models for Image Captioning
Fabian Paischer
M. Hofmarcher
Sepp Hochreiter
Thomas Adler
CLIP
VLM
47
0
0
10 Jul 2023
Several categories of Large Language Models (LLMs): A Short Survey
Saurabh Pahune
Manoj Chandrasekharan
AILaw
19
14
0
05 Jul 2023
Embodied Executable Policy Learning with Language-based Scene Summarization
Jielin Qiu
Mengdi Xu
William Jongwon Han
Seungwhan Moon
Ding Zhao
LM&Ro
24
7
0
09 Jun 2023
PuMer: Pruning and Merging Tokens for Efficient Vision Language Models
Qingqing Cao
Bhargavi Paranjape
Hannaneh Hajishirzi
MLLM
VLM
13
21
0
27 May 2023
CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers
Dachuan Shi
Chaofan Tao
Anyi Rao
Zhendong Yang
Chun Yuan
Jiaqi Wang
VLM
30
22
0
27 May 2023
SmartTrim: Adaptive Tokens and Attention Pruning for Efficient Vision-Language Models
Zekun Wang
Jingchang Chen
Wangchunshu Zhou
Haichao Zhu
Jiafeng Liang
Liping Shan
Ming Liu
Dongliang Xu
Qing Yang
Bing Qin
VLM
24
4
0
24 May 2023
Efficient Image-Text Retrieval via Keyword-Guided Pre-Screening
Min Cao
Yang Bai
Wenwen Qiang
Ziqiang Cao
Liqiang Nie
Min Zhang
24
0
0
14 Mar 2023
Tag2Text: Guiding Vision-Language Model via Image Tagging
Xinyu Huang
Youcai Zhang
Jinyu Ma
Weiwei Tian
Rui Feng
Yuejie Zhang
Yaqian Li
Yandong Guo
Lei Zhang
CLIP
MLLM
VLM
3DV
63
74
0
10 Mar 2023
UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers
Dachuan Shi
Chaofan Tao
Ying Jin
Zhendong Yang
Chun Yuan
Jiaqi Wang
VLM
ViT
23
38
0
31 Jan 2023
Towards Models that Can See and Read
Roy Ganz
Oren Nuriel
Aviad Aberdam
Yair Kittenplon
Shai Mazor
Ron Litman
21
13
0
18 Jan 2023
Efficient Image Captioning for Edge Devices
Ning Wang
Jiangrong Xie
Hangzai Luo
Qinglin Cheng
Jihao Wu
Mingbo Jia
Linlin Li
VLM
CLIP
23
20
0
18 Dec 2022
Exploring Discrete Diffusion Models for Image Captioning
Zixin Zhu
Yixuan Wei
Jianfeng Wang
Zhe Gan
Zheng-Wei Zhang
Le Wang
G. Hua
Lijuan Wang
Zicheng Liu
Han Hu
DiffM
VLM
25
17
0
21 Nov 2022
PromptCap: Prompt-Guided Task-Aware Image Captioning
Yushi Hu
Hang Hua
Zhengyuan Yang
Weijia Shi
Noah A. Smith
Jiebo Luo
45
101
0
15 Nov 2022
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention
Fenglin Liu
Xian Wu
Shen Ge
Xuancheng Ren
Wei Fan
Xu Sun
Yuexian Zou
VLM
75
12
0
28 Oct 2022
EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning
Tiannan Wang
Wangchunshu Zhou
Yan Zeng
Xinsong Zhang
VLM
30
36
0
14 Oct 2022
Vision-and-Language Pretraining
Thong Nguyen
Cong-Duy Nguyen
Xiaobao Wu
See-Kiong Ng
A. Luu
VLM
CLIP
21
2
0
05 Jul 2022
GIT: A Generative Image-to-text Transformer for Vision and Language
Jianfeng Wang
Zhengyuan Yang
Xiaowei Hu
Linjie Li
Kevin Qinghong Lin
Zhe Gan
Zicheng Liu
Ce Liu
Lijuan Wang
VLM
41
528
0
27 May 2022
LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval
Jie Lei
Xinlei Chen
Ning Zhang
Meng-xing Wang
Joey Tianyi Zhou
Tamara L. Berg
Licheng Yu
31
12
0
10 Mar 2022
Privacy Preserving Visual Question Answering
Cristian-Paul Bara
Q. Ping
Abhinav Mathur
Govind Thattai
M. Rohith
Gaurav Sukhatme
4
1
0
15 Feb 2022
A Frustratingly Simple Approach for End-to-End Image Captioning
Ziyang Luo
Yadong Xi
Rongsheng Zhang
Jing Ma
VLM
MLLM
25
16
0
30 Jan 2022
Distilled Dual-Encoder Model for Vision-Language Understanding
Zekun Wang
Wenhui Wang
Haichao Zhu
Ming Liu
Bing Qin
Furu Wei
VLM
FedML
29
30
0
16 Dec 2021
Injecting Semantic Concepts into End-to-End Image Captioning
Zhiyuan Fang
Jianfeng Wang
Xiaowei Hu
Lin Liang
Zhe Gan
Lijuan Wang
Yezhou Yang
Zicheng Liu
ViT
VLM
21
86
0
09 Dec 2021
Scaling Up Vision-Language Pre-training for Image Captioning
Xiaowei Hu
Zhe Gan
Jianfeng Wang
Zhengyuan Yang
Zicheng Liu
Yumao Lu
Lijuan Wang
MLLM
VLM
34
246
0
24 Nov 2021
Florence: A New Foundation Model for Computer Vision
Lu Yuan
Dongdong Chen
Yi-Ling Chen
Noel Codella
Xiyang Dai
...
Zhen Xiao
Jianwei Yang
Michael Zeng
Luowei Zhou
Pengchuan Zhang
VLM
29
879
0
22 Nov 2021
UFO: A UniFied TransfOrmer for Vision-Language Representation Learning
Jianfeng Wang
Xiaowei Hu
Zhe Gan
Zhengyuan Yang
Xiyang Dai
Zicheng Liu
Yumao Lu
Lijuan Wang
ViT
29
57
0
19 Nov 2021
What Vision-Language Models `See' when they See Scenes
Michele Cafagna
Kees van Deemter
Albert Gatt
VLM
31
13
0
15 Sep 2021
Playing Lottery Tickets with Vision and Language
Zhe Gan
Yen-Chun Chen
Linjie Li
Tianlong Chen
Yu Cheng
Shuohang Wang
Jingjing Liu
Lijuan Wang
Zicheng Liu
VLM
106
54
0
23 Apr 2021
Compressing Visual-linguistic Model via Knowledge Distillation
Zhiyuan Fang
Jianfeng Wang
Xiaowei Hu
Lijuan Wang
Yezhou Yang
Zicheng Liu
VLM
33
96
0
05 Apr 2021
Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision
Andrew Shin
Masato Ishii
T. Narihira
35
37
0
06 Mar 2021
Unified Vision-Language Pre-Training for Image Captioning and VQA
Luowei Zhou
Hamid Palangi
Lei Zhang
Houdong Hu
Jason J. Corso
Jianfeng Gao
MLLM
VLM
252
927
0
24 Sep 2019
1