4

HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

HyperAI Team
Yuchen Liu
Kaiyang Han
Zhiqiang Xia
Yuhang Dong
Chen Song
Kangyu Tang
Jiaming Xu
Xiushi Feng
WenXuan Yu
Li Peng
Mingyang Wang
Kai Wang
Changpeng Yang
Yang Li
Haoyu Lu
Hao Wang
Bingna Xu
Guangyao Liu
Long Huang
Kaibin Guo
Jinyang Wu
Dan Wu
Hongzhen Wang
Peng Zhou
Shuai Nie
Shande Wang
Runyu Shi
Ying Huang
Main:14 Pages
6 Figures
Bibliography:8 Pages
17 Tables
Appendix:7 Pages
Abstract

Current multimodal large lanauge models possess strong perceptual and reasoning capabilities, however high computational and memory requirements make them difficult to deploy directly on on-device environments. While small-parameter models are progressively endowed with strong general capabilities, standard Vision Transformer (ViT) encoders remain a critical bottleneck, suffering from excessive latency and memory consumption when processing high-resolutionthis http URLaddress these challenges, we introduce HyperVL, an efficient multimodal large language model tailored for on-device inference. HyperVL adopts an image-tiling strategy to cap peak memory usage and incorporates two novel techniques: (1) a Visual Resolution Compressor (VRC) that adaptively predicts optimal encoding resolutions to eliminate redundant computation, and (2) Dual Consistency Learning (DCL), which aligns multi-scale ViT encoders within a unified framework, enabling dynamic switching between visual branches under a shared LLM. Extensive experiments demonstrate that HyperVL achieves state-of-the-art performance among models of comparable size across multiple benchmarks. Furthermore, it significantly significantly reduces latency and power consumption on real mobile devices, demonstrating its practicality for on-device multimodal inference.

View on arXiv
Comments on this paper