475

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Abstract

With the recent success of pre-training technique for NLP and image-linguistic tasks, there are still few works on video-linguistic pre-training. Besides, most of the existing multimodal models are pre-trained for understanding task, which leads to a pretrain-finetune discrepency for generation tasks. In this paper, we propose UniViLM: a Unified Video and Language pre-training Model for both multimodal understanding and generation. Our model comprises of 4 components including two single-modal encoders, a cross encoder and a decoder with the Transformer backbone. We first pre-train our model to learn the universal representation for both video and language on a large instructional video dataset. Then we fine-tune the model on two multimodal tasks including understanding task (text-based video retrieval) and generation task (multimodal video captioning). Our extensive experiments show that our method can improve the performance of both understanding and generation tasks and achieves the state-of-the art results.

View on arXiv
Comments on this paper