ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.04387
98
120
v1v2 (latest)

M3^33IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning

7 June 2023
Lei Li
Yuwei Yin
Shicheng Li
Liang Chen
Peiyi Wang
Shuhuai Ren
Mukai Li
Yazheng Yang
Jingjing Xu
Xu Sun
Lingpeng Kong
Qi Liu
    MLLMVLM
ArXiv (abs)PDFHTML
Abstract

Instruction tuning has significantly advanced large language models (LLMs) such as ChatGPT, enabling them to align with human instructions across diverse tasks. However, progress in open vision-language models (VLMs) has been limited due to the scarcity of high-quality instruction datasets. To tackle this challenge and promote research in the vision-language field, we introduce the Multi-Modal, Multilingual Instruction Tuning (M3^33IT) dataset, designed to optimize VLM alignment with human instructions. Our M3^33IT dataset comprises 40 carefully curated datasets, including 2.4 million instances and 400 manually written task instructions, reformatted into a vision-to-text structure. Key tasks are translated into 80 languages with an advanced translation system, ensuring broader accessibility. M3^33IT surpasses previous datasets regarding task coverage, instruction number and instance scale. Moreover, we develop Ying-VLM, a VLM model trained on our M3^33IT dataset, showcasing its potential to answer complex questions requiring world knowledge, generalize to unseen video tasks, and comprehend unseen instructions in Chinese. To encourage further research, we have open-sourced both the dataset and trained models.

View on arXiv
Comments on this paper