v1v2 (latest)

Keypoint-Integrated Instruction-Following Data Generation for Enhanced Human Pose and Action Understanding in Multimodal Models

14 September 2024

Main:10 Pages

2 Figures

Bibliography:2 Pages

5 Tables

Abstract

Current vision-language multimodal models are well-adapted for general visual understanding tasks. However, they perform inadequately when handling complex visual tasks related to human poses and actions due to the lack of specialized vision-language instruction-following data. We introduce a method for generating such data by integrating human keypoints with traditional visual features such as captions and bounding boxes, enabling more precise understanding of human-centric scenes. Our approach constructs a dataset comprising 200,328 samples tailored to fine-tune models for human-centric tasks, focusing on three areas: conversation, detailed description, and complex reasoning. We establish a benchmark called Human Pose and Action Understanding Benchmark (HPAUB) to assess model performance on human pose and action understanding. We fine-tune the LLaVA-1.5-7B model using this dataset and evaluate it on the benchmark, achieving significant improvements. Experimental results show an overall improvement of 21.18% compared to the original LLaVA-1.5-7B model. These findings highlight the effectiveness of keypoint-integrated data in enhancing multimodal models. Code is available atthis https URL.

View on arXiv

@article{zhang2025_2409.09306,
  title={ Keypoint-Integrated Instruction-Following Data Generation for Enhanced Human Pose and Action Understanding in Multimodal Models },
  author={ Dewen Zhang and Wangpeng An and Hayaru Shouno },
  journal={arXiv preprint arXiv:2409.09306},
  year={ 2025 }
}

Comments on this paper