Enriching lexical-based approach with external knowledge for Vietnamese multiple-choice reading comprehension

Although over 95 million people worldwide speak the Vietnamese language, limited study and efforts have been made to carry out machine reading comprehension and create language resources for the Vietnamese language. This article proposes the lexical-based reading comprehension approach utilizing semantic similarity measurement and external knowledge sources to analyze questions and extract answers from reading texts in Vietnamese. This method is evaluated on our proposed dataset including 2,783 pairs of multiple-choice questions and answers based on a set of 417 Vietnamese texts used for teaching reading comprehension for 1st to 5th graders. The objectives of this research are two main contributions: (1) to develop a human-generated benchmark dataset for the low-resourced Vietnamese language for the machine reading comprehension; and (2) to evaluate machine reading comprehension techniques using lexical-based approaches, neural-based approaches, and our proposed method. Finally, the results of our proposed model are analyzed by comparing it with the outcome of the lexical-based and neural-based approaches. Our experiments show that our proposed method outperforms baseline models in terms of accuracy of 61.81%, higher 5.51% than the best baseline model. Besides, we measure human performance on our dataset and compared it to our MRC models. The performance gap between humans and our best experimental model indicates that significant progress can be made on Vietnamese machine reading comprehension in further research. Our dataset is freely available for research purposes.
View on arXiv