46
29

Multimodal Hate Speech Detection from Bengali Memes and Texts

Abstract

Numerous machine learning (ML) and deep learning (DL)-based approaches have been proposed to utilize textual data from social media for anti-social behavior analysis like cyberbullying, fake news detection, and identification of hate speech mainly for highly resourced languages, e.g., English. However, despite having a lot of diversity and millions of native speakers, some languages such as Bengali are under-resourced, which is due to a lack of computational resources for natural language processing (NLP). Like other languages, Bengali social media content also includes images along with texts (e.g., multimodal content is posted by embedding short texts into images on Facebook), only the textual data is not enough to judge them since images might give extra context to make a proper judgment. This paper is about hate speech detection from multimodal Bengali memes and texts. We prepared the only multimodal hate speech dataset for-a-kind of problems for Bengali, which we use to train state-of-the-art neural architectures (e.g., Bi-LSTM/Conv-LSTM with word embeddings, ConvNets + pre-trained language models (PLMs), e.g., monolingual Bangla BERT, multilingual BERT-cased/uncased, and XLM-RoBERTa) that jointly analyze textual and visual information for hate speech detection. Conv-LSTM and XLM-RoBERTa models performed best for texts, yielding F1 scores of 0.78 and 0.82, respectively. As of memes, ResNet-152 and DenseNet-161 models yield F1 scores of 0.78 and 0.79, respectively. As for multimodal fusion, XML-RoBERTa + DenseNet-161 performed the best, yielding an F1 score of 0.83. Our study suggests that text modality is most useful for hate speech detection, while memes are moderately useful. Further, to foster reproducible research, we plan to make available datasets, source codes, models, and notebooks

View on arXiv
Comments on this paper