An Annotated Corpus of Arabic Tweets for Hate Speech Analysis

Identifying hate speech content in the Arabic language is challenging due to the rich quality of dialectal variations. This study introduces a multilabel hate speech dataset in the Arabic language. We have collected 10000 Arabic tweets and annotated each tweet, whether it contains offensive content or not. If a text contains offensive content, we further classify it into different hate speech targets such as religion, gender, politics, ethnicity, origin, and others. A text can contain either single or multiple targets. Multiple annotators are involved in the data annotation task. We calculated the inter-annotator agreement, which was reported to be 0.86 for offensive content and 0.71 for multiple hate speech targets. Finally, we evaluated the data annotation task by employing a different transformers-based model in which AraBERTv2 outperformed with a micro-F1 score of 0.7865 and an accuracy of 0.786.
View on arXiv@article{biswas2025_2505.11969, title={ An Annotated Corpus of Arabic Tweets for Hate Speech Analysis }, author={ Md. Rafiul Biswas and Wajdi Zaghouani }, journal={arXiv preprint arXiv:2505.11969}, year={ 2025 } }