ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.00332
30
0
v1v2 (latest)

Disentangling Codemixing in Chats: The NUS ABC Codemixed Corpus

31 May 2025
S. Churina
Akshat Gupta
Insyirah Mujtahid
Kokil Jaidka
ArXiv (abs)PDFHTML
Main:19 Pages
6 Figures
8 Tables
Abstract

Code-mixing involves the seamless integration of linguistic elements from multiple languages within a single discourse, reflecting natural multilingual communication patterns. Despite its prominence in informal interactions such as social media, chat messages and instant-messaging exchanges, there has been a lack of publicly available corpora that are author-labeled and suitable for modeling human conversations and relationships. This study introduces the first labeled and general-purpose corpus for understanding code-mixing in context while maintaining rigorous privacy and ethical standards. Our live project will continuously gather, verify, and integrate code-mixed messages into a structured dataset released in JSON format, accompanied by detailed metadata and linguistic statistics. To date, it includes over 355,641 messages spanning various code-mixing patterns, with a primary focus on English, Mandarin, and other languages. We expect the Codemix Corpus to serve as a foundational dataset for research in computational linguistics, sociolinguistics, and NLP applications.

View on arXiv
@article{churina2025_2506.00332,
  title={ Disentangling Codemixing in Chats: The NUS ABC Codemixed Corpus },
  author={ Svetlana Churina and Akshat Gupta and Insyirah Mujtahid and Kokil Jaidka },
  journal={arXiv preprint arXiv:2506.00332},
  year={ 2025 }
}
Comments on this paper