Code-switching (CSW) is the act of alternating between two or more languages within a single discourse. This phenomenon is widespread in multilingual communities, and increasingly prevalent in online content, where users naturally mix languages in everyday communication. As a result, Large Language Models (LLMs), now central to content processing and generation, are frequently exposed to code-switched inputs. Given their widespread use, it is crucial to understand how LLMs process and reason about such mixed-language text. This paper presents a systematic evaluation of LLM comprehension under code-switching by generating CSW variants of established reasoning and comprehension benchmarks. While degradation is evident when foreign tokens disrupt English texteven under linguistic constraintsembedding English into other languages often improves comprehension. Though prompting yields mixed results, fine-tuning offers a more stable path to degradation mitigation.
View on arXiv@article{mohamed2025_2506.14012, title={ Lost in the Mix: Evaluating LLM Understanding of Code-Switched Text }, author={ Amr Mohamed and Yang Zhang and Michalis Vazirgiannis and Guokan Shang }, journal={arXiv preprint arXiv:2506.14012}, year={ 2025 } }