Distinguishing between human- and LLM-generated texts is crucial given the risks associated with misuse of LLMs. This paper investigates detection and explanation capabilities of current LLMs across two settings: binary (human vs. LLM-generated) and ternary classification (including an ``undecided'' class). We evaluate 6 close- and open-source LLMs of varying sizes and find that self-detection (LLMs identifying their own outputs) consistently outperforms cross-detection (identifying outputs from other LLMs), though both remain suboptimal. Introducing a ternary classification framework improves both detection accuracy and explanation quality across all models. Through comprehensive quantitative and qualitative analyses using our human-annotated dataset, we identify key explanation failures, primarily reliance on inaccurate features, hallucinations, and flawed reasoning. Our findings underscore the limitations of current LLMs in self-detection and self-explanation, highlighting the need for further research to address overfitting and enhance generalizability.

View on arXiv

@article{ji2025_2502.12743,
  title={ "I know myself better, but not really greatly": How Well Can LLMs Detect and Explain LLM-Generated Texts? },
  author={ Jiazhou Ji and Jie Guo and Weidong Qiu and Zheng Huang and Yang Xu and Xinru Lu and Xiaoyu Jiang and Ruizhe Li and Shujun Li },
  journal={arXiv preprint arXiv:2502.12743},
  year={ 2025 }
}

Comments on this paper