This research delved into GPT-4 and Kimi, two Large Language Models (LLMs), for systematic reviews. We evaluated their performance by comparing LLM-generated codes with human-generated codes from a peer-reviewed systematic review on assessment. Our findings suggested that the performance of LLMs fluctuates by data volume and question complexity for systematic reviews.
View on arXiv@article{kaptur2025_2504.20276, title={ Enhancing Systematic Reviews with Large Language Models: Using GPT-4 and Kimi }, author={ Dandan Chen Kaptur and Yue Huang and Xuejun Ryan Ji and Yanhui Guo and Bradley Kaptur }, journal={arXiv preprint arXiv:2504.20276}, year={ 2025 } }