Contrastive Learning on LLM Back Generation Treebank for Cross-domain Constituency Parsing

Cross-domain constituency parsing is still an unsolved challenge in computational linguistics since the available multi-domain constituency treebank is limited. We investigate automatic treebank generation by large language models (LLMs) in this paper. The performance of LLMs on constituency parsing is poor, therefore we propose a novel treebank generation method, LLM back generation, which is similar to the reverse process of constituency parsing. LLM back generation takes the incomplete cross-domain constituency tree with only domain keyword leaf nodes as input and fills the missing words to generate the cross-domain constituency treebank. Besides, we also introduce a span-level contrastive learning pre-training strategy to make full use of the LLM back generation treebank for cross-domain constituency parsing. We verify the effectiveness of our LLM back generation treebank coupled with contrastive learning pre-training on five target domains of MCTB. Experimental results show that our approach achieves state-of-the-art performance on average results compared with various baselines.
View on arXiv@article{guo2025_2505.20976, title={ Contrastive Learning on LLM Back Generation Treebank for Cross-domain Constituency Parsing }, author={ Peiming Guo and Meishan Zhang and Jianling Li and Min Zhang and Yue Zhang }, journal={arXiv preprint arXiv:2505.20976}, year={ 2025 } }