Comparing Titles vs. Full-text for Multi-Label Classification of Scientific Papers and News Articles

15 May 2017

Lukas Galke

Abstract

Until today there has been no systematic comparison of how far document classification can be conducted using just the titles of the documents. However, methods using only the titles are very important since automated processing of titles has no legal barriers. Copyright laws often hinder automated document classification on full-text and even abstracts. In this paper, we compare established methods like Bayes, Rocchio, kNN, SVM, and logistic regression as well as recent methods like Learning to Rank and neural networks to the multi-label document classification problem. We demonstrate that classifications solely using the documents' titles can be very good and very close to the classification results using full-text. We use two established news corpora and two scientific document collections. The experiments are large-scale in terms of documents per corpus (up to 100,000) as well as number of labels (up to 10,000). The best method on title data is a modern variant of neural networks. For three datasets, the difference to full-text is very small. For one dataset, a stacking of logistic regression and decision trees performs slightly better than neural networks. Furthermore, we observe that the best methods on titles are even better than several state-of-the-art methods on full-text.

View on arXiv

Comments on this paper