Recent developments in neural information retrieval models have been promising, but a problem remains: human relevance judgments are expensive to produce, while neural models require a considerable amount of training data. In an attempt to fill this gap, we present an approach for generating weak supervision training data for use in a neural IR model. Specifically, we use a news corpus with article headlines acting as pseudo-queries and article content as pseudo-documents, and we propose a measure of interaction similarity to filter these pseudo-documents. Additionally, we employ techniques for addressing problems related to finding effective negative training examples and disregarding headlines that do not work well as queries. By using our approach to train state-of-the-art neural IR models and comparing to established baselines, we find that training data generated by our approach can lead to good results on a benchmark test collection.
View on arXiv