89
0

Empirical Evaluation of Embedding Models in the Context of Text Classification in Document Review in Construction Delay Disputes

Abstract

Text embeddings are numerical representations of text data, where words, phrases, or entire documents are converted into vectors of real numbers. These embeddings capture semantic meanings and relationships between text elements in a continuous vector space. The primary goal of text embeddings is to enable the processing of text data by machine learning models, which require numerical input. Numerous embedding models have been developed for various applications. This paper presents our work in evaluating different embeddings through a comprehensive comparative analysis of four distinct models, focusing on their text classification efficacy. We employ both K-Nearest Neighbors (KNN) and Logistic Regression (LR) to perform binary classification tasks, specifically determining whether a text snippet is associated with 'delay' or ñot delay' within a labeled dataset. Our research explores the use of text snippet embeddings for training supervised text classification models to identify delay-related statements during the document review process of construction delay disputes. The results of this study highlight the potential of embedding models to enhance the efficiency and accuracy of document analysis in legal contexts, paving the way for more informed decision-making in complex investigative scenarios.

View on arXiv
Comments on this paper