An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets

22 February 2025

Abstract

In this paper, an improved clustering technique for large textual datasets by leveraging fine-tuned word embeddings is presented. WEClustering technique is used as the base model. WEClustering model is fur-ther improvements incorporating fine-tuning contextual embeddings, advanced dimensionality reduction methods, and optimization of clustering algorithms. Experimental results on benchmark datasets demon-strate significant improvements in clustering metrics such as silhouette score, purity, and adjusted rand index (ARI). An increase of 45% and 67% of median silhouette score is reported for the proposed WE-Clustering_K++ (based on K-means) and WEClustering_A++ (based on Agglomerative models), respec-tively. The proposed technique will help to bridge the gap between semantic understanding and statistical robustness for large-scale text-mining tasks.

View on arXiv

@article{sutrakar2025_2502.16139,
  title={ An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets },
  author={ Vijay Kumar Sutrakar and Nikhil Mogre },
  journal={arXiv preprint arXiv:2502.16139},
  year={ 2025 }
}

Comments on this paper