42
0

An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets

Abstract

In this paper, an improved clustering technique for large textual datasets by leveraging fine-tuned word embeddings is presented. WEClustering technique is used as the base model. WEClustering model is fur-ther improvements incorporating fine-tuning contextual embeddings, advanced dimensionality reduction methods, and optimization of clustering algorithms. Experimental results on benchmark datasets demon-strate significant improvements in clustering metrics such as silhouette score, purity, and adjusted rand index (ARI). An increase of 45% and 67% of median silhouette score is reported for the proposed WE-Clustering_K++ (based on K-means) and WEClustering_A++ (based on Agglomerative models), respec-tively. The proposed technique will help to bridge the gap between semantic understanding and statistical robustness for large-scale text-mining tasks.

View on arXiv
@article{sutrakar2025_2502.16139,
  title={ An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets },
  author={ Vijay Kumar Sutrakar and Nikhil Mogre },
  journal={arXiv preprint arXiv:2502.16139},
  year={ 2025 }
}
Comments on this paper