An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets
In this paper, an improved clustering technique for large textual datasets by leveraging fine-tuned word embeddings is presented. WEClustering technique is used as the base model. WEClustering model is fur-ther improvements incorporating fine-tuning contextual embeddings, advanced dimensionality reduction methods, and optimization of clustering algorithms. Experimental results on benchmark datasets demon-strate significant improvements in clustering metrics such as silhouette score, purity, and adjusted rand index (ARI). An increase of 45% and 67% of median silhouette score is reported for the proposed WE-Clustering_K++ (based on K-means) and WEClustering_A++ (based on Agglomerative models), respec-tively. The proposed technique will help to bridge the gap between semantic understanding and statistical robustness for large-scale text-mining tasks.
View on arXiv@article{sutrakar2025_2502.16139, title={ An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets }, author={ Vijay Kumar Sutrakar and Nikhil Mogre }, journal={arXiv preprint arXiv:2502.16139}, year={ 2025 } }