A Graph Diffusion Algorithm for Lexical Similarity Evaluation

9 April 2025

Abstract

In this paper, we present an algorithm for evaluating lexical similarity between a given language and several reference language clusters. As an input, we have a list of concepts and the corresponding translations in all considered languages. Moreover, each reference language is assigned to one of $c$ language clusters. For each of the concepts, the algorithm computes the distance between each pair of translations. Based on these distances, it constructs a weighted directed graph, where every vertex represents a language. After, it solves a graph diffusion equation with a Dirichlet boundary condition, where the unknown is a map from the vertex set to $\mathbb{R}^c$ . The resulting coordinates are values from the interval $[0,1]$ and they can be interpreted as probabilities of belonging to each of the clusters or as a lexical similarity distribution with respect to the reference clusters. The distances between translations are calculated using phonetic transcriptions and a modification of the Damerau-Levenshtein distance. The algorithm can be useful in analyzing relationships between languages spoken in multilingual territories with a lot of mutual influences. We demonstrate this by presenting a case study regarding various European languages.

View on arXiv

@article{mikula2025_2504.06816,
  title={ A Graph Diffusion Algorithm for Lexical Similarity Evaluation },
  author={ Karol Mikula and Mariana Sarkociová Remešíková },
  journal={arXiv preprint arXiv:2504.06816},
  year={ 2025 }
}

Comments on this paper