141
v1v2 (latest)

Sequence graphs realizations and ambiguity in language models

Main:32 Pages
16 Figures
Bibliography:3 Pages
2 Tables
Abstract

Several popular language models represent local contexts in an input text xx as bags of words. Such representations are naturally encoded by a sequence graph whose vertices are the distinct words occurring in xx, with edges representing the (ordered) co-occurrence of two words within a sliding window of size ww. However, this compressed representation is not generally bijective: some may be ambiguous, admitting several realizations as a sequence, while others may not admit any realization. In this paper, we study the realizability and ambiguity of sequence graphs from a combinatorial and algorithmic point of view. We consider the existence and enumeration of realizations of a sequence graph under multiple settings: window size ww, presence/absence of graph orientation, and presence/absence of weights (multiplicities). When w=2w=2, we provide polynomial time algorithms for realizability and enumeration in all cases except the undirected/weighted setting, where we show the #\#P-hardness of enumeration. For w3w \ge 3, we prove the hardness of all variants, even when ww is considered as a constant, with the notable exception of the undirected unweighted case for which we propose XP algorithms for both problems, tight due to a corresponding W[1]W[1]-hardness result. We conclude with an integer program formulation to solve the realizability problem, and a dynamic programming algorithm to solve the enumeration problem in instances of moderate sizes. This work leaves open the membership to NP of both problems, a non-trivial question due to the existence of minimum realizations having size exponential on the instance encoding.

View on arXiv
Comments on this paper