Machine Translation (MT) systems frequently encounter gender-ambiguous occupational terms, where they must assign gender without explicit contextual cues. While individual translations in such cases may not be inherently biased, systematic patterns-such as consistently translating certain professions with specific genders-can emerge, reflecting and perpetuating societal stereotypes. This ambiguity challenges traditional instance-level single-answer evaluation approaches, as no single gold standard translation exists. To address this, we introduce GRAPE, a probability-based metric designed to evaluate gender bias by analyzing aggregated model responses. Alongside this, we present GAMBIT-MT, a benchmarking dataset in English with gender-ambiguous occupational terms. Using GRAPE, we evaluate several MT systems and examine whether their gendered translations in Greek and French align with or diverge from societal stereotypes, real-world occupational gender distributions, and normative standards.
View on arXiv@article{mastromichalakis2025_2503.04372, title={ Assumed Identities: Quantifying Gender Bias in Machine Translation of Gender-Ambiguous Occupational Terms }, author={ Orfeas Menis Mastromichalakis and Giorgos Filandrianos and Maria Symeonaki and Giorgos Stamou }, journal={arXiv preprint arXiv:2503.04372}, year={ 2025 } }