Papers
arxiv:2507.12724

TransEvalnia: Reasoning-based Evaluation and Ranking of Translations

Published on Jul 17
Authors:
,
,

Abstract

TransEvalnia, a prompting-based translation evaluation system using reasoning, outperforms MT-Ranker and aligns well with human assessments across multiple language pairs.

AI-generated summary

We present TransEvalnia, a prompting-based translation evaluation and ranking system that uses reasoning in performing its evaluations and ranking. This system presents fine-grained evaluations based on a subset of the Multidimensional Quality Metrics (https://themqm.org/), returns an assessment of which translation it deems the best, and provides numerical scores for the various dimensions and for the overall translation. We show that TransEvalnia performs as well as or better than the state-of-the-art MT-Ranker (Moosa et al. 2024) on our own English-Japanese data as well as several language pairs from various WMT shared tasks. Using Anthropic's Claude-3.5-Sonnet and Qwen-2.5-72B-Instruct as the evaluation LLMs, we show that the evaluations returned are deemed highly acceptable to human raters, and that the scores assigned to the translations by Sonnet, as well as other LLMs, correlate well with scores assigned by the human raters. We also note the sensitivity of our system -- as well as MT-Ranker -- to the order in which the translations are presented, and we propose methods to address this position bias. All data, including the system's evaluation and reasoning, human assessments, as well as code is released.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.12724 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.12724 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.