How Machine Translation Accuracy is Measured

Smarter notes with HappyScribe

AI Notetaker, transcription and subtitles powered by AI & humans for top accuracy.

Additionally, we will touch upon the role of quality assurance models and quality estimation techniques that are gaining ground for enhancing MT accuracy and reliability.

Machine translation (MT) is undergoing a deep transformation, propelled by the advent of large language models (LLMs) and their remarkable capabilities for generating fluent and contextually relevant translations. Not only are LLM-translations being praised as increasingly accurate compared to previous versions of themselves and neural machine translation (NMT), but they are also becoming increasingly integrated into various applications.

Human Expert Evaluation of Machine Translation

In a human evaluation approach, trained linguists meticulously examine machine-generated translations, comparing them to reference translations or against a set of criteria or evaluation metrics.

Linguists are sometimes engaged as the only providers of evaluations, usually for short texts. They can also assist as post-evaluators of MT output that has been already automatically evaluated. These experts may also work as part of a group. The group scores of multiple evaluators are averaged to get a measurement of MT accuracy.

Most human evaluation of MT examines the same criteria: “adequacy” (fidelity), “comprehensibility” or “informativeness” (intelligibility), and “fluency” (grammaticality). These evaluations are typically sentence-based. Humans are also usually asked to determine whether the machine translation has captured the context correctly.

Generally speaking, sentence-level scoring is based on criteria directly applied by the human evaluator (i.e., direct MT assessment). When the evaluation is done to compare the output of different MT systems for the same source text (i.e., MT ranking evaluation), the evaluation is still at the sentence level, but the quality level of each MT system is compared to the rest. In either scenario, there is a great deal of subjectivity, with essentially no two human evaluators scoring translations the exact same way.

Human evaluation of MT has more limitations than just subjectivity. It could be very expensive and time consuming when there are large volumes of machine translated text to review. The quality of the MT output also directly impacts the human post-editing process (which takes place when a linguist corrects text that has been machine-translated).

Example of human evaluation metric: HTER

The human translation error rate (HTER) is the number of editing steps divided by the number of words in an acceptable translation. It is based on the number of revisions needed for a machine translation to be at the level of a correct reference translation. In HTER, a linguist looks at the number of insertions, deletions, substitutions, and other changes needed to create an acceptable translation. When Should Automated Evaluation Metrics Be Used? Automated MT evaluation metrics offer a scalable and efficient alternative to human evaluation for large volumes and multiple languages. These metrics leverage computational algorithms to compare MT outputs to reference translations or evaluate them based on various linguistic features.

Some of the best known automated evaluation metrics include:

BLEU (Bilingual Evaluation Understudy): BLEU measures the overlap between the machine-generated translation and one or more reference translations, focusing on the precision of n-gram matches (an n-gram is a sequence of elements, such as words, numbers, symbols, or punctuation appearing in sequence in a text).
METEOR (Metric for Evaluation of Translation with Explicit ORdering): METEOR considers both precision and recall, incorporating synonyms and paraphrases to better capture the meaning of a translation.
TER (Translation Edit Rate): TER measures the number of edits required to transform the machine-generated translation into a human-quality translation.
Other automated metrics include RIBES, chrF, and COMET. Each metric has its strengths and weaknesses, and the choice of metric often depends on the specific application and desired evaluation criteria.

Automated evaluation metrics are fast and cost-effective, but they also have limitations. They may struggle to capture subtle nuances of meaning and style and can also be sensitive to variations in evaluation criteria.

Given that no single automated evaluation metric is perfect, some scholars have resorted to creating a sample of evaluations with multiple metrics to cross-reference. This is a popular approach when measuring the quality of machine translation from an LLM specially trained with language and domain-specific data.

Quality Assurance vs Quality Estimation

Quality assurance (QA) models represent a proactive approach to ensuring translation quality. These models leverage machine learning techniques to predict the quality of machine translations before or during the translation process. By analyzing various features, such as source text complexity, MT model confidence scores (a probability, not an accuracy score), and linguistic patterns, QA models can identify potential errors and inconsistencies, flagging them for further review or correction.

The use of QA models can significantly streamline the post-editing process, even if it is a sort of intermediate step, because it can help linguists focus their efforts on the most problematic segments. QA models can also help identify systemic issues in MT systems, highlighting patterns and guiding improvements in model training and development.

Quality estimation (QE) is related to quality but a distinct concept from quality assurance in machine translation. While QA models are useful for predicting the overall quality of a translation, QE focuses on estimating the specific quality of individual segments or sentences. QE models typically analyze both the source and target texts to generate a quality score for each translation unit.

Despite its usefulness, QE often fails to capture subtle errors, such as ambiguities in context, and to spot patterns that could point to issues in the training data. However, QE models are continuously improving, and have an increasing number of applications. For example, QE can be used to prioritize segments for post-editing, serve as criteria for dynamic selection of an MT model when several are available, and offer feedback on the expected quality of the translation, which can in turn help manage projects more efficiently.

Accuracy in Neural Machine Translation vs AI Translation

Hundreds of scientific papers describing a variety of experiments with neural machine translation (NMT) systems and LLMs show that LLMs consistently outperform NMT in accuracy (measured as per the criteria explained above in human and automated evaluation). NMT faces challenges in maintaining consistency across longer texts and can sometimes produce inaccurate translations that appear to be linguistically correct but are factually incorrect.

In contrast, Large Language Models (LLMs) tend to perform at a higher level of accuracy due in part to their vast training data and ability to understand context and continuously learn. LLMs excel in maintaining consistency across larger context windows and adapting to specific styles or terminology through prompting.

However, LLMs can also produce hallucinations if not properly fine-tuned, and are generally slower and more computationally expensive than traditional NMT models. LLMs may also have a harder time with accuracy for low-resource languages or deeply nuanced language, such as legalese.

Written by

Henni Paulsen

Henni Paulsen is a language localization operations and technology expert with over two decades of experience in executive and consulting roles across diverse industries. She specializes in fit-for-purpose localization strategies, best practices, and standards. Henni also helps organizations learn about technology implementations that enable and drive international business growth.