Evaluation of machine translation is usually done via external tools (to cite some instances: ARPA, BLEU, METEOR, LEPOR, …). But let us investigate the idea of self-evaluation. For it seems that the software itself is capable of having an accurate idea of its possible errors.
In the above example, human evaluation yields a score of 1 – 5/88 = 94.31%. Contrast with self-evaluation which sums its possible errors: unknown words and disambiguation errors, thus entailing a self-evaluation of 92,05%, due to 7 hypothesized errors. In this case, self-evaluation computes the maximum error rate. But even here, there are some false positives: ‘apellation’ is left untranslated, being unrecognized. In effect, the correct spelling is ‘appellation’. To sum up: the software identifies an unknown word (and lefts it untranslated) and counts it as a possible error.
Let us sketch what could be the pros and cons of MT self-evaluation. To begin with, the pros:
- it could provide a detailed taxonomy of possible errors: unknown words, unresolved grammatical disambiguation, unresolved semantical disambiguation, …
- it could identify precisely the suspected errors
- evaluation would be very fast and uncostly
- self-evaluation would work with whatever text or corpus
- self-evaluation could pave the way to further self-improvement and self-correction of errors
- its reliability could be good
And the cons:
- MT may be unaware of some types of errors, i.e. errors related to expressions and locutions
- it would sometimes engender false positives and thus, an issue would be to identify those false positives
- MT would be unaware of erroneous disambiguations