Here are a few suggestions on how rule-based and statistical machine translation can help each other:
(This is a follow-up to the previous post)
- to begin with, rule-based and statistical machine translation are often contrasted and compared: it would be oversimplifying to conclude that one is better than the other. From a more objective standpoint, let us consider that each method has its strengths and weaknesses. Let us investigate on how one could make them collaborate in order to add up their respective strengths
- in the case of an endangered language, the lack of good quality corpora has been pointed out. But one way for rule-based and statistical machine translation to collaborate would be to use rule-based translation for building a better quality corpus for statistical machine translation
- suppose we begin with a statistical machine translation software that performs 50% on average with regard to French to Corsican translation
- let us sketch the process of creating these better corpora: let us take the example of the French-Corsican diglossic pair (the Corsican language being considered by Unesco as a definitely endangered language). Now presently we lack a quality French-Corsican corpus or to say it more accurately, the corpus at our disposal is a low-quality one. The idea would be to use rule-based machine translation to create a much better corpus to use with statistical machine translation.
- let us sketch now the different steps of this collaborative process: (i) create a French-Corsican corpus with the help of rule-based machine translation: if the software has some average 90% performance, then the corpus would be on average 90% reliable. With appropriate training, statistical MT should now perform some, say, 80% on average (to be compared with the previous 50% performance)
(ii) from this French-Corsican corpus, other corpora pairs can be created, such as Italian-Corsican, English-Corsican, etc. since French-Italian, English-Italian, etc. corpora of excellent quality already exist. The performance gain should then extend to other language pairs such as Italian-Corsican, English-Corsican, etc.
- with the help of this process, we re finally in a position to combine and add up the strengths of the two complementary approaches to MT: on the one hand, rule-based MT is able to translate with good accuracy even in the lack of corpora; on the other hand, statistical machine translation is able to handle successfully and fastly a great many language pairs. To sum up, as the Corsican proverb says: una mani lava l’altra (One hand washes the other).
Here are some arguments in favor of the choice of rule-based translation concerning machine translation of endangered languages (it relates to the philosophy of language policy):
- there does not exist at present time a reliable corpus between the given endangered language and other languages
- endangered languages are often polynomic, i.e. there exist some main variants of the language that coexist: it is important to preserve them since (i) it is a feature of diversity and (ii) it is an inherent feature of the given endangered language, and to distinguish between these variants. In addition, any translation should not contain a mix up of these variants. This also complicates the process of building a proper corpus, since the scarce existing corpus is made up of different variants of the language.
- in the lack of an adequate corpus, statistical machine translation is not able to provide quality translation of the given endangered language (while on the other hand it succeeds with common languages where excellent corpora are available): arguably, providing low quality translation (although the attempt is meritable) could harm these endangered languages that are by definition vulnerable, since people could use and diffuse the resulting low quality translation. On those grounds, given this vulnerability, it could be argued that a minimum 80% quality translation is needed for a given pair involving an endangered language.
- in addition, it should be pointed out that endangered languages are usually in a ‘diglossic’ relationship with another language: what is needed as a matter of priority is to provide translation between the two languages of this pair
(to be continued)
French to Corsican: performing on French wikipedia sample test currently amounts to 93% on average. Below is a rough typology of remaining errors (presumably an average of 95% performance should be attainable on the basis of correction of ‘easy’ tagged errors):
- unknown vocabulary: 50% (easy)
- basic disambiguation: 15% (easy)
- erroneous accord (relates to (i) words that are masculine in French and feminine in Corsican language; and (ii) ) words that are feminine in French and masculine in Corsican language: 5% (medium difficulty )
- inadequate locution: 10% (medium difficulty or hard)
- false positives: 5% (medium difficulty or hard)
- semantic disambiguation: 5% (hard). For example, disambiguating French ‘échecs’ = fiaschi/scacchi (failures/chess)
- specific grammatical case: 2% (hard)
- word reference error: 2% (hard)
- unknown, unclassified: 6% (hard)
Some improvements made to French to Italian translation:
- fixed several contractors (della, dello, …)
- the nice thing is that semantic disambiguation is working: ‘échecs’ = fallimenti/scacchi (failures/chess) and translates properly into scacchi
Now testing French to Italian translation: it is the very first draft. A rough 80%. A lot of things to fix.