Here is an interesting case of first name disambiguation for machine translation. Consider the following first name ‘Camille’. It can apply to both genders. In Corsican (taravese or sartinese variants) it translates either into Cameddu (masculine) or Camedda (feminine). In some cases, the corresponding disambiguation relies on mere grammatical grounds. For example, ‘Camille était beau’ translates into Cameddu era beddu (Camille was beautiful), on grammatical grounds alone. The same goes for ‘Camille était belle’, that translates straightforwardly into Camedda era bedda (Camille was beautiful), according to the adjective gender.
Now the related disambiguation can result in a hard case, relying only on semantic context. Hence, ‘Camille était pacifique” can translate either into Cameddu era pacificu or into Camedda era pacifica, depending on the context (which can be text or even an image…). In effect, it cannot be translated merely on grammatical grounds, since ‘pacifique’ (pacific) is gender-ambiguous: it can translate either into pacificu of pacifica.
Now the same goes for French first name ‘Dominique’ (Dominic), which translates either into ‘Dumenicu (masculine) or ‘Dumenica‘ (feminine). Hence, ‘Dominique était pacifique’ (Dominic was pacific) can translate either into ‘Dumenicu era pacificu‘ or into ‘Dumenica era pacifica‘, depending on the context.
Let us consider superintelligence related to machine translation. To fix ideas, we can propose a rough definition: machine with the ability to translate with 99% or above accuracy from one of the 8000 languages to another. It seems relevant here to mention the present 8000 human languages, including some 4000 or 5000 languages which are at risk of extinction before the end of the XXIth century. It could also include relevantly some extinct languages which are somewhat well-described and meet the conditions for building rule-based translation. But arguably, this definition needs some additional criteria. What appears to be the most important is the ability to self-improve its performance. In practise, this could be done by reading or hearing texts. The superintelligent translation machine should be able to acquire new vocabulary from its readings or hearings: not only words and vocabulary, but also locutions (noun locutions, adjective locutions, verbal locutions, etc.). It should also be able to acquire new sentence structures from its readings and enrich its database of grammatical sentence structures. It should also be able to make grow its database of word meanings for ambiguous words and instantly build the associate disambiguation rules. In addition, it should be capable of detecting and implementing specific grammatical structures.
It seems superintelligence will be reached when the superintelligent translation machine will be able to perform all that without any human help.
Also relevant in this discussion is the fact, previously argued, that rule-based translation is better suited to endangered langages translation than statistic-based translation. From the above definition of SMT, it follows that rule-based translation is also best suited to SMT, since its massively includes endangered languages.
Let us speculate now on how this path to superintelligent translation will be achieved. We can mention here:
- a ‘quantitative scenario’: (i) acquire, fist, an ability to translate very accurately, say, 100 languages. (ii) develop, second, the ability to self-improve (iii) extend, third, the translation ability to whole set of 8000 human languages.
- alternatively, there could be a ‘qualitative scenario’: (i) acquire, first, an ability to translate somewhat accurately the 8000 languages (the accuracy could vary from language to language, especially with rare endangered languages). (ii) suggest improvements to vocabulary, locutions, sentence structures, disambiguation rules, etc. that are verified and validated by human (iii) acquire, third, the ability to self-improve by reading texts or hearing conversations.
- but a third alternative would be an hybrid scenario, i.e. a mix of quantitative and qualitative improvements. It will be our preferred scenario.
But we should provide more details on how these steps could be achieved.
To fix ideas, let us focus on the vocabulary self-improvement module: it allows the superintelligent machine translation to extend its vocabulary in any language. This could be accomplished by reading or hearing new texts in any language. When facing a new word, the superintelligent machine translation (SMT, for short) should be able to translate it instantly into the 8000 other languages and add it to its vocabulary database.
To give another example, another module would be locution self-improvement module: it allows the superintelligent machine translation to extend its locution knowledge in any language.
Here are some writing differences between Corsican and Sardinian gallurese, that result from historical writing habits. These writing differences prevail, even when the words are the same:
- ghj is replaced by gghj: acciaghju (corsu), acciagghju (gallurese) , steel
- chj is replaced by cchj: finochju (corsu), finocchju (gallurese), fennel
- tonic accent is marked systematically in gallurese whereas it is not compulsory in Corsican: apostulu (Corsican), apòstulu (gallurese), apostle
- cc is prefered in Gallurese language instead of cq in Corsican: acquistu (corsu), accuistu (gallurese), purchase
- dd in Corsican taravese or sartinese is replaced with ddh in Gallurese: beddu bedda beddi (corsu), beddhu beddha beddhi (gallurese), beautiful
- final è in Corsican is replaced with é in Gallurese: sapè (corsu), sapé (gallurese), know
What are the conditions for a given endangered language to be a candidate for rule-based machine translation? For a given endangered language to be a candidate for rule-based machine translation, some requirements are in order. There is notably need for:
– a dictionary: some specialized lexicons are useful too
– a list of locutions and their translation: to be more accurate what is needed are noun locutions, adjective locutions, adverbial locutions, verbal locutions and their translations in other language.
– a detailed grammar (in any language): ideally, the grammar should be very detailed, mentioning notably irregular verbs, noun plurals, etc. Subjonctive, conditional tenses must also be accurately described.
– most importantly: a description of the main variants of the language and their differences. This is needed to handle what we can call the ‘variant problem’ (we shall say a bit more about this later): as an effect of diversity, endangered languages are often polynomic and come with variants. But translation must be coherent and a mix of several variants is not acceptable as a translation.
Let us mention that endangered languages are commonly associated with another language, being in a diglossia relationship one with another. To take an example, Corsican language is associated with French. So we consider the French-Corsican pair, and what is relevant is a French-Corsican. If we consider the sardinian gallurese language (‘gaddhuresu’), the relevant pair is Italian-Gallurese.
Quandu da la forza à la raghjoni cuntrasta
Tandu vinci la forza è la raghjoni ùn basta.
This is a rare Corsican proverb. In French, litterally: “Lorsque la force et la raison s’opposent, alors la force gagne car la raison ne suffit pas” (When strength and reason are opposed, then strength gains because reason is not enough).
But it seems better translated in French, as: “La raison du plus fort est toujours la meilleure.” (Jean de la Fontaine). Litterally: the reason of the strongest is always the best. This is semantically equivalent to: might makes right. This is the first verse of the fable of Jean de La Fontaine, The Wolf and the Lamb. To be compared with Aesop, from which this fable originates ; the conclusion of the story, in Aesop’s terms, was: This fable shows that with people decided to do the most righteous evil, defense remains without effect.
This rare Corsican proverb is a small wonder (heard from people of Laretu di Tallà). The proverb is in poetry, with the main rhyme cuntrasta/basta, but there is also a secondary rhyme at the beginning of the verses: Quandu/Tandu.
More generally, this raises the problem of the equivalence of meaning and the translation of the proverbs from one language to another. If one considers that Quandu da la forza à la raghjoni cuntrasta Tandu vinci la forza è la raghjoni ùn basta is semantically equivalent to “La raison du plus fort est toujours la meilleure” in French, it is somewhat surprising since there are several differences between the two versions:
- The Corsican proverb is poetry while the French version is in prose
- The phrase is longer in Corsica, and contains more words than the French version (the English version being even more concise but sense-preserving)
Posted in blog
Tagged Aesop, citation, Jean de la Fontaine, La Fontaine, machine translation, might makes right, power, proverb, proverb equivalence, proverb translation, reason, the Wolf and the Lamb
Here are a few suggestions on how rule-based and statistical machine translation can help each other:
(This is a follow-up to the previous post)
- to begin with, rule-based and statistical machine translation are often contrasted and compared: it would be oversimplifying to conclude that one is better than the other. From a more objective standpoint, let us consider that each method has its strengths and weaknesses. Let us investigate on how one could make them collaborate in order to add up their respective strengths
- in the case of an endangered language, the lack of good quality corpora has been pointed out. But one way for rule-based and statistical machine translation to collaborate would be to use rule-based translation for building a better quality corpus for statistical machine translation
- suppose we begin with a statistical machine translation software that performs 50% on average with regard to French to Corsican translation
- let us sketch the process of creating these better corpora: let us take the example of the French-Corsican diglossic pair (the Corsican language being considered by Unesco as a definitely endangered language). Now presently we lack a quality French-Corsican corpus or to say it more accurately, the corpus at our disposal is a low-quality one. The idea would be to use rule-based machine translation to create a much better corpus to use with statistical machine translation.
- let us sketch now the different steps of this collaborative process: (i) create a French-Corsican corpus with the help of rule-based machine translation: if the software has some average 90% performance, then the corpus would be on average 90% reliable. With appropriate training, statistical MT should now perform some, say, 80% on average (to be compared with the previous 50% performance)
(ii) from this French-Corsican corpus, other corpora pairs can be created, such as Italian-Corsican, English-Corsican, etc. since French-Italian, English-Italian, etc. corpora of excellent quality already exist. The performance gain should then extend to other language pairs such as Italian-Corsican, English-Corsican, etc.
- with the help of this process, we re finally in a position to combine and add up the strengths of the two complementary approaches to MT: on the one hand, rule-based MT is able to translate with good accuracy even in the lack of corpora; on the other hand, statistical machine translation is able to handle successfully and fastly a great many language pairs. To sum up, as the Corsican proverb says: una mani lava l’altra (One hand washes the other).
French to Corsican: performing on French wikipedia sample test currently amounts to 93% on average. Below is a rough typology of remaining errors (presumably an average of 95% performance should be attainable on the basis of correction of ‘easy’ tagged errors):
- unknown vocabulary: 50% (easy)
- basic disambiguation: 15% (easy)
- erroneous accord (relates to (i) words that are masculine in French and feminine in Corsican language; and (ii) ) words that are feminine in French and masculine in Corsican language: 5% (medium difficulty )
- inadequate locution: 10% (medium difficulty or hard)
- false positives: 5% (medium difficulty or hard)
- semantic disambiguation: 5% (hard). For example, disambiguating French ‘échecs’ = fiaschi/scacchi (failures/chess)
- specific grammatical case: 2% (hard)
- word reference error: 2% (hard)
- unknown, unclassified: 6% (hard)
Some improvements made to French to Italian translation:
- fixed several contractors (della, dello, …)
- the nice thing is that semantic disambiguation is working: ‘échecs’ = fallimenti/scacchi (failures/chess) and translates properly into scacchi
Now testing French to Italian translation: it is the very first draft. A rough 80%. A lot of things to fix.