Interesting case of first name disambiguation

Here is an interesting case of first name disambiguation for machine translation. Consider the following first name ‘Camille’. It can apply to both genders. In Corsican (taravese or sartinese variants) it translates either into Cameddu (masculine) or Camedda (feminine). In some cases, the corresponding disambiguation relies on mere grammatical grounds. For example, ‘Camille était beau’ translates into Cameddu era beddu (Camille was beautiful), on grammatical grounds alone. The same goes for ‘Camille était belle’, that translates straightforwardly into Camedda era bedda (Camille was beautiful), according to the adjective gender.

Now the related disambiguation can result in a hard case, relying only on semantic context. Hence, ‘Camille était pacifique” can translate either into Cameddu era pacificu or into Camedda era pacifica, depending on the context (which can be text or even an image…). In effect, it cannot be translated merely on grammatical grounds, since ‘pacifique’ (pacific) is gender-ambiguous: it can translate either into pacificu of pacifica.

Now the same goes for French first name ‘Dominique’ (Dominic), which translates either into ‘Dumenicu (masculine) or ‘Dumenica‘ (feminine). Hence, ‘Dominique était pacifique’ (Dominic was pacific) can translate either into ‘Dumenicu era pacificu‘ or into ‘Dumenica era pacifica‘, depending on the context.

Posted in blog | Tagged , , , , , | Leave a comment

Superintelligent machine translation

Let us consider superintelligence related to machine translation. To fix ideas, we can propose a rough definition: machine with the ability to translate with 99% or above accuracy from one of the 8000 languages to another. It seems relevant here to mention the present 8000 human languages, including some 4000 or 5000 languages which are at risk of extinction before the end of the XXIth century. It could also include relevantly some extinct languages which are somewhat well-described and meet the conditions for building rule-based translation. But arguably, this definition needs some additional criteria. What appears to be the most important is the ability to self-improve its performance. In practise, this could be done by reading or hearing texts. The superintelligent translation machine should be able to acquire new vocabulary from its readings or hearings: not only words and vocabulary, but also locutions (noun locutions, adjective locutions, verbal locutions, etc.). It should also be able to acquire new sentence structures from its readings and enrich its database of grammatical sentence structures. It should also be able to make grow its database of word meanings for ambiguous words and instantly build the associate disambiguation rules. In addition, it should be capable of detecting and implementing specific grammatical structures.
It seems superintelligence will be reached when the superintelligent translation machine will be able to perform all that without any human help.

Also relevant in this discussion is the fact, previously argued, that rule-based translation is better suited to endangered langages translation than statistic-based translation. From the above definition of SMT, it follows that rule-based translation is also best suited to SMT, since its massively includes endangered languages.

Let us speculate now on how this path to superintelligent translation will be achieved. We can mention here:

  • a ‘quantitative scenario’: (i) acquire, fist, an ability to translate very accurately, say, 100 languages. (ii) develop, second, the ability to self-improve (iii) extend, third, the translation ability to whole set of 8000 human languages.
  • alternatively, there could be a ‘qualitative scenario’: (i) acquire, first, an ability to translate somewhat accurately the 8000 languages (the accuracy could vary from language to language, especially with rare endangered languages). (ii) suggest improvements to vocabulary, locutions, sentence structures, disambiguation rules, etc. that are verified and validated by human (iii) acquire, third, the ability to self-improve by reading texts or hearing conversations.
  • but a third alternative would be an hybrid scenario, i.e. a mix of quantitative and qualitative improvements. It will be our preferred scenario.

But we should provide more details on how these steps could be achieved.

To fix ideas, let us focus on the vocabulary self-improvement module: it allows the superintelligent machine translation to extend its vocabulary in any language. This could be accomplished by reading or hearing new texts in any language. When facing a new word, the superintelligent machine translation (SMT, for short) should be able to translate it instantly into the 8000 other languages and add it to its vocabulary database.

To give another example, another module would be locution self-improvement module: it allows the superintelligent machine translation to extend its locution knowledge in any language.


Posted in blog | Tagged , , , , , | Comments Off on Superintelligent machine translation

Writing differences between Corsican and Gallurese

Here are some writing differences between Corsican and Sardinian gallurese, that result from historical writing habits. These writing differences prevail, even when the words are the same:

  • ghj is replaced by gghj: acciaghju (corsu), acciagghju (gallurese) , steel
  • chj is replaced by cchj: finochju (corsu), finocchju (gallurese), fennel
  • tonic accent is marked systematically in gallurese whereas it is not compulsory in Corsican: apostulu (Corsican), apòstulu (gallurese), apostle
  • cc is prefered in Gallurese language instead of cq in Corsican: acquistu (corsu), accuistu (gallurese), purchase
  • dd in Corsican taravese or sartinese is replaced with ddh in Gallurese: beddu bedda beddi (corsu), beddhu beddha beddhi (gallurese), beautiful
  • final è in Corsican is replaced with é in Gallurese: sapè (corsu), sapé (gallurese), know
Posted in blog | Tagged , , , , , , , | Comments Off on Writing differences between Corsican and Gallurese

What are the conditions for a given endangered language to be a candidate for rule-based machine translation?

What are the conditions for a given endangered language to be a candidate for rule-based machine translation? For a given endangered language to be a candidate for rule-based machine translation, some requirements are in order. There is notably need for:

– a dictionary: some specialized lexicons are useful too
– a list of locutions and their translation: to be more accurate what is needed are noun locutions, adjective locutions, adverbial locutions, verbal locutions and their translations in other language.
– a detailed grammar (in any language): ideally, the grammar should be very detailed, mentioning notably irregular verbs, noun plurals, etc. Subjonctive, conditional tenses must also be accurately described.
– most importantly: a description of the main variants of the language and their differences. This is needed to handle what we can call the ‘variant problem’ (we shall say a bit more about this later): as an effect of diversity, endangered languages are often polynomic and come with variants. But translation must be coherent and a mix of several variants is not acceptable as a translation.

Let us mention that endangered languages are commonly associated with another language, being in a diglossia relationship one with another. To take an example, Corsican language is associated with French. So we consider the French-Corsican pair, and what is relevant is a French-Corsican. If we consider the sardinian gallurese language (‘gaddhuresu’), the relevant pair is Italian-Gallurese.

Posted in blog | Tagged , , , , , | Comments Off on What are the conditions for a given endangered language to be a candidate for rule-based machine translation?

Quandu da la forza à la raghjoni cuntrasta Tandu vinci la forza è la raghjoni ùn basta

Quandu da la forza à la raghjoni cuntrasta
Tandu vinci la forza è la raghjoni ùn basta.

This is a rare Corsican proverb. In French, litterally: “Lorsque la force et la raison s’opposent, alors la force gagne car la raison ne suffit pas” (When strength and reason are opposed, then strength gains because reason is not enough).

But it seems better translated in French, as: “La raison du plus fort est toujours la meilleure.” (Jean de la Fontaine). Litterally: the reason of the strongest is always the best. This is semantically equivalent to: might makes right. This is the first verse of the fable of Jean de La Fontaine, The Wolf and the Lamb. To be compared with Aesop, from which this fable originates ; the conclusion of the story, in Aesop’s terms, was: This fable shows that with people decided to do the most righteous evil, defense remains without effect.

This rare Corsican proverb is a small wonder (heard from people of Laretu di Tallà). The proverb is in poetry, with the main rhyme cuntrasta/basta, but there is also a secondary rhyme at the beginning of the verses: Quandu/Tandu.

More generally, this raises the problem of the equivalence of meaning and the translation of the proverbs from one language to another. If one considers that Quandu da la forza à la raghjoni cuntrasta Tandu vinci la forza è la raghjoni ùn basta is semantically equivalent to “La raison du plus fort est toujours la meilleure” in French, it is somewhat surprising since there are several differences between the two versions:

  • The Corsican proverb is poetry while the French version is in prose
  • The phrase is longer in Corsica, and contains more words than the French version (the English version being even more concise but sense-preserving)
Posted in blog | Tagged , , , , , , , , , , , | Leave a comment