Why rule-based translation is (presently) best suited to endangered languages

  • Here are some arguments in favor of the choice of rule-based translation concerning machine translation of endangered languages (it relates to the philosophy of language policy):
  • there does not exist at present time a reliable corpus between the given endangered language and other languages
  • endangered languages are often polynomic, i.e. there exist some main variants of the language that coexist: it is important to preserve them since (i) it is a feature of diversity and (ii) it is an inherent feature of the given endangered language, and to distinguish between these variants. In addition, any translation should not contain a mix up of these variants. This also complicates the process of building a proper corpus, since the scarce existing corpus is made up of different variants of the language.
  • in the lack of an adequate corpus, statistic-based translation is not able to provide quality translation of the given endangered language (while on the other hand it succeeds with common languages where excellent corpora are available): arguably, providing low quality translation (although the attempt is meritable) could harm these endangered languages that are by definition vulnerable, since people could use and diffuse the resulting low quality translation. On those grounds, given this vulnerability, it could be argued that a minimum 80% quality translation is needed for a given pair involving an endangered language.
  • in addition, it should be pointed out that endangered languages are usually in a ‘diglossic’ relationship with another language: what is needed as a matter of priority is to provide translation between the two languages of this pair

(to be continued)

Posted in blog | Tagged , , , , | Leave a comment

Rough typology of remaining errors

French to Corsican: performing on French wikipedia sample test currently amounts to 93% on average. Below is a rough typology of remaining errors (presumably an average of 95% performance should be attainable on the basis of correction of ‘easy’ tagged errors):

 

  • unknown vocabulary: 50% (easy)
  • basic disambiguation: 15%  (easy)
  • erroneous accord (relates to (i) words that are masculine in French and feminine in Corsican language; and (ii) ) words that are feminine in French and masculine in Corsican language: 5% (medium difficulty )
  • inadequate locution: 10% (medium difficulty or hard)
  • false positives: 5% (medium difficulty or hard)
  • semantic disambiguation: 5% (hard). For example, disambiguating French ‘échecs’ = fiaschi/scacchi (failures/chess)
  • specific grammatical case: 2% (hard)
  • word reference error: 2% (hard)
  • unknown, unclassified: 6% (hard)
Posted in blog | Tagged , | Comments Off on Rough typology of remaining errors

Enhancing French to Italian translation

Some improvements made to French to Italian translation:

  • fixed several contractors (della, dello, …)
  • the nice thing is that semantic disambiguation is working: ‘échecs’ = fallimenti/scacchi (failures/chess) and translates properly into scacchi
Posted in blog | Tagged , , , | Leave a comment

Very first draft on French to Italian

Now testing French to Italian translation: it is the very first draft. A rough 80%. A lot of things to fix.

Posted in blog | Leave a comment

Improvement in grammatical structures: another 100% hit

Progress on grammatical structures: some improvements to be included in future 1.2 version yield another Feigenbaum hit: 100%. In the present case, the Corsican language variety is taravese.

 

Posted in blog | Comments Off on Improvement in grammatical structures: another 100% hit

Percentage of ambiguous words in French sentence (from French to Corsican translation perspective)

What is the average percentage of ambiguous words in a French sentence (from a French to Corsican translation perspective). In the above example, this percentage amounts to 20/99 words = approximately 20%. Not all semantic ambiguities are taken into account here, so the real average should amount at least to 25%.

  • le = u/lu: definite article or pronoun (the/it)
  • est = livanti/hè: masculine noun or verb (east/is)
  • culminant = culminanti/culminendu: adjective or gerund
  • émerge = emerghju/emerghji: first person or third person verb
  • commence = principiu/principia: first person or third person verb (begin/begins)
  • cesse = cessu/cessa: first person or third person verb (cease/ceases)
  • volcanique = vulcanicu/vulcanica: adjective, masculine of feminine (volcanic, unambiguous from a French to English translation perspective)
Posted in blog | Comments Off on Percentage of ambiguous words in French sentence (from French to Corsican translation perspective)

Disambiguation of two consecutive ambiguous words: ‘plusieurs mois’

 

Testing improved disambiguation engine. This is a special case of disambiguation of two consecutive ambiguous words. French ‘au terme de plusieurs mois’ translates into à u capu di parechji mesa (at the end of several months) in Corsican (taravese variant).  In this case, ‘plusieurs’ and ‘mois’ are ambiguous:

  • ‘plusieurs’ (several) as an indefinite plural pronoun can be either masculine of feminine.
  • ‘mois’ as a noun can be either singular (month, mesi) or plural (months, mesa: plural with a final –a is reminiscent of latine neutral)

There is only one error in the above translation: da latu should be replaced by da cantu.

 

Posted in blog | Comments Off on Disambiguation of two consecutive ambiguous words: ‘plusieurs mois’

Disambiguation of ‘vie’

We face here a special case of disambiguation: ‘un général byzantin du vie siècle’ (a Byzantine general of the sixth century) should translate: un generali bizantinu di u 6esimu seculu. French ‘vie’ is ambiguous between vita and 6esimu or VIesimu (life/sixth). In effect, ‘vi’ is sometimes used for the roman numeral ‘VI’. In this case, ‘VIe’ is unambiguous.

This also rises the interesting and more general issue: are ambiguities a weakness for a language? Is it better for a language to have few ambiguities?

Posted in blog | Comments Off on Disambiguation of ‘vie’

A virsioni 1.1 hè dispunibuli

Okchakko Traduttori: a virsioni 1.1 hè dispunibuli. Ci sò i nuvità:

– traduci da u francesu à i trè varietà maestri di a lingua corsa: cismuntincu, sartinesu, taravesu
– migliuramentu riguardu à u schidariu d’aiutu
– migliuramentu riguardu à l’elisioni
– vucabulariu allargatu

Posted in blog | Comments Off on A virsioni 1.1 hè dispunibuli

Light version 1.1 is available

Light version 1.1 is available. New features:

  • translates from French to one of the three main variants of Corsican language: cismuntincu, sartinesu, taravesu
  • some improvements made to the help file
  • improvements on elision
  • additional vocabulary

 

 

Posted in blog | Comments Off on Light version 1.1 is available