Today we are conducting the sixth open test. The result is 1 – (4/123) = 96.7%. There are three errors.
The current, provisional average is: (98.61 + 93.75 + 93.93 + 95.34 + 99.42 + 96.74)/5 = 96.29%.
Today we are conducting the sixth open test. The result is 1 – (4/123) = 96.7%. There are three errors.
The current, provisional average is: (98.61 + 93.75 + 93.93 + 95.34 + 99.42 + 96.74)/5 = 96.29%.
Today we are conducting the fifth open test. The result is 1 – (1/175) = 99.42%. There is only one error (level of difficulty: medium) due to the incorrect translation of the proper name: ‘del Monte’. Otherwise, several sentences are fully translated correctly.
The current, provisional average is: (98.61 + 93.75 + 93.93 + 95.34 + 99.42)/4 = 96.21%.
Today we are conducting the fourth open test. The result is 1 – (8/172) = 95.34%. There is one error (level of difficulty: easy) due to lack of vocabulary (‘escaladées’, ‘valaisan’). There are also some errors (level of difficulty: medium) related to proper nouns (‘Valais’).
The current, provisional average is: (98.61 + 93.75 + 93.93 + 95.34)/4 = 95.40%.
Today we are conducting the third open test. The result is 1 – (8/132) = 93.93%. There are several errors (level of difficulty: easy) due to lack of vocabulary (‘alunissage’, ‘plaqué”). There is also an error (level of difficulty: medium) concerning the grammatical disambiguation of ‘émis’ (singular or plural). Finally, there 3 occurrences of an error (level of difficulty: hard) concerning the semantic disambiguation of “argent” (silver/money).
The current, provisional average is: (98.61 + 93.75 + 93.93)/3 = 95.43%.
Today we are conducting the second open test. The result is 1 – (7/112) = 93.75%. There are several errors (level of difficulty: easy) due to lack of vocabulary (‘même’, ‘raillerie’, ‘elocution’). There is also an error (level of difficulty: medium) concerning the translation of ‘de’ (of) as a biased article (‘de nombreuses critiques’, many criticisms). Finally, there is an error (level of difficulty: hard) concerning the semantic disambiguation of “tirées” (taken from/fired).
The current, provisional average is: (93.75 + 98.61)/2 = 96.18%.
We will evaluate the translator’s current performance, using a series of seven open tests. The aim is to translate the first 100 words of the article of the day from the wikipedia encyclopedia into French for seven consecutive days. Today, the first day, the test scores 1 – (2/144) = 98.61%. The translation error concerning ‘utilisent’ consists in the use of the present tense ‘apradani‘ instead of the subjunctive ‘apradini‘.
There are now 500 users for the Traduttore corsu application for Android. There are also 200 users for the Traduttore corsu for windows application.
Here are a certain number of palindromes, in each of the main variants of the Corsican language:
Let us briefly recall the problem: translating ‘I love you’ might sound trivial, but it’s not. In fact, ‘ti amu‘ is not the best translation. The best translation is ‘ti tengu caru‘ when addressed to a male person, or ‘ti tengu cara‘ when addressed to a female person. Hence the proposed preliminary translation ‘ti tengu caru/cara‘. Such rough translation requires further disambiguation, but on what precise grounds?
Let us look at the issue from an analytical perspective. It appears that we need to assign a reference to the pronoun ‘te’ (you, ti). The latter could be identified according to the context, depending on whether the person ‘te’ refers to is male or female. At this stage, it appears that it is better to consider that the personal object pronoun has an inherent gender: masculine or feminine. This gender does not affect the pronoun itself which remains ‘te’ (you, ti) independently of the gender, but it does have an effect on the words that depend on it, i.e. the adjective caru/cara in Corsican, in the locution ti tengu caru/cara. The upshot is: in this case, ‘te’ (you, ti) is a personal object pronoun, masculine or feminine, whose inherent ambiguity can be solved according to the context.
If we were to update the priorities for language pairs to be achieved, from the point of view of endangered languages, the result would be as follows:
Pairs such as French to Gallurese, French to Sassarese, English to Gallurese, English to Sassarese, English to Sicilian do not have priority, as they can be resolved using an intermediate pair. French to Gallurese is done with the French to Italian pair (e.g. with Deepl) and then with the Italian to Gallurese pair, etc.
Translating ‘I love you’ might sound trivial, but it’s not. In fact, ‘ti amu‘ is not the best translation. The best translation is ‘ti tengu caru‘ when addressed to a male person, or ‘ti tengu cara‘ when addressed to a female person. Hence the proposed translation ‘ti tengu caru/cara‘, whose (difficult) disambiguation must be done according to the context.
It is worth sketching a few ideas, in order to get some insight into this issue. First of all, let’s look at the problem synthetically. This underlines the problem inherent in the grammatical status of the sentence ‘je t’aime’ (I love you) in French or in English, as it is not known whether it is addressed to a male or a female person. If one were to assign a gender to this sentence, it would therefore be masculine or feminine, with an inherent ambiguity. Assigning in some way a gender – masculine or feminine – to a sentence may seem strange prima facie, but it could prove useful (to be confirmed) In this case, the gender associated with the sentence would be inherited from the pronoun ‘t’ (short for ‘te’) which remains unambiguated with the sentence ‘je t’aime’ (I love you, ti tengu caru/cara) alone.
Second, let’s look at the issue from an analytical perspective. For another way to solve the problem could be to assign a reference to the pronoun ‘te’ (you). The latter could be identified according to the context. This sounds more promising and more in line with the well-known problem of pronoun resolution.
Let us add further reflexions on the remaining 1% problem. As hinted at previously, the remaining 1% problem may only be solved by general AI (GAI). Let us sketch in a series of posts what features are required for general AI in this context. On feature of GAI would be the ability to solve the ‘taxonomy optimization problem’. Let’s focus on defining it (very roughly, to begin with). Let us consider a given language, defined with a certain number of words, and a corpus of sentences (or a set of rules to define licit sentences in this language). In this context, the ‘taxonomy optimization problem’ is the question of deciding what is the simplest taxonomy with its associated rules to resolve the type ambiguities existing in this language? This feature of GAI would be notably capable of defining the best taxonomy for resolving type ambiguities existing within this language. And it is possible that such a feature of GAI would revolutionize grammar and our present grammatical taxonomy.
The Traduttore corsu application for Android has now more than a hundred users. Moving on…
Let’s take another look at polymorphic disambiguation. We shall consider the French word sequence ‘nombre de’. The translation into Corsican (the same goes for English and other languages) cannot be identical, because ‘number of’ can be translated in two different ways. In the sequence ‘mais nombre de poissons sont longs’ (but many fish are long), ‘number of’ is an indefinite determiner: it translates as bon parechji (many). On the other hand, in the sequence ‘mais le nombre de poissons est supérieur à dix’ (but the number of fish is greater than ten), ‘nombre de’ is a common name followed by the preposition ‘de’: it is translated by numaru di (number of). Statistical MT does usually better than human-like (rule-based) MT at polymorphic disambiguation (I did a test with both sentences with Deepl and Google translate, and both of them successfully solve the relevant polymorphic disambiguation), but it turns out that human-like (rule-based) MT is also capable of handling that.
Let us comment on the remaining errors encountered in the above open test:
The result is 1 – (5/169) = 97.04%. To be noticed: ambiguous French word ‘partie’ (‘durant la première partie’, during the first part) is correctly disambiguated into parti (part), instead of partita (game, match).
It seems that an average result of 95% is currently being consolidated, and that an average result of 96% is a target that should be achievable within a year.
The analysis of the Wikipedia article of the day in French is interesting, in the sense that it sheds light on the skills that will be necessary for a machine translation system to achieve a 100% accurate translation. The error that appears here is characteristic and must probably be placed in the missing 1% to achieve 100% accuracy in the translation (the problem of the remaining 1%). The phrase ‘Her father studied at the University of Oregon and then at Yale Law School‘ has a definite article with elision: l’. The translation given (u/a, i.e. indeterminate between the masculine definite article u and the feminine definite article a) is not correct in that it fails to determine the gender – masculine or feminine – of Yale Law School, the name of an English school. In order to provide the correct translation, it is necessary to know how to translate Yale Law School into Corsican, and thus to determine that school is translated by scola, which is feminine. Therefore the correct translation should have been: po à a Yale Law School prima di ….
This finally shows that a translator capable of translating with 100% performance must be able (i) to determine the language in which the text parts are written in another language and (ii) to translate those text parts into the target language. This highlights the skills necessary to successfully achieve the remaining 1% are: (i) the ability to determine the language of a subtext and (ii) the ability to translate a subtext from any language in the target language.
Presently, we can only conjecture that this ability to solve the remaining 1% requires artificial general intelligence (AGI ). Now providing concrete and detailed examples may help to confirm or disprove that hypothesis.
Let us expand the idea of two-sided (from the analytic/synthetic duality standpoint) grammatical analysis: consider, for example, ‘beaucoup et souvent’ (a lot and often) in the sentence ‘il mange beaucoup et souvent’ (he eats a lot and often). Analytically, ‘beaucoup et souvent’ is composed of and adverb (‘beaucoup’, a conjunction (‘et’) and another adverb (‘souvent’). But synthetically, ‘beaucoup et souvent’ is an adverb, the structure of which is ADVERB+CONJUNCTIONCORD+ADVERB, according to the meta-rule ADVERB = ADVERB+CONJUNCTIONCORD+ADVERB . In the same way, ‘beaucoup mais souvent’ (a lot but often) is also, from a synthetic point of view, an adverb. Analogously, ‘rarement ou souvent’ (rarely or often) is also an adverb, from a synthetic viewpoint. In the same way, ‘rarement voire jamais’ is also a synthetic adverb. This leads to considering ‘even’ as a conjunction of coordination.
Now it is patent that we can expand on that. As hinted at earlier, it seems some progress in rule-based machine translation (we should better speak of, say, ‘human-like MT, since it mimics human reasoning) requires revolutionizing grammar.
The application now changes its name on the Android Playstore, and becomes “Traduttore corsu”: the name is not very original, let’s face it, but at least it is easy to understand. “Traduttore corsu” is dedicated especially to the translation from French to Corsican. So we are leaving aside for the moment this beautiful word “okchakko” from the language of the Choctaw Indians.
To find the application Traduttore corsu on Google Play, you have to search with “traduttore_corsu”, because there is a known “bug” in Google Play that means that with “corsu” or “traduttore”, you cannot find the application.
Just powered the new engine (prototypal, not yet transferred to the API which is used both by the current site translator and the Android application) and made a few tests: it works! Let us take an example with French ‘en fait’: ‘en fait’ (in fact, actually, difatti) from the viewpoint of two-sided grammar is synthetically an adverb, made up – analytically – of a preposition followed by a singular noun. ‘en fait’ is polymorphic in the sense that it may also be part of the prepositional locution ‘en fait de’ (in fact of, in fatti di). Alternatively, ‘en fait’ may also be a pronoun (‘en’, it, ni) followed by the present tense (‘fait’, faci) of the verb ‘faire’ (makes) at the 3rd person of the singular. So, ‘en fait’ is highly ambiguous and context-sensitive.
As the above screenshot illustrates, the new engine handles adequately the three kinds of ‘en fait’. It could be kind of a breakthrough with regard to rule-based translation, since it is a well-known weakness of this type of MT implementation. Presumably, this progress on polymorphic disambiguation opens the path to some 95% or 96% scoring.
Closely related to my previous post on autonomous MT systems is the article on the fact that Researchers have developed a machine-learning system capable of deciphering lost languages.
Let us speculate about what could be an autonomous MT system. In the present state of MT we provide rules and dictionary to the software (rules-based translation) or we feed it with a corpus regarding a given pair of languages (statistical MT). But let us imagine that we could do otherwises and build an autonomous MT system. We provide the MT system with a corpus regarding a given source language. It analyses, first, the thoroughly this language. It begins with identifying single words. It creates then grammatical types and assigns then to the vocabulary. It also identifes locutions (adverbial, verbal, adjective locutions, verb locutions, etc.) and assigns them a grammatical type. The MT system also identifies prefixes and suffixes. It also computes elision rules, euphony rules, etc. for that source language.
Now the autonomous MT system should, second, do the same for the target language.
The MT system creates, third, a set of rules for translating the source language into the target one. For that purpose, the MT system could for example assign a structured reference to all these words and locutions. For instance, ‘oak’ in English refers to ‘quercus ilex’, ‘cat’ refers’ to ‘felis sylvestris’. For abstract entities, we presume it would not be a trivial task… Alternatively but not exclusively, it could use suffixes and exhibit morphing rules from the source language to the target one.
Is it feasible or pure speculation? It could be testable. Prima facie, this sounds like a different approach to IA than the classical one. It operates at a meta-level, since the MT system creates the rules and in some respect, builds the software.
The classical divide with regard to MT separates statistical from rule-based MT. But this divide is not as clear-cut as one could think at first glance. For rule-based MT can operate statistically. Let us take an example, concerning the disambiguation of French ‘est’: it can be translated either as is or as east, depending on the context. Defining the rules for disambiguating ‘est’ can be somewhat complicated. A rule-based MT could then define a few rules that would cover 90% of the cases, and for the remaining 10%, it could apply a closure rule that translates ‘est’ into is inconditionnally. Such rule would be based on the statistical fact that most often, ‘est’ translates into is and not into east. Such rule may succeed in most of the cases. As we see it, such rule is statistical by essence. Hence the conclusion, the statistical/rule-based divide regarding MT is not as as clear-cut as one could think prima facie. For a disambiguating system for rule-based MT could be built with closure rules of this type, that would ooperate statistically.
Let us discuss the question of priority pairs with regard to endangered languages. It consists of the most wanted translation pairs for a given endangered language, in keeping with the main language with which it is associated. To take an example: French-Corsican is the priority pair for Corsican language. In the same way, Italian-Gallurese is the priority pair for Gallurese language, etc. Now expanding on that idea, priority pairs are:
Let us give some further examples of two-sided grammatical analysis:
Let us call two-sided grammatical analysis the type of grammatical analysis that will be described below. Two-sided grammatical analysis contrasts with one-sided analysis, which sees a sequence of words either as a locution type (adverbial locution, verbal locution, noun locution, etc.) or as the sequence of types of it constituent words. From the standpoint of two-sided grammatical analysis, a given sequence of words can be attributed one (synthetically) single type, and (analytically) several grammatical types corresponding one-by-one to its constituent words. The upshot is that a given sequence of words can be described from two – synthetic & analytic – different viewpoints. What is now the status of ‘de fait’, from the viewpoint of ‘two-sided grammatical analysis’? From a synthetic standpoint, it is an adverb. And from an analytic viewpoint, it is made up of one preposition (‘de’) followed by a common noun (‘fait’). Both viewpoints are complementary and cast each light on one facet of the same reality. (lacking the time to write a scholar article, but I hope the main idea should be clear…)
Let us investigate an issue that relates to disambiguation. It is a hard case that needs to be addressed: I shall call it in what follows, for reasons that will become clearer later, polymorphic disambiguation. Let us take an example. It relates to the translation of the two consecutive words: ‘de fait’. The first French sentence ‘De fait, il part.’ translates into Difatti, parti‘ (Actually, he’s leaving.): in this case, ‘de fait’ is considered as an adverbial locution. The second French sentence ‘Il n’y a rien de fait. translates correctly into Ùn ci hè nienti di fattu. (There is nothing done.) where ‘fait’ is now identifed as a participe. The instance at hand concerns French to Corsican, but it should be clear that it arises in the same way within French to English translation. To sum up: the two consecutive words ‘de fait’ can be identifed either as an adverbial locution, or as a preposition (‘de’) followed by a participe (‘fait’, done).
Now we are in a position to formulate the problem in a more general way. It concerns two or more consecutive words, that may be grammatically interpreted differently in the sentence and that may, thus, be translated in a different way. Generally speaking, disambiguation may concern one word (in most cases) but also a group of words. Now polymorphic disambiguation relates then to a given groups of words, i.e. sequences of 2-words, 3-words, 4-words, etc.
A try with online translators shows that statistical MT does better with polymorphic disambiguation. That is truly an interesting difference. So it is a gap that should be filled for rule-based MT.
Let us sketch what could be some ethical requirements related to machine translation regarding endangered languages.
Let us consider a hard case for word sense disambiguation, in the context of French to Corsican MT. But the same goes for French to English MT. It relates to French words such as: ‘accomplit’, ‘affaiblit’, ‘affranchit’, ‘alourdit’, ‘amortit’. The corresponding verbs ‘accomplir’ (to fulfill, to accomplish), ‘affaiblir’ (to weaken), ‘affranchir’ (to free), ‘alourdir’ (to burden), ‘amortir’ (to damp) have the same word for simple present and simple past at the third person singular: respectively ‘accomplit’, ‘affaiblit’, ‘affranchit’, ‘alourdit’, ‘amortit’. The upshot is that a single sentence such as: ‘Il affaiblit sa position.’ can be translated either into he weakens his position or into he weakened his position. If the context is unambiguous with regard to the sence of the discourse, the correct tense can be adequately chosen. But in the lack of informative context, it would be opportune to let the ambiguity prevail.
It should be pointed out that any such verbs are not rare. A more complete list includes: accomplit, affaiblit, affranchit, alourdit, amortit, anéantit, anoblit, aplatit, arrondit, assombrit, bannit, bâtit, blanchit, blondit, démolit, éblouit, emplit, enfouit, enhardit, enlaidit, ennoblit, envahit, épaissit, étourdit, exclut, franchit, glapit, investit, jaunit, jouit, munit, noircit, obéit, obscurcit, occit, périt, réagit, régit, réjouit, remplit, répartit, resplendit, rétrécit, rit, rougit, rouvrit, saisit, sévit, surgit.
Let us focus on grammatical type disambiguation, which is a subproblem of word disambiguation. General grammatical types are: verbs, nouns, adjectives, adverbs, prepositions, gerundive, etc. But for grammatical type disambiguation purposes, more accuracy is in order: instances of grammatical types are then: masculine singular noun, feminine singular noun, masculine plural noun, feminine plural noun, masculine singular adjective, feminine singular adjective, masculine plural adjective, feminine plural adjective, adverbs, prepositions, gerundive, etc. Now grammatical type disambiguation can occur between two different grammatical types (in the above-mentioned form). For example, an ambiguity can occur between preposition and gerundive. In French, this is notably the case for ‘devant’ and ‘maintenant’. For ‘devant’ can either be an adverb (in front) or a gerundive (from the verb ‘devoir’, to have to). Similarly, ‘maintenant’ can either be an adverb (now) or a gerundive (from the verb ‘maintenir’, to maintain). It should be clear now that ‘devant’ and ‘maintenant’ are both ambiguous with regard to their grammatical type. In English, depending on the relevant grammatical type, ‘devant’ is ambiguous between having to or in front). In the same way, ‘maintenant’ is ambiguous between now and maintening.
In order to disambiguate French words ‘devant’ or ‘maintenant’, rule-based MT needs a disambiguation module that is able to distinguish whether ‘devant’ or ‘maintenant’ are adverbs or gerundives.
(not to mention the fact that ‘devant’ can also be a preposition, for the sake of clarity).
The issue of pair reversal: it goes as follows: Suppose your have a given translation pair A>B that translates language A into language B, how hard is it to build the reverse pair B>A? Now the current instance of this problem goes as follows: given the French>Italian pair, how hard is it to build an Italian>French pair? To state it more explicitly : could AI help build a reverse pair in a very short time. Arguably, if AI could build such reverse pair shortly, it seems it would be some kind of breakthrough. Supposedly, we do not expect a 100% efficiency and accuracy in this reversal process, but if some 98% or 99% were possible, it would do the job. For AI within MT is not only targeted at translating, it is also targeted at constructing translation engines.
Just tested pair reversal from French-Italian to Italian-French. Well, some 70% can be made automatically, but a big issue is still remaining, that relates to the disambiguation of Italian words. The disambiguation engine seems to be the crux of the matter here. The uupshot is that the entire disambiguation module needs to be rewritten, in order (if possible) to be language-related. The new module must be more AI-focused. If successful, it could open the path to the (somewhat) fast construction of a multi-language ecosystem with a rule-based MT architecture.