Let’s take another look at polymorphic disambiguation. We shall consider the French word sequence ‘nombre de’. The translation into Corsican (the same goes for English and other languages) cannot be identical, because ‘number of’ can be translated in two different ways. In the sequence ‘mais nombre de poissons sont longs’ (but many fish are long), ‘number of’ is an indefinite determiner: it translates as bon parechji (many). On the other hand, in the sequence ‘mais le nombre de poissons est supérieur à dix’ (but the number of fish is greater than ten), ‘nombre de’ is a common name followed by the preposition ‘de’: it is translated by numaru di (number of). Statistical MT does usually better than human-like (rule-based) MT at polymorphic disambiguation (I did a test with both sentences with Deepl and Google translate, and both of them successfully solve the relevant polymorphic disambiguation), but it turns out that human-like (rule-based) MT is also capable of handling that.
Let us comment on the remaining errors encountered in the above open test:
French ‘carrière’ remains undisambiguated: either carriera (career) or cava (quarry): two occurrences
‘de’: French ‘de’ is perhaps the most difficult word to translate into another language, due to its general polymorphism
‘national-socialiste’: missing vocabulary
l’ within ” l’empeche “: pronoun error
it should be pointed out that ‘Etats-Unis’ remains untranslated due to the fact that it is erroneously written, with a beginning E instead of É
The result is 1 – (5/169) = 97.04%. To be noticed: ambiguous French word ‘partie’ (‘durant la première partie’, during the first part) is correctly disambiguated into parti (part), instead of partita (game, match).
It seems that an average result of 95% is currently being consolidated, and that an average result of 96% is a target that should be achievable within a year.
The analysis of the Wikipedia article of the day in French is interesting, in the sense that it sheds light on the skills that will be necessary for a machine translation system to achieve a 100% accurate translation. The error that appears here is characteristic and must probably be placed in the missing 1% to achieve 100% accuracy in the translation (the problem of the remaining 1%). The phrase ‘Her father studied at the University of Oregon and then at Yale Law School‘ has a definite article with elision: l’. The translation given (u/a, i.e. indeterminate between the masculine definite article u and the feminine definite article a) is not correct in that it fails to determine the gender – masculine or feminine – of Yale Law School, the name of an English school. In order to provide the correct translation, it is necessary to know how to translate Yale Law School into Corsican, and thus to determine that school is translated by scola, which is feminine. Therefore the correct translation should have been: po à a Yale Law School prima di …. This finally shows that a translator capable of translating with 100% performance must be able (i) to determine the language in which the text parts are written in another language and (ii) to translate those text parts into the target language. This highlights the skills necessary to successfully achieve the remaining 1% are: (i) the ability to determine the language of a subtext and (ii) the ability to translate a subtext from any language in the target language.
Presently, we can only conjecture that this ability to solve the remaining 1% requires artificial general intelligence (AGI ). Now providing concrete and detailed examples may help to confirm or disprove that hypothesis.
Let us expand the idea of two-sided (from the analytic/synthetic duality standpoint) grammatical analysis: consider, for example, ‘beaucoup et souvent’ (a lot and often) in the sentence ‘il mange beaucoup et souvent’ (he eats a lot and often). Analytically, ‘beaucoup et souvent’ is composed of and adverb (‘beaucoup’, a conjunction (‘et’) and another adverb (‘souvent’). But synthetically, ‘beaucoup et souvent’ is an adverb, the structure of which is ADVERB+CONJUNCTIONCORD+ADVERB, according to the meta-rule ADVERB = ADVERB+CONJUNCTIONCORD+ADVERB . In the same way, ‘beaucoup mais souvent’ (a lot but often) is also, from a synthetic point of view, an adverb. Analogously, ‘rarement ou souvent’ (rarely or often) is also an adverb, from a synthetic viewpoint. In the same way, ‘rarement voire jamais’ is also a synthetic adverb. This leads to considering ‘even’ as a conjunction of coordination.
Now it is patent that we can expand on that. As hinted at earlier, it seems some progress in rule-based machine translation (we should better speak of, say, ‘human-like MT, since it mimics human reasoning) requires revolutionizing grammar.
The application now changes its name on the Android Playstore, and becomes “Traduttore corsu”: the name is not very original, let’s face it, but at least it is easy to understand. “Traduttore corsu” is dedicated especially to the translation from French to Corsican. So we are leaving aside for the moment this beautiful word “okchakko” from the language of the Choctaw Indians.
To find the application Traduttore corsu on Google Play, you have to search with “traduttore_corsu”, because there is a known “bug” in Google Play that means that with “corsu” or “traduttore”, you cannot find the application.
Just powered the new engine (prototypal, not yet transferred to the API which is used both by the current site translator and the Android application) and made a few tests: it works! Let us take an example with French ‘en fait’: ‘en fait’ (in fact, actually, difatti) from the viewpoint of two-sided grammar is synthetically an adverb, made up – analytically – of a preposition followed by a singular noun. ‘en fait’ is polymorphic in the sense that it may also be part of the prepositional locution ‘en fait de’ (in fact of, in fatti di). Alternatively, ‘en fait’ may also be a pronoun (‘en’, it, ni) followed by the present tense (‘fait’, faci) of the verb ‘faire’ (makes) at the 3rd person of the singular. So, ‘en fait’ is highly ambiguous and context-sensitive.
As the above screenshot illustrates, the new engine handles adequately the three kinds of ‘en fait’. It could be kind of a breakthrough with regard to rule-based translation, since it is a well-known weakness of this type of MT implementation. Presumably, this progress on polymorphic disambiguation opens the path to some 95% or 96% scoring.
Let us speculate about what could be an autonomous MT system. In the present state of MT we provide rules and dictionary to the software (rules-based translation) or we feed it with a corpus regarding a given pair of languages (statistical MT). But let us imagine that we could do otherwises and build an autonomous MT system. We provide the MT system with a corpus regarding a given source language. It analyses, first, the thoroughly this language. It begins with identifying single words. It creates then grammatical types and assigns then to the vocabulary. It also identifes locutions (adverbial, verbal, adjective locutions, verb locutions, etc.) and assigns them a grammatical type. The MT system also identifies prefixes and suffixes. It also computes elision rules, euphony rules, etc. for that source language. Now the autonomous MT system should, second, do the same for the target language. The MT system creates, third, a set of rules for translating the source language into the target one. For that purpose, the MT system could for example assign a structured reference to all these words and locutions. For instance, ‘oak’ in English refers to ‘quercus ilex’, ‘cat’ refers’ to ‘felis sylvestris’. For abstract entities, we presume it would not be a trivial task… Alternatively but not exclusively, it could use suffixes and exhibit morphing rules from the source language to the target one.
Is it feasible or pure speculation? It could be testable. Prima facie, this sounds like a different approach to IA than the classical one. It operates at a meta-level, since the MT system creates the rules and in some respect, builds the software.
A jeweler examines an emerald. “Aha,” he says, “another green emerald. In all my years in this business, I must have seen thousands of emeralds, and every one has been green.” We think the jeweler reasonable to hypothesize that all emeralds are green. Next door is another jeweler having equally comprehensive experience with emeralds. He speaks only the Choctaw Indian language. Color distinctions are not as universal as might be thought. The Choctaw Indians made no distinction between green and blue—the same words applied to both. The Choctaws did make a linguistic distinction between okchamali, a vivid green or blue, and okchakko, a pale green or blue. The Choctaw-speaking jeweler says: All emeralds are okchamali. He maintains that all his years in the jewelry business confirm this hypothesis. (William Poundstone, Labyrinths of reason)