Just powered the new engine (prototypal, not yet transferred to the API which is used both by the current site translator and the Android application) and made a few tests: it works! Let us take an example with French ‘en fait’: ‘en fait’ (in fact, actually, difatti) from the viewpoint of two-sided grammar is synthetically an adverb, made up – analytically – of a preposition followed by a singular noun. ‘en fait’ is polymorphic in the sense that it may also be part of the prepositional locution ‘en fait de’ (in fact of, in fatti di). Alternatively, ‘en fait’ may also be a pronoun (‘en’, it, ni) followed by the present tense (‘fait’, faci) of the verb ‘faire’ (makes) at the 3rd person of the singular. So, ‘en fait’ is highly ambiguous and context-sensitive.
As the above screenshot illustrates, the new engine handles adequately the three kinds of ‘en fait’. It could be kind of a breakthrough with regard to rule-based translation, since it is a well-known weakness of this type of MT implementation. Presumably, this progress on polymorphic disambiguation opens the path to some 95% or 96% scoring.
Let us speculate about what could be an autonomous MT system. In the present state of MT we provide rules and dictionary to the software (rules-based translation) or we feed it with a corpus regarding a given pair of languages (statistical MT). But let us imagine that we could do otherwises and build an autonomous MT system. We provide the MT system with a corpus regarding a given source language. It analyses, first, the thoroughly this language. It begins with identifying single words. It creates then grammatical types and assigns then to the vocabulary. It also identifes locutions (adverbial, verbal, adjective locutions, verb locutions, etc.) and assigns them a grammatical type. The MT system also identifies prefixes and suffixes. It also computes elision rules, euphony rules, etc. for that source language. Now the autonomous MT system should, second, do the same for the target language. The MT system creates, third, a set of rules for translating the source language into the target one. For that purpose, the MT system could for example assign a structured reference to all these words and locutions. For instance, ‘oak’ in English refers to ‘quercus ilex’, ‘cat’ refers’ to ‘felis sylvestris’. For abstract entities, we presume it would not be a trivial task… Alternatively but not exclusively, it could use suffixes and exhibit morphing rules from the source language to the target one.
Is it feasible or pure speculation? It could be testable. Prima facie, this sounds like a different approach to IA than the classical one. It operates at a meta-level, since the MT system creates the rules and in some respect, builds the software.
The classical divide with regard to MT separates statistical from rule-based MT. But this divide is not as clear-cut as one could think at first glance. For rule-based MT can operate statistically. Let us take an example, concerning the disambiguation of French ‘est’: it can be translated either as is or as east, depending on the context. Defining the rules for disambiguating ‘est’ can be somewhat complicated. A rule-based MT could then define a few rules that would cover 90% of the cases, and for the remaining 10%, it could apply a closure rule that translates ‘est’ into is inconditionnally. Such rule would be based on the statistical fact that most often, ‘est’ translates into is and not into east. Such rule may succeed in most of the cases. As we see it, such rule is statistical by essence. Hence the conclusion, the statistical/rule-based divide regarding MT is not as as clear-cut as one could think prima facie. For a disambiguating system for rule-based MT could be built with closure rules of this type, that would ooperate statistically.
Let us discuss the question of priority pairs with regard to endangered languages. It consists of the most wanted translation pairs for a given endangered language, in keeping with the main language with which it is associated. To take an example: French-Corsican is the priority pair for Corsican language. In the same way, Italian-Gallurese is the priority pair for Gallurese language, etc. Now expanding on that idea, priority pairs are:
Corsican: (i) French-Corsican (ii) Italian-Corsican (iii) English-Corsican
Sardinian Gallurese: (i) Italian-Gallurese (ii) English-Gallurese
Sardinian Sassarese: (i) Italian- Sassarese (ii) English-Sassarese
Sardinian Logodurese: (i) Italian-Logodurese (ii) English-Logodurese
Sicilian: (i) Italian- Sicilian (ii) English-Sicilian
Manx: (i) English-Manx
Munegascu: (i) French-Munegascu (ii) Italian-Munegascu (iii) English-Munegascu
Let us give some further examples of two-sided grammatical analysis:
“à dessein” (purposedly), “à volonté” (at will), “à tort” (mistakenly): from an analytical standpoint, these are prepositions followed by a singular noun. From a synthetical viewpoint, they are adverbs (adverbial locutions).
“à jamais” (forever): from an analytical standpoint, it is a preposition followed by an adverb. From a synthetical viewpoint, it is an adverb (adverbial locution).
“à genoux” (on my/his/her/… knees), “à torrents” (in torrents): from an analytical standpoint, these are prepositions followed by a plural noun. From a synthetical viewpoint, they are adverbs (adverbial locutions).
Let us call two-sided grammatical analysis the type of grammatical analysis that will be described below. Two-sided grammatical analysis contrasts with one-sided analysis, which sees a sequence of words either as a locution type (adverbial locution, verbal locution, noun locution, etc.) or as the sequence of types of it constituent words. From the standpoint of two-sided grammatical analysis, a given sequence of words can be attributed one (synthetically) single type, and (analytically) several grammatical types corresponding one-by-one to its constituent words. The upshot is that a given sequence of words can be described from two – synthetic & analytic – different viewpoints. What is now the status of ‘de fait’, from the viewpoint of ‘two-sided grammatical analysis’? From a synthetic standpoint, it is an adverb. And from an analytic viewpoint, it is made up of one preposition (‘de’) followed by a common noun (‘fait’). Both viewpoints are complementary and cast each light on one facet of the same reality. (lacking the time to write a scholar article, but I hope the main idea should be clear…)
Let us investigate an issue that relates to disambiguation. It is a hard case that needs to be addressed: I shall call it in what follows, for reasons that will become clearer later, polymorphic disambiguation. Let us take an example. It relates to the translation of the two consecutive words: ‘de fait’. The first French sentence ‘De fait, il part.’ translates into Difatti, parti‘ (Actually, he’s leaving.): in this case, ‘de fait’ is considered as an adverbial locution. The second French sentence ‘Il n’y a rien de fait. translates correctly into Ùn ci hè nienti di fattu. (There is nothing done.) where ‘fait’ is now identifed as a participe. The instance at hand concerns French to Corsican, but it should be clear that it arises in the same way within French to English translation. To sum up: the two consecutive words ‘de fait’ can be identifed either as an adverbial locution, or as a preposition (‘de’) followed by a participe (‘fait’, done).
Now we are in a position to formulate the problem in a more general way. It concerns two or more consecutive words, that may be grammatically interpreted differently in the sentence and that may, thus, be translated in a different way. Generally speaking, disambiguation may concern one word (in most cases) but also a group of words. Now polymorphic disambiguation relates then to a given groups of words, i.e. sequences of 2-words, 3-words, 4-words, etc.
A try with online translators shows that statistical MT does better with polymorphic disambiguation. That is truly an interesting difference. So it is a gap that should be filled for rule-based MT.
A jeweler examines an emerald. “Aha,” he says, “another green emerald. In all my years in this business, I must have seen thousands of emeralds, and every one has been green.” We think the jeweler reasonable to hypothesize that all emeralds are green. Next door is another jeweler having equally comprehensive experience with emeralds. He speaks only the Choctaw Indian language. Color distinctions are not as universal as might be thought. The Choctaw Indians made no distinction between green and blue—the same words applied to both. The Choctaws did make a linguistic distinction between okchamali, a vivid green or blue, and okchakko, a pale green or blue. The Choctaw-speaking jeweler says: All emeralds are okchamali. He maintains that all his years in the jewelry business confirm this hypothesis. (William Poundstone, Labyrinths of reason)