In this section we discuss several computerized lexicons that have been developed for Machine Translation applications: Eurotra, Cat-2, Metal, Logos and Systran (see §4.1). They have a high degree of formalization as compared to traditional dictionaries but the information is specifically structured to solve translation problems.
Eurotra is a transfer based and syntax driven MT system which deals with 9 languages (Danish, Dutch, German, Greek, English, French, Italian, Spanish and Portuguese). Monolingual and Bilingual Lexical resources were developed for all the languages involved, size and coverage of those were similar for all.
We will only supply figures for Spanish as an illustration in
Table 3.10.
Eurotra dictionaries are organized according to a number of levels of representation: Eurotra Morphological Structure (EMS), Eurotra Constituent Structure (ECS), Eurotra Relational Structure (ERS) and Interface Structure (IS). The IS is the basis for transfer and although it reflects deep syntactic information it is also the level were semantic information is present.
The Eurotra IS level is an elaboration of dependency systems in that every phrase is made up of a governor optionally followed by dependants of two types: arguments and modifiers. Arguments of a given governor are encoded in the lexicon. The relations between governors and their arguments are not explicitly stated. The set of arguments, are:
arg1: subject (experiencer/causer/agent) arg2: object (patient/theme/experiencer) arg_2P: 2nd participant (goal/receiver (non-theme)) arg_2E: 2nd entity (goal/origin/place (non-theme)) arg_AS: secondary stative predication on subject arg_AO: secondary stative predication on object arg_Pe: dative perceiver with raising predicates arg_ORIGIN: oblique arg_GOAL: oblique arg_MANNER: oblique
Not all labels have the same theoretical status nor correspond to the same level of depth in analysis. Thus,
John told Mary (arg2P) a story (arg2)
John told a story (arg2) to Mary (arg2P)
Essentially the semantic information encoded in E-dictionaries is used for disambiguation purposes. These include (i) structural ambiguity, (ie. argument modifier distinction, specially in the case of PP-attachment) and (ii) lexical ambiguity in lexical transfer, that is collocations (restricted to verb support constructions), homonymy and polysemy (this is further explained in §4.1.4).
All information is encoded as Feature-Value pairs, in ASCII files. Here is an example:
absoluto_1 = {cat=adj,e_lu=absoluto,e_isrno='1',e_isframe=arg1,e_pformarg2=nil,term='0'}.
Information encoded depends on the category. For all categories, the category (cat=), lema (e_lu=) and reading number (e_isrno=) is encoded. Other information is, for nouns and verbs: deep syntactic argument structure (e_isframe) as explained above, argumental strongly bound prepositions required by the lexical item (e_pformargX), selectional restrictions for all the arguments (semargX=) and the semantic type of the lexical item (sem=).
Reading number refers to a meaning distinction usually also reflected in a difference in the encoding of the other atributes. In the case of centro (``center") the meaning distinction is referred to in the ``sem" attribute" is: coordinate vs. place (lug). Besides, the reading ``place" has no argumental structure while the reading ``coord" can have one argument (``e_isframe=arg1"), and this has to be ``concrete" in oposition to ``abstract entity".
centro_1 = {cat=n,e_lu=centro,e_isrno='1',e_gender=masc,person=third,nform=norm, nclass=common,class=no,e_isframe=arg1,e_pformarg1=de,e_pformarg2=nil, e_pformarg3=nil,sem=coord,semarg1=conc,semarg2=nil,semarg3=nil, exig_mood=nil,e_predic=no,wh=no,whmor=none,e_morphsrce=simple, term='2000000538'}.
centro_2 = {cat=n,e_lu=centro,e_isrno='2',e_gender=masc,person=third,nform=norm, nclass=common,class=no,e_isframe=arg0,e_pformarg1=nil,e_pformarg2=nil, e_pformarg3=nil,sem=lug,semarg1=nil,semarg2=nil,semarg3=nil, exig_mood=nil,e_predic=no,wh=no,whmor=none,e_morphsrce=simple,term='0'}.
Other strictly monolingual information encoded for nouns is: gender, person, type of noun (``nform" and ``nclass"), if it requires a specific verbal mood (exig_mood) when creating a subordinate clause, if the noun is predicative (``e_predic"), information about relatives (wh and whmor), morphological derivative information (e_morphsrce, refers to morphological source, i.e, derivate...), and terminological identification: ``term".
basar_1 = {cat=v,e_lu=basar,e_isrno='1',e_isframe=arg1_2_PLACE,e_pformarg1=nil, e_pformarg2=nil,e_pformarg3=en,e_pformarg4=nil,p1type=nil,p2type=nil, semarg1=anim,semarg2=ent,semarg3=ent,semarg4=nil,e_vtype=main, vfeat=nstat,term='0',erg=yes}.
As said before, information about strongly bound prepositions is encoded for all the arguments, and in case the verb preposition is weakly bound, 2 features corresponding to 2 possible complements might refer to a class of prepositions such as ``origin", ``goal", etc. As for nouns, selectional restrictions are encoded but no semantic typing of the verb itself. Specific monolingual information is encoded in the following attributes: ``e_vtype", refers to the traditional main vs. auxiliar distinction, and ``erg" refers to ergative verbs. Aspectual characterization of the verb is encoded in ``vfeat", with possible values stative, non stative.
The CAT2 system, developed at IAI (Saarbruecken), is a direct
descendant of Eurotra and was designed specifically for MT
[Ste88], [Zel88], [Mes91]. The CAT2 system exploits
linguistic information of different kinds: phrase structure, syntactic
functional information and semantic information.
The figures supplied in Table 3.11 provide an indication of
size and coverage.
|
Semantic information is essentially used for reducing syntactic ambiguity, disambiguation of lexical entries, semantic interpretation of prepositional phrases, support verb constructions, lexical transfer and calculation of tense and aspect.
A verbal entry for the IS level example is:
apply1 = %% He applied the formula to the problem. {lex=apply,part=nil,VOW}\& ({slex=apply,head={VERB}} ;{slex=applying,head={VN_ING}} ;{slex=application,head={TION_R}} ;{slex=applicant,head={ANT_N}} ;{slex=applier,head={VN_AGENT}} ;{slex=application,head={TION_A}} ;{slex=unapplicable,head={UNABLE}} ;{slex=applicable,head={ABLE}} ;{slex=applicable,head={ELL_ABLE}} ;{slex=appliability,head={ABILITY}})\& {sc={a={AGENT},b={THEME},c={GOAL,head={ehead={pf=to}}}}, trans={de=({lex=applizieren};{lex=wenden,head={prf=an}}),fr={lex=appliquer}}}.
Metal is a commercial MT system which is offered in English-German,
English-Spanish, German-English, German-Spanish, Dutch-French,
French-Dutch, French-English, German-French. It delivers monolingual
and transfer system lexicons of up to 200,000 entries for a language
pair, as indicated in Table 3.12. Terms are coded for
morphological, syntactic, and semantic patterns, including
specification of selectional restrictions. Metal offers a
sophisticated subject-area code hierarchy.
|
Possible role values are:
$SUBJ deep subject $DOBJ deep object $IOBJ the affected $POBJ prepositional object $SOBJ sentential object $SCOMP attribute of subject $OCOMP attribute of object $LOC locative $TMP temporal $MEA measure $MAN manner
Adjectives, nouns and adverbs are semantically classified. Semantic features (attribute/value pairs) include:
Logos is a commercial high-end MT system which is offered in
English-German, English-French, English-Spanish, English-Italian,
English-Portuguese, German-English, German-French and German-Italian.
Lexical Resources contain app. 50,000 entries for English source, 100,000
for German source, plus an additional semantic rule database with app.
15,000 rules for English source and 18,000 for German source -- as
indicated in Table 3.13.
|
Logos is based on semantic analysis techniques using structural networks. Logos encodes Logos semantic types which allow to define selectional restrictions based on syntactic patterns. Dictionaries are extendible (Logos standard dictionary comprises 250 thematic dictionaries), and the system supplies with an automatic lexicographic tool (Alex), and a semantic database (Semantha).
Systran is a highly structured MT system whose translation process is based on repeated scanning of the terms in each sentence in order to establish acceptable relationships between forms. Using basic dictionaries, the system is able to define terms by analyzing morphemes (combining their grammatical, syntactic, semantic and prepositional composition).
It is a commercially available system offered with the following pairs:
|
Nouns are marked too. The inventory of labels is:
Adverbs are also characterized semantically
Besides, more semantic information is also encoded as part of a complex expression. It comprises two types: semantic primitives and terminology codes. The common attribute for both types is SEM.
THINGS, PROCES, LOCATN, QUALITY, BEINGS
Each of the taxons is the root of a tree which branches off to a number of subordinate nodes. For instance:
Following Eurotra-D, CAT2 uses semantic relations as a basis for monolingual and bilingual disambiguation. In addition, the system suggests an extensive semantic encoding of nouns using hierarchical feature structures.
The semantic coding of nouns follows Cognitive Grammar principle [Zel88]. The semantic coding of argument roles follows Systemic Grammar [Ste88]. Support verb constructions follow the analysis of [Mes91].