Next: Verb Semantic Classes Up: Linguistic aspects of lexical Previous: Semantic Roles

Lexicalization

Introduction

One of the basic goals of lexical semantic theory is to provide a specification of word meanings in terms of semantic components and combinatory relations among them. Different works in lexical semantics converge now on the ipothesis that the meaning of every lexeme can be analysed in terms of a set of more general meaning components, some or all of which are common to groups of lexemes in a language or cross-linguistically. In other words, meaning components can be identified which may or may not be lexicalized in particular languages. The individuation of the meaning components characterising classes of words in a language and of the possible combinations of such components within word roots leads to the identification of lexicalization patterns varying across languages. Moreover there is a strong correlation between each combination of meaning components and the syntactic constructions allowed by the words displaying them (e.g., [Tal85]; [Jac83], [Jac90]).

A trend has recently emerged towards addressing the issues of

i: identifying meaning components lexicalized within verb roots;
ii: stating a connection between specific components characterizing semantic classes of verbs and syntactic properties of the verbs themselves (e.g. [Lev93]; [Alo94b], [Alo95]).

The basic goals of research on lexicalization of meaning components are:

to define a (finite?) set of (primitive? universal? functionally discrete?) meaning components;
to provide a description of word meanings in terms of meaning components and combinatory relations among them;
to identify `preferences' displayed by (groups of) languages for lexicalization patterns;
to identify linkings between each meaning component's `conflation' pattern and syntactic properties of words.

The main aims of this section are firstly to point out some problematic issues raised in works dealing with the identification and discussion of meaning components; then, to briefly discuss proposals concerned with the identification of lexicalization patterns of semantic components both in a language and cross-linguistically. Furthermore, we shall point out how information on lexicalization is eventually encoded in lexical databases and useful for LE applications.

Description and comparison of different approaches

Works on meaning components

Representing complex meanings in terms of simpler ones has generally been considered one of the fundamental goals of semantic theory; however different positions have been taken with respect to various aspects of the issue in works devoted to it. In any case, the following hypotheses are shared by the various positions:

the meaning of every lexeme can be analysed in terms of a set of more general meaning components;
some or all of these components are common to groups of lexemes in a language/cross-linguistically.

It was [Hje61] componential analysis of word meaning which gave rise to various researches of the same type in Europe (e.g., [Gre66]; [Cos67]; [Pot74], etc.). These researches, although different with respect to the specific hypotheses put forward, tried to identify semantic components shared by groups of words by observation of paradigmatic relations between words. The semantic components identified in the various proposals differentiate both a group of words from another and, by combining in various ways, a word from another. Here is a standard example, showing the kind of analysis usually performed:

woman : man : child :: mare : stallion : foal :: cow : bull : calf

In this set of words, all the words in one group contrast with the words in another group in the same way (i.e. because of the same semantic components), and the first word in each group contrasts with the other words in its group in the same way, etc. Thus, for instance, all the words in the first group will be assigned a component HUMAN, vs. EQUINE and BOVINE assigned respectively to the second and third group. Then, the first word in each group will be characterized by a component FEMALE, vs. MALE characterizing the second word, etc.

Componential analysis in America developed independently firstly among anthropologists (e.g., [Lou56]; [Goo56]) who described and compared kinship terminology in various languages. Their research was taken up and generalized by various linguists and in particular by scholars working within the framework of transformational grammar (cf. [Kat63]), who aimed at integrating componential analyses of words with treatments of the syntactic organization of sentences. Generative semanticists (e.g. [McC68], [Lak70]) tried to determine units of meaning, or 'atomic predicates', by means of syntagmatic considerations. Thus, for instance, the components BECOME and CAUSE were identified by analysing pairs of sentences displaying similar syntactic relationships such as the following:

The soup cooled.
The metal hardened.
John cooled the soup.
John hardened the metal.

Afterwards, scholars working in different fields of language research dealt with various issues connected with the identification/definition of meaning components. Within this survey we do not intend to report on all the similarities/differences among the various hypotheses put forward. We shall instead point out problematic aspects which have been dealt with and which are of interest for our work.

The most important issues raised in the work on semantic components are the following:

first there is the question of whether the meaning components which have been identified/can be identified should be considered `primitives' or not: i.e., whether they are linguistic/conceptual units of some kind from which all possible meanings in a language can be derived, but which in turn are not themselves derivable from any other linguistic unit;
strictly linked to the above issue is that of the `universality' of primitives, i.e. if such primitives are the same across languages;
then, there is the question of whether it is possible to identify a finite set of (universal) primitives;
finally, there is the question of identifying a procedure for the definition of semantic components.

These issues have been explicitly or implicitly dealt with in theoretical semantic research, in computational linguistics, in philosophy, and in psycholinguistics. The `strongest' proposal put forward with respect to them is probably that presented by Wierzbicka in a number of works on semantic primitives (cf. [Wie72], [Wie80], [Wie85], [Wie89a], [Wie89b]). The avowed goal of these works is to arrive at a definition of a complete and stable set of semantic primitives, by means of cross-linguistic research on lexical universals. These are concepts which are encoded in the lexica of (nearly or possibly) all natural languages. While lexical universals are not necessarily universal semantic primitives (e.g., a concept such as mother), according to Wierzbicka the converse is true, i.e. all semantic primitives are universal. Decisive for succeeding in identifying such semantic primitives are large scale lexicographic studies. These studies should not rely on research of the most frequent words recurring in the definitions of conventional dictionaries, due to all the limits, incoherences and lack of data which are typically evidenced in these sources. In the various stages of her research, Wierzbicka postulated different sets of primitives. While the first set included only 14 elements, in [Wie89b] a set of twenty-eight universal semantic primitive candidates was proposed:

I, you, someone, something, this, the same (other), two, all, I want, I don't want, think, say, know, would (I imagine), do happen, where, when, after, like, can (possible), good, bad, kind (of), part, like, because, and very.

According to the author, this list should not necessarily be considered `final'. In any case, the set `works' in semantic analyses and has been validated through research into lexical universals.

In general, studies dealing with meaning components treat them as `primitives', i.e. as units which cannot be further defined. However, the components proposed as primitives in certain works are not always accepted as such in others (cf. Jackendoff's discussion of [McC68] proposal of a primitive ALIVE ([Jac83]).
Sometimes, a strong relation between `primitivity' and `universality' is not explicitly stated. For instance, [Mel89] conceives semantic primitives simply as 'elementary lexical meanings of a particular language' without wondering if they are the same for all the languages. However, others, and especially scholars working within a Chomskian framework, assume the universality of semantic primitives, in that they share the position that the meaning components which are lexicalized in any language are taken from a finite inventory, the knowledge of which is innate (e.g. [Jac90]).
The main problem remains, then, to decide which the universal semantic primitives are; i.e., to (eventually) define a finite and complete set of them. Indeed, while Wierzbicka proposes a complete and 'stable' (although not necessarily definitive) set of (pure) primitives, strong hypotheses like hers have not in general been presented. Rather, analyses of portions of the lexicon have been proposed: for instance, [Tal76] describes the various semantic elements combining to express causation; [Tal85] discusses the semantics of motion expressions; [Jac83], [Jac90] extends to various semantic fields semantic analyses provided for motion or location verbs (e.g., verbs of transfer of possession, verbs of touching, etc.); etc. In any case, by analysis and comparison of different works on the issue, we cannot circumscribe a shared set of primitives which could also be seen as `complete'.
Finally, no clear procedure for identification of semantic components has been so far formalized.

An approach which deliberately seeks to avoid a strong theoretical characterization of semantic components is that chosen by [Cru86], which, for this reason, could be taken as the starting point for the Eagles recommendations on the encoding of semantic components. According to Cruse's 'contextual approach', the meaning of a word can be described as composed of the meanings of other words with which it contracts paradigmatic and syntagmatic relations within the lexicon. These words are called semantic traits of the former word. Thus, for instance, animal can be considered a semantic trait of dog, since it is its hyperonym. Moreover, dog is implied in the meaning of to bark, given that it is the typical subject selected by the verb. Cruse clearly states that his 'semantic traits' are not claimed to be "primitive, functionally discrete, universal, or drawn from a finite inventory; nor is it assumed that the meaning of any word can be exhaustively characterised by any finite set of them ([Cru86], p. 22)". A similarly weakly theoretically characterised approach has been taken by [Dik78]; [Dik80] with his 'stepwise lexical decomposition'. No semantic primitive/universal elements are postulated. Lexical meaning is reduced to a limited set of basic lexical items of the object language, identified by analysing a network of meaning descriptions.

Works on Lexicalization patterns

Relying on the basic assumption that it is possible to identify a discrete set of elements (semantic components) within the domain of meaning and combinatory relations among them, [Tal85] carried out a study on the relationships among such semantic components and morphemes/words/phrases in a sentence/text. In particular, he deeply investigated the regular associations (lexicalization patterns) among meaning components (or sets of meaning components) and the verb, providing a cross-linguistic study of lexicalization patterns connected with the expression of motion. He was mainly interested in evidencing typologies, i.e. small number of patterns exhibited by groups of languages, and universals, i.e. single patterns shared cross-linguistically.

According to Talmy, a motion event may be analysed as related, at least, to five basic semantic elements:

MOTION (the event of motion or location),
PATH (the course followed or site occupied),
MANNER (the manner of motion),
FIGURE (the moving object),
GROUND (the reference object).

These may be found either lexicalized independently of one another, or variously conflated in the meaning of single words, as can be seen in the examples below (all taken from [Tal85], except the last one):

The rock moved down the hill rolling
FIGURE MOTION PATH GROUND MANNER
The rock rolled down the hill
FIGURE MOTION + MANNER PATH GROUND
La botella entró a la cueva flotando
(the bottle) (moved-in) (to) (the cave) (floating)
FIGURE MOTION + PATH PATH GROUND MANNER
She powdered her nose
MOTION + PATH + FIGURE GROUND
I shelved the books
MOTION + PATH + GROUND FIGURE
L'uomo fuggì
(the man) (escaped)
FIGURE MOTION + PATH + MANNER

Firstly, Talmy presents three basic lexicalization types for verb roots which are used by different languages in their most characteristic expression of motion:

1.: MOTION + MANNER/CAUSE
2.: MOTION + PATH
3.: MOTION + FIGURE

Talmy provides examples of these patterns of conflation:

the first one is found in the roots of e.g.
- stand in The lamp stood on the table;
- roll in The rock rolled down the hill;
- push in I pushed the keg into the storeroom^2.7.
This pattern is typical of English but not, for instance, of Spanish (or, we could also say, of Italian), which expresses the same meanings with different constructions as in e.g. Metí el barril a la bodega rodandolo = I rolled the keg into the storeroom.
The second pattern is typically displayed by Semitic, Polynesian and Romance languages, but not by English: whereas e.g. in Spanish we find El globo bajó por la chimenea flotando and La botella cruzó el canal flotando, in English we would find The ballon floated down the chimney and The bottle floated across the canal.
Finally, the third major typological pattern is displayed in a few English forms (e.g. I spat into the cuspidor, but an example par excellence of this type is Atsugewi, a Hokan language of northern California).

Another interesting issue discussed by Talmy is the possibility of extending the first pattern far beyond the expression of simple motion in English, in which, e.g. MOTION and MANNER can be compounded with mental-event notions (e.g. I waved him away from the building), or with specific material in recurrent semantic complexes (e.g. I slid him another beer), etc.
Other combinatorial possibilities are considered which, however, seem to form minor systems of conflation. Furthermore, a `hierarchy' of conflation types is also proposed, where the conflation involving PATH is considered the most extensively represented, next MANNER/CAUSE, and finally the FIGURE one. Some remarks are added on the possibility of conflating GROUND with MOTION, which is however only sporadically instantiated (e.g. emplane). Further discussion is provided on lexicalization of aspect, causation etc. and of the relations between meaning components and other parts-of-speech apart from the verb. This does not however seem relevant for our purposes and, in any case, we believe that the issues treated raise problems which should not be discussed here.

Interesting discussion of lexicalization patterns are found in [Jac83], [Jac90]. His theory of Conceptual Semantics and the organization of Lexical Conceptual Structure are discussed in detail in the following section. We shall only briefly recall some points of interest for our purposes. The main elements of the LCS language are: conceptual constituents, semantic fields and primitives. Then there are other elements, like conceptual variables, semantic features, constants, and lexical functions, which play minor roles. Each conceptual constituent belongs to one of a small set of ontological categories such as Thing, Event, State, Action, Place, Path, etc. Among conceptual primitives the main ones are BE, which represents a state, and GO, which represents any event. Other primitives include: STAY, CAUSE, INCH, EXT, etc. A second larger set of primitives describes prepositions: AT, IN, ON, TOWARD, FROM, TO, etc.
The LCS organization incorporates [Gru67]'s view, according to which the formalism used for encoding concepts of spatial location and motion can be abstracted and generalized to many other semantic fields (cf. next section). Thus, Jackendoff tries to extend semantic analyses provided for motion or location verbs to a wide range of other semantic fields. This turns out to require an additional elaboration of his conceptual system. At the same time, observations are added on the various correspondences between different lexicalization patterns and syntactic expressions. An interesting proposal put forward by Jackendoff (developing a suggestion from [Car88]) concerns a distinction between a MOVE-function and a GO-function: manner-of-motion verbs which cannot occur with complements referring to a PATH (more precisely, a bounded path) should only be linked to a MOVE-function. A rule is then proposed to account for (typically English) sentences containing manner-of-motion verbs allowing directional complements: a sentence like Debbie danced into the room expresses a conceptual structure that includes both a MOVE-function and a GO-function (indicating change of position). What differentiates English manner-of-motion verbs from, e.g., Spanish ones is the possibility of incorporating what Jackendoff calls a GO-Adjunct.

Both Talmy and Jackendoff observed a strict correlation between the meaning components clustered within a verb root and the verb syntactic properties. An extensive study on the correlation between verb semantics and syntax has been provided by [Lev93]. This study shows that verb semantic classes can be identified, each characterized by particular syntactic properties (2.6.2).

Within the Acquilex project (3.10.3) work has been carried out to identify information on lexicalization of meaning components and to connect such information to the syntactic properties of verbs. MRD definitions of some classes of verbs (e.g., verbs referring to motion, to change-of-state-by-cooking, etc.) were analysed in order to link recurrent patterns to specific meaning components characterizing each class in a specific language. Furthermore, connections were stated between single components and syntactic properties displayed by the verbs under analysis (cf. e.g. [Alo94a]; [Tau94].

Within the EuroWordNet project (3.4.3) relations between words are being encoded which allow data to be gathered on lexicalization. For instance, information on arguments involved in verb meaning is being encoded and compared cross-linguistically (cf. [AloFC]).

Relation to other areas of lexical semantics

The kinds of meaning components 'conflated' within verb roots are strongly correlated with the syntactic properties of the verbs themselves, i.e. with the possibility of verbs occurring with certain arguments (e.g. [Tal85]; [Lev93]; cf. this volume §1.4). Moreover, a clear identification of the semantic components conflated within verb roots in individual languages could be relevant also for isolating semantic classes displaying, or amenable to, similar sense extensions, given that amenability to yield different interpretations in context appears to be connected with semantic characteristics which verbs (words) share (cf. [San94]).

By adopting a strongly 'relational' view of the lexicon, then, we may identify lexicalization patterns by stating paradigmatic/syntagmatic relations between words (cf. work carried out within EuroWordNet). Thus, research on lexicalization is strictly linked to work on lexical relations such as hyponymy, meronymy, etc.

How is information encoded in lexical databases

The work carried out within the Acquilex project led the identification of semantic components lexicalized within the roots of various verb classes. The information acquired is variously encoded in the language-specific LDBs. Furthermore, part of this information was encoded within the multilingual LKB by linking the relevant meaning components to the participant role types involved by verb meaning. For instance, the subject of the English verb swim was associated with the participant role type proto-agent-cause-move-manner^2.8, indicating that the verb involves self-causing, undirected motion for which manner is specified (cf. [San92b]).

Much information on lexicalization patterns is being encoded within the EuroWordNet database for substantial portions of the lexica of various languages. Here, information on semantic components lexicalized within word meanings is encoded by means of lexical relations applying between synsets (3.4.2).

LE applications

Results of research on lexicalization seem necessary for a variety of NLP tasks and applications. Because of

the strict correlation between the meaning components involved in a word root and its syntactic properties,
the cross linguistic differences in the meaning components conflation within word roots,

data on lexicalization can be useful for Word Sense Disambiguation (WSD) and all connected applications (ranging from Machine Translation 4.1 to Information Retrieval 4.3); for NL generation 4.5, etc.

Next: Verb Semantic Classes Up: Linguistic aspects of lexical Previous: Semantic Roles

EAGLES Central Secretariat eagles@ilc.cnr.it