Next: Proper Noun Recognition and Up: Component Technologies Previous: Multiword Recognition and Extraction

Word Sense Disambiguation

Introduction

One of the first problems that is encountered by any natural language processing system is that of lexical ambiguity, be it syntactic or semantic. The resolution of a word's syntactic ambiguity has largely been solved in language processing by part-of-speech taggers which predict the syntactic category of words in text with high levels of accuracy (for example [Bri95]). The problem of resolving semantic ambiguity is generally known as word sense disambiguation and has proved to be more difficult than syntactic disambiguation.

The problem is that words often have more than one meaning, sometimes fairly similar and sometimes completely different. The meaning of a word in a particular usage can only be determined by examining its context. This is, in general, a trivial task for the human language processing system, for example consider the following two sentences, each with a different sense of the word bank :

The boy leapt from the bank into the cold water.
The van pulled up outside the bank and three masked men got out.

We immediately recognise that in the first sentence bank refers to the edge of a river and in the second to a building. However, the task has proved to be difficult for computer and some have believed that it would never be solved. An early sceptic was Bar-Hillel [Bar64] who famously proclaimed that ``sense ambiguity could not be resolved by electronic computer either current or imaginable''. He used the following example, containing the polysemous word pen , as evidence:

Little John was looking for his toy box.
Finally he found it.
The box was in the pen.
John was very happy.

He argued that even if pen were given only two senses, `writing implement' and `enclosure', the computer would have no way to decide between them. Analysis of the example shows that this is a case where selectional restrictions fail to disambiguate ``pen'', both potential senses indicate physical objects in which things may be placed (although this is unlikely in the case of the first sense), the preposition in may apply to both. Disambiguation, in this case, must make use of world-knowledge; the relative sizes and uses of pen as `writing implement' and pen as `enclosure'. This shows that word sense disambiguation is an AI-complete problem^5.6.

However, the situation is not as bad as Bar-Hillel feared, there have been several advances in word sense disambiguation and we are now at a stage where lexical ambiguity in text can be resolved with a reasonable degree of accuracy.

The Usefulness of Word Sense Disambiguation

We can distinguish ``final'' and ``intermediate'' tasks in language processing: final tasks are those which are carried out for their own usefulness examples of final tasks are machine translation, automatic summarisation and information extraction; intermediate tasks are carried out to help final tasks, examples are part-of-speech tagging, parsing, identification of morphological root and word sense disambiguation, these are tasks in which we have little interest in their results per se.

The usefulness of intermediate tasks can be explored by looking at some of the final tasks with which they are likely to help. We shall now examine three tasks which it has been traditionally assumed word sense disambiguation could help with: information retrieval, machine translation and parsing.

Information Retrieval: It has often been thought that word sense disambiguation would help information retrieval. The assumption is that if a retrieval system indexed documents by senses of the words they contain and the appropriate senses in the document query could be identified then irrelevant documents containing query words of a different sense would not be retrieved. Strzalkowski [Str95] has recently found evidence that NLP may help in information retrieval. However, other researchers have found that word sense disambiguation leads to little, if any, improvement in retrieval performance. Krovetz and Croft [Kro92,Kro97] manually disambiguated a standard IR test corpus and found that a perfect word sense disambiguation engine would improve performance by only 2%. Sanderson [San94b] performed similar experiments where he artificially introduced ambiguity into a test collection, he found that performance was only increased for very short queries (less than 5 words). The reason for this is that the statistical algorithms often used in information retrieval are similar to some approaches to word sense disambiguation^5.7 and the query words in long queries actually help to disambiguate each other with respect to documents. Sanderson also dicovered that ``the performance of [information retrieval] systems is insensitive to ambiguity but very sensitive to erroneous disambiguation'' (p 149).
Machine Translation: In contrast, researchers in machine translation have consistently argued that effective word sense disambiguation procedures would revolutionise their field. Hutchins and Sommers [Hut92] have pointed out that there are actually two types of lexical semantic ambiguity with which a machine translation system must contend: there is ambiguity in the source language where the meaning of a word is not immediately apparent but also ambiguity in the target language when a word is not ambiguous in the source language but it has two possible translations. Brown et. al. [Bro91] constructed a word sense disambiguation algorithm for an English-French machine translation system. This approach performed only as much disambiguation as was needed to find the correct word in the target language (ie. it resolves only the first type of ambiguity in machine translation). Brown found that 45% of the translations were acceptable when the disambiguation engine was used while only 37% when it was not. This is empirical proof that word sense disambiguation is a useful intermediate task for machine translation.
Parsing: Parsing is an intermediate task used in many language processing applications and accurate paring has long been a goal in NLP. It seems that if the semantics of each lexical item were known then this could aid a parser in constructing a phrase structure for that sentence. Consider the sentence ``The boy saw the dog with the telescope.'', which is often used as an example of the prepositional phrase attachment problem. A parser could only correctly attach the prepositional phrase to the verb ``saw'' by using semantic world knowledge, and it seems that semantic tags would provide part of that knowledge. Unfortunately there seems to have been little empirical research carried out on the usefulness of word sense disambiguation for parsing.

We can see then that word sense disambiguation is likely to be of benefit to several important NLP tasks, although it may not be as widely useful as many researchers have thought. However, the true test of word sense disambiguation technology shall be when accurate disambiguation algorithms exist, we shall then be in a position to experiment whether or not they add to their effectiveness.

Survey of Approaches to Word Sense Disambiguation

It is useful to distinguish some different approaches to the word sense disambiguation problem. In general we can categorise all approaches to the problem into one of three general strategies: knowledge based, corpus based and hybrid. We shall now go on to look at each of these three strategies in turn.

Knowlegde based

Under this approach disambiguation is carried out using information from an explicit lexicon or knowledge base. The lexicon may be a machine readable dictionary, thesaurus or it may be hand-crafted. This is one of most popular approaches to word sense disambiguation and amongst others, work has been done using existing lexical knowledge sources such as WordNet [Agi96,Res95,Ric95,Sus93,Voo93], LDOCE [Cow,Gut91], and Roget's International Thesaurus [Yar92].

The information in these resources has been used in several ways, for example Wilks and Stevenson [Wil97], Harley and Glennon [Har97] and McRoy [McR92] all use large lexicons (generally machine readable dictionaries) and the information associated with the senses (such as part-of-speech tags, topical guides and selectional preferences) to indicate the correct sense. Another approach is to treat the text as an unordered bag of words where similarity measures are calculated by looking at the semantic similarity (as measured from the knowledge source) between all the words in the window regardless of their positions, as was used by Yarowsky [Yar92].

Corpus based

This approach attempts to disambiguate words using information which is gained by training on some corpus, rather that taking it directly from an explicit knowledge source. This training can be carried out on either a disambiguated or raw corpus, where a disambiguated corpus is one where the semantics of each polysemous lexical item is marked and a raw corpus one without such marking.

Disambiguated corpora

This set of techniques requires a training corpus which has already been disambiguated. In general a machine learning algorithm of some kind is applied to certain features extracted from the corpus and used to form a representation of each of the senses. This representation can then be applied to new instances in order to disambiguate them. Different researchers have made use of different sets of features, for example [Bro91] used local collocates such as first noun to the left and right, second word to the left/right and so on. However, a more common feature set used by [Gal92a] is to take all the words in a window of words around the ambiguous words, treating the context as an unordered bag of words.

Another approach is to use Hidden Markov Models which have proved very successful in part-of-speech tagging. Realizing of course that semantic tagging is a much more difficult problem than part-of-speech tagging, [Seg97] decided nonetheless to perform an experiment to see how well words can be semantically disambiguated using techniques that have proven to be effective in part-of-speech tagging. This experiment involved the following steps:

1.: deriving a lexicon from the WordNet data files which contains all possible semantic tags for each noun, adjective, adverb and verb. Words having no semantic tags (determiners, prepositions, auxiliary verbs, etc.) are ignored.
2.: constructing a training corpus and a test corpus from the semantically tagged Brown corpus (manually tagged by the WordNet team) by extracting tokens for the HMM bigrams.
3.: computing a HMM model based on the training corpus, runnig the tagger on the test corpus and comparing the results with the original tags in the test corpus.

The general problem with these methods is their reliance on disambiguated corpora which are expensive and difficult to obtain. This has meant that many of these algorithms have been tested on very small numbers of different words, often as few as 10.

Artificial Corpora

A consequence of the difficulty in obtaining sense tagged corpora has meant that several researchers have found innovative ways of creating artificial corpora which contain some form of semantic tagging.

The first type of artificial corpus which has been used extensively is the parallel corpus. A bilingual corpus consists of two corpora which containing the same text in different languages (for example one may be the translation of the other, or they may have been produced by an organisation such as the United Nations or the European Union who routinely transcript meetings in several languages). Sentence alignment is the process of taking such a corpus and matching the sentences which are translations of each other and several algorithms exist to carry this out with a high degree of success (eg. [Cat89], [Gal92b]). A bilingual corpus which has been sentence aligned becomes an aligned parallel corpus. This is an interesting resource since it consists of many examples of sentences and their translations. These corpora have been made use of in word sense disambiguation (see [Bro91] and [Gal92b]) by taking words with senses which translate differently across languages. They used the Canadian Hansard, the proceedings of the Canadian Parliament which is published in both French and English, and words such as ``duty'' which translates as ``devoir'' in the sense of 'moral duty' and ``droit'' when it means 'tax'. They took all the sentence pairs with ``duty'' in the English sentence and split then into two groups, roughly corresponding to senses, depending upon which word was in the French sentence of the pair. In this way a level of disambiguation suitable for a Machine Translation application could be tested and trained without hand-tagging.

There are two ways of creating artificial sense tagged corpora. The first way is to disambiguate the words by some means, as happens in the case of parallel corpora; the other approach is to add ambiguity to the corpus and have the algorithm attempt to resolve this ambiguity to return to the original corpus. Yarowsky [Yar93] used this method by creating a corpus which contained ``pseudo-words''. These are created by choosing two words (``crocodile'' and ``shoes'' for the sake of argument) and replacing each occurance of either with their concatenation (``crocodile/shoes'').

Raw Corpora

It is often difficult to obtain appropriate lexical resources (especially for texts in a specialised sublanguage), and we have already noted the difficulty in obtaining disambiguated text for supervised disambiguation. This lack of resources has led several researchers to explore the use of unannotated, raw, corpora to perform unsupervised disambiguation. It should be noted that unsupervised disambiguation cannot actually label specific terms as a referring to a specific concept: that would require more information than is available. What unsupervised disambiguation can achieve is word sense discrimination, it clusters the instances of a word into distinct categories without giving those categories labels from a lexicon (such as LDOCE sense numbers or WordNet synsets).

An example of this is the dynamic matching technique [Rad96] which examines all instances of a given term in a corpus and compares the contexts in which they occur for common words and syntactic patterns. A similarity matrix is thus formed which is subject to cluster analysis to determine groups of semantically related instances of terms.

Another example is the work of Pedersen [Ped97] who compared three different unsupervised learning algorithms on 13 different words. Each algorithm was trained on text with was tagged with either the WordNet or LDOCE sense for the word but the algorithm had no access to the truce senses. What it did have access to was the number of senses for each word and each algorithm split the instances of each word into the appropriate number of clusters. These clusters were then mapped onto the closest sense from the appropriate lexicon. Unfortunately the results are not very encouraging, Pedersen reports 65-66% correct disambiguation depending on the learning algorithm used. This result should be compared against that fact that, in the corpus he used, 73% of the instances could be correctly classified by simply choosing the most frequent sense.

Hybrid approaches

These approaches can be neither properly classified as knowledge or corpus based but use part of both approaches. A good example of this is Luk's system [Luk95] this uses the textual definitions of senses from a machine readable dictionary (LDOCE) to identify relations between senses. He then uses a corpus to calculate mutual information scores between these related senses in order to discover the most useful. This allowed Luk to produce a system which used the information in lexical resources as a way of reducing the amount of text needed in the training corpus.

Another example of this approach is the unsupervised algorithm of Yarowsky [Yar95]. This takes a small number of seed definitions of the senses of some word (the seeds could be WordNet synsets or definitions from some lexicon) and uses these to classify the ``obvious'' cases in a corpus. Decision lists [Riv87] are then used to make generalisations based on the corpus instances classified so far and these lists are then re-applied to the corpus to classify more instances. The learning proceeds in this way until all corpus instances are classified. Yarowsky reports that the system correctly classifies senses 96% of the time.

Relevant notions of lexical semantics

NLP applications using WSD Techniques

Semantic disambiguation in Locolex

One application of semantic tagging is in the framework of an intelligent on line dictionary lookup such as LocoLex [Bau95]. LocoLex is a tool that has been developed at RXRC and which looks up a word in a bilingual dictionary taking the syntactic context into account. For instance, in a sentence such as They like to swim the part of speech tagger in LocoLex determines that like is a verb and not a preposition. Accordingly, the dictionary lookup component provides the user with the translation for the verb only. LocoLex also detects multi-word expressions ^5.8. For instance, when stuck appears in the sentence my own parents stuck together the translation displayed after the user clicks on stuck is the one for the whole phrase stick together and not only for the word stick.

Currently LocoLex is purely syntactic and cannot distinguish between the different meanings of a noun like bark. If, in addition to the current syntactic tags, we had access to the semantic tags provided by WordNet for this word (natural event or plants) and if we were able to include this label in the online dictionary, this would improve the bilingual dictionary access of Locolex even further.

Current bilingual dictionaries often include some semantic marking. For instance looking at the OUP-Hachette English French dictionary, under bark we find the label Bot(anical) attached to one meaning and the collocator (of dog) associated with the other one. It is possible that some type of automated matching between these indications and the WordNet semantic tags^5.9 would allow the integration of a semantic tagger into LocoLex.

Using only existing dictionary labels might still not be completely satisfying for machine translation purpose. Indeed looking back at the example my own parents stuck together, even if we retrieved the multi-word expression meaning it will be difficult to decide which translation to choose with existing dictionary indications^5.10. For instance for stick together the Oxford-Hachette English French dictionary gives:


stick together

1. (become fixed to each other)

(pages) se coller

2. (Coll) (remain loyal)

se serrer les coudes (Fam) =EAtre solidaire

3. (Coll) (not separate)

rester ensemble

One could go one step further by using the sense indicators in the Oxford-Hachette dictionary: (become fixed to each other) (remain loyal) (not separate). These sense indicators are remains of definitions and often turn to be synonyms of the entry. They are about 27.000 of them and building a HMM tagger for them is not possible. We can still reduce their number by grouping them into classes of higher level. For instance we could group together : old man, old person, young man, old woman, etc. under person. Then we can use a statistical method such as the one described in [Yar95] to choose the most appropriate meaning in context. How to evaluate the result on large corpora is still pending.

Another step can be achieved by using the verb subcategorization frame together with selectional restriction for its arguments and shallow parsing.
At RXRC we have developped a shallow parser for French (Ait-Moktar and Chanod, 1997). The advantages of using shallow parsing are many. Among them:

: Shallow parsing is robust and based on linguistic knowledge. Therefore it handles any kind of text. The system is fast (about 150 words per second)
: Our shallow parser is incremental. Results can be checked at each step insuring that the input of the next step is the desired one.
: Our shallow parser provides in a reliable way functional information such as subject and object ^5.11.

If we consider the example:

J'ai assisté à la réunion de ce matin.

The verb assister has the following subcategorisation frames:

Two arguments: a subject (noun phrase) and a prepositional phrase. The prepositional phrase is introduced by the preposition à. The arguments are mandatory.
Two arguments: a subject (noun phrase) and an object (noun phrase). The arguments are mandatory.

When we parse the above sentence with a shallow parser we get the following structure:


[VC  [NP j' NP]/SUBJ :v ai assist=E9 v: VC]  [PP =E0 la r=E9union PP]

[PP de ce matin PP] .

from the above parse we learn that there is a subject, it is a noun phrase, there are two prepositional phrases, one of them introduced by the preposition à. Therefore we can select the meaning associated with only the first sub categorization frame for assister, to attend a meeting.

Using just verb subcategorisation frame moved us one step further in the process of semantic selection.

Still, in some cases even subcategorisation frame is not enough and one needs to have access to ontologies in order to express selectional restriction. In other words one needs to have more information regarding the semantic type of the verb argument. Consider now the sentence:

Je bouche le trou avec du ciment (I fill the hole with concrete)

The shallow parser creates the following analysis:


[VC  [NP je NP]/SUBJ :v bouche v: VC]  [NP le trou NP]/OBJ  [PP avec du

ciment PP]

If we look in the Oxford Hachette biligual dictionary we have all the meanings below (in this case translations) associated with the transitive use of the verb boucher:

boucher vtr

: mettre un bouchon à to cork, to put a cork on [bouteille];
: obstruer [to block tuyau, passage, aération, fen=EAtre, vue];
en encrassant to clog (up) [gouttière, artère, pore];
en comblant to fill [trou, fente]; ...
lit to fill the holes;
fig (dans un budget, une soirée) to fill the gaps
....

This example illustrates the representation and the treatment of collocates. The underlined words are encoded in the Oxford-Hachette as the object collocate, in other words they indicate what is the semantic type of the object. Looking at the analysis produced by the shallow parser we know that the head of the object is the word ` trou. In this case just a simple pattern matching with the list of possible object collocates will tell us which meaning to choose and therefore which translation (to fill).
But if we had the sentence boucher les fissures we would need to know that trou and fissure are member of the same semantic cluster and that the verb boucher accepts this cluster as typical object in one of its meaning.

Next: Proper Noun Recognition and Up: Component Technologies Previous: Multiword Recognition and Extraction

EAGLES Central Secretariat eagles@ilc.cnr.it