One of the first problems that is encountered by any natural language processing system is that of lexical ambiguity, be it syntactic or semantic. The resolution of a word's syntactic ambiguity has largely been solved in language processing by part-of-speech taggers which predict the syntactic category of words in text with high levels of accuracy (for example [Bri95]). The problem of resolving semantic ambiguity is generally known as word sense disambiguation and has proved to be more difficult than syntactic disambiguation.
The problem is that words often have more than one meaning, sometimes fairly similar and sometimes completely different. The meaning of a word in a particular usage can only be determined by examining its context. This is, in general, a trivial task for the human language processing system, for example consider the following two sentences, each with a different sense of the word bank :
We immediately recognise that in the first sentence bank refers to the edge of a river and in the second to a building. However, the task has proved to be difficult for computer and some have believed that it would never be solved. An early sceptic was Bar-Hillel [Bar64] who famously proclaimed that ``sense ambiguity could not be resolved by electronic computer either current or imaginable''. He used the following example, containing the polysemous word pen , as evidence:
Little John was looking for his toy box. Finally he found it. The box was in the pen. John was very happy. |
He argued that even if pen were given only two senses, `writing implement' and `enclosure', the computer would have no way to decide between them. Analysis of the example shows that this is a case where selectional restrictions fail to disambiguate ``pen'', both potential senses indicate physical objects in which things may be placed (although this is unlikely in the case of the first sense), the preposition in may apply to both. Disambiguation, in this case, must make use of world-knowledge; the relative sizes and uses of pen as `writing implement' and pen as `enclosure'. This shows that word sense disambiguation is an AI-complete problem5.6.
However, the situation is not as bad as Bar-Hillel feared, there have been several advances in word sense disambiguation and we are now at a stage where lexical ambiguity in text can be resolved with a reasonable degree of accuracy.
We can distinguish ``final'' and ``intermediate'' tasks in language processing: final tasks are those which are carried out for their own usefulness examples of final tasks are machine translation, automatic summarisation and information extraction; intermediate tasks are carried out to help final tasks, examples are part-of-speech tagging, parsing, identification of morphological root and word sense disambiguation, these are tasks in which we have little interest in their results per se.
The usefulness of intermediate tasks can be explored by looking at some of the final tasks with which they are likely to help. We shall now examine three tasks which it has been traditionally assumed word sense disambiguation could help with: information retrieval, machine translation and parsing.
We can see then that word sense disambiguation is likely to be of benefit to several important NLP tasks, although it may not be as widely useful as many researchers have thought. However, the true test of word sense disambiguation technology shall be when accurate disambiguation algorithms exist, we shall then be in a position to experiment whether or not they add to their effectiveness.
It is useful to distinguish some different approaches to the word sense disambiguation problem. In general we can categorise all approaches to the problem into one of three general strategies: knowledge based, corpus based and hybrid. We shall now go on to look at each of these three strategies in turn.
Under this approach disambiguation is carried out using information from an explicit lexicon or knowledge base. The lexicon may be a machine readable dictionary, thesaurus or it may be hand-crafted. This is one of most popular approaches to word sense disambiguation and amongst others, work has been done using existing lexical knowledge sources such as WordNet [Agi96,Res95,Ric95,Sus93,Voo93], LDOCE [Cow,Gut91], and Roget's International Thesaurus [Yar92].
The information in these resources has been used in several ways, for example Wilks and Stevenson [Wil97], Harley and Glennon [Har97] and McRoy [McR92] all use large lexicons (generally machine readable dictionaries) and the information associated with the senses (such as part-of-speech tags, topical guides and selectional preferences) to indicate the correct sense. Another approach is to treat the text as an unordered bag of words where similarity measures are calculated by looking at the semantic similarity (as measured from the knowledge source) between all the words in the window regardless of their positions, as was used by Yarowsky [Yar92].
This approach attempts to disambiguate words using information which is gained by training on some corpus, rather that taking it directly from an explicit knowledge source. This training can be carried out on either a disambiguated or raw corpus, where a disambiguated corpus is one where the semantics of each polysemous lexical item is marked and a raw corpus one without such marking.
Another approach is to use Hidden Markov Models which have proved very successful in part-of-speech tagging. Realizing of course that semantic tagging is a much more difficult problem than part-of-speech tagging, [Seg97] decided nonetheless to perform an experiment to see how well words can be semantically disambiguated using techniques that have proven to be effective in part-of-speech tagging. This experiment involved the following steps:
The general problem with these methods is their reliance on disambiguated corpora which are expensive and difficult to obtain. This has meant that many of these algorithms have been tested on very small numbers of different words, often as few as 10.
The first type of artificial corpus which has been used extensively is the parallel corpus. A bilingual corpus consists of two corpora which containing the same text in different languages (for example one may be the translation of the other, or they may have been produced by an organisation such as the United Nations or the European Union who routinely transcript meetings in several languages). Sentence alignment is the process of taking such a corpus and matching the sentences which are translations of each other and several algorithms exist to carry this out with a high degree of success (eg. [Cat89], [Gal92b]). A bilingual corpus which has been sentence aligned becomes an aligned parallel corpus. This is an interesting resource since it consists of many examples of sentences and their translations. These corpora have been made use of in word sense disambiguation (see [Bro91] and [Gal92b]) by taking words with senses which translate differently across languages. They used the Canadian Hansard, the proceedings of the Canadian Parliament which is published in both French and English, and words such as ``duty'' which translates as ``devoir'' in the sense of 'moral duty' and ``droit'' when it means 'tax'. They took all the sentence pairs with ``duty'' in the English sentence and split then into two groups, roughly corresponding to senses, depending upon which word was in the French sentence of the pair. In this way a level of disambiguation suitable for a Machine Translation application could be tested and trained without hand-tagging.
There are two ways of creating artificial sense tagged corpora. The first way is to disambiguate the words by some means, as happens in the case of parallel corpora; the other approach is to add ambiguity to the corpus and have the algorithm attempt to resolve this ambiguity to return to the original corpus. Yarowsky [Yar93] used this method by creating a corpus which contained ``pseudo-words''. These are created by choosing two words (``crocodile'' and ``shoes'' for the sake of argument) and replacing each occurance of either with their concatenation (``crocodile/shoes'').
It is often difficult to obtain appropriate lexical resources (especially for texts in a specialised sublanguage), and we have already noted the difficulty in obtaining disambiguated text for supervised disambiguation. This lack of resources has led several researchers to explore the use of unannotated, raw, corpora to perform unsupervised disambiguation. It should be noted that unsupervised disambiguation cannot actually label specific terms as a referring to a specific concept: that would require more information than is available. What unsupervised disambiguation can achieve is word sense discrimination, it clusters the instances of a word into distinct categories without giving those categories labels from a lexicon (such as LDOCE sense numbers or WordNet synsets).
An example of this is the dynamic matching technique [Rad96] which examines all instances of a given term in a corpus and compares the contexts in which they occur for common words and syntactic patterns. A similarity matrix is thus formed which is subject to cluster analysis to determine groups of semantically related instances of terms.
Another example is the work of Pedersen [Ped97] who compared three different unsupervised learning algorithms on 13 different words. Each algorithm was trained on text with was tagged with either the WordNet or LDOCE sense for the word but the algorithm had no access to the truce senses. What it did have access to was the number of senses for each word and each algorithm split the instances of each word into the appropriate number of clusters. These clusters were then mapped onto the closest sense from the appropriate lexicon. Unfortunately the results are not very encouraging, Pedersen reports 65-66% correct disambiguation depending on the learning algorithm used. This result should be compared against that fact that, in the corpus he used, 73% of the instances could be correctly classified by simply choosing the most frequent sense.
These approaches can be neither properly classified as knowledge or corpus based but use part of both approaches. A good example of this is Luk's system [Luk95] this uses the textual definitions of senses from a machine readable dictionary (LDOCE) to identify relations between senses. He then uses a corpus to calculate mutual information scores between these related senses in order to discover the most useful. This allowed Luk to produce a system which used the information in lexical resources as a way of reducing the amount of text needed in the training corpus.
Another example of this approach is the unsupervised algorithm of Yarowsky [Yar95]. This takes a small number of seed definitions of the senses of some word (the seeds could be WordNet synsets or definitions from some lexicon) and uses these to classify the ``obvious'' cases in a corpus. Decision lists [Riv87] are then used to make generalisations based on the corpus instances classified so far and these lists are then re-applied to the corpus to classify more instances. The learning proceeds in this way until all corpus instances are classified. Yarowsky reports that the system correctly classifies senses 96% of the time.
One application of semantic tagging is in the framework of an intelligent on line dictionary lookup such as LocoLex [Bau95]. LocoLex is a tool that has been developed at RXRC and which looks up a word in a bilingual dictionary taking the syntactic context into account. For instance, in a sentence such as They like to swim the part of speech tagger in LocoLex determines that like is a verb and not a preposition. Accordingly, the dictionary lookup component provides the user with the translation for the verb only. LocoLex also detects multi-word expressions 5.8. For instance, when stuck appears in the sentence my own parents stuck together the translation displayed after the user clicks on stuck is the one for the whole phrase stick together and not only for the word stick.
Currently LocoLex is purely syntactic and cannot distinguish between the different meanings of a noun like bark. If, in addition to the current syntactic tags, we had access to the semantic tags provided by WordNet for this word (natural event or plants) and if we were able to include this label in the online dictionary, this would improve the bilingual dictionary access of Locolex even further.
Current bilingual dictionaries often include some semantic marking. For instance looking at the OUP-Hachette English French dictionary, under bark we find the label Bot(anical) attached to one meaning and the collocator (of dog) associated with the other one. It is possible that some type of automated matching between these indications and the WordNet semantic tags5.9 would allow the integration of a semantic tagger into LocoLex.
Using only existing dictionary labels might still not be completely satisfying for machine translation purpose. Indeed looking back at the example my own parents stuck together, even if we retrieved the multi-word expression meaning it will be difficult to decide which translation to choose with existing dictionary indications5.10. For instance for stick together the Oxford-Hachette English French dictionary gives:
stick together 1. (become fixed to each other) (pages) se coller 2. (Coll) (remain loyal) se serrer les coudes (Fam) =EAtre solidaire 3. (Coll) (not separate) rester ensemble
One could go one step further by using the sense indicators in the Oxford-Hachette dictionary: (become fixed to each other) (remain loyal) (not separate). These sense indicators are remains of definitions and often turn to be synonyms of the entry. They are about 27.000 of them and building a HMM tagger for them is not possible. We can still reduce their number by grouping them into classes of higher level. For instance we could group together : old man, old person, young man, old woman, etc. under person. Then we can use a statistical method such as the one described in [Yar95] to choose the most appropriate meaning in context. How to evaluate the result on large corpora is still pending.
Another step can be achieved by using the verb subcategorization frame
together with selectional restriction for its arguments and shallow
parsing.
At RXRC we have developped a shallow parser for French (Ait-Moktar and
Chanod, 1997). The advantages of using shallow parsing are many.
Among them:
[VC [NP j' NP]/SUBJ :v ai assist=E9 v: VC] [PP =E0 la r=E9union PP] [PP de ce matin PP] .from the above parse we learn that there is a subject, it is a noun phrase, there are two prepositional phrases, one of them introduced by the preposition à. Therefore we can select the meaning associated with only the first sub categorization frame for assister, to attend a meeting.
Still, in some cases even subcategorisation frame is not enough and one needs to have access to ontologies in order to express selectional restriction. In other words one needs to have more information regarding the semantic type of the verb argument. Consider now the sentence:
[VC [NP je NP]/SUBJ :v bouche v: VC] [NP le trou NP]/OBJ [PP avec du ciment PP]
If we look in the Oxford Hachette biligual dictionary we have all the
meanings below (in this case
translations) associated with the transitive use of the verb boucher:
boucher vtr