Next: Information Extraction Up: Areas of Application Previous: Machine Translation

Information Retrieval

Introduction

Information retrieval (IR) systems aim to provide mechanisms for users to find information in large electronic document collections (we ignore voice, audio, and image retrieval systems here). Typically this involves retrieving that subset of documents (or portions thereof) in the collection which is deemed relevant by the system in relation to a query issued by the user. The query may be anything from a single word to a paragraph or more of text expressing the user's area of interest. With the proliferation of on-line textual information (especially the World Wide Web) IR technology has become of significant interest both as a research topic and in applications (cf. the sudden emergence of commercially supported Web search engines).

Survey

Generally IR systems work by associating a set of terms (index terms) with each document, associating a set of terms with the query (query terms) and then performing some similarity matching operation on these sets of terms to select the documents which are to be returned to the user.There are two main approaches in IR, Boolean and ranked-output or best-match.

A Boolean query is constructed from atomic query terms (words or phrases) using the logical operators AND, OR and NOT. It divides the database being searched into two parts, one containing documents which are considered to be relevant with respect to the query, and the other containing the remaining documents. In the first category, the user will consider some documents to be more relevant than others and some not to be relevant at all. The same situation is mirrored in the non-relevant set. Within each set, however, the IR system makes no differentiation among the documents - they are all considered to be equally relevant, or not. The user must potentially inspect each and every document with no a priori knowledge as to where in the set the useful documents lie. Neither is it possible to predict the likely size of the retrieved set, except with considerable experience of particular systems.

Ranked-output systems rank the documents within a database in decreasing likelihood of relevance with respect to the query. They do this by comparing a set of terms extracted from the query with the sets of terms corresponding to each of the documents in the database. They calculate a measure of similarity between the query and each of the documents using a numerically-based algorithm and then sort the documents by decreasing degree of similarity with the query. The user can then browse down the list just so far as (s)he considers necessary. This approach takes into account the fact that relevance is not an all-or-nothing matter; it depends not only on the query itself, but must allow for the user's previous knowledge and the items already retrieved and inspected in that search. Ranked-output systems tend to be based on either a vector-based or probabilistic model. The former treats each index term as a coordinate in an information space, so that both document and query become represented as vectors (perhaps weighted) of term values between which a similarity is computed, using a measure such as the cosine law. The latter associates probabilities with query terms in documents that have assessed for relevance and then uses these probabilities to judge the probability of relevance of new, unseen documents.

See [Sal89] or [Spa97] for general a introduction and background to IR.

Conceptual IR & Indexing

Of particular interest in the context of lexical semantic and IR are IR systems that attempt some form of conceptual indexing - that is, rather than simply indexing the surface words that appear in a text, an attempt is made to identify the concept that is being expressed which is then recorded and matched against concepts identified in an user's query.

BADGER [BAD] is a text analysis system which uses linguistic context to indentify concepts in a text. The key point of the system is that single words taken out of context may not relate to the same concept as the phrase to which that word belongs. It therefore, aims to find linguistic features that reliably identify the relevant concepts, representing the conceptual content of a phrase as a case frame, or concept node (CN).

CRYSTAL [Sod95] automatically induces a dictionary of CNs from a training corpus. The CN definitions describe the local context in which relevant information may be found, specifying a set of syntactic and semantic constraints. When the constraints are satisfied for a portion of text, a CN is instantiated.

Woods, W. at Sunlabs [Woo] uses semantic relationships among concepts to improve IR. Use of NLP and knowledge representation techniques to deal with differences in terminology between query and target. Development of a prototype system for indexing and organising information in structured conceptual taxonomies.

DEDAL [Bau,DED] is a knowledge-based retrieval system which uses a conceptual indexing and query language to describe the content and form of design information. DE-KART is a KA tool which refines the knowledge of DEDAL by increasing the level of generality and automation.

CRISTAL [Cri96] used the Dicologique semantic dictionary (§ 3.5.2) to conceptually index new stories.

Role of Lexical Semantics

At least two issues in lexical semantics are of immediate relevance to IR applications. Polysemy, the fact that many words have multiple meanings means that any strategy that simply uses string matching to select documents which contain terms also found in the user's query is bound to return many irrelevant documents - those that contain the word used in a sense different from that intended by the user. Synonymy, the fact that many equivalent or closely related meanings can be conveyed by distinct words means again any strategy that simply uses string matching to select documents which contain terms also found in the user's query is bound to miss many relevant documents - those that contain different words expressing similar meanings to that intended by the user.

Moving beyond issues pertaining to the meanings of single words, it would seem that IR could clearly benefit from other NLP techniques and capabilities (though just how much is a subject of current debate). In particular, use of semantic verb frames (see below) and phrasal parsing (§5.5) should assist.

Use of Thesauri

The synonomy problem is frequently tackled by using thesauri to expand query or index terms to permit broader matching.

Use of SNOMED/UMLS

[Ama95] use SNOMED (see §3.8) and a formal grammar to create templates with a combination of syntactic and semantic labels.

[Nel95] use the UMLS metathesaurus (see §3.8) to identify concepts in medical knowledge.

Use of WordNet

[Sta96] present a system to identify relevant concepts in a document and to index documents by concept using WordNet (§ 3.4.2). Lexical chains are generated which describe the context surrounding the concept. Documents are indexed using WordNet synsets as tokens, as opposed to terms.

Use of Semantic Verb Frames

Semantic verb frames permit more constrained retrieval on entities playing certain roles in verbal case frames, see [Jin97] and other semantic classification programs [Hat].

Related Areas and Techniques

Word clustering and word sense disambiguation techniques, as discussed in §5.1 and 5.3 (see also [Kro92,Kro97,Res97]).

Next: Information Extraction Up: Areas of Application Previous: Machine Translation

EAGLES Central Secretariat eagles@ilc.cnr.it