Next: Natural Language Generation
Up: Areas of Application
Previous: Information Extraction
Subsections
Text Summarization
Introduction
With the proliferation of online textual resources, an increasingly
pressing need has arisen to improve online access to textual
information. This requirement has been partly addressed through the
development of tools aiming at the automatic selection of document
fragments which are best suited to provide a summary of the document
with possible reference to the user's interests. Text summarization
has thus rapidly become a very topical research area.
Survey
Most of the work on summarization carried out to date is geared
towards the extraction of significant text fragments from a document
and can be classified into two broad categories:
- domain dependent approaches where a priori knowledge of the
discourse domain and text structure (e.g. weather, financial, medical)
is exploited to achieve high quality summaries, and
- domain independent approaches where a statistical (e.g.
vector space indexing models) as well as linguistic techniques
(e.g. lexical cohesion) are employed to identify key passages and
sentences of the document.
Considerably less effort has been devoted to ``text condensation''
treatments where NLP approaches to text analysis and generation are
used to deliver summary information of the basis of interpreted text
[McK95].
Domain Dependent Approaches
Several domain dependent approaches to summarization use Information
Extraction techniques ([Leh81,Ril93]) in order to identify the
most important information within a document. Work in this area
includes also techinques for Report Generation ([Kit86]) and
Event Summarization ([May93]) from specialized databases.
Domain Independent Approaches
Most domain-independent approaches use statistical techniques often in
combination with robust/shallow language technologies to extract
salient document fragments. The statistical techniques used are
similar to those employed in Information Retrieval and include: vector
space models, term frequency and inverted document frequency
([Pai90,Rau94,Sal97]). The language technologies employed vary
from lexical cohesion techniques ([Hoe91,Bar97]) to robust
anaphora resolution ([Bog97]).
Role of Lexical Semantics
In many text extraction approaches, the essential step in abridging a
text is to select a portion of the text which is most representative
in that it contains as many of the key concepts defining the text as
possible (textual relevance). This selection must also take into
consideration the degree of textual connectivity among sentences
so as to minimize the danger of producing summaries which contain
poorly linked sentences. Good lexical semantic information can help
achieve better results in the assessment of textual relevance and
connectivity.
For example, computing lexical cohesion for all pair-wise
sentence combinations in a text provides an effective way of assessing
textual relevance and connectivity in parallel [Hoe91]. A simple
way of computing lexical cohesion for a pair of sentences is to count
non-stop (e.g. closed class) words which occur in both the
sentences. Sentences which contain a greater number of shared non-stop
words are more likely to provide a better abridgement of the original
text for two reasons:
- the more often a word with high informational content occurs in
a text, the more topical and germane to the text the word is
likely to be, and
- the greater the times two sentences share a word, the more
connected they are likely to be.
The assessment of lexical cohesion between text units can be improved
and enriched by using semantic relations such as synonymy,
hyp(er)onymy [Hoe91,Mor91,Hir97,Bar97] as well as semantic
annotations such as subject domains [SanFCb] in addition to
simple orthographic identity.
Related Areas and Techniques
Related areas of research are: Information Retrieval, Information
Extraction and Text Classification.
Next: Natural Language Generation
Up: Areas of Application
Previous: Information Extraction
EAGLES Central Secretariat eagles@ilc.cnr.it