In the context of natural language processing, parsing may be defined as the process of assigning structural descriptions to sequences of words in a natural language (or to sequences of symbols derived from word sequences). Just what sort of structural description is assigned and on what grounds depends on the grammar - a description language plus set of structural constraints - according to which the parser attempts to analyse the symbol sequences presented to it. Put another way, a parser accepts as input a sequence of words (or their surrogates) in some language and an abstract description of possible structural relations that may hold between words or sequences of words in the language, and produces as output zero or more structural descriptions of the input as permitted by the structural rule set. There will be zero descriptions if either the input sequence cannot be analysed by the grammar, i.e. is ungrammatical, or if the parser is incomplete, i.e. fails to find all of the structure the grammar permits. There will be more than one description if the input is ambiguous with respect to the grammar, i.e. if the grammar permits more than one analysis of the input.
The input symbol sequence to a parser may or may not consist solely of words in a natural language. Leaving aside parsing of artificial languages (e.g. programming languages or logical languages) and of annotated documents (e.g. SGML) and of non-linguistic sequences such as gene sequences, parsing in NLP may be of word sequences, part-of-speech tag sequences, or of sequences of complex symbols such as feature bundles (e.g. where a word may have been replaced by a set of features including its orthographic form, part-of-speech, inflectional class, etc.).
Generally speaking, the motivation for parsing is the belief that grammatical structure contributes to meaning and that discovering the grammatical structure of an NL word sequence is a necessary step in determining the meaning of the sequence. In some parsers the construction of a meaning representation is carried out in parallel with the derivation of a structural analysis according to the grammar.
Applications of parsing include everything from simple phrase finding, e.g. for proper name recognition (§5.4), to full semantic analysis of text, e.g. for information extraction (§ 4.3) or machine translation (§4.1).
Parsing has been of central concern in computational linguistics and related areas of application for well over thirty years and any attempt to review these approaches here is bound to be both shallow and incomplete. Approaches to parsing may be classified according to a number of characteristics of the parser.
While, as noted, for most grammatical theories some attempt has been made to implement an automatic parser, most current applied computational work is taking place in the phrase structure framework 5.12 and can be described under two heads:
Depending on the grammar a parser uses to analyse its input, lexical semantics may play no role whatsoever, e.g. in parsing syntactic tag sequences, or an extensive role through, e.g., the application of semantic constraints or preferences supplied via lexical entries. Lexicalised grammars, such as recent versions of HPSG [Pol94] and LTAG [Jos91] provide natural frameworks in which complex semantic constraints may be expressed. But simpler frameworks, e.g. tag-based finite-state or CFG parsing schemes, may be adapted to take simple lexical semantic information into account by using semantic tag classes as well as part-of-speech tags (e.g. tagging nouns as LOCATION, PERSON or ORGANISATION). Many so-called pattern-matching information extraction systems use this sort of lexical semantic information, employing patterns triggered by the presence of key lexical items together with surrounding fillers of the appropriate semantic type. Such approaches tend to be highly domain and genre specific.
Machine translation §4.1, information retrieval § 4.2, information extraction §4.3.