The literature on ontologies mentions a number of requirements and principles, some of which go back to Aristotle ([Mah96], [Zar95]). Some important examples are the following:
Existing ontologies vary in a number of respects, such as the format they are encoded in, the level of granularity of the semantic information they provide, the extension of their conceptual coverage, and the size of the lexicons which are mapped onto them. The methodology of ontology acquisition and the representational structure are driven by the following motivations [Mah96]:
While NL applications in different domains appear to require domain-specific conceptualisations there is some hope that a common upper level of domain-independent concepts and relations can be agreed: such a shared resource would greatly reduce the load on individual NL application developers to reinvent a model of the most general and abstract concepts underlying language and reasoning. A collectively refined resource should also benefit from increased comprehensiveness and accuracy, especially if a standard for the representation of portable ontologies emerges such as the Knowledge Interface Format [Gen92], supported by ontology encoding and maintenance tools such as Ontolingua [Gru92]. The development of such a generic top-ontology is the aim of the ANSI-committee on Ontology Standards. Their Reference Ontology includes about 3,000 general concepts taken from a variety of existing resources. A description of the work done is given in [HovFC]. The final result should form the standard for the development of more specific ontologies.
This section reviews four current candidates for upper level ontologies: Cyc, Mikrokosmos, the Generalised Upper Model, and Sensus. These are by no means the only candidates, but give an indication of the work going on in this area, especially work of relevance to NLP (since ontologies are also being explored both purely theoretically, and with a view to application in areas other than NLP, such as simulation and modelling (e.g. in molecular biology) and knowledge sharing and reuse, not all work on ontologies of relevance here). Recent general review articles on ontologies are [Vic97] and [Noy97].
Ontologies are not lexical resources per se. They are generally regarded as conceptualisations underlying language, so that mappings from lexicons into ontologies need to be provided. One of the advantages of this is that ontologies can serve an interlingual role, providing the semantics for words from multiple languages, as is shown in EuroWordNet (§3.4.3) where different language-wordnets share the same Top Ontology. But there are murky philosophical waters here. And, there are practical problems for any attempt to evolve standards for lexical semantics: should such semantics be anchored in an underlying ontological framework? If so, which ? And would this presuppose arriving first at a standard for ontologies?
Cycorp (the inheritor of Cyc from MCC which ran the Cyc project for 10 years) has made public an upper ontology of approximately 3,000 terms, a small part of the full Cyc knowledge base (`many tens of thousands' more concepts), but one which they believe contains `the most general concepts of human consensus reality' [Cyc97]. They have not made available most of the (`hundreds of thousands of' axioms relating concepts nor any of the domain-specific microtheories implemented in the Cyc KB. They have not made available the Cyc lexicon which contains over 14,000 English word roots with word class and subcategorization information plus their mappings into the KB, nor the other components of their NL system - a parser and semantic interpreter.
Each concept in the KB is represented as a Cyc constant, also called a term or unit. Each term has isa links to superclasses of which it is an instance, plus genls links to superclasses of which it is a subclass. Two of the most important Cyc classes are collections and relations (predicates and functions). In addition to isa and genls links, collections frequently also have have links to subsets (usually just illustrative examples in the published version). Associated with predicate terms in the hierarchy is information about the predicate's arity and the types of its arguments. There may also be links to more general and more specific predicates. Functions also have information about their argument and result types.
Here is the textual representation of two sample Cyc constants - a collection and a relation. Each has a heading, an English gloss, then one or more relational attributes indicating links to other constants in the KB.
The Cyc KB organised as a collection of lattices where the nodes in all the lattices are Cyc constants and the edges are various sorts of relation (isa, genls, genlpred).
The Mikrokosmos ontology [Mik97,Mah95a,Mah95b] is part of the Mikrokosmos knowledge-based machine translation system currently under development at the Computer Research Laboratory, New Mexico State University. It is meant to provide a language-neutral repository of concepts in the world to assist in the process of deriving an interlingual text meaning representation for texts in a variety of input languages. It is derived from earlier work on the ONTOS ontology [Car90].
The ontology divides at the top level into object, event, and property. Nodes occurring beneath these divisions in the hierarchy constitute the concepts in the ontology and are represented as frames consisting of slots with facets and fillers. Concepts have slots for an NL definition, time-stamp, links to superordinate and subordinate concepts, and an arbitrary number of other other properties (local or inherited). These slots have facets each of which in turn has a filler. Facets capture such things as the permissible semantic types or ranges of values for the slot ( sem), the actual value (value) if known, and default values default. Where necessary inheritance has been blocked by using the value ""NOTHING"" (see example).
An example of a MikroKosmos concept:
(MAKE-FRAME ARTIFACT (IS-A (VALUE (COMMON INANIMATE))) (SUBCLASSES (VALUE (COMMON ANIMAL-RELATED-ARTIFACT AIRPORT-ARTIFACT ARTIFACT-PART BUILDING-ARTIFACT DECORATIVE-ARTIFACT DEVICE EARTH-RESOURCE-ARTIFACT ENGINEERED-ARTIFACT EVERYDAY-ARTIFACT MEASURING-ARTIFACT MEDIA-ARTIFACT MUSICAL-INSTRUMENT PACKAGING-MATERIAL PROTECTION-OBJECT RESTAURANT-ARTIFACT RESTRAINING-ARTIFACT SMOKING-DEVICE VEHICLE))) (PART-OF (SEM (COMMON ARTIFACT))) (PRODUCTION-MODE (SEM (COMMON MANUAL MECHANICAL AUTOMATED))) (DEFINITION (VALUE (COMMON "physical objects intentionally made by humans"))) (AGE (SEM (COMMON (>= 0) (<> 0 20)))) (TIME-STAMP (VALUE (COMMON "lcarlson at Monday, 6/4/90 12:49:46 pm" "updated by lori at 14:02:32 on 03/15/95" .... etc.... "lori at 09:47:09 on 07/20/95"))) (COLOR (SEM (COMMON RED BLUE YELLOW ORANGE PURPLE GREEN GRAY TAN CYAN MAGENTA))) (OWNED-BY (SEM (COMMON HUMAN))) (MADE-OF (SEM (COMMON MATERIAL))) (PRODUCT-TYPE-OF (SEM (COMMON ORGANIZATION))) (PRODUCED-BY (SEM (COMMON HUMAN))) (THEME-OF (SEM (COMMON EVENT))) (MATERIAL-OF (SEM (COMMON *NOTHING*))))
The Mikrokosmos web site puts the current size of the ontology at about 4500 concepts. The ontology is being acquired manually in conjunction with a lexicon acquisition team, and a set of guidelines have evolved for acquiring and placing concepts into the ontology.
The PENMAN upper model originated in work done in natural language generation at ISI in the 1980's [Bat90]. It emerged as a general and reusable resource, supporting semantic classification at an abstract level that was task- and domain-independent. One of its key features was the methodology underlying its construction, according to which ontologies should be created by careful analysis of semantic distinctions as revealed through grammatical alternations in and across languages. The PENMAN upper model was written in LOOM a knowledge representation language developed at ISI.
The original PENMAN upper model was then merged with the KOMET German upper model [Hen93] to create a single unified upper model. This in turn has been further generalised through consideration of Italian and is now referred to as the Generalized Upper Model [Gum97,Bat94].
The Sensus ontology (formerly known as the Pangloss ontology) is a freely available `merged' ontology produced by the Information Sciences Institute (ISI), California [Sen97,Kni94,HovFC]. It is the result of merging:
The topmost levels of the ontology (called the Ontology Base (OB)) consist of about 400 terms representing generalised distinctions necessary for linguistic processing modules (parser, analyser, generator). The OB is the result of manually merging the PENMAN upper model with ONTOS. The middle region of the ontology consists of about 50,000 concepts from WordNet. An automated merging of WordNet and LDOCE with manual verification was carried out and the result of this merging, given the earlier merging of OB and WordNet, is an ontology linked to a rich English lexicon. A final merge with the Harper-Collins Spanish-English Bilingual Dictionary links Spanish words into the ontology (one of the aims of the work is to support Spanish-English machine translation).
Little detail of the structure of the ontology or of individual entries is available in published form. The electronic source for the ontology consists of some various word and concept definition files. The OB files contain entries of the form:
(DEFCONCEPT ARTIFACT :DEFINITION "physical objects intentionally made by humans" :DIRECT-SUPERCLASS (INANIMATE-OBJECT))and entries in WordNet derived files are of the form:
(DEFCONCEPT |tolerate| :DEFINITION " put up with something or somebody unpleasant " :DIRECT-SUPERCLASS (|countenance,let|) :FRAMES ((8 0) (9 0)) :WN-TYPE VERB.COGNITION )It is not clear how the WordNet derived entries link into the OB.
Comments in the interface/KB access code suggest that much richer information is available including part-whole relations, instance and member relations, constraints on verbal arguments, etc. But none of this data appears to be in the public release data files.
Summarizing, we can characterize the discussed ontologies as follows:
Cyc:
MikroKosmos:
Generalised Upper Model:
Sensus:
The ontologies described here are different from usual lexical resources in that they focus on knowledge from a non-lexical perspective. An exception is MikroKosmos, which has rich lexical resources linked to the ontology which come more close to the lexical resources discussed here. However, the distinction is not that clear-cut. We have seen that EDR (§3.6),and WordNet1.5 (§ 3.4.2),contain both lexicalized and non-lexicalized concepts, and can thus partly be seen as language-neutral structures as well. Furthermore in EuroWordNet (§3.4.3), wordnets of lexicalized concepts are interlinked with a separate ontology of formalzied semantic distinctions.
The kind of semantics described in the higher-level ontologies comes closest to the taxonomic models described in §2.7. It is also closely related to the work of [Sch73,Sch75], which formed the basis for a non-linguistic approach to conceptual semantics. As a non-lexical approach (with the exception of MikroKosmos) the resources clearly do not relate to §2.5 and 2.4.
One of the prime uses to which the Cyc ontology is to be put is natural language understanding: in particular the Cycorp Web pages refer to several enhanced information retrieval applications (see §4.2), including ``knowledge-enhanced'' searching of captioned information for image retrieval and information retrieval from the WWW, parts of which could be converted into Cyc's internal format and used to supplement Cyc itself. Another application is thesaurus management, whereby the extensive Cyc ontology is used to support ``conceptual merging'' of multiple (perhaps industry-specific) thesauri.
Mikrokosmos is primarily designed to support knowledge-based machine translation (KBMT - see §4.1) and is being used in Spanish-English-Japanese translation applications.
The Penman upper model and the Generalised Upper Model were originally designed to assist in natural language generation applications (see §4.5), though their authors believe these models have broader potential utility for NL systems.
Pangloss and Penman from which Sensus was derived were applications in machine translation and text generation respectively, and Sensus is intended to support applications in these areas (see §4.1 and 4.5 for discussions of machine translation and text generation).