Next: Unified Medical Language System Up: Lexical Semantic Resources Previous: EDR

Higher Level Ontologies

Introduction

In the last several years a number of higher or upper level ontologies have become generally available to the knowledge representation and natural language research communities. Representations of the sorts of things that exist in the world and relations between them are necessary for a variety of natural language understanding and generation tasks, including syntactic disambiguation (e.g. prepositional phrase attachment), coreference resolution (only compatible types of things can corefer), inference based on world knowledge for interpretation in context, and to serve as language-independent meaning representations for text generation and machine translation.

The literature on ontologies mentions a number of requirements and principles, some of which go back to Aristotle ([Mah96], [Zar95]). Some important examples are the following:

1.: Ontologies should be language independent, which means independently motivated and not dictated by the lexicalisation patterns of a particular language;
2.: Ontologies need to be well-formed according to a axiomatic specification.
3.: Similarity Principle: a child must share the meaning of a parent;
4.: Specificity Principle: a child must differ from its parent in a dstinctive way which is the necessary and sufficient condition for being the child concept;
5.: Opposition Principle: a concept must be distinguishable from its siblings and the distinction between each pair of siblings must be represented.

Existing ontologies vary in a number of respects, such as the format they are encoded in, the level of granularity of the semantic information they provide, the extension of their conceptual coverage, and the size of the lexicons which are mapped onto them. The methodology of ontology acquisition and the representational structure are driven by the following motivations [Mah96]:

Encyclopedia driven: detailed world knowledge
Domain analysis driven: in order to cover the application domain as much as possible
task driven: complete coverage w.r.t. a specific NLP task
lexicon driven: in order to cover the meaning of all words in a lexicon;
formal: strict adherence to a formalism

While NL applications in different domains appear to require domain-specific conceptualisations there is some hope that a common upper level of domain-independent concepts and relations can be agreed: such a shared resource would greatly reduce the load on individual NL application developers to reinvent a model of the most general and abstract concepts underlying language and reasoning. A collectively refined resource should also benefit from increased comprehensiveness and accuracy, especially if a standard for the representation of portable ontologies emerges such as the Knowledge Interface Format [Gen92], supported by ontology encoding and maintenance tools such as Ontolingua [Gru92]. The development of such a generic top-ontology is the aim of the ANSI-committee on Ontology Standards. Their Reference Ontology includes about 3,000 general concepts taken from a variety of existing resources. A description of the work done is given in [HovFC]. The final result should form the standard for the development of more specific ontologies.

This section reviews four current candidates for upper level ontologies: Cyc, Mikrokosmos, the Generalised Upper Model, and Sensus. These are by no means the only candidates, but give an indication of the work going on in this area, especially work of relevance to NLP (since ontologies are also being explored both purely theoretically, and with a view to application in areas other than NLP, such as simulation and modelling (e.g. in molecular biology) and knowledge sharing and reuse, not all work on ontologies of relevance here). Recent general review articles on ontologies are [Vic97] and [Noy97].

Ontologies are not lexical resources per se. They are generally regarded as conceptualisations underlying language, so that mappings from lexicons into ontologies need to be provided. One of the advantages of this is that ontologies can serve an interlingual role, providing the semantics for words from multiple languages, as is shown in EuroWordNet (§3.4.3) where different language-wordnets share the same Top Ontology. But there are murky philosophical waters here. And, there are practical problems for any attempt to evolve standards for lexical semantics: should such semantics be anchored in an underlying ontological framework? If so, which ? And would this presuppose arriving first at a standard for ontologies?

Cycorp

Cycorp (the inheritor of Cyc from MCC which ran the Cyc project for 10 years) has made public an upper ontology of approximately 3,000 terms, a small part of the full Cyc knowledge base (`many tens of thousands' more concepts), but one which they believe contains `the most general concepts of human consensus reality' [Cyc97]. They have not made available most of the (`hundreds of thousands of' axioms relating concepts nor any of the domain-specific microtheories implemented in the Cyc KB. They have not made available the Cyc lexicon which contains over 14,000 English word roots with word class and subcategorization information plus their mappings into the KB, nor the other components of their NL system - a parser and semantic interpreter.

Each concept in the KB is represented as a Cyc constant, also called a term or unit. Each term has isa links to superclasses of which it is an instance, plus genls links to superclasses of which it is a subclass. Two of the most important Cyc classes are collections and relations (predicates and functions). In addition to isa and genls links, collections frequently also have have links to subsets (usually just illustrative examples in the published version). Associated with predicate terms in the hierarchy is information about the predicate's arity and the types of its arguments. There may also be links to more general and more specific predicates. Functions also have information about their argument and result types.

Here is the textual representation of two sample Cyc constants - a collection and a relation. Each has a heading, an English gloss, then one or more relational attributes indicating links to other constants in the KB.

#$Head-AnimalBodyPart

The collection of all heads of #$Animals.

isa:: #$AnimalBodyPartType #$UniqueAnatomicalPartType
genls:: #$AnimalBodyPart #$BiologicalLivingObject
some subsets:: #$Head-Vertebrate

#$hairColor

<#$Animal> <#$ExistingObjectType> < #$Color>
(#$hairColor ANIMAL BODYPARTTYPE COLOR) means that the hair which the #$Animal ANIMAL has on its BODYPARTTYPE has the #$Color COLOR. E.g., (#$hairColor #$SantaClaus #$Chin #$WhiteColor). This is normally #$Mammal hair, but certain #$Invertebrates also have hair.

isa:: #$TernaryPredicate #$TangibleObjectPredicate
arg2Genl:: #$AnimalBodyPart

The Cyc KB organised as a collection of lattices where the nodes in all the lattices are Cyc constants and the edges are various sorts of relation (isa, genls, genlpred).

Mikrokosmos

The Mikrokosmos ontology [Mik97,Mah95a,Mah95b] is part of the Mikrokosmos knowledge-based machine translation system currently under development at the Computer Research Laboratory, New Mexico State University. It is meant to provide a language-neutral repository of concepts in the world to assist in the process of deriving an interlingual text meaning representation for texts in a variety of input languages. It is derived from earlier work on the ONTOS ontology [Car90].

The ontology divides at the top level into object, event, and property. Nodes occurring beneath these divisions in the hierarchy constitute the concepts in the ontology and are represented as frames consisting of slots with facets and fillers. Concepts have slots for an NL definition, time-stamp, links to superordinate and subordinate concepts, and an arbitrary number of other other properties (local or inherited). These slots have facets each of which in turn has a filler. Facets capture such things as the permissible semantic types or ranges of values for the slot ( sem), the actual value (value) if known, and default values default. Where necessary inheritance has been blocked by using the value ""NOTHING"" (see example).

An example of a MikroKosmos concept:


(MAKE-FRAME ARTIFACT (IS-A (VALUE (COMMON INANIMATE)))

 (SUBCLASSES

  (VALUE

   (COMMON ANIMAL-RELATED-ARTIFACT AIRPORT-ARTIFACT

    ARTIFACT-PART BUILDING-ARTIFACT DECORATIVE-ARTIFACT

    DEVICE EARTH-RESOURCE-ARTIFACT ENGINEERED-ARTIFACT

    EVERYDAY-ARTIFACT MEASURING-ARTIFACT MEDIA-ARTIFACT

    MUSICAL-INSTRUMENT PACKAGING-MATERIAL

    PROTECTION-OBJECT RESTAURANT-ARTIFACT

    RESTRAINING-ARTIFACT SMOKING-DEVICE VEHICLE)))

 (PART-OF (SEM (COMMON ARTIFACT)))

 (PRODUCTION-MODE

  (SEM (COMMON MANUAL MECHANICAL AUTOMATED)))

 (DEFINITION

  (VALUE

   (COMMON

    "physical objects intentionally made by humans")))

 (AGE (SEM (COMMON (>= 0) (<> 0 20))))

 (TIME-STAMP

  (VALUE

   (COMMON "lcarlson at Monday, 6/4/90 12:49:46 pm"

    "updated by lori at 14:02:32 on 03/15/95"

    .... etc....

    "lori at 09:47:09 on 07/20/95")))

 (COLOR

  (SEM

   (COMMON RED BLUE YELLOW ORANGE PURPLE GREEN GRAY TAN

    CYAN MAGENTA)))

 (OWNED-BY (SEM (COMMON HUMAN)))

 (MADE-OF (SEM (COMMON MATERIAL)))

 (PRODUCT-TYPE-OF (SEM (COMMON ORGANIZATION)))

 (PRODUCED-BY (SEM (COMMON HUMAN)))

 (THEME-OF (SEM (COMMON EVENT)))

 (MATERIAL-OF (SEM (COMMON *NOTHING*))))

The Mikrokosmos web site puts the current size of the ontology at about 4500 concepts. The ontology is being acquired manually in conjunction with a lexicon acquisition team, and a set of guidelines have evolved for acquiring and placing concepts into the ontology.

The PENNMAN Upper Model

The PENMAN upper model originated in work done in natural language generation at ISI in the 1980's [Bat90]. It emerged as a general and reusable resource, supporting semantic classification at an abstract level that was task- and domain-independent. One of its key features was the methodology underlying its construction, according to which ontologies should be created by careful analysis of semantic distinctions as revealed through grammatical alternations in and across languages. The PENMAN upper model was written in LOOM a knowledge representation language developed at ISI.

The original PENMAN upper model was then merged with the KOMET German upper model [Hen93] to create a single unified upper model. This in turn has been further generalised through consideration of Italian and is now referred to as the Generalized Upper Model [Gum97,Bat94].

The Sensus ontology

The Sensus ontology (formerly known as the Pangloss ontology) is a freely available `merged' ontology produced by the Information Sciences Institute (ISI), California [Sen97,Kni94,HovFC]. It is the result of merging:

the PENMAN Upper Model
the ONTOS ontology
the LDOCE semantic categories for nouns
WordNet
the Harper-Collins Spanish-English Bilingual Dictionary

The topmost levels of the ontology (called the Ontology Base (OB)) consist of about 400 terms representing generalised distinctions necessary for linguistic processing modules (parser, analyser, generator). The OB is the result of manually merging the PENMAN upper model with ONTOS. The middle region of the ontology consists of about 50,000 concepts from WordNet. An automated merging of WordNet and LDOCE with manual verification was carried out and the result of this merging, given the earlier merging of OB and WordNet, is an ontology linked to a rich English lexicon. A final merge with the Harper-Collins Spanish-English Bilingual Dictionary links Spanish words into the ontology (one of the aims of the work is to support Spanish-English machine translation).

Little detail of the structure of the ontology or of individual entries is available in published form. The electronic source for the ontology consists of some various word and concept definition files. The OB files contain entries of the form:


(DEFCONCEPT ARTIFACT

  :DEFINITION "physical objects intentionally made by humans"

  :DIRECT-SUPERCLASS (INANIMATE-OBJECT))

and entries in WordNet derived files are of the form:


(DEFCONCEPT |tolerate|

  :DEFINITION " put up with something or somebody unpleasant  "

  :DIRECT-SUPERCLASS (|countenance,let|)

  :FRAMES ((8 0) (9 0))

  :WN-TYPE VERB.COGNITION

  )

It is not clear how the WordNet derived entries link into the OB.

Comments in the interface/KB access code suggest that much richer information is available including part-whole relations, instance and member relations, constraints on verbal arguments, etc. But none of this data appears to be in the public release data files.

Summarizing, we can characterize the discussed ontologies as follows:

Cyc:

contains encyclopaedic knowledge
has a high level grain size
rich connectivity between conceptual nodes
many resources spent on development
domain independent
language independent

MikroKosmos:

intermediate level grain size
not many resources spent on development
rich connectivity between conceptual nodes
domain independent
language independent

Generalised Upper Model:

unknown amount of resources spent
domain independent
language dependent
interconnectivity between conceptual nodes: mostly hierarchical

Sensus:

unknown amount of resources spent
domain independent
language dependent
interconnectivity between conceptual nodes: hierarchical enriched with WordNet relation types

Comparison with Other Lexical Databases

The ontologies described here are different from usual lexical resources in that they focus on knowledge from a non-lexical perspective. An exception is MikroKosmos, which has rich lexical resources linked to the ontology which come more close to the lexical resources discussed here. However, the distinction is not that clear-cut. We have seen that EDR (§3.6),and WordNet1.5 (§ 3.4.2),contain both lexicalized and non-lexicalized concepts, and can thus partly be seen as language-neutral structures as well. Furthermore in EuroWordNet (§3.4.3), wordnets of lexicalized concepts are interlinked with a separate ontology of formalzied semantic distinctions.

Relation to Notions of Lexical Semantics

The kind of semantics described in the higher-level ontologies comes closest to the taxonomic models described in §2.7. It is also closely related to the work of [Sch73,Sch75], which formed the basis for a non-linguistic approach to conceptual semantics. As a non-lexical approach (with the exception of MikroKosmos) the resources clearly do not relate to §2.5 and 2.4.

LE Users

One of the prime uses to which the Cyc ontology is to be put is natural language understanding: in particular the Cycorp Web pages refer to several enhanced information retrieval applications (see §4.2), including ``knowledge-enhanced'' searching of captioned information for image retrieval and information retrieval from the WWW, parts of which could be converted into Cyc's internal format and used to supplement Cyc itself. Another application is thesaurus management, whereby the extensive Cyc ontology is used to support ``conceptual merging'' of multiple (perhaps industry-specific) thesauri.

Mikrokosmos is primarily designed to support knowledge-based machine translation (KBMT - see §4.1) and is being used in Spanish-English-Japanese translation applications.

The Penman upper model and the Generalised Upper Model were originally designed to assist in natural language generation applications (see §4.5), though their authors believe these models have broader potential utility for NL systems.

Pangloss and Penman from which Sensus was derived were applications in machine translation and text generation respectively, and Sensus is intended to support applications in these areas (see §4.1 and 4.5 for discussions of machine translation and text generation).

Next: Unified Medical Language System Up: Lexical Semantic Resources Previous: EDR

EAGLES Central Secretariat eagles@ilc.cnr.it