TOSCA (Tools for Syntactic Corpus Analysis) is an annotation project developed at the Katholieke Universiteit at Nijmegen, the Netherlands. The main aim of the project is the production of resources for linguistic research in the areas of syntax and language use.
The TOSCA annotation scheme has been used in the analysis of the Nijmegen corppus, and of the TOSCA corpus, both of which consist of mainly written language. It is also being used for some parts of the ICE corpus, in which spoken language is also included.
The TOSCA annotation scheme is applied by an interactive system between linguist and computer. The computer is used to produce all possible analyses, from which the linguist selects the correct choice. The part of speech tagging and the addition of syntactic labels take place as part of one process, in which the chosen sequence of word class tags is used as input to the parser. The parser is automatically generated from a formal grammar in the AGFL formalism (Affix Grammar over Finite Lattices).
In the TOSCA annotation scheme, constituents are labelled for their function and category, while additional syntacticosemantic information is contained in various attributes.
The three major units of description are the word, the phrase and the clause/sentence. The non-lexical category labels are shown in table 3.8:
In addition to these labels, there are over 90 function labels which are added to the annotation, and over 100 attribute labels. The function labels identify such phenomena as:
The attribute labels represent the field in which some of the following are included:
In addition to the above mentioned syntacticosemantic labels, the annotation scheme contains a number of labels used for extra textual material such as speaker changes, headings and pauses.
The following example shows how the simple sentence He walked in the garden is analysed with the TOSCA scheme. The lower case tags in brackets are the attribute labels, while the first label in capitals represents the non-lexical category, and the second shows the function label:
NOFU, TXTU () UTT, S (act, decl, indic, intr, past, unm) SU, NP () NPHD, PN (pers) {He} V, VP (act, indic, intr, past) MVB, LV (indic, intr, past) {walked} A, PP () P, PREP () {in} PC, NP () DT, DTP () DTCE, ART (def) {the} NPHD, N (com, sing) {garden} PUNC, PM (per) {.}
A more complex example is shown below:
NOFU,TXTU() UTT,S(decl,indic,intr,pass,pres,unm) SU,NP() DT,DTP() DTCE,ART(indef) {An} NPPR,AJP(attru) AJHD,ADJ(attru) {alternative} NPHD,N(com,sing) {pathway} NOFU,COORD(decl,indic,intr,pass,pres) CJ,CONJ(decl,indic,intr,pass,pres) V,VP(indic,intr,mod,pass,pres) OP,AUX(indic,mod,pres) {may} AVB,AUX(indic,infin,pass) {be} MVB,LV(indic,motr,pastp) {deprived} A,PP() P,PREP() {of} PC,NP() NPHD,PN(pers,sing) {it} COOR,CONJN(coord) {and} CJ,CONJ(decl,indic,intr,pass,pres) V,VP(indic,intr,mod,pass,pres) A,CON() {hence} AVB,AUX(indic,infin,pass) {be} MVB,LV(indic,motr,pastp) {controlled} A,AVP(excl) AVHD,ADV(excl) {simply} A,PP() P,PREP() {by} PC,NP() NPHD,N(com,sing) {limitation} NPPO,PP() P,PREP() {of} PC,NP() NPHD,N(com,sing) {substrate} A,CL(act,indic,intens,pres,sub,unm) SUB,SUBP() SBHD,CIBJN(subord) {when} SU,NP() NPHD,N(com,plu) {demands} NPPO,PP() P,PREP() {on} PC,NP() DT,DTP() DTCE,ART(def) {the} NPPR,AJP(attru) AJHD,ADJ(attru) {main} NPDH,N(com,sing) {pathway} V,VP(act,indic,intens,pres) MVB,LV(indic,intens,pres) {are} CS,AJP(prd) AJHD,ADJ(prd) {heavy} PUNC,PM(per) {.}