AWN Data Spec
WordNet
XML Interchange Specification
Adam Pease, Bill Black , Piek Vossen
January 23, 2006
This document provides a specification for an interchange file format that will be used by the participants in the “Arabic WordNet with Ontology” project, and circulated to other WordNet builders.
Many fields in WordNets for languages that have non-Latin character sets will be in Unicode.
DTD
<!DOCTYPE wordnet [ <!ELEMENT item EMPTY> <!ATTLIST item id ID #REQUIRED> <!ATTLIST item offset CDATA #IMPLIED> <!ATTLIST item lexfile CDATA #IMPLIED> <!ATTLIST item name CDATA #REQUIRED> <!ATTLIST item type (synset|term) #REQUIRED> <!ATTLIST item headword (yes|no) #IMPLIED > <!ATTLIST item POS (noun|verb|adjective|adverb) #IMPLIED> <!ATTLIST item source CDATA #REQUIRED> <!ATTLIST item gloss CDATA #REQUIRED> <!ATTLIST item authorshipid IDREF #REQUIRED> <!ELEMENT link EMPTY> <!ATTLIST link type (antonym|hyponym|instance hyponym|meronym| entailment|cause|also see|derived from| attribute|relational adj|similar to| verb group| participle|member holynym|substance holonym| part holonym|member meronym|substance meronym| part meonym|attribute|derivationally related| domain topic|member topic|domain region| member region|domain usage|member usage|pertainym|same| equivalent|subsuming|instance|antiequivalent|antisubsuming| antiinstance) #REQUIRED> <!ATTLIST link id1 IDREF #REQUIRED> <!ATTLIST link id2 IDREF #REQUIRED> <!ATTLIST link authorshipid IDREF #REQUIRED> <!ELEMENT word EMPTY> <!ATTLIST word value CDATA #REQUIRED> <!ATTLIST word synsetid IDREF #REQUIRED> <!ATTLIST word wordid ID #REQUIRED> <!ATTLIST word frequency CDATA #REQUIRED> <!ATTLIST word corpus CDATA #REQUIRED> <!ATTLIST word authorshipid IDREF #REQUIRED> <!ELEMENT form EMPTY> <!ATTLIST form value CDATA #REQUIRED> <!ATTLIST form root (yes|no) #REQUIRED> <!ATTLIST form tense (past|present|future) #IMPLIED> <!ATTLIST form number (singular|dual|plural) #IMPLIED> <!ATTLIST form person (1|2|3) #IMPLIED> <!ATTLIST form gender (masculine|femenine|neuter) #IMPLIED> <!ATTLIST form case (nominative|genative|partitive) #IMPLIED> <!ATTLIST form wordid IDREF #REQUIRED> <!ATTLIST form authorshipid IDREF #REQUIRED> <!ELEMENT verbFrame EMPTY> <!ATTLIST verbFrame frame CDATA #REQUIRED> <!ATTLIST verbFrame synsetid IDREF #REQUIRED> <!ATTLIST verbFrame authorshipid ID #REQUIRED> <!ELEMENT author EMPTY> <!ATTLIST author authorshipid ID #REQUIRED> <!ATTLIST author author CDATA #REQUIRED> <!ATTLIST author date CDATA #REQUIRED> <!ATTLIST author score CDATA #IMPLIED> <!ATTLIST author comment CDATA #IMPLIED> <!ATTLIST author covering (yes|no) #IMPLIED> ]>
DTD Reference
<!ELEMENT
item EMPTY>
The item element is the central element in the schema. It holds the synset or term and its basic information.
<!ATTLIST
item id ID #REQUIRED>
The id attribute is a unique identifier for the synset or term. Ideally, it should be persistent across versions. However, some specification is needed to cover cases where the item is changed significantly enough to warrant creation of a new id. Changing a gloss shouldn’t prompt creation of a new id, but deciding to subdivide a synset should. It should be of the form word_POS_sensenum_languageID. LanguageID will be the standard ISO two letter language code.
<!ATTLIST
item offset CDATA #IMPLIED>
The offset is a byte offset in a WordNet .DAT file. Since WordNet uses the byte offset as a unique id within versions, this is needed for compatibility reasons.
<!ATTLIST
item lexfile CDATA #IMPLIED>
The lexfile attribute of the item element should be a two digit number from 00 to 40 as described at <http://wordnet.princeton.edu/man/lexnames.5WN.html#sect4>.It is possible that new files might be added in the future though so a number greater than 40 is not necessarily an error.
<!ATTLIST
item name CDATA #REQUIRED>
This is a human-readable name for the item. Where the item is a sunset containing multiple words, this should be the first word in the synset.
<!ATTLIST
item type (synset|term) #REQUIRED>
Whether the item is a WordNet synset or a formal ontology term.
<!ATTLIST
item headword (yes|no) #IMPLIED>
Whether the item is an adjective “headword”. See <http://wordnet.princeton.edu/man/wngloss.7WN> for a further description.
<!ATTLIST
item POS (noun|verb|adjective|adverb) #IMPLIED>
The part of speech of the item. This is omitted when the item is a formal ontology term.
<!ATTLIST
item source CDATA #REQUIRED>
The product from which this item comes. Typically, this would be a particular WordNet, or ontology. This is an enumerated set, but one which is growing rapidly, so the enumerations are not listed as part of the schema, but should follow the naming of WordNets given in the “resource name” column of <http://globalwordnet.org/wordnets-in-the-world/>
<!ATTLIST
item verbFrame IDREF #IMPLIED>
The verbFrame attribute of item is one of those specified at <http://wordnet.princeton.edu/man/wninput.5WN.html#sect4>
<!ATTLIST
item gloss CDATA #REQUIRED>
The glossary text for the item. Note that the glossary text is of course in the language of the particular synset, so English WordNet has English glosses, ItalWordNet has Italian glosses etc.
<!ATTLIST
item authorshipid IDREF #REQUIRED>
A pointer to information about who created this item.
<!ELEMENT
link EMPTY>
This element covers links between items.
<!ATTLIST
link type (antonym|hyponym|instance hyponym|meronym|
entailment|cause|also
see|derived from|
attribute|relational
adj|similar to| verb group|
participle|member
holynym|substance holonym|
part
holonym|member meronym|substance meronym|
part
meonym|attribute|derivationally related|
domain
topic|member topic|domain region|
member
region|domain usage|member usage|pertainym|same|
equivalent|subsuming|instance|antiequivalent|antisubsuming|
antiinstance)
#REQUIRED>
The attribute type in element link has the values “antonym”, “hyponym”, “instance hyponym”, “meronym”, “entailment”, “troponym”, “cause”, “also see”, “derived from”, “attribute”, “relational adj”, “similar to”, “verb group”, “participle”, “member holynym”, “substance holonym”, “part holonym”, “member meronym”, “substance meronym”, “part metonym”, “attribute”, “derivationally related”, “domain topic”, “member topic”, “domain region”, “member region”, “domain usage”, “member usage” and “pertainym” for links between WordNet senses in the same version of WordNet.
The value “same” should be used to link between equivalent senses between different WordNet versions. The types “equivalent”, “subsuming”, “instance”, “anti equivalent”, “antisubsuming”, and “antiinstance” should be used to link SUMO terms and WordNet senses. Note that the part of speech restricts the allowable types of the first argument as follows
-
link
typenoun
verb
adjective
adverb
antonym
x
x
x
x
derived
from (adjective)x
similar
tox
participle
(of verb)x
pertainym
(pertains to noun)x
attribute
x
x
hyponym
x
x
instance
hyponymx
entailment
x
cause
x
also
seex
x
verb
groupx
member
holonymx
substance
holonymx
part
holonymx
member
meronymx
substance
meronymx
part
meronymx
derivationally
related (form)x
x
domain
topic (Domain of synset – TOPIC)x
x
x
x
member
topic (Member of this domain – TOPIC)x
domain
region (Domain of synset – REGION)x
x
x
x
member
region (Member of this domain – REGION)x
domain
usage (Domain of synset – USAGE)x
x
x
x
member
usage (Member of this domain – USAGE)x
<!ATTLIST
link id1 IDREF #REQUIRED>
The id of the first argument to the link.
<!ATTLIST
link id2 IDREF #REQUIRED>
The id of the first argument to the link
<!ATTLIST
link authorshipid IDREF #REQUIRED>
A pointer to information about who created this item.
<!ELEMENT
word EMPTY>
Information about a
particular word that is part of a (possibly singular) synset.
<!ATTLIST
word value CDATA #REQUIRED>
The word, which should
be a root form. Root forms in English would be singular forms for
nouns and infinitive forms (without the “to”) for verbs.
<!ATTLIST
word synsetid IDREF #REQUIRED>
The synset that the
word belongs to.
<!ATTLIST
word wordid ID #REQUIRED>
The id of the word, so
that the form table can provide additional forms of the word.
<!ATTLIST
word frequency CDATA #REQUIRED>
The frequency of
appearance of the word in a primary corpus.
<!ATTLIST
word corpus CDATA #REQUIRED>
The name of the corpus
from which frequency information for this word is derived.
<!ATTLIST
word authorshipid IDREF #REQUIRED>
A pointer to
information about who created this item.
<!ELEMENT
form EMPTY>
The
part of speech forms included in the form
table should generally be those forms that are exceptions to the
rules provided in the MORPHY system
<http://wordnet.princeton.edu/man/morphy.7WN.html>
for English, or exceptions to any regularly derived form for other
languages.
<!ATTLIST
form value CDATA #REQUIRED>
A particular word form.
<!ATTLIST
form root (yes|no) #REQUIRED>
Whether the given form
of the word is considered its root form.
<!ATTLIST
form tense (past|present|future) #IMPLIED>
The tense of the
particular word form. This is optional, as not all languages have
tense inflections on word forms, in particular, Chinese does not.
<!ATTLIST
form number (singular|dual|plural) #IMPLIED>
The grammatical number
of the particular word form.
<!ATTLIST form
person (1|2|3) #IMPLIED>
The grammatical person
of the particular word form.
<!ATTLIST form
gender (masculine|femenine|neuter) #IMPLIED>
The grammatical gender
of the word form. This is optional as many languages do not have
inflections for gender on word forms. English doesn’t have genders.
Romance languages have masculine and feminine but not neuter, etc.
<!ATTLIST
form case (nominative|genative|partitive) #IMPLIED>
The grammatical case of
the word form. The partitive case appears in few languages, Finnish
among them.
<!ATTLIST
form wordid IDREF #REQUIRED>
A reference to the root
word that this form is derived from.
<!ATTLIST
form authorshipid IDREF #REQUIRED>
A pointer to
information about who created this item.
<!ELEMENT
verbFrame EMPTY>
A relation between a
verb synset and a verb “frame” which is a minimal pattern
of linguistic usage.
<!ATTLIST
verbFrame frame IDREF #REQUIRED>
The
verbFrame attribute of item is one of those specified at
<http://wordnet.princeton.edu/man/wninput.5WN.html#sect4>
<!ATTLIST
verbFrame synsetid IDREF #REQUIRED>
The synset that the
frame is valid for.
<!ELEMENT
author EMPTY>
Information about the
authorship of this synset or link etc.
<!ATTLIST
author authorshipid ID #REQUIRED>
The unique identifier
for this table, because many items are likely to have been authored
by the same person on a given day, at least for initial versions of
any particular WordNet.
<!ATTLIST
author author CDATA #REQUIRED>
The name of the author
of the information.
<!ATTLIST
author date CDATA #REQUIRED>
The date on which the
information was created.
<!ATTLIST
author score CDATA #IMPLIED>
If
the item or link was created automatically, a score indicating the
confidence of the creation. This item should not normally be present
for information authored by a human. Note that this apparent total
order will not be comparable between different automatic processes.
<!ATTLIST
author comment CDATA #IMPLIED>
A comment string relevant to this data item.
<!ATTLIST author
covering (yes|no) #IMPLIED>
If
the item is a synset, this refers to whether the item covers all
synonyms. That is, if the item tag is for a WordNet item, rather
than a formal term, if this attribute is “yes”, the human
author of the item has determined that all possible synonyms of this
word sense have been entered and linked.
Example
<item id=”1″
offset=”474859483″ name=”put” type=”synset”
POS=”verb”
source=”Princeton
WN” gloss=”To place an object at a location.”
author=”Christiane
Fellbaum” date=”19990101″/>
<item number=”2″
name=”Putting” type=”term”
source=”SUMO”
gloss=”To place an object at a location.”
author=”Adam
Pease” date=”20050101″/>
<link type=”hyponym”
item1=”1″ item2=”2″ author=”Adam Pease”
date=”20050101″/>
<link
type=”equivalent” item1=”1″ item2=”2″
author=”Adam Pease” date=”20050101″/>
<word value=”put”
synsetid=”1″ wordid=”1” frequency=”200″
corpus=”Brown
Corpus” author=”Christiane Fellbaum” date=”19990101″/>
<form
value=”putting” form=”present” wordid=”1″
author=”Christiane
Fellbaum” date=”19990101″/>