xi
Sfruttare le nozioni di VERBNET e CORELEX e introdurle nel nostro sistema si è
rivelato un compito non banale. In particolare un problema importante è stato riuscire
a costruire, in modo compositivo, la semantica della frase da quella del verbo e dei
suoi argomenti. A questo scopo ci siamo serviti di AUTOSEM, uno strumento
compatibile con LCFLEX, in grado di costruire in modo compositivo la semantica
durante il parsing. Per funzionare questo strumento necessita di un’ontologia che
abbiamo costruito ispirandoci alle nozioni semantiche di VERBNET e CORELEX.
Ci siamo concentrati su frasi appartenenti a domini task-oriented, nei quali si devono
eseguire azioni per ottenere un obiettivo, e poiché sono i verbi ad esprimere l’azione,
la maggior parte della conoscenza è racchiusa in essi. Tuttavia questa conoscenza
impone spesso dei vincoli sugli argomenti dei verbi (che generalmente sono
sostantivi), per questo si è rivelato necessario avvalersi di una semantica di
riferimento anche per i sostantivi.
Abbiamo così definito un’ontologia e un lessico che gestiscono sette classi che
coprono il significato di 109 verbi, e 47 classi che coprono il significato di 289
sostantivi.
Il sistema è stato quindi valutato su un insieme di 751 frasi tratte dalla sezione home-
repair di un corpus di nove Megabyte originariamente creato presso la University of
Brighton.
I risultati ottenuti mostrano che il sistema ha prodotto una rappresentazione semantica
corretta nel 96% dei casi, e ha generato una rappresentazione parziale per il 2,6%.
1
CHAPTER 1
INTRODUCTION
Language is the fundamental means of communication for human beings. Though
simple and comprehensive as it may appear to humans, it is in fact of utmost
complexity when it has to be understood by a computer. Natural language
processing (NLP), also called computational linguistics, attempts to use automated
means to process text and to deduce its syntactic and semantic structure. This is
done for many purposes such as to extract specific information from text, to
perform machine translation, to produce automated summaries, etc. Since most of
human knowledge is recorded in linguistic form, enabling computers to understand
natural language would allow them to access all this knowledge. However this is
still an active field of research, and this could be due to the fact that NLP is still a
relatively young field. The modern field of linguistics has a hundred-year history as
a scientific discipline, and computational linguistics has a forty year history as a
part of computer science, but it is only in the last ten years that language
understanding has emerged as an industry reaching millions of people, with
information retrieval and machine translation available on the internet, and speech
recognition becoming popular on desktop computers. This industry has been
enabled by theoretical advances in representation and processing of language
information.
The commercial availability of speech recognition and the need for web-based
language techniques have provided an important impetus for development of real
systems. The availability of very large on-line corpora (collection of annotated
sentences) has enabled statistical models of language at every level, from phonetics
to discourse.
Before the actual introduction to this thesis it is necessary to define some terms that
belong to the terminology used in NLP that will be used throughout this dissertation
(such definitions can be found in [Jurafsky et al., 2000]).
2
Two important aspects among the others on which NLP focuses on are syntax and
lexicon. Syntax, or the patterns of language, defines structures such as the sentence,
made up of noun phrases and verb phrases. These structures include a variety of
modifiers such as adjectives, adverbs and prepositional phrases. The determination
of the syntactic structure of a sentence is done by a parser. At the basis of all this
are the words, and information about these is kept in a lexicon, which is usually a
machine-readable dictionary that may contain a good amount of additional
information about the properties of each word, encoded in a form that parsers can
utilize.
The parser then, given a sentence, tries to build all or some of the possible syntactic
trees for such a sentence; every node in the syntactic tree is a so-called non-terminal
symbol and the leaves are the words in the sentence.
A non-terminal is a symbol that represents a constituent of the sentence, where we
indicate as constituent a group of words that behave as a single unit or phrase.
The most commonly used mathematical system for modeling constituent structure
in English and other natural languages is the Context-Free Grammar or CFG.
Once a parse-tree is built for a sentence, the next step is attaching some meaning to
it in order to obtain its semantic representation. In this way we will be able to
perform some inferences on the sentence if the semantic representation that is built
is appropriate.
The focus of this thesis is: first, building a lexicon that is able to directly link
syntactic frames of action verbs to their lexical semantic meanings; second,
coupling this lexicon with a robust parser to automatically derive semantic
attachments for parse trees.
The most widely used approach to achieve semantic attachment uses first-order
logic: the advantages of such an approach are the easy mechanism of deriving the
semantic representation from language, and the ability to make inferences in such
representations. There are some drawbacks though: a sentence such as “Mary
breaks the mirror” can lead to a representation such as the following (we are
adopting here a reified representation as in [Jurafky et al., 2000], i.e. explicitly
encoding the event):
∃e(Break(e)∧Breaker(e, Mary)∧Broken(e, mirror))
3
This implies that each verb should have a different representation, so in cases in
which the number of verbs to be covered is high, most of the time is spent in
building the representations for such verbs in the lexicon.
The approach we propose here uses lexical semantics and is inspired by both
Levin’s theory and VERBNET as concerns verbs, and by CORELEX for nouns.
Lexical semantics, as it is defined in [Jurafksy et al., 2000], is the linguistic study of
the lexicon. Lexicon has a highly systematic structure that governs what words can
mean and how they can be used. This structure consists of relations among words
and their meanings as well as the internal structure of individual words.
The lexical semantic approach we use here looks at Levin’s theory for what that
concerns verbs. In this theory individual verbs are mapped onto classes, which
contain a single semantic representation that is associated with every verb in that
class. This was the inspiring principle for the development of VERBNET as well.
VERBNET is in fact a lexicon that takes advantage of Levin’s theories and modifies
them slightly. This was necessary to ensure that each verb class is coherent enough
so that all its members have a common semantics and share basically the same
parameters and syntactic frames.
The syntactic frames of a verb are the different syntactic constructions that such
verb allows for its arguments. If you look at the example below we have two
different possible constructions. Each of these constructions represents a syntactic
frame.
The approach of lexical semantics that we used, as Levin’s theory proposes, builds
a representation for sentences that links the syntactic frame together with the
semantic representation of the sentence. This linkage is done preeminently via the
verb semantics. In our lexicon verbs are grouped into classes of meaning. Each
class embodies a description in logic for the meaning of the verb class; such
meaning usually changes as the verb construction changes: in this way, for instance,
if the verb is transitive some propositions hold, and if the verb is used in the
intransitive way some other propositions hold. The way in which this can be done is
shown in Fig. 1-1.
4
Figure 1-1: the semantic representation for “break”
Fig. 1-1 shows two different syntactic frames for the verb break. The first one is
called Basic Transitive and the second one Intransitive. In VERBNET the description
in logic representing the meaning of the sentence is centered on events; here E.
In the first frame the Agent is the cause of the event, the Patient is in contact with
an instrument (that is unspecified in the sentence) for the duration of the event, and
the consequence of the event is that the integrity of the patient has been degraded,
and the physical form of the object is now determined by the form that typically this
action implies (Form is a kind of built-in property that each verb in this class has to
specify to determine which will be the form of the patient at the end of the action).
In this syntactic frame the verb supports a transitive construction, which means the
verb has a direct object. The proposed syntactic construction for this transitive
frame (“Agent V Patient” in the figure) has the subject preceding the verb followed
by an object.
The second frame is called Intransitive: here the verb does not have an object, but
only a subject. The preferred construction for such a syntactic frame (“Patient V” in
the example) requires the subject to precede the verb.
As you can see this example does say much more about this verb. It also says that in
the transitive case the subject corresponds to the deep
1
semantic role (also called
thematic role) of the agent and the object is mapped to the patient. Instead in the
intransitive case the subject is mapped to the patient role.
1
Usually such roles are called “deep” to make a distinction between weak syntactic roles (like
subject and object) that every verb has and semantic roles (like agent and patient) that only some
verbs (like action verbs) have.
% Basic Transitive (causative)
$$ Agent V Patient
cause(Agent,E) contact(during(E),?Instrument,Patient)
degradation_material_integrity(result(E),Patient) physical_form(result(E),Form,Patient)
% Intransitive (inchoative)
$$ Patient V
degradation_material_integrity(result(E),Patient) physical_form(result(E),Form,Patient)
5
Also in the transitive case some propositions hold, whereas in the intransitive one
fewer propositions are true, and it could also happen (even if this example is not the
case) that the propositions that hold are different from a frame to the other.
Since the sentence “Mary breaks the mirror” presents a transitive frame, then we
obtain the following semantic representation that corresponds to the transitive usage
of break (see Fig. 1-2).
Figure 1-2: lexical semantics representation for the sentence “Mary breaks the mirror”
Then we inspired our lexicon (as regards verbs) on VERBNET, which implements a
lexical semantics approach based on Levin’s theory. In this way we also avoided
writing a different semantic representation for each verb in the lexicon. In facts
Levin’s theory groups verbs into classes on the basis of their meaning and syntactic
construction. In this way the number of semantic representation to be built
decreases noticeably.
VERBNET introduces also some restrictions on thematic roles. Nonetheless thematic
roles are always mapped to nouns. Therefore the restrictions introduced by
VERBNET can be seen as restrictions on the meaning of nouns.
It’s clear then that we also needed a semantic resource for nouns. Nevertheless there
was also the need for this resource to have a meaning representation that was
compatible with that of VERBNET. This is why we came to CORELEX.
Let’s now consider an example: in the case of the verb class break-45.1 VERBNET
introduces the selectional restrictions shown in Figure 1-3.
cause(Mary,E) contact(during(E),?Instrument,mirror)
degradation_material_integrity(result(E),mirror)
physical_form(result(E),?Form,mirror)
6
Figure 0-3: restrictions for the class break-45.1
This is to say that (the lexical item bound to) the role of the agent should have an
intentional control (int_control) on the event, and that both the patient and the
instrument are solid entities (solid).
In our opinion the better way to perform such restrictions was using a classification
for nouns. In this way nouns with similar meaning are represented by a class. If the
classification is a good one, the restrictions can be easily and satisfactorily checked
on a class for a number of nouns.
The only lexical resource for noun semantics that we found suitable to our purposes
was CORELEX - initially we intended to use WORDNET, but then we noticed it did
not meet our needs.
CORELEX is an ontology for lexical semantics processing based on WORDNET, and
it performs a classification on nouns that is suitable to our problem.
To have a clear idea of what we intended to do, let’s have a look again at the
example above. In the sentence “Mary breaks the mirror” the roles to be checked
are the agent and the patient since the instrument is not specified.
The agent role is mapped to the noun Mary. CORELEX does not include proper
nouns but we can suppose that proper nouns addressing people belong to a class that
satisfies the intentional control restriction.
The role of the patient is then bound to mirror. The problem here is that there is not
a straightforward way to check if an object is solid unless we look at every single
noun in our corpus and specify for all of them whether they are denoting a solid,
liquid or gaseous object. Nonetheless solidity is a function of temperature, therefore
nothing can be said to be always solid. The only feature that we believed had a
closest meaning to that of being solid is that of being a concrete entity. With
concrete entity we intend something physical, tangible, e.g. something that can be
touched.
Agent[+int_control] Patient[+solid] Instrument[+solid]
7
Therefore we have in our ontology a concrete-entity class. CORELEX classes like
artifact, physical_object, animal, human_being, etc. can be said to be concrete
entities, and therefore they all inherit from this class.
When we have to check if a lexical item is solid we only check for it to be at least a
concrete entity, adding the solid preference.
Going back to our example, mirror belongs to the artifact class in CORELEX, and
therefore it also satisfies the selectional restriction for the patient role.
Large practical NLP applications require robust analysis components so that they
can effectively handle disfluent or extra-grammatical inputs regardless of the
grammar they employ. Such NLP systems must be prepared to frequently encounter
input that deviates from the coverage of their grammar. This is a requirement in our
case since when corpora taken from real dialogues are used, extra-grammaticalities
occur very often in the data. Then we don’t want to have to correct each sentence
we have to deal with. Instead we need an automatic way of dealing with such un-
grammatical input, so the parser should recognize such ungrammaticalities and
return all the possible parses for all the interpretations it can find for such errors.
Whereas large scale parsing is available now, often such parsers don’t produce
semantic representations, but only syntactic representations (like in the statistical
approaches of [Collins, 1999]).
The goal of this work was to achieve large coverage parsing but also to produce a
semantic representation based on lexical semantics.
We did so building a natural language system that coupled the notions coming from
VERBNET and CORELEX with the LCFLEX parser.
LCFLEX is a robust parser that is able to return parsed portions of the input when it
cannot find a complete parse. This is bound to happen when, as in the proposed
work, the corpus is large and contains informal writing and dialogues. Due to
corpus size, the parser is unlikely to have complete coverage. Moreover, the testing
data used in our work is prone to ungrammaticalities and typos.
Coupling LCFLEX and VERBNET revealed itself as a complex task. One important
issue was how to build the compositional semantics of the sentence from the
semantics of the verb and of its arguments and adjuncts. We took advantage of
8
AUTOSEM, an LCFLEX built-in tool, which is able to build semantics
compositionally at parse time, given an ontology that we defined on the base of
VERBNET semantic knowledge.
Even the part of our work involving CORELEX was not easy. A crucial step was
adding its knowledge to our lexicon and ontology in a way that had to be suitable to
check on the restrictions introduced by VERBNET.
We focused on sentences that belong to task-oriented domains, in which actions
must be executed to achieve a goal. Since it is the verb that expresses the action, in
many systems much of the domain knowledge is built-in in verbs.
We built a lexicon and ontology for a set of 7 verb classes which cover an overall
number of 109 English verbs, and 47 classes that cover the meaning of 289 English
nouns.
We tested our ontology on 751 sentences taken from the home-repair portion of a 9
Megabyte written corpus originally collected at the University of Brighton, and
available for research purposes. The full corpus also includes other instructional
texts such as cooking recipes.
The obtained results show that the system was successful with a percentage of 96%,
and produces good semantic representations, although not optimal, for 2.6% of the
tested sentences.
In the following chapters we will introduce the main components of our project:
WORDNET and VERBNET, LCFLEX, and CORELEX; then we will explain how we
managed to integrate them and finally we will have a look at the results and their
discussion.