1
Introduction
This thesis provides an analysis of the meaning and use of some words, which seem to
represent key concepts in the discourse of the European Union. The data collected, the
EUSC (European Union Speeches Corpus), consist of 45 public speeches, held by 23
different speakers from the European Union setting. The EUSC contains a total of
97,708 words. The tool I used to analyse the corpus is WordSmith Tools (Scott 1999).
The idea of choosing the future of the European Union in particular, was
suggested by the fact that my experience as an undergraduate student in European
studies has provided me with several occasions to deal with political and economic
problems concerning the future of the European Union, under different perspectives. As
a language student, however, my curiosity was that of understanding the conventional
structures of meaning, which are used in texts produced inside and for the European
area. I decided to focus on speeches first because they are easily available from the
Internet (and in large quantity too), second because they seem to be one of the major
means through which discussion of issues related to the future of European Union is
achieved. Books and research papers seem to be more specialised economical types of
texts. A third reason was that I was interested in spoken rather than written language
and in the way these speakers achieve confidence and trust from the public. A further
aim of this thesis is more simply to show that a corpus of this type can help a language
student to develop some language knowledge connected to a specific subject area,
whose nature is established by personal interests on the basis of, e.g., future working
opportunities.
The thesis is organised as follows. In the first chapter, I deal with some
theoretical background concerning corpus analysis. I briefly report on the debate
between rationalists and empiricists, which has developed in the 1950s and 1960s
following strong criticism by the linguist Noam Chomsky, one of the main supporters of
rationalism, against empiricist theories. I discuss the role of Corpus Linguistics inside
linguistic theories and its expansion after the 1980s. In the second part of chapter 1 I
provide the basic information about corpora and I illustrate the main advantages of
corpora in investigating a language. Again in chapter 1 I deal with the differences
2
between large and small corpora, with a particular focus on the latter ones. I conclude
the chapter with a brief description of some studies concerning possible uses of small
corpora.
In the second chapter, I deal with issues in the methodology of corpus analysis
and criteria to select the words from my corpus, which I then analyse in the third
chapter. In the first part I describe the criteria of corpus design. In the second and third
part I illustrate respectively the compilation of the EUSC, following the criteria of
representativeness suggested by Biber (1993), and a description of the tool used to
analyse the corpus, WordSmithTool, Version 3.0 (Scott 1999). Part four deals with a
couple of theoretical issues, which I found of particular interest in my work: these are
mainly connected to a re-discussion of “Units of meaning”, particularly following
Sinclair (1991, 1996). I conclude the chapter identifying a procedure to select the 9
words from my corpus, which I analyse in details in the following chapter.
In chapter 3, I deal with the analysis of 9 words I selected from a KeyWord
List calculated through a comparison of my corpus and the BNC Sampler. The words
are: UNION, ENLARGEMENT, STATES, INTEGRATION, FUTURE, APPLICANT,
OPENNESS, PILLAR and TRANSPARENCY.
In chapter, 4 I draw the conclusion of this research, going through the different
steps taken. I will therefore re-define the starting point applied for the analysis, the
methodology followed and in the end I will sum up the main results obtained
investigating the 9 words in the EUSC. In the last paragraph I will give a personal
concluding opinion regarding the thesis and the subject area in which it is collocated.
3
1 Introducing the area of research: Corpus
Linguistics, corpus-based analysis and small
corpora
The aim of this chapter is to introduce the subject area in which my study can be
collocated. In the first part I therefore provide a brief introduction of Corpus Linguistics
and a short excursus of its history, distinguishing between rationalist theories and
empiricist ones. After providing the basic notions about corpora and their advantages as
instruments for language investigation (part 2), I focus on the differences between large
and small corpora, their different purposes and design criteria (part 3). This last issue
will be then re-discussed in details in chapter 2, as regards in particular small corpora.
The last part illustrates some possible applications of small corpora to language studies.
1.1 What is Corpus Linguistics
Corpus Linguistics can simply be defined as the study of language based on naturally
occurring examples of language use. It can be considered a research methodology that
can be used in various areas of linguistics, but it can not be intended as a “branch” in the
same sense as syntax, sociolinguistics and so on.
The history of Corpus Linguistics is characterised by studies conducted before the
advent of Chomsky on the one hand, known as “early Corpus Linguistics” and by the
more modern phenomenon increasingly prevalent since the 1960s and 1970s, which is
properly known as “Corpus Linguistics”, on the other.
Before the advent of Chomsky, namely before the 1950s, Corpus Linguistics
was characterised by studies conducted upon observed use of language. In particular,
the focus of attention was on language acquisition by children and the first primitive
corpora were mainly parental diaries recording children’s locutions. But children’s
language study was not the only concern of early corpus linguists. Between the 19
th
Century and the first half of the 20
th
Century, other studies in the area of foreign
language teaching and comparative linguistics focused on the study of spelling
4
convention. Even if the basic methods of analysis were well known in linguistics, there
was a certain discontinuity in the development of this discipline. In particular in the
1950s, corpora as sources of data underwent a period of almost total unpopularity and
neglect. This was mainly due to the influence of Noam Chomsky, who strongly
criticised the corpora as a source of information.
1.1.1 The debate between rationalists and empiricists
Around the 1960s a debate developed between rationalists and empiricists. The most
important supporter of the rationalistic theories was the linguist Noam Chomsky who
was also very influential in the development of grammar descriptions in the following
years. Rationalist theories were based on a so-called “theory of mind”. They started
from a theory based on artificial behavioural data and conscious introspective
judgements, which means, a native speaker of a language reflects on that language and
makes theoretical claims based on those reflections. Rationalists aimed at developing a
theory of language that only could emulate the external effects of human language
processing, but that also represented how the processing is actually undertaken.
The empiricists’ approach to language, instead, was dominated by the observation of
naturally occurring data, typically through the medium of the corpus. This meant that
empiricists were concerned not so much with cognitive representation of language, but
rather whether some linguistic constraints actually occurred in a language. Looking at
corpora of texts, empiricists observed whether particular forms were actually produced
in the language in question.
As mentioned above, Chomsky criticised the empiricist approach deeply. What
Chomsky did was to change the object of linguistic enquiry from descriptions of
language to theories upon cognitive plausible models of language. He said that it is
impossible to determine from any given language patterns what are the relevant
performance phenomena in a language. According to his perspective linguists musk
seek to model language competence rather than language performance. Chomsky, in
fact, believed that it is language competence, namely our tacit, internalised knowledge
of a language, which both explains and characterises a speaker’s knowledge of a
language. On the contrary, performance, is the external evidence of language
5
competence, a poor mirror of it and as it can be influenced by factors other than our
competence, it can not display accurately the typical characteristics of a language.
Corpus studies were obviously based on empiricist models and their aim was that of
studying language performance rather than language competence.
The crucial point of the criticism of Chomsky towards Corpus Linguistics was
that it is impossible to determine from any given utterance what are the linguistically
relevant performance phenomena, as the corpus can not give certainty that the
information contained in it is in any way generalisable. A somewhat extreme example
to explain this concept is that regarding aphasics. If we analyse a corpus of transcribed
speeches between aphasics and we are not told that they are aphasics, we could easily
end up modelling features of aphasia as grammatical competence.
Another criticism of Chomsky against Corpus Linguistics was that it is not
possible to analyse a language only by looking at skewed, partial collections of texts
like corpora. This had to do with a very important point raised by Chomsky i.e. the
underlying assumption in Corpus Linguistics that language can be reduced to a finite
sample. Chomsky observed that corpora could have never been so comprehensive as to
enumerate all possible occurrences of language and that using a finite sample as a
source of analysis would have inevitably been misleading.
1.1.2 The rise of Corpus Linguistics
Due to strong criticism by Chomsky, then, Corpus Linguistics was certainly not a
mainstream discipline in the 1950s. However some pioneers worked on with corpus
data throughout the 1950s, 1960s and 1970s and thanks to their efforts corpus work
revived, with increasing success in the 1980s. Many studies in different areas were
undertaken during the 1950s, 1960s and 1970s, such as part of humanities computing, in
mechanolinguistics, in the study of English grammar and the work of neo-Firthians
(refer to McEnery & Wilson 1996:20-24).
The work of Firth in particular deserves some explanation here. His work,
which developed in the late 1950s, is in fact of great importance because he revived the
interest of linguistics for language performance. Firth outlined an approach to language
6
in which social context and the social purpose of communication are of basic
importance. In his collection of texts (1957, in McEnery & Wilson 1996:23) he states:
The central concept … is the context of situation. In that context are the human
participant or participants, what they say, what is going on. The phonetician can find
his phonetic context, and the grammarian and the lexicographer theirs.
Firth’s ideas dominated much of British linguistics for the best part of a generation. In
particular, the assumption that gave him much popularity between corpus linguists was
that “attested language… duly recorded is in the focus of attention for the linguist”
(1957, in McEnery & Wilson 1996:23). On the data side, his exhortation to study
“attested language” inspired the studies of the so-called neo-Firthian linguists, such as
Halliday and Sinclair, who worked on the tradition he established. Under the
terminological aspect, Sinclair has given popularity to the Firth’s term of collocation
(see Chapter 2, par.2.4.1). The neo-Firthian corpus linguists have also inspired the one
of the largest programmes of research, namely the COBUILD project, carried out by
Sinclair and his team from around the 1980s onwards. The COBUILD project and its
associated corpus, the Bank of English are only two of the big corpora created as a
response to the criticism of Chomsky about corpora being too small for most kind of
language description.
A last thing to say regarding the work of neo-Firthians is that it is based on the
examination of complete texts and the construction of fairly open-ended corpora. The
other tradition of corpus building relies instead upon sampling and representativeness to
construct a corpus of a set size, which eschews the inclusion of complete texts within a
corpus.
This is, in general, the development of Corpus Linguistics across the years.
Even if through the years a series of criticisms were made to corpus-based approaches
to language studies, no one was necessarily fatal. Of course the criticism of Chomsky,
among others, has helped to foster a more realistic attitude towards corpora today. In
particular, his opposition to the fact that corpus seemed to consider it possible to reduce
language to a finite sample, has contributed to shape and adapt the new approach of
corpus linguists.
7
Corpus Linguistics today has become a viable methodology and is no more a
so-called “pseudo-procedure”, as many critics said. This thanks also to the advent of
digital computers, which have rendered it possible to work with huge quantities of data.
Other techniques, such as parsing and tagging, have also contributed to improve
searches and develop more reliable methods of corpus analysis.
1.1.3 Qualitative versus quantitative analysis: searching for a
compromise
Chomsky’s argument against corpora was based upon the observation that when one
derives a sample of a language variety, it will be skewed; chance will operate so that
rare constructions may occur more frequently than in the variety as a whole and some
common constructions may occur less frequently than in the variety as a whole. This
point on rarity and commonness appears to assume that Corpus Linguistics is a
quantitative approach. This is at least partly true, because a corpus, considered to be a
maximally representative finite sample, enables results to be quantified and compared to
other results in the same way as any other similar scientific investigation. But it is not
essential that corpus data are used solely for quantitative research and, in fact, many
researchers have used it as a source of qualitative data. For this reason I think it is useful
to look at the relationship between quantitative and qualitative approaches to corpus
analysis.
Qualitative research aims at interpreting data and language phenomena, while
quantitative research classifies features, counts them and even constructs more complex
statistical models in an attempt to explain what is observed. Schmied (1993, in McEnery
& Wilson 1996) said that a stage of qualitative research is often a precursor for
quantitative analysis, since, before linguistic phenomena are classified and counted, the
categories for classification must be first identified. It is more useful, however, to
consider these two as forming two different, but totally complementary, perspectives on
corpus data. Qualitative analysis offers, on the one hand, richness and precision,
because it provides the interpretation of the data. Quantitative analysis, on the other, is
statistically reliable and gives “objective” results. Both qualitative and quantitative
analysis, however, have some disadvantages.
8
In qualitative analysis the findings cannot be extended to larger populations
with the same degree of certainty with which quantitative analysis can, because,
although the corpus may be statistically representative, the specific findings of the
research cannot be tested to discover whether they are statistically significant or more
likely to be due to chance. In contrast to qualitative analysis, the quantitative one does
allow for its findings to be generalised to a larger population, and furthermore, it means
that direct comparisons may be made between different corpora. Quantitative analysis
thus enables one to discover which phenomena are likely to be genuine reflections of
the behaviour of a language or variety and which are merely chance occurrences.
The picture of data that emerges from qualitative research is merely based on
the observation of a phenomenon, which can even occur once. Quantitative analysis,
instead, gives a precise idea of the frequency and rarity of particular phenomena and,
hence, arguably, of their relative normality or abnormality. Quantitative analysis,
nevertheless, forces the researcher to do some classification of the data. That means that
he has to decide on himself whether an item either belongs to a class or it does not. For
example, looking at the word red we would have to decide whether to put the word in
the category “colour” or in the category “politics”. There exist some phenomena, which
may clearly belong to potentially more than one class and this creates inevitably some
doubts to the analyst. Quantitative analysis may therefore entail sometimes a certain
idealisation of the data. At the same time it also tends to sideline rare occurrences,
because it bases their frequency lists on statistical significance tests. That is to say that
the computer requires the specification of minimum frequencies values, so that it can
exclude those occurrences thought as “non relevant”. This results inevitably in a loss of
data richness.
The conclusion that should be drawn from these considerations is that both
qualitative and quantitative analysis have something to offer to corpus studies. In fact,
as McEnery & Wilson (1996:77) say, while qualitative analysis “can provide greater
richness and precision”, “quantitative analysis can provide statistically reliable and
generalisable results”. Therefore, they offer a so-called “multi-method approach”, which
has recently demonstrated to have many benefits for linguistic research.
9
1.2 Corpus-based analysis
In this section I am going to focus more deeply on the characteristics of corpora as tools
of analysis. I start by giving the notion of corpus, then a general description of the
different types of corpora and I illustrate the concept of representativeness. Then I
provide the 10 major advantages of corpora in studying a language.
1.2.1 The basic information about corpora
a) Corpus definition
A corpus is simply a collection of texts. Within Corpus Linguistics, however, corpora
are intended to have three specific characteristics, namely:
1) the texts they contain are in electronic format and can therefore be read by a
computer,
2) the texts are put together because they share a common purpose, such as e.g. that of
studying a particular genre or a single aspect of a language, etc.,
3) the criteria chosen to select the texts have to be explicit or at least explicable.
(Bernardini & Gavioli 1999).
b) Types of corpora
Early corpora, those compiled without computer aids, turned around five major fields of
scholarship: biblical and literary studies, lexicography, dialect studies, language
education studies and grammatical studies (Kennedy 1998). Modern corpora are similar
in purpose to the early ones. What makes the two generations of corpora different is
size. As far as electronic corpora are concerned, in modern linguistics, we distinguish
between so-called general corpora and specialised corpora. While the latter are
generally smaller and more similar to the early corpora, general corpora nowadays
contain hundreds of millions words. The two types of corpora also differ in their
purpose: general corpora are compiled for general descriptive purposes and then used
for linguistic research, e.g. about lexis and grammar. Specialised corpora, on the other
side, are created for very specific research projects, such as working out the features of
specialised language (e.g. learners’ language, English for Special Purposes, etc.).
10
Corpora can contain a total population of texts that we want to analyse, e.g. all the
editions of a newspaper or all publications of an author. These are called full-text
corpora and differ from the sample-text corpora, which instead only represent a sample
of the population, containing for example a definite number of reports on economics.
Texts are sampled to follow specific criteria and the researcher bearing specific
objectives in mind such as to investigate particular characteristics of a genre (see
Kennedy 1998: 19-23, for a more detailed description of the different types of corpora).
c) Representativeness
Representativeness in a corpus refers to the degree with which the language population
contained in that corpus represents the entire language population, which it refers to. In
other words, the higher the quantity of material representing a language population in
the corpus, the higher the probability that the whole population is properly represented
in that corpus. Therefore, a corpus needs to be representative in order to be
appropriately used as a basis for generalisations concerning a language.
The concept of representativeness changes if we refer to large corpora or to
small ones. As regards the former, as the scope of large corpora is to represent a
language population as a whole, they have to contain the greatest possible quantity and
variety of texts referable to that language. As far as the latter are concerned, instead, as
small corpora usually aim at representing only a particular aspect of a language
population, they must contain texts of the same type and with the same characteristics.
Going back to Chomsky’s criticism, achieving an appropriate
representativeness of a language in a corpus would be a very difficult if not impossible
task. This because quantitative analysis entails risks concerned with the generalisation
of a single phenomenon found on a sample, to some larger population. It is on the other
hand also true, as mentioned again in McEnery & Wilson (1996:78), that if such
criticism would be accepted, it should be applied not only to language corpora, but to
any form of scientific investigation, which is based on sampling rather than on the
exhaustive analysis of an entire and finite population. On the basis of these
considerations I agree with McEnery & Wilson (1996) when they say that Chomsky’s
criticism should not be taken as so drastic, since corpus linguists have developed many
safeguards and methods which may be applied in sampling for the maximal possible
11
representativeness. The discussion on representativeness will continue, however, in par.
1.3.2 of this chapter and in par. 2.1 of chapter 2, where it will be deepened by looking at
additional features to achieve representativeness in corpus design.
1.2.2 The main advantages of corpora in investigating a language
Bowker & Pearson (2002), in the first chapter of their book, give a very exhaustive
description of the ten major advantages of corpora in studying a language, if compared
to traditional tools, such as dictionaries and other printed tools. Here I report them in a
very brief “top-ten” list.
1) Corpora have no physical constraint typical of printed media; they are more
extensive than other resources. In fact, thousands of words of running text can be
stored on a diskette and millions can fit easily on to a hard drive or optical disk.
2) Corpora are easier to update. Their electronic form allows them to be updated
regularly and very easily, if compared, e.g., to dictionaries. As the work of
lexicographers designing a dictionary is intensive and long, it can be even the case
that by the time a dictionary has been published, already new words have been
identified in that language that could not be included in that edition. This does not
happen to corpora, which can be collected very easily and in every moment.
3) Corpora are easier and faster to consult. Searching for a word or phrase in a printed
text is labour-intensive and time-consuming task. It is again the electronic form of
corpora and the ability of numerous applications created to analyse them, that allows
the linguist to spend relative few time in the analysis procedure.
4) They contain a wealth of authentic usage information. Since corpora comprise texts
written by subject field experts, LSP (Language for Special Purposes) learners have
before them a body of evidence pertaining to the function and usage of words and
expressions in the LSP of the field.
5) With the aid of corpus analysis tools, it is possible to sort the contents so that the
meaningful patterns are revealed.
6) Frequency information is easily obtainable to discover interesting lexical patterns.
Knowledge about frequency allows to analyse the lexical patterns associated with
12
words in a more consistent and objective way. These observations are difficult to do
when working with printed documents since the human eye may simply not notice a
pattern when its occurrences are spread over several pages of documents.
7) Corpora can be consulted in every moment. Once they have been collected and
stored, they are always available for consultation. This is not the case of actual
subject field experts. The unrestricted availability of corpus is important because
language learning goes on all the time.
8) Corpora display different opinions of expert users simultaneously. In fact, thy
contain various articles written by different subject field experts and that means that
LSP learners have access to more than one expert opinion, which means they are
better able to judge whether terms or expressions are generally accepted in the
subject field, or whether they are simply the preference of one particular expert.
9) Corpora are objective frame of reference. Unlike judgement based on intuition,
naturally occurring data has the principal benefit of being observable and verifiable
by all who examine it.
10) ONE-STOP SHOP: you can retrieve from the same corpus different types of
information. Because a corpus consists of natural occurring running text, it is
possible to retrieve information about both lexical and non-lexical (e.g. style,
punctuation, register, semantic, etc.) elements of language.