40
3. THE EXPERIMENT
3.1 Aims and hypothesis
The aims of this is study is to find out if an articulatory (ultrasound - US)
training is better than a perceptual training and if a short training may help
Italian learners in reproducing the English contrast /ɑ-ʌ/. These aims are
based on the fact that Italian learners find extremely difficult to perceive and
produce the contrast, since, according to the mother tongue phonetic-
phonological system, Italian learners produce /ɑ/-L2 as /ɔ/-L1 and /ʌ/-L2,
often, as /a/-L1 (S. D’Apolito, 2017). The English orthography represents a
further difficulty. Indeed, in Italian language the orthography is relatively
transparent and native Italians tend to follow the orthographic form and, as a
consequence, to mispronounce English written words, which are
phonologically opaque instead. However, despite the fact that both trainings
may help learners to move from the most similar L1 sound toward a new
vowel target position, the US articulatory training is expected to be more
effective than the perceptual one since a direct feedback can help learners to
understand how to position the tongue for the production of the L2 vowels,
or at least to differentiate them adequately from L1 similar vowels. The US
is also expected to allow subjects to better control tongue position reaching a
more stable tongue position over time.
3.2 Method
3.2.1 Subjects
Nine female subjects from Southern Apulia (Salento) participated in the
experiment. All speakers are monolingual, who have never been in a foreign
country for longer than a month. The subjects have started studying English
41
as a foreign language at the mean age of 8,5 and were mainly exposed to L1-
accented English teachers. Participants were divided as follows: i) three
experimental subjects performed the articulatory US training (ES-US) i.e.,
SPK1, SPK2, SPK3; ii) three experimental subjects performed the perceptual
training only (ES-P), i.e., SPK4, SPK5, SPK6; and iii) three control subjects
(CS) who did not receive any training, i.e., SPK7, SPK8, SPK9. One female
native speaker of American English was recorded to collect native speaker
US data and 5 American English natives were recorded to collect acoustic
data for perceptual stimuli to be presented during the training.
3.2.2 Training procedures
A one-hour training session for each type of training was planned. Both
experimental groups received information about orthography, that is the
grapheme-to-sound correspondences of the American English contrast, and
the phonetic differences between the non-native vowels /ɑ-ʌ/ as well as with
respect to the closest native vowels /a-ɔ/. Graphic representations of F1
(tongue height) and F2 (tongue backness) vowel space of the American
English contrasts and of the Italian vowels were provided followed by verbal
instructions about their articulation. Then the two experimental groups
underwent to different trainings. The ES-US group only received an audio-
articulatory training by means of the US machine (Toshiba Aplio XV), which
offers a real-time biofeedback of learner’s tongue position together with a
real-time movie of the native speakers’ tongue, used as a visual model. Thus,
the US training started showing the native speaker’s movies (with audio) of
tongue contour during the production of /ɑ-ʌ/ in isolation, then during the
production of CVC words and finally the videos of entire real sentences.
When subjects affirmed to have detected the acoustic and articulatory
differences between the target vowels and between them and the native
vowels /a-ɔ/, they started practicing the production of the L2 sounds with the
US probe under their own chin. This allowed the subjects to see their own
42
tongue profile on the screen and to adjust their tongue according to native
speaker’s movement.
The ES-P group only received a perceptual training, that is an identification
test according to HVPT procedure. The perceptual training was performed by
a web application expressly created by the CRIL laboratory staff. During the
training, participants were presented via headphones with one auditory
stimulus at the time. The stimuli were English CVC words (V=/ɑ-ʌ/),
produced by 5 English speakers and presented 4 times each. Participants were
asked to correctly associate the vowel sound to one of two non-orthographic
symbols displayed on the computer screen. Non-orthographic symbols were
used to avoid inaccurate associations between sounds and English
orthography. A total of 150 trials was used in the training session (2 target
vowels x 5 talkers x 3 contexts x 4 repetitions + 2 control vowels x 3 talkers
x 3 contexts).
3.2.3 Speech recordings
Each recording session consisted in: i) pre-test data collection of L1 and L2
production, ii) one-hour training (for ES-US, ES-P) and iii) post-test data
collection of L1 and L2 production. The L1 corpus consisted of /pV1pV2/
words and pseudowords, where V1 was one of the five Italian native vowels
/i, ɛ, a, ɔ, u/ and V2 was /i/ or /a/, proposed in a carrier sentence (e.g. “Dicevi
pV1pi in su” or “Diceva pV1pa a Ken”). The L2 corpus consisted of /pV1p/
American English words, where V1 was /ɑ/ or /ʌ/, inserted in a carrier
sentence (e.g. “I see pV1p inside”). All subjects read both corpora, displayed
on a PC screen, 12 times. L1 and L2 productions of the three groups were
collected in pre- and post-test sessions. Both data were collected
simultaneously at CRIL laboratory in a soundproof room. US data were
collected by the means of a convex probe positioned under the subject’s chin
on the midsagittal plane, with a special stabilization set to fix the probe.
43
Acoustic data were analyzed using PRAAT and the first three formants were
calculated at the central 40% of the entire vowel duration.
3.3 Acoustic data
Acoustic data (22050Hz, 16 bit) were labelled and analyzed using PRAAT, a
freeware program for the analysis and manipulation of acoustic speech signals.
The program can generate spectrograms, intensity contour and pitch tracks,
edit a recorded sound, label and extract individual sounds for further analysis,
as shown in the lower figure.
Figure 1: PRAAT’s graphical user interface
The top pane is the waveform, which shows time left-to-right, and the local
value of the sound signal in the up-and-down dimension. The middle pane is
the spectrogram and the bottom pane is where the transcriptions are put .
These options were helpful in labelling and calculating the first three formants
(F1, F2, and F3), which were measured as the average of the values in the
44
central 40% of the entire vowel duration. Successively, F1 and F2 mean and
standard deviation values of the pre- and post-test target vowels were plotted
on a Cartesian F1-F2 plan for each subject. Data were statistically analyzed
by means of a series of independent test (p<0.05) in order to compare: 1) /ɑ-
ʌ/ in pre- vs post-test respectively; 2) /ɑ-ʌ/ in pre- and post-test vs L1 /a-o/
respectively.
3.4 Articulatory data
3.4.1. EdgeTrak
The articulatory data were labelled using EdgeTrak, which is a computer
program that automates the tracking of tongue contours by extracting (x, y)
coordinates from the lower edge of the white curve in the ultrasound image.
First, a few points on the first tongue frame of the sequence are manually
chosen (fig. 2), and then EdgeTrak uses an active contour model to determine
the location of the tongue edge in the image (fig.3). If the automatic tracking
of the tongue edge does not produce satisfactory results, points can be
manually added or subtracted to obtain the best fit.
Figure 2: EdgeTrak
manual contouring input.
(ES-SPK4)
45
Each following frame is determined by the same optimization process on the
basis of the contour of the first one (fig.4). Automatic tracking accuracy is
quite high, as the error between automatic and human expert tracking ranges
from 1.83 to 3.59 pixels (1 pixel=0.295mm).
3.4.2 Smoothing spline ANOVA and Bayesian intervals
Subsequently, the frames with tongue surface contours were compared by
exploiting the smoothing spline analysis of variance: the SS ANOVA.
Smoothing splines are a type of natural cubic spline, which is a piecewise
polynomial function that connects discrete data points called knots. The
smoothing spline contains two terms, one that attempts to fit the data and one
that penalizes a fit which does not have the appropriate amount of smoothness.
Figure 3 (left) & 4 (right): EdgeTrak automatic tracking of the tongue edge, for the first
and subsequent frames. (ES-SPK4)
46
Although the penalty term does not allow the function to fit the data precisely,
it ensures that the resulting spline has a suitable amount of smoothness (fig.5).
The SS ANOVA is usually implemented in programming language and free
software environment for statistical computing and graphics, such as R or S-
PLUS. Because of its benefits, the SS ANOVA statistical method has been
used, as in this study, in applications that require a statistical technique to
determine whether the shapes of multiple curves are significantly different
from one another, taking into account their shape, rotation and translation and
comparing each other as a whole.
Figure 5: Smoothing spline estimate from twelve post-test repetition for L2 /ɑ/ (ES-
SPK4). The x axis is the length of the tongue, and the y axis is the height of the tongue.
Like the ultrasound images, the tongue tip is on the right and the tongue root is on the
left.
Since the interaction may be significant even when a small section of the
curves differs, Bayesian confidence intervals are used to determine whether
the curves are significantly different at any point in the comparison. Given
two groups of tongue contours, the SS for each contour set is termed main
group effect, and around each SS the 95% Bayesian confidence interval is
constructed. The comparison of the two groups of tongue curves is performed
with the interaction diagram which represents a plot of the difference of the