D2.1: Level Markup

Prosody

 

1. Coding Purpose

 

In this chapter we describe a framework for the annotation of speech corpora at the level of prosodic analysis. The scope of the Prosody Level includes phonetic transcription, intonation annotation and prosodic phrasing. The intended phenomena pertain to aspects of speech that are not explicitly represented in its orthographic transcription, which may be considered the starting point for the other linguistic annotation levels considered in MATE. So, the Prosody Level integrates the linguistic description of dialogues with information closer to their actual acoustic realization. The common reference to the speech signal allows to align prosodic annotations with orthographic transcription and higher linguistic levels, enabling cross level analyses.

 

1.1 Scope

 

Prosodic phenomena are specific to spoken language. They concern the way in which speech sounds are acoustically realized: how long they are, how high and how loud. Such acoustic modulations are used by human speakers to express a variety of linguistic or paralinguistic features, from stress and syntactic boundaries, to focus and emphasis or pragmatical and emotional attitudes. Linguistics and speech technology have approached prosody from a variety of points of view, so that a precise definition of the scope of prosodic research is not easy. A main distinction can be drawn between acoustic-phonetic analyses of prosody and more abstract, linguistic, phonological approaches.

Linguistically relevant prosodic events concur to express sentence structure: they highlight linguistic units by marking their boundaries and suggesting their function. Linguistic-phonological descriptions of prosody, usually identify a set of prosodic units (phonological units with a scope wider than a segment), and a set of prosodic phenomena which are ‘superimposed’ on these units. Prosodic units are the natural scope of prosodic events. Several types of prosodic units (differing mainly in their scope) have been proposed: paragraphs, sentences, intonation groups, intermediate groups, stress groups, feet, syllables, mora... Although prosody is by definition suprasegmental, prosodic analyses take often the phoneme as their minimal unit, where to measure rhythmical variations and locate intonation events. The family of prosodic phenomena includes the suprasegmental features of intonation, stress, rhythm and speech rate, whose variations are relevant to express the function of the different prosodic units: the prominent syllable in the word will be marked by stress, a falling intonation contour will mark the conclusion of a sentence, a faster speech rate and lower intonation characterize a parenthetical phrase...

Such prosodic features are physically realized in the speech chain in terms of variations of a set of acoustic parameters. Acoustic-phonetic analyses identify the following ‘phonetic correlates of prosody’: fundamental frequency (f0), length changes in segmental duration, pauses, loudness, voice quality.

 

Depending on the research purpose and point of view, prosodic phenomena can be marked in a speech corpus by simple diacritics in its orthographic transcription, or by labels classifying intonation contours and unit boundaries according to some phonological theory, or by detailed measures of the acoustic-phonetic parameters.

We refer to D1.1 for a more detailed discussion concerning prosodic phenomena and their possible codings. What we state here are our minimal assumptions on the scope of prosodic coding:

á coding should take into account at least segmental duration, pauses and intonation

á it should consider the structuring role of prosody and provide means to delimit prosodic units by marking phrase boundaries

á finally, it should allow both detailed phenomenological descriptions and more abstract functional ones, providing distinct levels for phonetic and phonological annotation.

 

 

2. Existing Schemes

 

Coding prosody appears as a complex task, which has to deal with the intrinsic complexity of prosodic phenomena and with the variety of purposes, theories and points of view from which prosody can be approached. Such complexity is reflected in the wide variety of existing schemes that can be found in the literature. Examples of coding schemes more or less explicitly inspired by the different intonation theories and approaches are reviewed in D1.1. The review, by no means exhaustive, gives brief descriptions of the following schemes:

 

PROSPA

IPA

TEI

ToBI

SAMPA

SAMPROSA

INTSINT

SAMSINT

IPO

TSM

TILT

VERBMOBIL

KIM

PROZODIAG (Lund)

Goeteborg

 

These schemes have revealed differences both in the covered phenomena, and in the underlying theoretical assuptions. They reflect the different purposes of prosodic analysis, which go from the phenomenological description of prosody in itself, to the study of its relations with discourse structure and to its applications in speech technology - synthesis, recognition and dialogue systems. As stated in D1.1:

 

"Each experimental study has adopted some kind of prosodic representation suited to its purposes, from abstract labels to acoustic measures".

 

The schemes range from simple diacritic symbols integrating the orthographic transcription of corpora intended for linguistic analyses (e.g. TEI, Göteborg...), to theory-dependent phonological labels for intonation contours and phrasing (e.g. ToBI), to phonetic-acoustic representation of the f0 curve (e.g. INTSINT, IPO, TILT, ...).

 

The conclusion in D1.1 is that defining a unique ‘standard’ coding scheme by choosing a single prosody annotation scheme seems a difficult task at this moment. Although, in the era of large speech corpora, there is a definite need for a common notation allowing for easy data exchange and comparison, a single scheme would certainly dissatisfy some of the many points of view in the field, would be unsuitable to some of the intended purposes, would be too detailed or too poor, too theoretically committed or lost in insignificant details.

 

 

3. The MATE 'meta-scheme' for prosody annotation

 

3.1 The 'meta-scheme'

 

Due to the variety of points of view in prosodic studies and the difficulty in selecting the most representative coding schemes, the MATE proposal for the Prosody Level offers a 'meta-scheme', a framework where different existing notation conventions can be integrated and possibly new ones can be developed. The framework is detailed enough to suit the richer phonetic and phonological schemes and flexible enough to admit partial filling of its structure and to allow for different schemes to cooperate.

Its definition reflects the multi-level nature of prosodic research - the fact that prosody can be studied both with a phonetic and a phonological approach - and the useful distinction between prosodic units and prosodic phenomena.

 

The MATE ‘meta-scheme’ is a four-layer annotation structure, in which the different elements discussed in D1.1 can be accommodated. The sublevels are the following:

 

1) phonetic transcription

conceived for the representation of phonetic segments (the ‘phones’), but also of other phenomena related to the segmental aspects of prosody, such as pauses, and other sub-word units such as syllables

2) phonetic representation of intonation

intended for the phonetic annotation of intonation phenomena, where the shape of fundamental frequency curves (and possibly of other acoustic correlates of intonation, such as energy, which at present are not included in the meta-scheme) is described in detail, by means of stylization and/or explicit labelling

3) phonological representation of intonation

reserved for those schemes which annotate intonation from a phonological point of view, in terms of functional or underlying representations, and mark the role of relevant intonation events with respect to prosodic units

4) prosodic phrasing

intended for the segmentation of utterances in terms of high-level prosodic units (tone units, intonation groups, etc.)

 

The four levels do not represent a fixed hierarchy. The two phonetic levels, intended for phoneme segmentation and f0 description, are directly aligned with the speech signal and in this sense may be considered as base levels. The two phonological levels, describing the linguistically relevant intonation events and the prosodic structure of the utterance, keep a natural relationship both with the base prosodic levels and with other linguistic units. So, different links can be established between levels. It is conceivable to associate an intonation event such as "pitch accent" or a "boundary contour" to the word or phrase (orthographic level) on which it occurs as well as to the syllable or vowel on which it reaches its f0 target (phonetic transcription level) or to the corresponding configuration of pitch movements (phonetic description of f0). The following picture sketches the possible links between levels:

 

In the actual use of the scheme, the levels and their links can be fully or partially specified. In a linguistic text-oriented analysis, prosody could be considered in its function, leaving out the details of its realization. In this case, the sole phonological levels may be filled and linked to the orthographic level of words. Complex schemes like ToBI ([Silverman et al., 1992], D1.1A) could be used in this way, or simpler schemes providing labels to distinguish types of accents, associated with words, and types of intonation boundaries.

In a speech technology context, a more signal-oriented approach could be adopted. In order to recognize or synthesize prosodic patterns, detailed phonetic descriptions are necessary, requiring both phonetic segmentation and phonetic representation of intonation - in terms of pitch movements or target f0 levels. The annotator would in this case look at the signal to segment it and possibly stylize its f0 profile and accurately label the stylized curve. For a complete analysis, he would link the detected units and events - phonemes and f0 variations - to the phonological descriptions of intonation contours and phrase structure.

 

 

3.2 Instances of the 'meta-scheme'

 

The first goal of the MATE ‘meta-scheme’ is then to provide an empty framework where the existing (or future) prosody annotation schemes could be represented in a common (and accordingly compatible and easy-to-compare) format. But it has been also conceived to allow the annotation of corpora using some of the most widely used existing annotation schemes. For each layer, (at least) one existing coding scheme has been adapted to XML, in order to be integrated within the MATE workbench and provide both a ready-to-use instance of the meta-scheme and an example and guideline for future adaptation of other schemes.

The chosen schemes for each level are the following:

 

1) phonetic transcription: SAMPA ([Wells et al. 1992]; D1.1A )

2) phonetic representation of intonation: INTSINT ([Hirst, 1991, 1994; Hirst & Di Cristo, 1998]; D1.1A), IPO ([t’Hart et al., 1990]; D1.1A )

3) phonological representation of intonation: ToBI (‘Tones’layer) ([Silverman et al. 1992]; D1.1A )

4) prosodic phrasing: ToBI (‘Break-Indices’ layer)

 

Widespread schemes have been preferred as examples. In the case of phonetic description of intonation, two schemes have been selected in order to represent both the 'pitch movement' approach and the 'target level' one. It should be noted that for some schemes a reference definition is available, although not so strictly respected in actual applications (ToBI has a number of 'variants' and is subject to language-adaptation). For IPO, the reference is the classical text in which the methodology of perceptual analysis of intonation has been proposed, which was not explicitly intended to define a notation system. In any case, some simplifications or additions to the original schemes have been performed, in order to obtain a coherent adaptation.

 

As suggested above, each scheme can be used alone or can be integrated with the others. One could for example keep with IPO methodology and use SAMPA for phoneme segmentation and IPO for f0 description (and possibly a newly defined IPO-like "pitch configuration" scheme for the phonological level...). Or integrate the four layers using SAMPA, INTSINT and ToBI. To allow such modular approach, separate DTD's have been defined for each pair layer:scheme. These DTD's are included in the Annex.

 

The elements and attributes identified in the selected schemes are described in detail in the following. It should be noted that level 2), both in its IPO and INTSINT instances, has an inner structure corresponding to a typical three-step procedure in the phonetic annotation of intonation: obtain the raw f0 curve (element <f0>), stylize it (elements <closecopy> and <momel>) and label it (elements <pitmove> and <intone>). At the phonetic segmentation level, a useful extension is the <syllable> element, to which the element <phone> can be subordinated and to which the phonological intonation labels could profitably be linked. For the other levels a single main element is defined: <tobitone> (<target>, <f0range>, and <repair> are accessory information), <breakindex>.

 

The list of elements adapted to XML, which is accordingly available for use in the MATE workbench, is the following:

 

1) Phonetic transcription

 

<syllable>

<phone>

 

2) Phonetic representation of intonation

 

<f0>

<closecopy> (IPO)

<pitmove> (IPO)

<momel> (INTSINT)

<intone> (INTSINT)

 

3) Phonological representation of intonation

 

<tobitone>

<target>

<f0range>

<repair>

 

4) Prosodic phrasing

 

<breakindex>

 

In the following, each pair layer:scheme will be described in a separate paragraph. For layer 2, to avoid duplication of descriptions, a single description will be given of the element <f0> for the raw f0 curve, that is present in both schemes IPO and INTSINT. Moreover, it should be noted that there is apparently no formal difference between the respective elements for the stylized curve <closecopy> and <momel>, both consisting in target points on the f0 curve. The substantial difference is in the intended interpolation function between the target points, which is linear for <closecopy> and parabolic for <momel>, and in the intended stylization procedure (manual vs. automatic).

 

 

4. Layer 1: Phonetic Transcription - SAMPA scheme

 

4.1 Markup Declaration

 

The layer of phonetic transcription is a base level intended for the representation of the minimal units for phonetic and prosodic analysis: phones and syllables. The level defines a base element, the <phone> element, corresponding to a segment in the speech signal, labeled according to its phonetic features. A <syllable> element may be added, consisting of a sequence of <phones>. The annotation at this level is a transcription and a segmentation, in the sense that it refers directly to the speech signal, recognizes the uttered sounds and splits the speech continuum into phonetic chunks. Each <phone> will then be classified with a phonetic label and associated with time information specifying its start and end instants. Higher linguistic levels, like the phonological prosodic levels or the orthographic word level, might inherit time information from the phonetic level by linking their elements with <phone> elements or <syllable> elements.

The scheme adopted here for phonetic transcription is SAMPA [Wells et al., 1992], which is intended for multi-lingual phonetic transcription. In the original SAMPA notation, a transcription is a stream of phonetic labels and diacritics, where labels classify phones and diacritics give further specifications about phones, with the exception of stress marks which implicitly refer to the following syllable. In our adaptation, the <syllable> element is made explicit as a second layer built on top of the <phone> layer.

 

4.2 The <phone> element

 

4.2.1 Description

 

For the annotation of phones, SAMPA (SAM Phonetic Alphabet) has been chosen, providing a multilingual and computer-readable inventory of phonetic symbols.

The transcription task using SAMPA involves the use of a set of symbols and diacritics, which can be combined to represent the phonetic realisation of phones.

 

The considered SAMPA symbols provides labels for vowels and consonants. A further symbol (taken from the SAMPROSA extension of the SAMPA scheme) is considered for pauses, which are marked as a special kind of sounds. Symbols can be combined together in some cases, e.g. two vowel symbols may be combined to represent diphthongs. The set of allowable combination may be language-dependent. A few diacritics are also available to mark additional features of phones: e.g. the length mark ":" may follow a phonetic label. The base symbols are listed below:

 

a) consonants

 

IPA symbol

SAMPA symbol

phonetic description

b

voiced bilabial plosive

c

voiceless palatal plosive

C

voiceless palatal fricative

d

voiced dental/alveolar plosive

D

voiced dental fricative

f

voiceless labiodental fricative

g

voiced velar plosive

G

voiced velar fricative

h

voiceless glottal fricative

j

palatal approximant

k

voiceless velar plosive

l

dental/alveolar lateral approximant

L

palatal lateral appoximant

m

bilabial nasal

n

palatal nasal

J

palatal nasal

N

velar nasal

p

voiceless bilabial plosive

r

alveolar trill

R

uvular trill/fricative

s

voiceless alveolar fricative

S

voiceless postalveolar fricative

t

voiceless dental/alveolar plosive

T

voiceless dental fricative

v

voiced labiodental fricative

w

labial-velar approximant

x

voiceless velar fricative

H

labial-palatal approximant

z

voiced alveolar fricative

Z

voiced postalveolar fricative

?

stod, glotal stop

 

 

b) vowels

 

IPA symbol

SAMPA symbol

phonetic description

a

open front unrounded

A

open back unrounded

{

near-open front unrounded

6

near-open central unrounded

Q

open back rounded

O

open-mid back rounded

e

close-mid front unrounded

E

open-mid front unrounded

@

mid central unrounded (schwa)

3

mid central unrounded

i

close front unrounded

I

near-close front unrounded lax

o

close-mid back rounded

2

close-mid front rounded

9

open-mid front rounded

&

open front rounded

u

close back rounded

U

near-close back rounded lax

}

close central rounded

V

open-mid back unrounded

y

close front rounded

Y

near-close front rounded lax

 

 

c) pause

 

SAMPA

(SAMPROSA) symbol

phonetic description

...

silent pause

 

 

The following SAMPA diacritics may be combined with the phonetic label (preceding or following it, according to the syntax suggested by the example):

 

SAMPA symbol

phonetic description

Example of use

~

nasalization

O~

=

syllabic consonant

=n

:

length mark

a:

 

The user is referred to Wells et al. (1992) for a detailed description of the SAMPA symbols and their corresponding usage. More information is also available at ‘http://www.phon.ucl.ac.uk/home/sampa/home.htm’, including guidelines for the use use of SAMPA for transcription in the following languages: Bulgarian, Croatian, Danish, Dutch, English, Estonian, French, German, Greek, Hungarian, Italian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovenian, Spanish and Swedish. A description of the SAMPROSA scheme can be found at ‘http://www.phon.ucl.ac.uk/home/sampa/samprosa.htm’.

 

4.2.2 Data Source

 

Phonetic transcription is usually carried out from speech files, where the speech sound is sampled. Speech files can be listened to and graphically displayed on a time axis, so that phones can easily be time-aligned to sound.

 

4.2.3 Segmentation/selection

 

Phonetic transcription is a segmentation task: the speech sound is segmented into a sequence of adjacent chunks, each corresponding to a <phone>. While in principle one could listen to the recorded speech and write down the perceived phones and the corresponding time (measured by a clock...), a reasonable segmentation procedure should rely on sampled speech, graphically shown as a waveform on a time axis and possibly also displayed in its spectrographic representation. The annotator would select a signal portion on the screen, listen to it and inspect its shape. Each phone will be characterized by its peculiar shape and show two transition zones where the boundaries with the adjacent phones should be placed. On this basis the annotator would recognize the uttered phone and segment it, possibly by clicking on its start and end point on the screen.

 

4.2.4. Assignment

 

The attributes considered here for the <phone> element are the following:

 

á type: label specifying the type of phone according to its phonetic classification. The phone is recognized by listening to the correponding sound and looking (if necessary) at the available signal representations; it is then classified according to the phonological system of the language and represented by the corresponding SAMPA label. The list of SAMPA base symbols and diacritics is given above. Symbols may be combined together or with diacritics into complex labels.

á start: time start of the phone, expressed in milliseconds from the beginning of the sound file. Time start of a phone will generally coincide with time-end of the preceding one, as phonetic transcription is a segmentation of the speech signal. The exact determination of the boundary between adjacent phones is somewhat arbitrary, as the articulatory and acoustic transition between sounds is smooth. For each class of phones a set of conventions can be set to prescribe where to place the boundary, depending on the available signal representations (waveform, spectrogram, etc.). A consistent application of explicitly stated segmentation criteria is recommended.

á end: time end of the phone, in milliseconds from the beginning of the sound file.

 

 

4.3 The <syllable> element

 

4.3.1 Description

 

In many prosodic descriptions the syllable is taken as the minimal prosodic unit, the building block of the rhythmical structure and the scope of intonation events. Formally, it is a sequence of one or more phonemes centered on a vocalic nucleus. Its precise definition is language and theory dependent. In SAMPA, the diacritics for primary and secondary stress are inserted at the beginning of the stressed syllable: e.g. ["meZ@] (measure), [@"nVD@] (another). So, even if the prosodic extension of SAMPA (SAMPROSA [Gibbon, 1989]) is not taken into account, the notion of syllable is implicit in SAMPA phonetic notation. Here an element <syllable> is defined explicitly, linked to its component <phone>'s and possibly carrying the stress mark, according to the following definition:

 

&quot

primary stress

%

secondary stress

 

4.3.2 Data Source

 

Syllables are defined starting from <phone>'s.

 

4.3.3 Segmentation/selection

 

After the phonetic transcription has been obtained, syllables are defined by selecting their component phones, from syllable boundary to syllable boundary, according to the phonetic syllabification rules of the language, and judged as to its accent degree. Language-dependent automatic procedures could be implemented for syllabification.

 

4.3.4. Assignment

 

The attributes considered here for the <syllable> element are the following:

 

á stress: optional label specifying if the syllable is stressed, with primary (&quot) or secondary (%) stress; if not specified, the syllable is unstressed

á href: a sequence of <phone> elements

á start: start of the first phone of the syllable, inherited from <phone>

á end: end of the last phone of the syllable, inherited from <phone>

 

4.4 Examples

 

The following example shows the phonetic transcription of the Spanish word 'casa' ('house') and its corresponding syllabic segmentation, using the <phone> and <syllable> elements:

 

phone.xml

 

<phone id="phn_01" type="k" start="345" end="390"/>

<phone id="phn_02" type="a" start="390" end="450"/>

<phone id="phn_03" type="s" start="450" end="490"/>

<phone id="phn_04" type="a" start="490" end="540"/>

 

syllable.xml

 

<syllable id="sllbl_01" stress="&quot" href=phone.xml# id(phn_001)..id(phn_002)/>

<syllable id="sllbl_02" href=phone.xml# id(phn_003)..id(phn_004)/>

 

 

4.5 Coding Procedure

 

Manual phonetic segmentation would be helped by a software tool displaying the speech signal in its waveform and spectrographic representations, allowing listening, selecting signal portions, zooming, selecting pre-defined phonetic labels, choosing segmentation points on the time axis. The set of allowable phonetic labels (for the given language) should be defined in the DTD (the DTD included in the Annex does not define language-dependent symbol sets), while a specific coding guideline document will explicitly state the adopted set of segmentation criteria. The coding procedure would then be:

1. select the speech file and open the synchronized windows for phonetic segmentation and waveform and spectrum display

2. zoom until a detailed inspection of the signal is possible

3. inspect and listen to the signal portion until the uttered phonemes are recognized

4. select a phonetic label for the first phone

5. identify its boundaries according to the segmentation criteria and mark them by placing the cursor on the proper point on the time-axis (this should automatically set the time attribute)

6. after phonetic segmentation is concluded, define syllables by selecting their component <phone>'s and, if stressed, by assigning the proper stress mark

 

Tools for automatic segmentation are available, often language dependent. Good performances are offered by phonetic aligners, that align a speech signal to a predefined phonetic transcription. The procedure in this case would be:

1. listen to the speech sound and transcribe it as a sequence of phones

2. apply the phonetic aligner to the speech signal with its phonetic transcription and obtain its phonetic segmentation

3. import the phonetic segmentation in the MATE environment

4. define syllables as in step 6 above.

 

 

4.6 Markup Table

 

<phone>

id

[ASCII]

type

[ASCII] *

start

[FLOAT]

end

[FLOAT]

 

* The attribute ‘type’, although defined as ASCII data, can only contain an allowable combination of SAMPA symbols and diacritics, as above defined.

 

<syllable>

id

[ASCII]

stress

&quot, %

start

[FLOAT]

end

[FLOAT]

 

 

5. Layer 2: Phonetic Representation of Intonation

 

The phonetic representation of intonation should provide a detailed description of the utterance intonation profile, which is one of the main acoustic correlates of prosodic structure. The object of the description is fundamental frequency - an acoustic parameter which is calculated from the voiced portions of the speech signal by means of signal processing algorithms. Once the gaps of unvoiced phones have been interpolated, f0 is a continuous curve showing perceptually irrelevant variations, micro-prosodic variations due to phoneme quality and macro-prosodic variations which may have a linguistic function. A phonetic representation of this curve will ignore minor details but will describe the shape of the curve by classifying all its relevant features, that a functional phonological analysis could later interpret.

Two steps are necessary to obtain a phonetic representation of intonation:

á a stylization of the f0 curve, where irrelevant details (and possible errors of the pitch tracking algorithm) are removed and the curve is represented by a sequence of discrete elements: inflection points, interpolated with a linear or parabolic function

á a classification of the elements of the stylized curve

The elements of the stylized curve are the 'relevant variations' of f0. Depending on the point of view and the underlying intonation theory, such variations may be seen in their movement between two f0 values or in their target value. So, you may see the curve as a chain of rises and falls or as a sequence of high and low values. The two approaches are represented by the two schemes that we have chosen as examples for layer 2. Both schemes start from the raw f0 curve (automatically obtained from the signal), represented as a sequence of frame by frame f0 values (<f0>). Both schemes rely on a stylization of the f0 curve, represented as a sequence of inflection points on the curve (named <closecopy> for IPO and <momel> for INTSINT, just to keep track of the different interpolation laws suggested by the two schemes). But the INTSINT description of the curve will directly label the inflection points as target tones, while IPO will label pitch movements from one inflection point to the following one. The difference will be reflected in the different use of the href attribute for the elements <intone> and <pitmove>, pointing to a single stylized element in one case and to two consecutive elements in the other.

 

In the following, the two schemes will be described separately. The description of the base <f0> element will be given once, as the element and its use are common to the two schemes. It should be noted that when the stylized curve is imported as such (obtained outside the MATE workbench), in its <closecopy> or <momel> version, it could be the base reference for prosodic annotation and the <f0> element may be unnecessary.

 

5.1. Layer 2: Phonetic Representation of Intonation - f0 contours

 

5.1.1 Markup declaration

 

Fundamental frequency (pitch) is a parameter estimated from the acoustic signal, in its voiced (quasi-periodic) portions. It is defined as the inverse of period length and generally measured in Hz (number of periods per second). Period length could in principle be manually measured on the waveform, but it is usually estimated by pitch detection algorithms, whose output can be the series of points in time corresponding to period boundaries or, more often, a sequence of pairs [time interval : f0 value], where f0 is the average fundamental frequency measured on the time interval or frame (typically a few milliseconds).

 

Here we define an element to represent such raw f0 values, whose sequence provides the so-called f0 contour of the utterance. It should be noted that pitch estimation algorithms are not fully reliable, so that raw f0 values should be considered just a starting point of intonation analysis rather than its unquestionable objective reference.

 

 

5.1.2 The <f0> element

 

5.1.2.1 Description

 

This element has been included to allow each f0 value of an f0 contour to be considered as an XML element (and accordingly handled and displayed). Each <f0> element is intended to represent a pair [time interval : f0 value] of a f0 contour. The most useful representation of the <f0> element is a graphical display of the sequence of its values as a function of time (the f0 curve or contour or profile).

 

5.1.2.2 Data Source

 

The f0 contour is computed directly from the speech signal file (although some pitch detection algorithms rely on phonetic segmentation to obtain better estimates of fundamental frequency).

 

5.1.2.3 Segmentation/selection

 

<f0> elements will be generated automatically, from f0 values calculated by an f0 estimation algorithm. If possible, such an algorithm will be available in the workbench. Otherwise, the f0 values will be imported from external files.

 

 

5.1.2.4 Assignment

 

The attributes considered here for the <f0> element are the following:

 

á value: the f0 value (in Hz)

á start: time start of the calculation frame

á end: time end of the calculation frame

 

 

5.1.3 Markup Table

 

<f0>

id

[ASCII]

value

[FLOAT]

start

[FLOAT]

end

[FLOAT]

 

 

5.2 Layer 2: Phonetic Representation of Intonation - IPO scheme

 

5.2.1 Markup Declaration

 

The IPO methodology for the analysis of intonation relies on two main assumptions: the first is that what is not perceived is irrelevant for a linguistic description of intonation, the second is that we perceive tone variations (rise/fall movements) rather than tone levels (high/low). The steps in the perceptual analysis of intonation are:

1. obtain a stylized close copy of the original f0 curve, by approximating the original values with a sequence of straight segments: the re-synthesized signal should be perceptually equivalent to the original one

2. classify the f0 segments as pitch movements, according to their shape and position in the phone chain (the proper reference is the syllable)

3. build up a grammar of admissible configurations of pitch movements and link intonation patterns to linguistic functions

Here we consider only the first two steps, which pertain to the phonetic representation of intonation. In order to represent them, we need three hierarchically ordered elements:

á <f0>, representing the points of the raw f0 curve

á <closecopy>, representing the inflection points in the stylized curve

á <pitmove>, representing the classified movements from one inflection point to the next one.

In principle, <f0> should be linked to the signal, <closecopy> to one <f0> element, and <pitmove> to two consecutive <closecopy> elements.

In actual annotation, it is not required that <closecopy> points coincide with <f0> points (a good stylization removes irrelevant excursions and possible pitch detection errors). Moreover, as suggested in the paragraph on coding procedures, if stylization is performed outside the MATE workbench, it could be directly imported, without reference to <f0>. In this case, the element <closecopy> will directly be aligned with the soundfile by means of its time attributes. Viceversa, a very simplified stylization (without the feedback of resynthesis), could be performed by directly linking <pitmove> to a sequence of <f0> elements, which could be thought of as approximated by a straight line. A further otpion would be to link the <pitmove> to the corresponding <syllable>: in this case, some of the precise acoustic content of the <pitmove> will be lost.

 

 

5.2.2 The <closecopy> element

 

5.2.2.1 Description

 

This element has been included to allow each f0 inflection point of a ‘close copy’ stylised f0 contour (used in the IPO annotation system as the phonetic base representation of f0 contours) to be considered – and accordingly handled – as an XML element. The close copy is intended as a clean version of the f0 curve, where errors and irrelevant details have been removed, gaps corresponding to unvoiced phonemes have been filled and only the relevant movements are apparent. A more detailed desription of the concept of‘close copy’ can be found in ‘t Hart et al. (1990), among others. Such stylized description of f0 as a function of time can be displayed as a sequence of straight segments connecting the relevant f0 values (inflection points), which may coincide with selected points (frames) in the raw f0 curve or simply approximate them.

 

5.2.2.2 Data Source

 

The starting point for the creation of a close copy should be the raw f0 contour, together with the whole speech signal to allow for resynthesis. Phonetic segmentation would be useful as accessory information. In the MATE workbench, the close copy will most probably be imported from external files.

 

5.2.2.3 Segmentation/selection

 

In the IPO methodology, the ‘close copy’ stylisations are defined by a resynthesis method which allows the perceptual definition of the relevant inflection points. The raw f0 curve is displayed, if possible aligned with phonetic segmentation. On this basis, the annotator draws a simplified curve which approximates the original one. He then listens to the speech resynthesized with the stylized artificial f0 values. He repeats these steps until he reaches the simplest stylization perceptually equivalent to the original contour.

The MATE workbench will not provide this complex environment. As a consequence, close copies will be imported from external files or will be obtained by a simplified (non-theory-conformant) procedure, where inflection points are directly selected on the raw f0 with no resynthesis feedback.

 

5.2.2.4 Assignment

 

The attributes considered here for the <closecopy> element are the following:

á value: the stylized f0 value (in Hz) at the inflection point

á href: optional, points to an <f0> element

á start: time start of the stylised point

á end: time end of the stylised point

If the close copy is imported from an external file, the (link with the) <f0> element may not be necessary. In this case, as each inflection point is indeed a point, the two time attributes will have the same value. Alternatively, in case the close copy is obtained by selection of <f0> elements, href will point to the selected <f0>, from which the time attributes (and possibly the value) might be inherited, and consequently the two time values will be different (the first one corresponding to the beginning of the f0 calculation frame and the second one corresponding to the end of the frame).

5.2.3 The <pitmove> element

 

5.2.3.1 Description

 

The element <pitmove> is intended for phonetic transcription of intonation contours according to IPO methodology. Whithin the IPO framework, a ‘pitch movement’ is a portion of ‘close-copy stylization’ between two inflection points. A complete phonetic description of the stylized f0 curve should capture its shape and its relation with the phone sequence. So it will classify its segments according to their size, direction (rise/fall) and position in the syllable. The principles for this classification are presented in ‘t Hart et al. (1990), which also provides a set of labels explicitly intended for Dutch. It should be noted that work proposes a methodology rather than a notational standard. There have been several applications of the IPO approach to different languages (such as English [Willems et al., 1988], French [Beaugendre et al., 1992], Italian [Quazza, 1991] or German [Brindopke et al., 1997]) and different symbols have been used for the same concepts. Here, to keep to a concrete and classical example, we refer to the original proposal for Dutch.

In the IPO approach, pitch movements are intended to be superimposed on an ideal declination grid, which determines the height of the flat movements: two main declination lines (at least for Dutch) are identified as trends in the sequence of peaks and valleys. Pitch movements can follow the baseline or the topline, or depart from them. Every pitch movement departing from the declination lines can be characterized in terms of the following parameters:

 

a) direction (rise/fall)

b) timing (early in the syllable/late/very late)

c) rate of change (fast/slow)

d) size (full/half)

 

The combination of these features provides a set of possible pitch movements, which are labelled with a figure (if the movement is rising) or a letter (if the movement is falling):

 

transcription symbol

 

1

2

3

4

5

A

B

C

D

E

Direction

rise

x

x

x

x

x

         

fall

         

x

x

x

x

x

Timing

early

x

     

x

 

x

   

x

late

   

x

   

x

       

very late

 

x

         

x

   

Rate of change

fast

x

x

x

 

x

x

x

x

 

x

slow

     

x

       

x

 

Size

full

x

x

x

x

 

x

x

x

x

 

half

       

x

       

x

 

'Flat' pitch movements following the baseline or the topline are labelled with 0 or Ø respectively. A special diacritic '&' is used in the IPO transcription to join pitch movements occurring on the same syllable. For example a rise-fall with the peak in the middle of the syllable (pointed hat) could be labeled "1&A". In our formalization, where labels are assigned to <pitmove> elements, the diacritic '&' before a label will have the meaning "pitch movement realized on the same syllable as the preceding one". So, for example, a complex configuration rise-fall-rise occurring on a single syllable could be represented by three <pitmove>'s respectively labeled "1" "&A" "&2". Of course, two 'early' movements can't occur on the same syllable, so labels 1, 5, B, E can't be preceded by '&'.

 

5.2.3.2 Data Source

 

The IPO notation scheme annotates pitch movements taking as a starting point the ‘close copy’ stylisation. In order to select the proper labels also phonetic segmentation is a necessary reference, allowing to identify syllables.

 

5.2.3.3 Segmentation/selection

 

The IPO phonetic representation of intonation is a segmentation of the speech flow into consecutive pitch movements. Each <pitmove> covers a segment of the close copy stylized curve, the stretch between a <closecopy> inflection point and the following one. The annotator will label pitch movements by looking at the <closecopy>'s sequence, graphically displayed and aligned with the phonetic transcription of the utterance. He will define a <pitmove> by selecting two consecutive <closecopy> elements, from which the <pitmove> will inherit its time attributes start, end. Then, on the basis of the shape of the segment as displayed in the stylized curve and of its alignment with phones (syllable), he will assign the <pitmove> a proper label.

 

 

5.2.3.4 Assignment

 

The attributes considered here for the <pitmove> element are the following:

 

á type: IPO symbol representing the movement.

á href: two consecutive <closecopy> elements initial and final inflection points of the movement; alternatively, the <pitmove> might be linked with a <syllable>

á start: time start of the movement (inherited from the first linked <closecopy>)

á end: time end of the movement (inherited from the second linked <closecopy>)

 

5.2.4 Example

 

The following example shows the Italian sentence "quell'artificio contabile sara` scoperto facilmente" read by a female speaker. In the picture, the vertical bars correspond to phoneme boundaries (phoneme symbols are not SAMPA...), the blue line to the original f0 curve and the red line to the stylized one (closecopy).

 

 

This example can be represented using the <closecopy> and <pitmove> elements as below. It is assumed that the closecopy is directly imported as a sequence of inflection points (in this case f0.xml is not needed).

 

closecopy.xml

 

<closecopy id="clscpy_001" value="207" start="130" end="130"/>

<closecopy id="clscpy_002" value="243" start="540" end="540"/>

<closecopy id="clscpy_003" value="285" start="690" end="690"/>

<closecopy id="clscpy_004" value="212" start="860" end="860"/>

<closecopy id="clscpy_005" value="189" start="1110" end="1110"/>

<closecopy id="clscpy_006" value="159" start="1290" end="1290"/>

<closecopy id="clscpy_007" value="209" start="1500" end="1500"/>

<closecopy id="clscpy_008" value="206" start="1750" end="1750"/>

<closecopy id="clscpy_009" value="246" start="2070" end="2070"/>

<closecopy id="clscpy_010" value="226" start="2600" end="2600"/>

<closecopy id="clscpy_011" value="148" start="2780" end="2780"/>

<closecopy id="clscpy_012" value="144" start="3070" end="3070"/>

 

pitmove.xml

 

<pitmove id="pitm_001" type="4" href=closecopy.xml# id(clscpy_001).. id(clscpy_002)"/>

<pitmove id="pitm_001" type="1" href=closecopy.xml# id(clscpy_002).. id(clscpy_003)"/>

<pitmove id="pitm_001" type="B" href=closecopy.xml# id(clscpy_003).. id(clscpy_004)"/>

<pitmove id="pitm_001" type="Ø" href=closecopy.xml# id(clscpy_004).. id(clscpy_005)"/>

<pitmove id="pitm_001" type="B" href=closecopy.xml# id(clscpy_005).. id(clscpy_006)"/>

<pitmove id="pitm_001" type="4" href=closecopy.xml# id(clscpy_006).. id(clscpy_007)"/>

<pitmove id="pitm_001" type="Ø" href=closecopy.xml# id(clscpy_007).. id(clscpy_008)"/>

<pitmove id="pitm_001" type="4" href=closecopy.xml# id(clscpy_008).. id(clscpy_009)"/>

<pitmove id="pitm_001" type="0" href=closecopy.xml# id(clscpy_009).. id(clscpy_010)"/>

<pitmove id="pitm_001" type="B" href=closecopy.xml# id(clscpy_010).. id(clscpy_011)"/>

<pitmove id="pitm_001" type="Ø" href=closecopy.xml# id(clscpy_011).. id(clscpy_012)"/>

 

 

5.2.5 Coding Procedure

 

The objective of phonetic transcription of intonation according to the IPO methodology is to obtain a stylized curve where the sequence of pitch movements is properly labeled. The MATE workbench will not provide a true stylization/resynthesis environment. It might provide a pitch tracking function to obtain the raw f0 curve, or alternatively a means to import it from external files. The most IPO-conformant coding procedure will directly import the stylized f0 curve, obtained with the help of a proper external environment for perceptual stylization (e.g. Winpitch, see http://www.winpitch.com), using the <closecopy> element with no need of the <f0> element, and will consist in the following steps:

 

á open the speech file in order to listen to its intonation

á open the corresponding phonetic segmentation (<phone> and <syllable>)

á import the close copy and display it as a curve, aligned with phonetic segmentation

á define <pitmove> elements by selecting the segments of the stylized curve (delimited by two consecutive <closecopy> elements) and labeling each of them according to the following criteria:

á if it can be considered to coincide with the ideal baseline or topline, by a global look at the curve, label it 0 or Ø respectively

á otherwise choose the proper label on the basis of movement direction and size and of its position in the syllable, judged by looking at its phonetic alignment

 

If the close copy is not available, the third step may be replaced by the following steps (a very simplified approximation of the correct stylization procedure):

á import or generate automatically the raw f0 curve and display it

á obtain a closecopy by selecting the 'relevant' <f0> points on the raw curve; base such stylization on the shape of the curve, the perceived intonation of the sound file and the alignment with syllables (accents, boundaries...)

 

 

5.2.6 Markup Table

 

<closecopy>

id

[ASCII]

value

[FLOAT]

href

<f0>

start

[FLOAT]

end

[FLOAT]

 

<pitmove>

id

[ASCII]

type

0, Ø, 1, 2, 3, 4, 5, A, B, C, D, E, &2, &3, &4, &A, &C, &D

href

<closecopy>..

<closecopy>

or

<syllable>

start

[FLOAT]

end

[FLOAT]

 

 

5.3. Layer 2: Phonetic Representation of Intonation - INTSINT scheme

 

5.3.1 Markup Declaration

 

INTSINT is a coding system of intonation developed by Daniel Hirst and his colleagues at the CNRS centre of theAix-en-Provence University. It is conceived "to provide a purely formal encoding of the macroprosodic curve. Each target point of the stylised curve is coded by a symbol either as an absolute tone, defined globally with respect to the speakers pitch-range or as a relative tone, defined locally with respect to the inmediately neighbouring target-points"(Campione et al., 1997, p. 72). Descriptions of this method can be found in Hirst (1991,1994); Hirst & Di Cristo (1998), among other references.

The starting point is again the raw f0 curve, which is (automatically) stylized to remove irrelevant and micro-prosodic details. The stylized representation, called MOMEL (Hirst & Espesser, 1993), consists in a sequence of inflection points [time : f0 value], which should be interpolated by a parabolic function. As a second step, each target point in the MOMEL stylized curve is considered in its absolute or relative height and accordingly labeled as a high or low tone. The elements necessary to represent the INTSINT notation system are the following:

á <f0>, for the frames of the raw f0 curve

á <momel>, for the inflection points in the stylized curve

á <intone>, for the labeled tones

The three elements are hierarchically ordered, with a one-to-one mapping between <intone>'s and <momel>'s. The alignment with the soundfile is kept through the base element <f0>, although in case the <momel> stylized curve is directly imported, the link with <f0> can be skipped and <momel> can be directly aligned with the soundfile.

 

 

5.3.2 The <momel> element

 

5.3.2.1 Description

 

This element has been included to allow each f0 inflection point of a MOMEL stylised f0 contour (used in the INTSINT annotation system as the phonetic base representation of f0 contour) to be considered –and accordingly handled– as an XML element.

For a detailed description of the MOMEL stylization procedure, the reader is referred to Hirst & Espesser (1993), and Hirst (1994), among other references.

 

 

5.3.2.2 Data Source

 

The MOMEL stylised f0 contour is obtained automatically from the raw f0 curve.

 

5.3.2.3 Segmentation/selection

 

The calculated MOMEL stylised f0 values (or imported from the ‘mes’ tool) will be automatically converted to <momel> elements. ‘Mes’ is described at (and can be downloaded from) the following site: ‘http://www.lpl.univ-aix.fr/ext/projects/mes_signaix.htm/’.

 

 

5.3.2.4 Assignment

 

The attributes considered here for the <momel> elements are the following:

 

á value: f0 value (in Hz) of the stylised point

á href: <f0>, optional

á start: time start of the stylised point

á end: time end of the stylised point

 

If the MOMEL curve is imported from the 'mes' tool, the reference to <f0> can be avoided. In this case, as each inflection point is indeed a point, the two time attributes will have the same value. Otherwise, they will be inherited from start, end of the <f0> frame.

 

5.3.3 The <intone> element

 

5.3.3.1 Description

 

The target points in the MOMEL stylized curve can be phonetically labeled as tones, here represented by the <intone> element.

 

INTSINT includes two types of symbols to transcribe f0 tones:

 

1) Absolute Tones

 

INTSINT includes three symbols to label the Absolute Tones, which are defined according to the speaker’s pitch range.

 

T

top of the speaker’s pitch range

M

initial, mid value

B

bottom of the speaker’s pitch range

 

2) Relative tones

 

Relative tones are coded in INTSINT considering the height of the preceding and following target points. Five different symbols exist to transcribe these Relative Tones:

 

H

target higher than both immediate neighbours

L

target lower than both immediate neighbours

S

target not different to preceding target

U

target in a rising sequence

D

target in a falling sequence

 

 

 

5.3.3.2 Data Source

 

The INTSINT representation is usually obtained from the MOMEL stylised f0 contour. So the <intone> element will be directly linked to <momel>. Phonetic segmentation is also useful to assign labels, although it is not strictly necessary.

 

 

5.3.3.3 Segmentation/selection

 

The INTSINT symbols are assigned to each inflection point of the MOMEL stylised contour, following a set of conventions which are described in Hirst (1991, 1994), Hirst et al. (1993) and Hirst & Di Cristo (1998), among other references.

In order to label <intone>'s, the <momel> elements should be displayed as a stylized curve (parabolic interpolation) aligned with phonetic segmentation.

The INTSINT symbols can also be automatically assigned to the MOMEL inflection points by means of the ‘mes’ tool.

 

 

5.3.3.4 Assignment

 

The attributes considered here for the <intone> element are the following:

 

á type: the INTSINT symbol corresponding to the tone.

á href: points to a single <momel> element

á start: time start of the stylised point, inherited from <momel>

á end: time end of the stylised point, inherited from <momel>

 

 

5.3.4 Example

 

The example presented here shows the MOMEL and INTSINT annotation of the French utterance 'Il faut que je sois a Grenoble Samedi vers quince heures', using the <momel> and <intone> elements.

 

momel.xml

 

<momel id="mml_001" value="163" start="106" end=106"/>

<momel id="mml_002" value="217" start="265" end="265"/>

<momel id="mml_003" value="148" start="521" end="521"/>

<momel id="mml_004" value="190" start="617" end="617"/>

<momel id="mml_005" value="130" start="827" end="827"/>

<momel id="mml_006" value="223" start="1249" end="1249"/>

<momel id="mml_007" value="139" start="1614" end="1614"/>

<momel id="mml_008" value="172" start="1822" end="1822"/>

<momel id="mml_009" value="144" start="1983" end="1983"/>

<momel id="mml_010" value="185" start="2078" end="2078"/>

<momel id="mml_011" value="152" start="2248" end="2248"/>

<momel id="mml_012" value="99" start="2505" end="2505"/>

<momel id="mml_013" value="152" start="2730" end="2730"/>

 

intone.xml

 

<intone id="intn_001" type="L" href="momel.xml# id(mml_001)"/>

<intone id="intn_002" type="T" href="momel.xml# id(mml_002)"/>

<intone id="intn_003" type="M" href="momel.xml# id(mml_003)"/>

<intone id="intn_004" type="H" href="momel.xml# id(mml_004)"/>

<intone id="intn_005" type="L" href="momel.xml# id(mml_005)"/>

<intone id="intn_006" type="T" href="momel.xml# id(mml_006)"/>

<intone id="intn_007" type="M" href="momel.xml# id(mml_007)"/>

<intone id="intn_008" type="H" href="momel.xml# id(mml_008)"/>

<intone id="intn_009" type="L" href="momel.xml# id(mml_009)"/>

<intone id="intn_010" type="H" href="momel.xml# id(mml_010)"/>

<intone id="intn_011" type="D" href="momel.xml# id(mml_011)"/>

<intone id="intn_012" type="B" href="momel.xml# id(mml_012)"/>

<intone id="intn_013" type="M" href="momel.xml# id(mml_013)"/>

 

 

5.3.5 Coding Procedure

A specific tool, 'mes' (available at ‘http://www.lpl.univ-aix.fr/ext/projects/mes_signaix.htm/’), has been developed to perform automatic intonation transcription according to the INTSINT system. Both stylization and annotation can be performed automatically by 'mes'. So, the simplest way to get to INTSINT annotation in the MATE environment would be the following:

á import the raw f0 curve in <f0>

á import the MOMEL stylized curve in <momel>

á import the INTSINT annotation in <intone>

á link <momel> to <f0> and <intone> to <momel> (an automatic function should be provided for that by the workbench)

The <f0> element may not be necessary (unless it is used as a reference by other layers...). In case only <momel> is imported, <intone>'s may be created manually by the following procedure:

á open the speech file in order to listen to its intonation

á open the corresponding phonetic segmentation

á import <momel> elements and display them as a stylized curve

á define <intone> elements by selecting every <momel> element (inflection point in the stylized curve) and mark it with the proper label

 

 

5.3.6 Markup Table

 

<momel>

id

[ASCII]

value

[FLOAT]

href

<f0> (optional)

start

[FLOAT]

end

[FLOAT]

 

<intone>

id

[ASCII]

type

T, M, B, H, S, L, U, D

href

<momel>

start

[FLOAT]

end

[FLOAT]

 

 

6. Layer 3: Phonological Representation of Intonation - ToBI scheme

 

6.1 Markup Declaration

 

Within the ToBI system [Silverman et al., 1992], the Tone Tier is the level used to transcribe intonation phenomena. The types of phenomena covered by ToBI are tones and dowstepping, in their definition by Pierrehumbert (1980). F0 range and peak delay are also considered. The system is mainly phonological, labeling intonation events according to their function as described in language-dependent intonation models, with explicit reference to prosodic units such as prominent syllables, words and prosodic phrases. Nevertheless, some direct reference to acoustic values is admitted: while intonation events are supposed to be associated with linguistic units (syllables, words) they may also be aligned to specific points in the raw f0 curve, possibly corresponding to the tone 'peak'. Such alignment may be given for each tone or, alternatively, only for those whose peak occur too 'early' or too 'late' with respect to the stressed syllable. Moreover, a special marker may indicate the highest point in the f0 curve, to give an idea of the pitch range. These acoustic references look spurious in phonological annotation and are required only when a true acoustic-phonetic representation of the curve is lacking. In the MATE meta-scheme, such intermediate level is present, so that it could be profitably used instead of the direct references to f0 points.

In our XML adaptation of ToBI, four elements have been defined:

á <tobitone>, for the tones, distinguished according to their function as pitch accents, phrase accents or boundary tones and labeled according to a classification of their linguistically admissible types

á <target>, to mark peak location when it occurs outside the scope of the accented syllable

á <f0range>, to mark the highest f0 value in the curve

á <repair>, to mark the restart of the intonation contour after a disfluency

The four elements are not hierarchically ordered. All may refer to the f0 curve, while only the two accessory element <target> and <f0range> are necessarily linked to <f0>. The <tobitone> and <repair> elements can be linked to prosodic units and/or to phonetic descriptions of intonation, rather than raw f0. This kind of reference is recommended. The other two elements are provided for completeness with respect to ToBI, but may be avoided.

 

6.2 The <tobitone> element

 

6.2.1 Description

 

The <tobitone> element has been defined to adapt to XML format the ToBI labels defined in the Tone Tier for the description of tones. In this framework, a tone is a functionally simple prosodic event which may be phonetically complex, e.g it may consist in an accent realized by reaching a low target f0 and immediately rising to a high f0 value. In fact, while the base descriptive elements are pitch levels H (high) and L (low), a tone can in some cases be described by a combination of levels, which amounts to describing it as a movement. ToBI notation presupposes a classification of the different types of tones admissible in a given language, so it is model and language dependent. In the following, the reference for the inventory of tones is the original ToBI model proposed for American English (Beckman & Ayers, 1994).

 

Within the ToBI framework, two types of tones are considered:

 

a) phrasal tones: pitch events associated with intonational boundaries.

b) pitch accents: pitch events associated with accented syllables.

 

Phrasal tones could be further distinguished into:

 

a.a) phrase accents, events at intermediate phrase boundaries

a.b) boundary tones, events at full intonation phrase boundaries

 

Note that, in the prosodic structure, an intonation phrase is a sequence of intermediate phrases, so it will be marked both by the phrase accent of its last intermediate phrase and by the boundary tone.

 

In our ToBI adaptation, we will use the <tobitone> element for all these classes of tones and distinguish them with the attribute class, which may assume the values pitch accent ("pitaccent"), phrase accent ("phraccent") or boundary tone ("boundtone"). For each class, a set of different tone types is defined, represented as values of the type attribute.

 

The different types of phrase accents considered in ToBI are:

 

L-

Low phrase accent, which occurs at an intermediate phrase boundary

H-

High phrase accent, which occurs at an intermediate phrase boundary

!H-

Dowstepped high phrase accent

 

 

The different types of boundary tones are:

 

L%

Low (final) boundary tone, which occurs at every full intonation phrase boundary

H%

High (final) boundary tone, which occursat every full intonation phrase boundary

%H

Initial boundary tone; marks a phrase that begins relatively high in the speaker’s pitch range; the default initial boundary is in the middle of the range or lower, and will be left unmarked in the transcription

 

As full intonation phrase boundaries will always have two final tones, a phrase accent tone plus a boundary tone, the possible set of allowable combinations at the end of an intonation unit is the following:

 

L-L%

a full intonation phrase with a L phrase accent ending its final intermediate phrase and a L% boundary tone falling to a point low in the speaker’s range

(standard ‘declarative’ contour of American English).

L-H%

a full intonation phrase with a L phrase accent closing the last intermediate phrase, followed by a H boundary tone (‘continuation rise’)

H-H%

an intonation phrase with a final intermediate phrase ending in a H phrase accent and a subsequent H boundary tone (‘yes-no questions’)

H-L%

an intonation phrase in which the H phrase accent of the final intermediate phrase upsteps the L% to a value in the middle of the speaker’s range (final level ‘plateau’)

 

 

The inventory of pitch accents considered in ToBI is the following:

 

H*

‘peak accent’, an apparent tone target on the accented syllable which is in the upper part of the speaker’s pitch range for the phrase.

L*

‘low accent’, an apparent tone target on the accented syllable which is in the lowest part of the speaker’s pitch range

L*+H

‘scooped accent’, a low tone target on the accented syllable which is immediatly followed by relatively sharp rise to a peak in the upper part of the speaker’s pitch range

L+H*

‘rising peak accent’, a high peak target on the accented syllable which is immediatly preceded by relatively sharp rise from a valley in the lowest part of the speaker’s pitch range

H+!H*

a clear step down onto the accented syllable from a high pitch which itself cannot be accounted for by a H phrasal tone ending the preceding phrase or by a preceding H pitch accent in the same phrase

 

Finally, ToBI has a way of dealing with uncertainty, by using one or several of the following symbols, that we will consider as special cases of <tobitone> elements:

 

*

The pitch accent has not been transcribed yet

*?

Uncertainty about the presence of a pitch accent

X*?

Uncertainty about the type of pitch accent

-

The phrase accent has not been transcribed yet

-?

Uncertainty about the presence of a phrase accent

X-?

Uncertainty about the type of phrase accent

%

The boundary tone has not been transcribed yet

%?

Uncertainty about the presence of a boundary tone

X%?

Uncertainty about the type of boundary tone

 

For a more detailed description of these symbols, the user is referred to Price (1992), Silverman et al. (1992), Beckman & Ayers (1994) and Pitrelli et al. (1994).

 

 

6.2.2 Data Source

 

A ToBI transcription is usually carried out taking the raw f0 as basic representation. Alternatively, it can rely on a phonetic representation of the f0 curve. In any case it should be aligned with linguistic units: phones or syllables or words or phrases or all of them.

 

 

6.2.3 Segmentation/selection

 

Tones are identified by inspecting the intonation curve (raw or stylized) and the aligned syllables and prosodic phrases. Depending on the annotation purposes, tones may be linked to points in the f0 curve or to linguistic units. ToBI annotation is not intended as a segmentation of the intonation curve, rather it is a selection of its relevant events, driven by the underlying linguistic structure of the utterance. ToBI tones are originally intended as associated to syllables, stressed syllables or phrase-final syllables, with some loose suggestion as to their precise alignement with the f0 curve: the 'peak' of the tone is intended to occur in the scope of the associated syllable, unless otherwise specified by the >, < diacritics (see the <target> element). In a current implementation of the system in the ESPS-Waves+ environment (‘http://www.entropic.com/products&services/esps/esps.html’), the link with the f0 curve is made explicit, with tones aligned with their peak in the f0 curve.

Different options are available in MATE for <tobitone>'s alignment. A very simple one could be to have a display of the f0 raw curve aligned with <word>'s and define each <tobitone> by selecting the word on which it occurs. One could proceed in a similar way with <syllable>'s instead of words and select the syllable (or the <phone> sequence) to be associated with the tone.

In a more sophisticated approach, one may build up <tobitone>'s starting from the stylized curve, so giving a precise phonetic correlate to the phonological ToBI label. To this end one may rely on <pitmove>'s or <intone>'s. The latter are perhaps more suitable to be considered as components of a <tobitone>, because of their underlying pitch level or target point interpretation. Basing on the synchronized display of the <intone> stylized curve, the <phone> transcription and the prosodic phrasing (<breakindex>), a <tobitone> will be defined by selecting its target points on the stylized curve: e.g. a simple <tobitone> like H* will be associated with a single <intone>, a complex one with a sequence of <intone>'s. Time attributes will be inherited from the selected <intone>'s. Time alignment will always be available in order to find out (via query or window synchronization) the corresponding <syllable>.

6.2.4 Assignment

 

The attributes considered here for the <tobitone> element are the following:

 

á type: one of the labels defined in ToBI for tonal transcription.

á class: one of the class of tones defined in the ToBI system: "pitaccent" (pitch accent), "phraccent" (phrase accent), "boundtone" (boundary tone).

á href: <f0> or <intsint> or <syllable> or <word>

á start: time start of the tone (inherited)

á end: time end of the tone (inherited)

 

 

6.3 The <target> element

 

6.3.1 Description

 

This element is used to indicate the location of a peak in the f0 contour, when it does not coincide with the stressed syllable.

 

6.3.2 Data Source

 

The raw or stylized f0 contour (in one of the following representations: <f0>, <closecopy>, <momel>, <intone>).

 

6.3.3 Segmentation/selection

 

The target position is located by visual inspection of the f0 contour. The corresponding <f0> (or stylized) element is selected and marked as ‘EarlyF0’, if it precedes the stressed syllable, or ‘LateF0’, if it follows it.

 

6.3.4 Assignment

 

The attributes considered here for the <target> element are thefollowing:

 

á type: "EarlyF0" or "LateF0"

á href: <f0> or <closecopy> or <momel> or <intone>

á start: time start of the f0 peak (inherited)

á end: time end of the f0 peak (inherited)

 

 

6.4 The <f0range> element

 

6.4.1 Description

 

This element has been included to represent the ‘f0 range’ annotation symbol, which is used to indicate the f0 maximum in the speaker’s range for a given phrase.

 

6.4.2 Data Source

 

The raw or stylized f0 contour (in one of the following representations: <f0>, <closecopy>, <momel>, <intone>).

 

6.4.3 Segmentation/selection

 

The location of the maximum of the f0 range position is determined by visual inspection of the f0 contour and the corresponding element (<f0> or <momel> or <closecopy> or <intone>) is selected to be associated with the <f0range> element.

 

6.4.4 Assignment

 

The attributes considered here for the <target> element are the following:

 

á type: ToBI symbol for f0 maximum: "HiF0"

á href: <f0> or <closecopy> or <momel> or <intone>

á start: time start of the f0 peak (inherited)

á end: time end of the f0 peak (inherited)

 

6.5 The <repair> element

 

6.5.1 Description

 

This element has been included to represent the ‘repair’ annotation symbol "%r", defined in ToBI for the restart of an intonation contour when the last contour was interrupted without being finished by some disfluency. Such restart can be considered an intonation event aligned with a specific point in the f0 curve or with the corresponding prosodic unit.

 

6.5.2 Data Source

 

The raw or stylized f0 contour (in one of the following representations: <f0>, <closecopy>, <momel>, <intone>) and the phonetic transcription, with <phone> and <syllable> elements.

 

6.5.3 Segmentation/selection

 

Both listening and inspection of the f0 curve are necessary, aligned with phonetic transcription.

As in the case of <tobitone>, two main options are available: 1) select the <syllable> element on which the intonation restart occurs, 2) select an element in a phonetic representation of intonation: <f0>, <intone>, etc.

 

6.5.4 Assignment

 

The attributes considered here for the <repair> attribute are the following:

 

á type: ToBI symbol for repair ("%r")

á href: <f0> or <closecopy> or <momel> or <intone> or <syllable>

á start: time start of the f0 peak (inherited)

á end: time end of the f0 peak (inherited)

 

 

6.5 Example

 

The following example shows the ToBI annotation of the English utterance "Show me the cheapest fare from Philadelphia to Dallas excluding restriction VU slash one" (obtained from the TOBI-TRAINING material), using the elements <tobitone> and <repair>. Note that in this case tones are linked (by means of the 'href' attribute) to <word> elements, in addition to time values.

 

tobitone.xml

 

<tobitone id="tbtn_001" type="H*" class="pitaccent" href="word.xml# id(wrd_001)" start="2052" end="2052"/>

<tobitone id="tbtn_002" type="L+H*" class="pitaccent" href="word.xml# id(wrd_004)" start="2579" end="2579"/>

<tobitone id="tbtn_003" type="!H*" class="pitaccent" href="word.xml# id(wrd_005)" start="3065" end="3065"/>

<tobitone id="tbtn_004" type="L-" class="phraccent" href="word.xml# id(wrd_005)" start="3315" end="3315"/>

<tobitone id="tbtn_004" type="L%" class="boundtone" href="word.xml# id(wrd_005)" start="3315" end="3315"/>

<tobitone id="tbtn_006" type="L+H*" class="pitaccent" href="word.xml# id(wrd_009)" start="4470" end="4470"/>

<tobitone id="tbtn_007" type="!H*" class="pitaccent" href="word.xml# id(wrd_009)" start="4771" end="4771"/>

<tobitone id="tbtn_008" type="L-" class="phraccent" href="word.xml# id(wrd_009)" start="5015" end="5015"/>

<tobitone id="tbtn_009" type="H*" class="pitaccent" href="word.xml# id(wrd_011)" start="5388" end="5388"/>

<tobitone id="tbtn_010" type="L-" class="phraccent" href="word.xml# id(wrd_011)" start="5855" end="5855"/>

<tobitone id="tbtn_010" type="L%" class="boundtone" href="word.xml# id(wrd_011)" start="5855" end="5855"/>

<tobitone id="tbtn_011" type="L+H*" class="pitaccent" href="word.xml# id(wrd_012)" start="6984" end="6984"/>

<tobitone id="tbtn_012" type="L-" class="phraccent" href="word.xml# id(wrd_012)" start="7399" end="7399"/>

<tobitone id="tbtn_012" type="L%" class="boundtone" href="word.xml# id(wrd_012)" start="7399" end="7399"/>

<tobitone id="tbtn_013" type="H*" class="pitaccent" href="word.xml# id(wrd_013)" start="8154" start="8154"/>

<tobitone id="tbtn_014" type="L-" class="phraccent" href="word.xml# id(wrd_013)" start="8585" end="8585"/>

<tobitone id="tbtn_014" type="L%" class="boundtone" href="word.xml# id(wrd_013)" start="8585" end="8585"/>

<tobitone id="tbtn_015" type="H*" class="pitaccent" href="word.xml# id(wrd_014)" start="8711" end="8711"/>

<tobitone id="tbtn_016" type="!H*" class="pitaccent" href="word.xml# id(wrd_015)" start="8928" end="8928"/>

<tobitone id="tbtn_017" type="L-" class="phraccent" href="word.xml# id(wrd_015)" start="9114" end="9114"/>

<tobitone id="tbtn_018" type="H*" class="pitaccent" href="word.xml# id(wrd_016)" start="9353" end="9353"/>

<tobitone id="tbtn_019" type="H*" class="pitaccent" href="word.xml# id(wrd_017)" start="9694" end="9694"/>

<tobitone id="tbtn_020" type="L-" class="phraccent" href="word.xml# id(wrd_017)" start="9880" end="9880"/>

<tobitone id="tbtn_020" type="L%" class="boundtone" href="word.xml# id(wrd_017)" start="9880" end="9880"/>

 

 

repair.xml

 

<repair id="rpr_001" type="%r" start="4149" end="4149"/>

 

6.6 Coding Procedure

 

Different procedures may be followed to obtain a ToBI annotation of intonation. As above mentioned, a simple procedure may look at the shape of the raw f0 curve and align <tones> to <words>. Alternatively, tones may be linked to stylized curves or linguistic units.

A recommended procedure, in line with the multilevel integrated MATE approach, is the following (where <intone>'s could be replaced with <closecopy>'s):

 

á open the following synchronized windows: <intone> (with the <momel> graphical display of the stylized f0 curve), <phone>, <syllable>, <breakindex>

á look for stressed syllables in the <syllable> sequence and inspect the corresponding f0 contour to judge if it is a pitch accent

á if it is, select its components (<intone> elements) and create a corresponding <tobitone> (that will inherit time attributes from <intone>'s)

á label it with class = pitaccent and type = the appropriate ToBI label

á (alternatively link the tone to the corresponding stressed <syllable>, or to the corresponding <word>)

á look for phrase boundaries in the <breakindex> stream and inspect the corresponding f0 contour to recognize the type of phrasal accent

á if the <breakindex> value is 4 (i.e. it corresponds to a full intonation boundary) decompose the contour into phrase accent and boundary tone

á select the <intone> elements componing the phrase accent and create a corresponding <tobitone> element (that will inherit time attributes from <intone>'s) with class= phrase accent and type= appropriate ToBI label

á if there is a boundary tone (break index 4), select its <intone> element and create the corresponding <tobitone> with class= boundary tone and appropriate label

á (alternatively link the tone to the corresponding <syllable>)

 

With the explicit link to the phonetic representation of f0, the <target> and <f0range> elements may be unnecessary. If desired, they may be introduced by selecting the corresponding <intone> element.

 

 

6.7 Markup Table

 

<tobitone>

id

[ASCII]

type

[ASCII]*

class

pitaccent, phraccent, boundtone

href

<f0> or

<closecopy> or

<momel> or

<intone> or

<syllable> or

<word>

start

[FLOAT]

end

[FLOAT]

 

*The attribute ‘type’, although defined as ASCII data, can only contain an allowable combination of ToBI symbols. The following table provides the list of possible values for the attribute ‘type’, and a short description of its use:

 

L-

Low phrase accent, which occurs at an intermediate phrase boundary (level 3and above).

H-

High phrase accent, which occurs at an intermediate phrase boundary (level 3 and above).

!H-

Dowstepped high phrase accent.

-

The phrase accent has not been transcribed yet

-?

Uncertainty about the presence of a phrase accent

X-?

Uncertainty about thetype of phrase accent

L-L%

a full intonation phrase with a L phrase accent ending its final intermediate phrase and a L% boundary tone falling to a point low in the speaker’s range

standard ‘declarative’ contour of American English.

L-H%

a full intonation phrase with a L phrase accent closing the last intermediate phrase, followed by a H boundary tone

‘continuation rise’

H-H%

an intonation phrase with a final intermediate phrase ending in a H phrase accent and a subsequent H boundary tone

‘yes-no questions’

H-L%

an intonation phrase in which the H phrase accent of the final intermediate phrase upsteps the L% to a value in the middle of the speaker’s range

final level ‘plateau’

%

The boundary tone has not been transcribed yet

 

%?

Uncertainty about the presence of a boundary tone

 

X%?

Uncertainty about the type of boundary tone

 

H*

‘peak accent’, an apparent tone target on the accented syllable which is in the upper part of the speaker’s pitch range for the phrase.

!H*

dowstepped ‘peak accent’

L*

‘low accent’, an apparent tone target on the accented syllable which is in the lowest part of the speaker’s pitch range.

L*+H

‘scooped accent’, a low tone target on the accented syllable which is immediatly followed by relatively sharp rise to a peak in the upper part ofthe speaker’s pitch range.

L*+!H

dowstepped ‘scooped acent’.

L+H*

‘rising peak accent’, a high peak target on the accented syllable which is immediatly preceded by relatively sharp rise from a valley in the lowest part of the speaker’s pitch range.

L+!H*

dowstepped ‘rising peak accent’.

H+!H*

a clear step down onto the accented syllable from a high pitch which itself cannot be accounted for by a H phrasal tone ending the preceding phrase or by a preceding H pitch accent in the same phrase

*

The pitch accent has not been transcribed yet

*?

Uncertainty about the presence of a pitch accent

X*?

Uncertainty about the type of pitch accent

 

 

<target>

id

[ASCII]

type

EarlyF0, LateF0

href

<f0> or

<closecopy> or

<momel> or

<intone>

start

[FLOAT]

end

[FLOAT]

 

 

<f0range>

id

[ASCII]

type

HiF0

href

<f0> or

<closecopy> or

<momel> or

<intone>

start

[FLOAT]

end

[FLOAT]

 

 

<repair>

id

[ASCII]

type

%r

href

<f0> or

<closecopy> or

<momel> or

<intone> or

<syllable>

start

[FLOAT]

end

[FLOAT]

 

 

7. Layer 4: Prosodic Phrasing - ToBI scheme

 

7.1 Markup Declaration

The Prosodic Phrasing Layer is intended to represent the prosodic structure of the utterance, at the levels above the word. In this sense it is complementary to the Phonetic Transcription Layer, where sub-word units such as phones and syllables are represented. The ToBI annotation system provides an effective way of marking prosodic structure, in its Break-Index Tier. The underlying theory builds up prosodic units in a hierarchy where clitics are joined to the following content word, word sequences form intermediate phrases and intermediate phrases form full intonation phrases. The end of each constituent is marked by a prosodic event whose importance is proportional to the boundary depth. ToBI provides a series of break indexes to rate the depth of the boundary. A single element <breakindex> is here defined to represent word boundaries and rate them with the proper degree of disjuncture.

7.2 The <breakindex> element

 

7.2.1 Description

 

ToBI symbols for prosodic boundaries can be adapted to XML by using a single element called <breakindex>. The ToBI notation conventions for Break Index transcription include the following symbols:

 

0

for cases of clear phonetic marks of clitic groups; e.g. the medial affricate in contractions of ‘did you’ or a flap as in‘got it’

1

most phrase-medial word boundaries

2

a strong disjuncture marked by a pause or virtual pause, but with no tonal marks; i.e. a well-formed tune continues across the juncture

OR

a disjuncture that is weaker than expected at what is tonally a clear intermediate or full intonation phrase boundary

3

intermediate intonation phrase boundary; i.e. marked by a single phrase tone affecting the region from the last pitch accent to the boundary

4

full intonation phrase boundary; i.e. marked by a final boundary tone after thelast phrase tone

 

ToBI includes also, for the representation of disfluencies, the 'p' diacritic, which can be attached to a break index symbol, as in the following example:

 

3p

a hesitation pause or a pause-like prolongation where there is a phrase accent in the tone tier

 

Uncertainty and underspecification can also be expressed by means of the ‘-’ and ‘?’ diacritics, and the ‘X’ symbol:

 

1-

uncertainty between ‘0’ and ‘1’ values

2-

uncertainty between ‘1’ and ‘2’ values

3-

uncertainty between‘2’ and ‘3’ values

 

1p?

uncertainty about ‘1p’

2p?

uncertainty about ‘2p’

3p?

uncertainty about ‘3p’

 

X

underspecification of a break index value

 

The combination of symbols and diacritics provides the different possible string symbols which can be used in the notation of prosodic boundaries by using the ToBI scheme.

 

For a more detailed description of these symbols, the user is referred to Price (1992), Silverman et al. (1992), Beckman & Ayers (1994) and Pitrelli et al.(1994).

 

 

7.2.2 Data Source

 

Break Index symbols are usually located at the end of words (not at the beginning of the unit); for this reason, it is useful to have available during the transcription task the orthographic transcription, as well as the speech file (and possibly the phonetic representation of intonation).

 

 

7.2.3. Segmentation/selection

 

Break indexes are assigned to a word in order to classify the degree of juncture perceived with the following word. For such classification, the speech file, aligned with its orthographic transcription should be available and listened to. The display of the corresponding stylized curve may be helpful. The <breakindex> element will be defined by selecting a <word> element, listening to the corresponding signal portion, looking at the f0 curve and on these basis classifying the type of right boundary of the word.

 

 

7.2.4. Assignment

 

The attributes considered here for the <breakindex> element are the following:

 

á type: the ToBI break-index label symbol classifying the prosodic boundary

á href: <word>

á start: time location of the boundary (inherited from <word> end)

á end: time location of the boundary (inherited from <word> end)

 


7.3 Example

 

The following example shows the 'break index' annotation of the same utterance of section 6.5 ("Show me the cheapest fare from Philadelphia to Dallas excluding restriction VU slash one") using the <breakindex> element:

 

breakindex.xml

 

<breakindex id="brkndx_001" type="1" href=word.xml# id(wrd_001) start="2105" end="2105"/>

<breakindex id="brkndx_002" type="1" href=word.xml# id(wrd_002) start="2245" end="2245"/>

<breakindex id="brkndx_003" type="1" href=word.xml# id(wrd_003) start="2355" end="2355"/>

<breakindex id="brkndx_004" type="1" href=word.xml# id(wrd_004) start="2935" end="2935"/>

<breakindex id="brkndx_005" type="4" href=word.xml# id(wrd_005) start="3315" end="3315"/>

<breakindex id="brkndx_006" type="1" href=word.xml# id(wrd_006) start="3565" end="3565"/>

<breakindex id="brkndx_007" type="1p" href=word.xml# id(wrd_007) start="3836" end="3836"/>

<breakindex id="brkndx_008" type="1" href=word.xml# id(wrd_008) start="4325" end="4325"/>

<breakindex id="brkndx_009" type="3" href=word.xml# id(wrd_009) start="5015" end="5015"/>

<breakindex id="brkndx_010" type="1" href=word.xml# id(wrd_010) start="5225" end="5225"/>

<breakindex id="brkndx_011" type="4" href=word.xml# id(wrd_011) start="5855" end="5855"/>

<breakindex id="brkndx_012" type="4" href=word.xml# id(wrd_012) start="7399" end="7399"/>

<breakindex id="brkndx_013" type="4" href=word.xml# id(wrd_013) start="8585" end="8585"/>

<breakindex id="brkndx_014" type="1" href=word.xml# id(wrd_014) start="8825" end="8825"/>

<breakindex id="brkndx_015" type="3" href=word.xml# id(wrd_015) start="9115" end="9115"/>

<breakindex id="brkndx_016" type="1" href=word.xml# id(wrd_016) start="9595" end="9595"/>

<breakindex id="brkndx_017" type="4" href=word.xml# id(wrd_017) start="9880" end="9880"/>

 

7.4 Coding Procedure

A procedure for break index annotation may be:

á open the following synchronized windows: speech file, f0 curve (<f0> or <closecopy> or <momel>...), <word>

á select a word to define a corresponding <breakindex> element

á listen to a surrounding portion of the speech signal and look at the f0 curve, in order to classify the boundary depth and choose the proper label

 

7.5 Markup Table

 

<break index>

id

[ASCII]

type

[ASCII]*

href

<word>

start

[FLOAT]

end

[FLOAT]

 

* The attribute ‘type’, although above defined as ASCII data, can only contain an allowable combination of ToBI symbols. The use of the possible values for the ‘type’ attribute are summarized in the following table:

 

0

for cases of clear phonetic marks of clitic groups; e.g. the medial affricate in contractions of ‘did you’ or a flap as in‘got it’

1-

uncertainty between ‘0’ and‘1’ values

1

most phrase-medial word boundaries

1p

an abrupt cutoff before an actual repair, or as if stopping to permit a repair or restart of some kind

1p?

uncertainty about ‘1p’

2-

uncertainty between ‘1’and ‘2’ values

2

a strong disjuncture marked by a pause or virtual pause, but with no tonal marks; i.e. a well-formed tune continues across the juncture

OR

a disjuncture that is weaker than expected at what is tonally a clear intermediate or full intonation phrase boundary

2p

a hesitation pause or prolongation of segmental material where there is no phrase accent perceived in the intonation contour

2p?

uncertainty about ‘2p’

3-

uncertainty between‘2’ and ‘3’ values

3

intermediate intonation phrase boundary; i.e. marked by a single phrase tone affecting the region from the last pitch accent to the boundary

3p

a hesitation pause or a pause-like prolongation where there is a phrase accent in the tone tier

3p?

uncertainty about ‘3p’

4-

uncertainty between ‘3’ and‘4’ values

4

full intonation phrase boundary; i.e. marked by a final boundary tone after the last phrase tone

X

underspecification of a break index value

 

 

References

 

BEAUGENDRE, F. et al. (1992).- "A perceptual study of French intonation", ICSLP 92, Banff.

 

BECKMAN, M.E. - AYERS, G.M. (1994).- Guidelines for ToBI Labelling. Version 2.0, February1994.

 

BECKMAN, M.E.- HIRSCHBERG, J. (1994).- The ToBI AnnotationConventions. Appendix A of BECKMAN, M.E. - AYERS, G.M.(1994).- Guidelines for ToBI Labelling.Version 2.0, February 1994.

 

BRINDOPKE, C. - PHADE, A. - KUMMERT, F. - SAGER, G. (1997).- "An envorinment for the labeling and testing of melodic aspects of speech", Eurospeech 97.

 

CAMPIONE, E. - FLACHAIRE, E. - HIRST, D. - VERONIS, J. (1997).- "Stylisation and symbolic coding of F0: A quantitative model", en BOTINIS, A. - KOUROUPETROGLOU, G. - CARAYIANNIS, G. (eds.) (1997).- Intonation: Theory, Models and Applications. Proceedings of an ESCA Workshop, September 18-20, Athens, Greece, ESCA/Universty of Athens, pp. 71-74.

 

GIBBON, D. (1989).- Survey of Prosodic Labelling for EC Languages. SAM-UBI-1/90, 12 February 1989; Report e.6, in ESPRIT 2589 (SAM) Interim Report, Year 1.=20, Ref.SAM-UCL G002. University College London, February 1990.

 

HIRST, D.J. (1991).- "Intonation models: Towards a third generation" in Actes du XIIème Congrés International des Sciences Phonétiques. 19-24 août 1991, Aix-en-Provence, France. Aix-en-Provence: Université de Provence, Service des Publications. Vol. 1 pp. 305-310.

 

HIRST, D.J. (1994).- "The symbolic coding of fundamental frequency curves: from acoustics to phonology", in FUJISAKI, H. (Ed.) Proceedings of International Symposium on Prosody, Satellite Workshop ofICSLP 94, Yokohama, September 1994.

 

HIRST, D.J. - DI CRISTO, A. - LE BESNERAIS, M. - NAJIM, Z. - NICOLAS, P. -ROMƒAS, P. (1993).- "Multilingual modelling of intonation patterns", in HOUSE, D.TOUATI, P. (Eds.) Proceedings of an ESCA Workshop on Prosody. September 27-29, 1993, Lund,Sweden. Lund University Department of Linguistics andPhonetics, Working Papers 41. pp. 204-207.

 

HIRST, D.J. - DI CRISTO, A. (Eds.) (1998).- Intonation Systems. A Survey of Twenty Languages. Cambridge:Cambridge University Press.

 

HIRST, D.J. - ESPESSER, R. (1993).- "Automatic modelling of fundamental frequency using a quadratic spline function", Travaux de l‘Institut de Phonetique d’Aix, 15, pp. 71-85.

 

PIERREHUMBERT, J. B. (1980).- The Phonology and Phonetics of English Intonation, Bloomington, Indiana UniversityLinguistics Club.

 

PITRELLI, J. - BECKMAN, M. - HIRSCHBERG, J. (1994).- "Evaluationof prosodic transcription labelling reliability in the tobiframework", en Proceedings of the third International Conference on Spoken Language Processing, Yokohama, ICSLP, Vol. 2, pp. 123-126.

 

PRICE, P. (1992).- Summary of the Second Prosodic Transcription Workshop: the TOBI ( TOnes and Break Indices) Labeling System. Nynex Science and Technology, Inc. 5-6 April, 1992. Linguist List vol.3-761, 9 October 1992.

 

QUAZZA, S. (1991).- "Modelling Italian Intonation on a Text-to-Speech System", Eurospeech 91, Genova.

 

SILVERMAN, K.- BECKMAN, M.- PITRELLI, J.-OSTENDORF, M.- WIGHTMAN, C.- PRICE, P.- PIERREHUMBERT, J.- HIRSCHBERG, J.(1992).- "TOBI: A standard for labeling English prosody", in OHALA, J.J. et al. (eds.).- Proceedings ICSLP 92, pp. 867-870.

 

‘T HART, J. - COLLIER, R. - COHEN, A. (1990).- A perceptual study of intonation. An experimental-phonetic approach to speech melody, Cambridge, Cambridge University Press.

 

WELLS, J. BARRY, W. - GRICE, M. - FOURCIN, A. GIBBON, D. (1992).- Standard Computer-Compatible Transcription, SAM Stage Report Sen.3 SAM UCL-037, 28 February 1992. In SAM (1992)ESPRIT PROJECT 2589 (SAM) Multilingual Speech Input/Output Assessment,Methodology and Standardisation. Final Report. YearThree: 1.III.91-28.II.1992. Ref. SAM-UCL-G004. London: University CollegeLondon.

 

WILLEMS, N. J. - COLLIER, R. - T'HART, J. (1988).- "A synthesis scheme for British English intonation", Journal of the Acoustical Society of America, 84.

 

 

Annex 1: DTD for Layer 1 (Phonetic transcription)

 

<!-- DTD for the MATE project Prosody Level based on D2.1 (june 99)

XML 1.0, XLink 1.0 DTD

18/6/99 -->

 

<!-- prlayer1 DOCTYPE contains all the element and attribute declarations for the layer 1 of the prosody level -->

 

<!DOCTYPE prlayer1 [

 

<!-- Layer 1: phonetic transcription; SAMPA scheme -->

<! ELEMENT syllable (phone)* >

<!ATTLIST syllable

id ID #REQUIRED

stress (&quot | %) #IMPLIED

start NMTOKEN #IMPLIED

end NMTOKEN #IMPLIED>

<! ELEMENT phone (#PCDATA) >

<!ATTLIST phone

id ID #REQUIRED

type CDATA #IMPLIED

start NMTOKEN #IMPLIED

end NMTOKEN #IMPLIED>

 

<!-- End of prlayer1 DOCTYPE -->

]>

 

 

Annex 2: DTD for Layer 2a (Phonetic representation of iintonation - IPO scheme))

 

<!-- DTD for the MATE project Prosody Level based on D2.1 (june 99)

XML 1.0, XLink 1.0 DTD

18/6/99 -->

<!-- layer2a DOCTYPE contains all the element and attribute declarations for the layer 2a of the prosody level -->

 

<!DOCTYPE layer2a [

 

<!-- Layer 2: phonetic representation of intonation -->

<!-- Layer 2a: IPO scheme -->

<! ELEMENT f0 (#PCDATA) >

<!ATTLIST f0

id ID #REQUIRED

value NMTOKEN #IMPLIED

start NMTOKEN #IMPLIED

end NMTOKEN #IMPLIED>

 

<! ELEMENT closecopy (#PCDATA) >

<!ATTLIST closecopy

id ID #REQUIRED

value NMTOKEN #IMPLIED

href CDATA #IMPLIED

xml:link CDATA #FIXED "simple"

start NMTOKEN #IMPLIED

end NMTOKEN #IMPLIED >

 

<! ELEMENT pitmove (#PCDATA) >

<!ATTLIST pitmove

id ID #REQUIRED

type (0 | Ø | 1 | 2 | 3 | 4 | 5 | A | B | C | D | E | &2 | &3 | &4 | &A | &C | &D) #IMPLIED

href CDATA #IMPLIED

xml:link CDATA #FIXED "extended"

start NMTOKEN #IMPLIED

end NMTOKEN #IMPLIED >

 

<!-- End of layer2a DOCTYPE -->

]>

 

 

 

Annex 3: DTD for Layer 2b (Phonetic representation of intonation - INTSINT scheme)

 

<!-- DTD for the MATE project Prosody Level based on D2.1 (june 99)

XML 1.0, XLink 1.0 DTD

18/6/99 -->

 

<!-- layer2b DOCTYPE contains all the element and attribute declarations for the layer 2b of the prosody level -->

 

<!DOCTYPE layer2b [

 

<!-- Layer 2: phonetic representation of intonation -->

 

<!-- Layer 2b: INTSINT scheme -->

 

<! ELEMENT f0 (#PCDATA) >

<!ATTLIST f0

id ID #REQUIRED

value NMTOKEN #IMPLIED

start NMTOKEN #IMPLIED

end NMTOKEN #IMPLIED>

 

<! ELEMENT momel (#PCDATA) >

<!ATTLIST momel

id ID #REQUIRED

value NMTOKEN #IMPLIED

href CDATA #IMPLIED

xml:link CDATA #FIXED "simple"

start NMTOKEN #IMPLIED

end NMTOKEN #IMPLIED >

 

<! ELEMENT intone (#PCDATA) >

<!ATTLIST intone

id ID #REQUIRED

type (T | M | B | H | S | L | U | D) #IMPLIED

href CDATA #IMPLIED

xml:link CDATA #FIXED "simple"

start NMTOKEN #IMPLIED

end NMTOKEN #IMPLIED >

 

<!-- End of layer2b DOCTYPE -->

]>

 

 

 

Annex 4: DTD for Layer 3 (Phonological representation of intonation - ToBI scheme)

 

<!-- DTD for the MATE project Prosody Level based on D2.1 (june 99)

XML 1.0, XLink 1.0 DTD

18/6/99 -->

 

<!-- prlayer3 DOCTYPE contains all the element and attribute declarations for the layer 3 of the prosody level -->

 

<!DOCTYPE prlayer3 [

 

<!-- Layer 3: phonological representation of intonation; ToBI scheme -->

 

<! ELEMENT tobitone (#PCDATA) >

<!ATTLIST tobitone

id ID #REQUIRED

type CDATA #IMPLIED

class (pitaccent | phraccent | boundtone) #IMPLIED

href CDATA #IMPLIED

xml:link CDATA #FIXED "simple"

start NMTOKEN #IMPLIED

end NMTOKEN #IMPLIED >

 

<! ELEMENT target (#PCDATA) >

<!ATTLIST target

id ID #REQUIRED

type (EarlyF0 | LateF0) #IMPLIED

href CDATA #IMPLIED

xml:link CDATA #FIXED "simple"

start NMTOKEN #IMPLIED

end NMTOKEN #IMPLIED >

 

<! ELEMENT f0range (#PCDATA) >

<!ATTLIST f0range

id ID #REQUIRED

type CDATA #FIXED "HiF0"

href CDATA #IMPLIED

xml:link CDATA #FIXED "simple"

start NMTOKEN #IMPLIED

end NMTOKEN #IMPLIED >

 

<! ELEMENT repair (#PCDATA) >

<!ATTLIST repair

id ID #REQUIRED

type CDATA #FIXED "%r"

href CDATA #IMPLIED

xml:link CDATA #FIXED "simple"

start NMTOKEN #IMPLIED

end NMTOKEN #IMPLIED >

 

<!-- End of prlayer3 DOCTYPE -->

]>

 

 

 

Annex 5: DTD for Layer 4 (Prosodic Phrasing - ToBI scheme))

 

<!-- DTD for the MATE project Prosody Level based on D2.1 (june 99)

XML 1.0, XLink 1.0 DTD

18/6/99 -->

 

<!-- prlayer4 DOCTYPE contains all the element and attribute declarations for the layer 4 of the prosody level -->

 

<!DOCTYPE prlayer4 [

 

<!-- Layer 4: prosodic phrasing; ToBI scheme -->

 

<! ELEMENT breakindex (#PCDATA) >

<!ATTLIST breakindex

id ID #REQUIRED

type CDATA #FIXED "HiF0"

href CDATA #IMPLIED

xml:link CDATA #FIXED "simple"

start NMTOKEN #IMPLIED

end NMTOKEN #IMPLIED >

 

<!-- End of prlayer4 DOCTYPE -->

]>