Peeter Lind, Ülle Viks
The first major task MALL has been applied to is type recognition: the phonological properties of the initial form (lemma) of a word are used as a clue to the place of the word in a morphological classification. The classification is taken from Viks 1992: A Concise Morphological Dictionary of Estonian (MDE).
As the relationship between the phonological and morphological properties is quite a close one the phonological shape of an Estonian word usually suffices to identify the inflection type of the word (v. Viks 1990). If, for example, the lemma stem of a verb consists of two syllables and ends in a sequence 'consonant + LE', there is a great possibility that it is conjugated like r`iidle[ma (Type 30 in MDE), e.g. v`aatle[ma : vaadel[da, h`üple[ma : hüpel[da, n`õudle[ma : nõuel[da. MDE contains 175 such words, but there are three words of the same structure that are conjugated like ela[ma (Type 27): t`aotle[ma : t`aotle[da, l`oetle[ma : l`oetle[da, n`õutle[ma : n`õutle[da. So the MALL-program helps the linguist to find out which phonological properties correlate with the inflection type of the word, and to what extent.
More generally, MALL enables one to ascertain correlations between two groups of characteristic features: the phonological properties and some other properties of a word that are taken as distinctive features for a classification. Although the program has been written with an eye to the requirements of morphology, it is quite possible to replace morphological classification with some other one, thus enabling the researcher to study the correlation between the phonological properties and, e.g. the derivational type of a word, or between different phonological characteristics.
The input lexicon is required to be
stored in one catalogue, while the number of files within an inflection
type corresponds to the number of different stem variants in the
type. The file name is tp??*.txt
in which characters 3-4 contain
the number of the inflection type and the following 2 or 3 characters
represent the code of the stem variant (A-/B-stem, strong/weak
stem, etc.).
The program consists of a working module and an initiating file. The ini-file (see Suppl. 1) describes the alphabet and defines a few possible sound classes which can be accommodated to match a concrete task. In essence MALL is a database system with certain linguistic functions to it.
a) number of syllables from the beginning of the word (S1),
b) number of syllables from the main stress syllable (S2),
c) degree of phonetic quantity (Vl),
d) final sounds (Lh),
e) medial sounds (Sh),
f) syllable structure of the word (Sss).
The medial and the final sounds can be considered either in the original form or in various sound classes specified by the user in the ini-file.
The selection and alternation of the features proceeds in dialog-mode, for the relevance of this or that feature for the type recognition may differ from type to type. In some cases it suffices to fix just a couple of general features, whereas some other cases require the using of several features in greater detail.
For the generation of a pattern, the user is offered a menu provided with the default values of the features (see Fig. 1).
1 Silpe sõna algusest (J/E) : J
2 Silpe rõhust alates (J/E) : J
3 Välde (J/E) : J
4 Lõpuhäälikute arv (0..5) : 1 ei asenda
5 Sisehäälikud (J/E) : E
6 Tüübid (Nt.00,02-07,12-34) : 00-38
7 Tüvekoodid [ABCD?][TN0G?][RV?] : TP*.TXT
8 Lemmatüved (J/E) : E
9 Sõna silbistruktuur (J/E) : E
T TEGUTSE
A Asenduste ridade vaatamine
0 või Esc Tagasi algmenüüsse
Figure 1. Dialog screen for the generation of a pattern
Lines 1-5 and 9 specify the phonological features to be included in the pattern. Lines 6-8 serve to limit down the number of words to be analyzed. Selection can be based on the type number (6) and on the stem code (7). In order to simplify selection of initial forms line 8 presents the lemma stems. After the final and the medial sounds have been selected there follows a question on sound class replacements. Now that the initial requirements have been fixed in the form of a pattern the program can proceed to analysis.
One of the basic modules of the program deals with syllabification. There are several features that cannot be identified unless the word has already been syllabized and the syllable carrying the main stress identified. For the rules underlying the syllabification algorithm see Suppl. 2.
The number of syllables is counted in two ways: from the beginning of the word (S1) and from the syllable carrying the main stress (S2). For genuine Estonian words, S1 and S2 usually coincide as in Estonian it is the first syllable that carries the main stress, as a rule. The difference appears in the case of foreign words, e.g. šokol`aad (S1=3, S2=1), gig`ant (S1=2, S2=1) and of such genuine Estonian words that contain a suffix with grade alternation, e.g. k`oolk`ond (S1=2, S2=1), sõbral`ik (S1=3, S2=1). Although from the phonetic point of view the main stress is often moved from the non-initial syllable to the first syllable (š'okol`aad -- not šokol'`aad, s'õbral`ik -- not sõbral'`ik), MALL overrides pronunciation in favour of the, so to say, morphological main stress.
Final sounds (Lh) are those denoted by the final letters of the word. If a word happens to be shorter than the number of positions asked for the final sounds, blanks remain to the left of them.
Medial sounds (Sh) begin from the first vowel of the main-stress syllable and end either with the sound preceding the first vowel of the next syllable, or with the end of the word (if the main stress falls on the last syllable).
The quantity degree (Vl) of a word is ascertained as follows:
The syllable structure (Sss) of a word is a representation of the word as a sequence of syllable types. The underlying classification is, in principle, the one suggested by P. Päll in 1986, but with slight modifications. Relevant are the sounds from the syllable nucleus (consonants preceding the first vowel are not considered) up to the syllable boundary (that may coincide with the end of the word). The syllables are classified as follows ('v' = vowel, 'c' = consonant, 's' = s word-final, '-' = syllable boundary):
open syllable (ends in a vowel): -- short (contains one vowel) v- = Y -- long (contains two vowels) vv- = E closed syllable (ends in a consonant): -- contains one vowel: -- ends in a word-final s vs- = S -- ends in one consonant vc- = G -- ends in several consonants vcc = K -- contains two vowels: -- ends in one consonant vvc- = D -- ends in several consonants vvcc = T
In order to identify the syllable structure of a word the system does the following:
1) doubles the fortis stops (k, p, t) and the foreign fortis consonants (f, š) in a voiced environment, the end of a word included:
latern ® lattern, laatsaret ® laatsarett
2) syllabizes the word according to the rules of syllabification:
lat-tern, laat-sa-rett
3) identifies the type of each syllable according to the type descriptions of syllables:
lat-tern ® GK, laat-sa-rett ® DYK
In the following example of analysis all properties belonging to the pattern have been found for three words: SENJOORA, LATERN and TSISTERN:
SEN-JOO-RA LA-TERN TSIS-TERN S1 3 2 2 S2 2 2 1 Vl 2 2 3 Lh vcv vcc vcc Sh OOR AT `ERN Sss GEY GK GK
(There are three final sounds, divided into vowels and consonants; the medial sounds are presented in their original form.)
First, all analyzed words are recorded in a text file in which each word is supplemented by a sequence of phonological features with separators in the following order: /S1 (S2 )Sh !Vl =Lh ?Sss
This is what the above examples look like in the text file:
SENJOORA / 3( 2) OOR!2=vcv?GEY LATERN / 2( 2) AT!2=vcc?GK TSIST`ERN / 2( 1) `ERN!3=vcc?GK
The text file can be addressed by additional specific inquiries by means of 'grep' or 'agrep'. Also, the text file can be handled by some other programs for sorting, restructuring, etc.
Second, the screen format represents a table of patterns displaying the correlations between the patterns and the inflection types.
Figure 2 displays the results of an analysis where the inquiry concerned the lemma stems of Types 28-30 of MDE and the pattern was to include: the number of syllables from the beginning of the word (S1), the number of syllables from the stressed syllable (S2), and the quantity degree (Vl) of the word.
Nr Kordi M/T T/M Tüüp Mall: S1 S2 Vl Lh Sh Sss 1 20 1.33 100.00 28AT 2 2 1 2 831 55.44 54.85 28AT 2 2 3 3 508 100.00 33.53 29AT 2 2 3 4 176 99.44 11.62 30AT 2 2 3 5 175 11.67 100.00 28AT 3 2 3 6 295 19.68 99.66 28AT 4 2 3 7 1 0.56 0.34 30AT 4 2 3 8 130 8.67 100.00 28AT 5 2 3 9 39 2.60 100.00 28AT 6 2 3 10 8 0.53 100.00 28AT 7 2 3 11 1 0.07 100.00 28AT 8 2 3 Lõpetab ESC, ENTER näitab sõnu, P Print, PgUp, PgDn, F1 Abi, F2, F3
Figure 2. Table of patterns
Every line represents a combination of a type and a pattern. Column 1 contains the line number, Column 2 shows how many words of the given type correspond to the given pattern. Columns 3 and 4 display the pattern/type and type/pattern ratios. There follows the number of the inflection type together with the stem code, and the phonological pattern in the composition asked for.
The user has the following options:
a) sort the table by columns (number of words, ratios, types, patterns);
b) print the whole table;
c) see what words correspond to a line;
d) store the words of a line in a separate file;
e) divide the input files into two parts depending on correlations (full or partial correlation).
Let us take, for example, declinable words of three and more syllables, ending in a consonant (S1=3... & Lh=c). According to MDE they divide among the following six types of inflection: 02 (õpik), 09 (katus), 11 (harjutus), 19 (seminar), 22 (s`epp), 25 (õnnel`ik). Three final sounds divide the words into three groups (vvc, cvc and cc) within which the patterns can be further approximated. For a sample of the recognition rules v. Suppl. 3. The following is a closer study of a subgroup:
Lh=vvc v1v2n ® 19~02 (141) * 02 (2), 19 (2), 22 (3) v1v1n ® 22 (1007) * 19 (2) vv+^n ® 22 (605) * 02 (4), 09 (1), 11 (15)
As a distinctive feature in the 'vvc'-group serves the sequence v1v2n (two different vowels + a voiced consonant) which indicates that the words with such final sounds belong to two types (19 and 02) in parallel. E.g. st`aadion can be declined either like seminar (19): st`aadion : st`aadioni : st`aadioni : st`aadioni[de or like õpik (02): st`aadion : st`aadioni : st`aadioni[t : st`aadioni[te. The number of such words in MDE is 141, apart from 7 exceptions: konv`eier, biidermeier (02); karaul, liineal (19); linol`eum, mausol`eum, karbolin`eum (22).
The rest of the 'vvc'-words: a) 'v1v1n' (two similar vowels + a voiced consonant) and b) 'vv+^n' (two vowels + a nonvoiced consonant) belong to Type 22 (s`epp), e.g. illusi`oon, kartot`eek, sinus`oid. In MDE there are 1612 such words, exceptions are 22: kont`iinuum, v`aakuum (19); küren'aik, m`essias, paran'oik, tobias (02); skarab`eus (09); avaus, bakal`aureus, f`aatsies, g`eenius, `iileus, `ishias, k`aaries, n`oonius, n`untsius, ordin`aarius, paleus, p`ankreas, r`aadius, stradiv`aarius, teenuis (11).
The following are comments to be overlooked by the program.
Line 1 : letters accepted
Line 2 : voiced (l) and voiceless (k/c) sounds
Line 3 : vowels
Line 4 : short consonants
Line 2 is replaced by an alphabet according to the user-selected code. The maximum number of possible choices is 9.
The code (name) of the user's alphabet is on a separate line. The code can be 8 characters long at most and it stands in brackets '['...']'. The next line contains an user's alphabet in terms of sound classes.
Vowels/consonants
[V-C-] vccvcccvcccccvcccccccvcvvvv
Long/short consonants
[V-Cpl] vllvpllvlplllvpllpllpvlvvvv
Voiced/voiceless consonants
[V-Cht] vttvtttvhthhhvthtthhtvhvvvv
Full classification of consonants
[V-C+] vggvfgsvnknnnvknsfnnkvnvvvv
Classification of vowels
[V+C-] xccqcccycccccqcccccccycqxqy
Stem-final sounds
[Lh] AggEfgsIjknnnOknsfnnkUjvvvv
Medial sounds
[Sh] vggvkgsvnknnnvknsknnkvnvvvv
Anything
[oma] vllvpllvlplMlvpllpllpvlvvvv
Sound classes:
v = vowels: AEIOUÕÄÖÜ y = high: IUÜ q = medium-high: EOÕÖ x = low: AÄ c = consonants: BDFGHJKLMNPRSŠZŽTV p = long: KPTFŠ l = short: BDGHJLMNRSZŽV h = voiced: JLMNRZŽV t = voiceless: BDFGHKPSŠT g = lenis: GBD k = fortis: KPT f = foreign fortis: FŠ s = sibilants, spirants: SH n = voiced: JLMNRZŽV
In a word it is important to differentiate between:
A non-initial syllable carries the morphological main stress if it:
If there are two or more syllables fulfilling the above conditions for a main-stress syllable, the stress is regarded as falling on the last of them (k`onst`ant).
* (3.2) exceptions: there are two foreign words with two similar vowels in a non-main-stress syllable, producing two syllable nuclei (v`aaku-um, kont`iinu-um), variants with one vowel are in parallel use (v`aakum, kont`iinum).
** (3.3.b) a sequence ending in 'i' is also created by the suffixes istika, ist and ism if they get linked to a vowel. But as those suffixes carry the main stress they create a syllable boundary in front of them (kasu-'istika, ate-`ist, ego-`ism).
*** (4.b) an ambiguous situation may arise:
1) on the word boundary in compound words if the second member begins with a vowel (t`äis-`arv) or a consonant cluster (`öö-klubi);
2) before a final foreign component of a compound-like word if the component begins with a consonant cluster (tele-gr`amm);
3) in foreign names (Neu-stadt, Dobro-ljubov, Gorba-tšov).
Lh=vvc
v1v2n ® 19~02 (141) * 02 (2), 19 (2), 22 (3)
v1v1n ® 22 (1007) * 19 (2)
vv+^n ®
22 (605) * 02 (4), 09 (1), 11 (15)
Lh=cvc:
Lh=cvn
c+(EL/ER/OR) ® 02 (277) * 19 (27)
c+^(EL/ER/OR) ®
19 (190) * 02 (37)
Lh=cvk
(n/D/ST)+IK ® 25 (1064) * 02 (50)
^(n/D/ST)+IK ® 02 (47) * 25 (3)
c+^(IK) ®
22 (15)
Lh=cvs
c+IS ® 11 (148) * 09 (12)
c+US & S1=3 ® 11 (2041) * 11~09 (46)
& S1=4... ® 11~09 (313) * 11 (46)
c+^(IS/US) ®
02 (696) * 09 (1), 11~09 (1)
Lh=cv+^(n/k/s) ®
02 (208) * 19 (1)
Lh=cc
NG ® 02 (48) * 22 (8)
^(NG) ®
22 (949) * 02 (4)