The aim of the paper is to find out whether-and if, then to what extent-component boundaries in the Estonian compound words can be found by rules that take into account only feasible and unfeasible letter sequences. A set of such simple rules could evidently speed up the morphological analysis of such word forms in which the rather time-consuming search for a boundary in a possible compound (consisting in the check-up on the acceptability of a possible component to be followed by a search in the dictionary of stems) could be replaced by a finite state automaton to mark certain letter sequences as indicating the presence of a boundary without exceptions or with a great probability. This would also be useful for such applications which do not refer to morphological or semantic analysis-database queries, hyphenation algorithms etc. E.g. in the word form 'päästeüksusi' the boundary can be fixed at once as in Estonian the sequence 'eü' can occur only at compound boundaries. The sequence 'dt' may either denote a boundary ('raud+tee', 'med+töötaja', 'üld+tunnustatud') or a proper name of Germanic origin ('Schmidt', 'Markvardt', 'Rembrandt' etc.). As can be seen from the examples the notion of a compound has been interpreted rather freely.
The combinatorial analysis of the letter sequences is based on
the text corpus of the Institute of the Estonian Language, more
emphasis has been laid on newspaper texts («Eesti Ekspress»,
«Hommikuleht» and «Eesti Sõnumid»)
as they display more linguistic variation. All word forms occurring
in the texts are alphabetically sorted, the number after the slash
gives the frequency. All compound boundaries were then marked
by '+'. The resulting dictionary:
The size of the corpus used in the analysis was 14Mbytes consisting of 1,733,000 word forms. As the ordered dictionary has approx. 200,000 entries, the average frequency of a word form in the selection is 8.5. The respective numbers for compound words were 204,000 in the texts and 78,000 in the dictionary. Consequently the percentage of compound words in the Estonian (newspaper)texts is 12%. One tenth of the compounds consist of more than two components, the respective figures were
5 components 8
4 components 290
3 components 7426
2 components 70117
The comparison between the average frequency of compounds in the text (2.6) and simple word forms (12.5) indicates that compound words are mostly used to denote more specified notions.
It also appears that compounds spontaneously created
in conversation are rare, most of the compounds used are established
(i.e. not unique) in the language. For automatic analysis this
means that a relatively large part of compound words can be included
in the dictionary. This would leave the texts with only 2% of
such word forms that shall need the lengthy procedure of boundary-finding.
By way of illustration the following list of the compound words
the first component of which 'öö' night is relatively
neutral, is provided. The non-initial components are lemmatised.
Words that can be found in the Orthological Dictionary (OD) are
underlined, the dotted underline marks derivatives of OD entries
and such newer compounds that seem to be established enough to
be included in the dictionary as well.
öö+aja+kiri/8, öö+baar/9,
öö+ekspress/1, öö+elu/4, öö+hakuks/1,
öö+jook/1, öö+kapp/2, öö+klubi/18,
öö+klubi+hääl/1, öö+klubi+omanik/1,
öö+kreem/1, öö+kull/6, öö+kulli+topis/1,
öö+kuninganna/2, öö+külm/2,
öö+lokaal/7, öö+lõhn/1, öö+maja/9,
öö+must/1, öö+muusik/l, öö+pikaks/1,
öö+pikk+silm/6, öö+pimedus/6,
öö+pood/1, öö+pott/1, öö+printsess/1,
öö+päev/69, öö+päeva+ringne/14,
öö+rahu/2, öö+show/1, öö+särk/7,
öö+söök/1, öö+tund/3,
öö+töö/2, öö+vaikus/1,
öö+valve/1, öö+valvur/2,
öö+video/7, öö+visiit/1, öö+õde/1.
Summarising the frequencies of the two groups we can see that the text ratio of established compounds and occasional ones is 175 to 30.
To find the letter sequences occurring exclusively on the boundaries of compound words a program was used that for every possible letter sequence found its frequency of occurrence on a compound boundary (positive result) versus its frequency elsewhere (negative result).
One-letter sequences yield but forbidding rules: as 'j', 'h' and 'õ' never occur at the end of a word form they never mark the end of a compound word component either.
A two-letter sequence requires three hypotheses to
be checked: the boundary can be found before the sequence (+xx),
after the sequence (xx+) or between the letters (x+x). As the
number of possibilities grows exponentially (for two letters ~3000,
for three letters ~130,000, the number of different four-letter
sequences being ~5,000,000), sequences of more than four letters
were not analysed. For two-letter sequences the results are as
follows:
+iü 0 1 i+ü 130 1 iü+ 0 1
+ja 234 11854 j+a 0 8288 ja+ 3840 13172
+jb 0 0 j+b
0 0 jb+ 0 0
The first number after the hypothesis denotes positive results, whereas the second stands for negative results. E.g. all of the 130 'iü' sequences occurred on the compound boundary, while the boundary ran between the two letters (the only exception was a spelling error). The 'ja' sequence was found 234 times at the beginning of a component, 3840 times at the end of a component, not once was the boundary between these two letters, but predominantly was 'ja' found elsewhere in the word form. The sequence 'jb' was not detected.
There in no formal criterion to select the best set of rules. One possible approach is to take into account the ratio of positive and negative results, but this method has several drawbacks.
A closer look at the letter sequences that qualify as rules enables us to differentiate between several groups. The first group of the rules is phonotactically based. Those rules have practically no exceptions and should be included in the final set of rules even if the letter sequence is very rare. Ordered to the same group are such rules that that have exceptions among loanwords and foreign words, or in paradigms of unproductive declensions but otherwise seem to be phonotactical. The distinction between these two sets is somewhat arbitrary. E.g. although at first sight the sequence 'tg' seems quite impossible to accept within a word, the word 'röntgen' x-rays is much more frequent than 't+g' on a compound boundary. At the same time anyone can come up with the exception to rule 't+p'-'tpruu' whoa!. Still the rule 't+p' has no exceptions while 't+g' can hardly be considered as useful.
The rules are presented in tables. The third column contains the number of boundaries covered by the rule (the corpus number of the boundaries was 85,871). The numbers of exceptions, if any, are separated by a slash. I have also tried to arrange the rules in the order of 'goodness', but there are no formal criteria.
Rule | Exceptions | Frequency |
{A,E,I,O,U}+{Õ,Ä,Ö,Ü}
(any vowel of the first set combined with any of the second set) | puänt | 3007/4 |
I+J{O,U,Õ,Ä,Ü} any vowel except A | 550 | |
T+P | 533 | |
K+P | 495 | |
G+P | 341 | |
D+T | Brandt, Landtag, | 334 |
V+P | 233 | |
G+H | 196 | |
ÖÖ+A | 191 | |
M+T | 154 | |
D+P | 127 | |
M+V | 126 | |
K+F | 99 | |
V+M | 41 | |
G+T | 38 | |
Ö+Ü | 29 | |
M+H | 28 | |
Ö+K | 28 | |
G+Õ | 16 | |
Ö+Õ | 7 | |
Ä+A | 4 | |
Ö+F | 4 | |
US+J | soomusjad | 534/1 |
S+H | show, ekshibitsionism, isheemia | 603/358 |
P+K | knopka, papka, the -kond-suffix: piiskopkond, the ki-enclitic: kappki, | 171/12 |
G+J | õlgjad | 74/1 |
L+R | taalri, maalri, | 141/16 |
M+J | liimjas, piimjas, | 58/5 |
K+H | khaan, khmeerid | 81/10 |
G+K | the ki-enclitic: lepingki, poegki,
the -kond-suffix: ringkond, aegkond, | 230/228 |
T+G | röntgen | 6/23 |
The second group contains rules that are the result of a combination
of several circumstances such as the typical structure of a word
stem in Estonian, the occurrence of certain letters in case and
person suffixes, restrictions on the occurrence of 'Õ',
'Ä', 'Ö' and 'Ü' in non-initial syllables, frequency
of word forms etc. E.g. the rule 'EE+J' covers only two extremely
frequent word forms 'seejuures' (202) and 'seejärel' (317).
Rule | Exceptions | Frequency |
+KÕ | kaksikõde | 1044/5 |
+PÕ | tippõiguskaitsja | 638/1 |
+TÕ | importõlu, importõunad | 1129/5 |
+VÕ | administratiivõigus, sugestiivõpe, korvõieline | 5386/5 |
+PÄ | trumpäss, tippärimees | 5366/8 |
+VÄ | reväär, skväär | 3026/8 |
+MÄ | paremäärmus, proper names ending in -mäe | 2926/4 |
+NÄ{D, G, H, I, L, O} (all except suffix -när/-näär - aktsionär, etc.) | ~1000 | |
+JÕ (partial overlap with rule I+JÕ) | 876 | |
EE+J | 571 | |
+HÄ | 496 | |
OO+A | 355 | |
EA+O | 145 | |
I+UU | 142 | |
ÄE+O | 109 | |
+HÕ | 67 | |
+VÖ | 50 | |
+NÕ | vaseliinõli | 916/1 |
+KÄ | vasakäärmus | 997/5 |
+PÜ | kupüür, tippüritus | 345/14 |
+PÖ | epopöa, pompöösne | 126/17 |
+HÜ | formaldehüüd | 86/3 |
The third group is mainly the result of an analysis of the word
form frequencies. Many of the words carry political or journalistic
connotations ('OND+ER' - koonderakond, 'ISA+MAA', both are names
of political parties; 'NE+MAA' - Venemaa Russia, 'AJA+KIR'
- ajakirjandus/ajakirjanik journal, journalist, journalistic).
Some of the rules added here could, in principle, belong to one
of the previous groups, but their context has been extended to
reduce the number of exceptions ('+SÜS' and '+SÜN').
Rule | Exceptions | Frequency |
+SÜ{S, N} | 584 | |
I+RÄ | 439 | |
I+TÖ | 427 | |
{A, E, I, O}+TÜ | atatürk, trotüül | 317/9 |
LE+A | asalea | 314/2 |
L+PO | kolposkoopia | 305/5 |
O+AJ | 286 | |
GE+P | Taagepera | 266 |
E+AAST | 204 | |
TS+P | 182 | |
A+EE | 127/1 | |
ÄE+{A, P, R} | 147 | |
IU+P | 146 | |
UP+M | 133 | |
A+UU | 31/2 |
The following rules of the third group are presented in a list:
{A,E,I,U}+PEA
+AMETI
+BÜRO
+FILM
+FIR
+GRUP
+KIRJ
+MINISTE
+POOL
+PROG
+RÄÄ
+RÄH
+SUGU
+TÖÖ
A+LEHE
A+LEHT
AA+IL
AJA+KIR
AJA+LOO
AJA+LU
AKTSIA+SE
ARU+SA
AS+AEG
ASJA+O
AU+HI
BI+EL
BI+KAA
DU+MAA
EEL+ARV
EES+KUJ
EESTI+M
GI+KO
HE+KO
I+MEES
ISA+MAA
ISE+EN
JA+MAA
JA+PI
KÄES+O
KOOS+S
KSA+MAA
LE+OL
LGE+O
LIS+MAA
LU+KO
MAA+VA
MITTE+
NE+MAA
OMA+E
OMA+KA
OND+ER
ÖÖ+KO
ÕU+KO
PEA+A
PEA+DI
S+TÖÖ
SE+LO
SE+PR
SI+ALG
ST+KI
ST+KU
TE+VAH
TE+VAA
TI+MAA
TO+JU
US+MAA
US+VA
The following rules could be used if the rules are ordered or when the logical operator 'not' can be used:
S+ÕIGU > +SÕ
NIM+Õ > +MÕ
AAL+Õ > +LÕ
EEL+Õ > +LÕ
OOL+Õ > +LÕ
M+ÜH > +MÜ
A similar analysis has been carried out using different material (programs and realisation by Indrek Kiissel). The analysis was applied to the compound words contained in the Orthological Dictionary. There are two main reasons for differences in the resulting rules: first, OD is poor in proper names, neologisms, foreign words and terminology, and, second, OD does not inform the user of word frequences.
According to OD, there are 55 two-letter combinations
(x+x) that without exceptions mark the compound boundary:
I+Õ 234
D+T 228
A+Õ 183
E+Õ 162
D+P 150
U+Õ 110
E+Ü 96
I+Ü 96
A+Ü 83
V+P 81
V+V 52
U+Ü 50
A+Ä 33
E+Ä 32
A+Ö 28
D+B 28
I+Ä 22
P+H 20
T+D 17
K+B 16
E+Ö 13
O+Õ 10
Ä+A 10
Ö+Õ 9
D+D 7
P+F 6
K+G 5
V+G 5
V+F 5
F+K 4
H+P 4
U+Ö 4
Ö+Ü 4
S+Þ 3
D+Þ 2
G+F 2
O+Ö 2
P+B 2
P+G 2
Ð+T 2
V+B 2
G+G 1
H+S 1
K+Ð 1
M+Z 1
O+Ä 1
P+Ð 1
S+Ð 1
Ð+K 1
Ð+P 1
Ð+R 1
Þ+F 1
Þ+J 1
Þ+S 1
Ä+Þ 1
52 combinations had more of positive results than
of negative ones. For every rule, the number of positive results,
negative results and their ratio is provided.
K+P 229 2 0.009
T+P 274 3 0.011
G+P 115 2 0.017
V+M 55 1 0.018
S+D 50 1 0.02
S+H 406 10 0.025
M+H 40 1 0.025
T+B 22 1 0.045
N+P 88 4 0.045
S+R 542 25 0.046
T+H 85 4 0.047
G+Õ 20 1 0.05
M+T 83 5 0.06
G+T 81 5 0.062
M+V 92 6 0.065
L+R 123 8 0.065
D+K 250 18 0.072
K+F 12 1 0.083
P+K 83 7 0.084
G+H 42 4 0.095
D+H 61 6 0.098
K+H 96 11 0.115
U+Ä 17 2 0.118
M+K 176 23 0.131
N+M 52 7 0.135
V+K 128 18 0.141
D+F 7 1 0.143
G+K 144 21 0.146
M+R 31 5 0.161
O+Ü 16 3 0.188
N+B 5 1 0.200
B+P 5 1 0.200
T+F 18 4 0.222
D+Õ 17 4 0.235
S+G 41 10 0.244
P+D 4 1 0.25
U+O 131 33 0.252
B+V 7 2 0.286
V+T 47 14 0.298
B+K 19 6 0.316
O+Z 3 1 0.333
D+G 3 1 0.333
T+V 185 71 0.384
S+V 1097 478 0.436
S+B 57 25 0.439
I+P 1705 785 0.460
V+H 36 17 0.472
L+H 109 52 0.477
W+P 4 2 0.5
M+G 2 1 0.5
K+D 10 5 0.5
G+Ä 2
1 0.5
The worst rules were:
R+I 21 11035 525.4
V+E 7 3689 527.0
P+Õ 3 1588 529.3
N+D 12 6925 577.0
M+E 10 6028 602.8
M+Ä 2 1233 616.5
V+U 2 1253 626.5
P+Ä 2 1328 664.0
G+U 4 3037 759.2
N+I 8 6083 760.3
B+I 2 1804 902.0
P+I 5 4528 905.6
K+U 7 6524 932.0
P+O 3 2800 933.3
Ü+H 1 1082 1082.0
H+K 1 1247 1247.0
T+U 9 11907 1323.0
N+E 14 20132 1438.0
M+I 8 12024 1503.0
P+U 2 3392 1696.0
B+E 1 1963 1963.0
Ü+L 1 1986 1986.0
Ä+Ä 1 1999 1999.0
N+U 1 2054 2054.0
V+Ä 1 2123 2123.0
V+I 2 5333 2666.5
M+U 1 3104 3104.0
H+A 1 4187 4187.0
For the purpose of comparison we will also present
two lists of four-letter sequences conforming to the pattern xx+xx.
The first of them presents such letter combinations that occurred
exclusively on compound boundaries 50 times or more:
IS+VÄ 152
US+MA 133
IS+VI 101
TE+VA 91
LE+KA 88
US+VI 82
US+VÄ 82
SE+VA 75
US+KI 73
US+TÖ 71
UU+VI 68
NA+KO 66
IS+PU 60
ME+KA 59
LA+KO 56
IS+AE 55
SE+KI 55
US+PI 55
US+RA 54
US+RI 53
US+PA 52
JA+VA 51
UD+TE 51
US+AS 51
US+PU 51
US+SÄ 51
LI+VA 50
SE+KO 50
The second list contains sequences with exceptions,
but no more than 1 exception to 10 recognised boundaries:
US+VA 170 1 0.006
SE+KA 100 1 0.010
IS+VA 93 1 0.011
JA+KO 58 1 0.017
TE+KA 56 1 0.018
SI+VA 54 1 0.019
SI+PU 54 1 0.019
DE+VA 54 1 0.019
VA+KA 43 1 0.023
US+ME 43 1 0.023
JA+TE 43 1 0.023
IS+PI 42 1 0.024
KU+VA 40 1 0.025
SI+TE 38 1 0.026
NI+TE 39 1 0.026
SA+VA 37 1 0.027
RI+KI 37 1 0.027
US+KE 36 1 0.028
LI+PA 36 1 0.028
AL+MA 34 1 0.029
US+TO 35 1 0.029
NI+LA 35 1 0.029
ER+KA 35 1 0.029
TI+KO 31 1 0.032
DI+MA 31 1 0.032
RI+LE 30 1 0.033
US+SA 29 1 0.034
SE+VI 29 1 0.034
SE+RA 28 1 0.036
SI+PI 27 1 0.037
NA+SA 27 1 0.037
MA+KI 27 1 0.037
LA+PA 27 1 0.037
AS+PU 27 1 0.037
IS+KI 76 3 0.039
TE+SA 25 1 0.040
KU+KA 25 1 0.040
LE+HA 24 1 0.042
HA+VA 24 1 0.042
NI+PU 48 2 0.042
HA+KU 23 1 0.043
US+PE 22 1 0.045
TE+TE 22 1 0.045
SI+ME 21 1 0.048
SE+TO 21 1 0.048
HE+LA 21 1 0.048
BA+MA 21 1 0.048
NI+SI 20 1 0.050
ME+KE 20 1 0.050
ER+KO 20 1 0.050
EL+AR 20 1 0.050
BI+VA 20 1 0.050
US+IN 19 1 0.053
NA+LE 19 1 0.053
NA+KE 18 1 0.056
AS+LI 18 1 0.056
SI+PA 36 2 0.056
US+SE 90 5 0.056
DI+VA 35 2 0.057
LI+RI 17 1 0.059
KI+KA 17 1 0.059
GU+PI 17 1 0.059
DI+TU 17 1 0.059
RA+MA 33 2 0.061
SI+SI 16 1 0.063
LU+TE 16 1 0.063
KA+ME 16 1 0.063
DE+PA 16 1 0.063
DU+KA 32 2 0.063
SI+KL 15 1 0.067
ON+KA 15 1 0.067
NI+TO 15 1 0.067
NA+VE 15 1 0.067
KU+TU 15 1 0.067
DU+VA 15 1 0.067
LU+PI 14 1 0.071
KE+LA 14 1 0.071
IN+HA 14 1 0.071
HI+LA 14 1 0.071
AE+KA 14 1 0.071
AL+KA 28 2 0.071
LU+MA 56 4 0.071
TU+KU 13 1 0.077
TU+VA 13 1 0.077
RI+AR 13 1 0.077
OR+SO 13 1 0.077
JA+AR 13 1 0.077
GI+ME 13 1 0.077
ER+PI 13 1 0.077
ES+PI 13 1 0.077
IS+PA 39 3 0.077
LI+PI 38 3 0.079
IS+MA 114 9 0.079
US+KA 139 11 0.079
SE+AS 12 1 0.083
RE+VY 12 1 0.083
GA+TA 12 1 0.083
GA+VI 12 1 0.083
EL+SE 12 1 0.083
BE+KA 12 1 0.083
BI+KA 12 1 0.083
DA+VA 23 2 0.087
AS+PI 23 2 0.087
TI+PE 11 1 0.091
TI+PI 11 1 0.091
OO+TA 11 1 0.091
MA+VO 11 1 0.091
KS+PA 11 1 0.091
IK+VA 11 1 0.091
IS+SO 11 1 0.091
HU+LA 11 1 0.091
EO+TE 11 1 0.091
NU+KA 22 2 0.091
RA+PU 22 2 0.091
NA+PA 22 2 0.091
NA+PO 22 2 0.091
NA+MA 43 4 0.093
VA+TE 32 3 0.094
US+LU 21 2 0.095
GI+LA 21 2 0.095
RI+PA 42 4 0.095
RI+LA 52 5 0.096
RI+PU 41 4 0.098
The rules proposed here cover approx. 45% of the compound boundaries while the addition of some rules without noticeable slowdown of the process would probably raise the efficiency to 60% of all compound boundaries.