Ch. 8. Lemmatisation of manuscript text

Version 1.0 (20 May 2003)

 

8.1 Introduction
8.2 The attribute lemma
8.3 The attribute pos
8.4 General problems
8.5 Word classes

 

 

8.1 Introduction

In ch. 2.3 we suggested that the unit word, <w>, should be marked in the transcription of manuscript text, in order to provide possibilities to treat abbreviations and their expansions consistently. The element <w> can also include information on lemma and a grammatical analysis for every word in the manuscript text. This information can preferably be provided as content in the two attributes lemma and pos. In this chapter the basic principles for lemmatization of manuscript text are treated. It is important to note, however, that this presentation should be seen as a scetch rather than definite guidelines. The elements and attributes that will be discussed are:

Element

Contents

<w>

delimits a grammatical word.

lemma

gives the lexical form of the grammatical word.

pos

gives the morphosyntactic analysis of the grammatical word.

It is essential that the lemmatization of Medieval Nordic manuscript text is done in adherence to the principles developed for handling large corpora in linguistic research. We therefore recommend that the guidelines provided by EAGLES (1996) are used as a starting point. In some aspects, however, the principles presented in the following are diverging from the principles suggested by EAGLES, as the Medieval Nordic languages present particular problems for the encoder.

The model provided here is adjusted to the Old Norse-Icelandic grammar. For Old Swedish and Old Danish texts we can expect a radical levelling in the grammatical system, e.g. in the nominal and verbal inflections. The model provided here will therefore overgenerate when applied to Swedish and Danish texts form the period.

 

 

8.2 The attribute lemma

Within the element <w> it is possible to provide a variety of information for every graphic word. With the attribute lemma we can for example provide information on the lexical form of the graphic word, which enables us to search all graphic and grammatical forms of the word. When a text is marked up with the element <w> we can add information about the lexical form of the word in the attribute lemma. The lemma could preferably be equal to the form you find in the lexicon. For Old Norse-Icelandic texts we suggest that the word-list produced by the Arnamagnæan Commission's Ordbog over det norrøne prosasprog (ONP) at University of Copenhagen is used to create the lemma base. The attribute would then be marked up as follows:

<w lemma="hafa">ha&fins;i</w>

In ch. 2.3 the use of <w> for the markup of graphic words and information concerning their description is treated. In the example used here the graphic word contains an Old Norse-Icelandic character which is not used in modern script. In a transcription of manuscript text this character is given with an entity name "&fins;" as described in ch. 5, but in the lemmatized form the character is normalized according to the principles of the ONP. The resulting structure will look as follows:

<w lemma ="hafa">
<facs>ha&fins;i</facs>
<dipl>ha&fins;i </dipl>
<norm>hafi</norm>
</w>

In the following example we can see how a more complicated form with abbreviations and expansions can be presented in the elements <facs> and <dipl> respectively, included in the element <w>, and thereby all be related to the attribute lemma. 

<w lemma ="koma">
<facs> co&bar;</facs>
<dipl> co<expan>m</expan></dipl>
<norm>kom</norm>
</w>

In cases where a graphic word is included partially or completely in the element <unclear> this can be marked within the element <w> and be related to the attribute lemma.

<w lemma ="sv&aacute;">
<facs><unclear reason="faded">s&ra;</unclear></facs>
<dipl> <unclear>s<expan>ua</expan></unclear></dipl>
<norm>sv&aacute;</norm>
</w>

Text included within the element <supplied> is not lemmatized. The following example shows how a character, word or phrase that has been supplied is marked with the element <w>, but without markup of the lemma as the text is not transcribed from the manuscript text.

<w>
<facs><supplied reason="illegible" resp="KGJ">lei</supplied>kti</facs>
<dipl><supplied reason="illegible" resp="KGJ">lei</supplied>kti</dipl>
<norm>leikti</norm>
</w>

This means that the forms that are not marked will not be included in the searchable database under the category lemma. We hereby avoid the problem of contamination between forms that are from the manuscript text and forms that have been supplied by a transcriber or encoder of the text. A basic principle is that the lemmatized text should be from the manuscript text.

 

 

8.3 The attribute pos

With the attribute pos we can add information about the morphosyntactic form of the individual representation of a lemma, i.e. the form provided in the element <facs> is described morphosyntactically. To be able to make this analysis it is necessary to create a model for the encoding that describes all the possible morphological forms of each lemma. In the following this description is tentatively built from the basic categories with sub-categories to provide a full description of the Old Norse-Icelandic grammar.

For the noun hestr 'horse' in dative plural this can be described as follows:

<w lemma ="hestr" pos="NCMPDI">
<facs> hestu&bar;</facs>
<dipl> hestu<expan>m</expan></dipl>
<norm>hestum</norm>
</w>

The lemmas can primarily be divided in word classes. The first character in the character set provided for the attribute pos above represents the word class Nouns (N). The following character defines the noun as a nomen appellativum or a Common Noun (C). There is also a need for information about gender (masculinum, M), number (plural, P) and case (dative, D). For every word class the categories are given in a certain order. In cases where a category is not in use, the space is marked with an # (cf. ch. 8.4.2 below).

 

 

8.4 General problems

8.4.1 Form variation: internal and external homography

The manuscript texts of Medieval Nordic display a wide range of variation graphematically and ortographically, the Old Norse-Icelandic texts in a higher degree than the East Nordic texts. Further, the Medieval language of the Nordic countries is highly flectional, which causes problems as soon as we try to relate the graphical forms, what we call graphic words, to the lemmatic forms, i.e.i.e. the grammatical forms available for the analysed lemma. In the following we propose a model for the lemmatizition and analysis into lemmatic forms.

The variation we find in the manuscript text, what we can call form variation, is an initial problem in the first phase of the lemmatization. We need to be able to identify all possible graphic forms that can represent a lemma in the manuscript text. A good example is some of the graphic variation for the pronoun hann 'he' in different cases.

Form

Lemma

hann

hann

han&bar;

hann

h&bar;

hann

h&bar;n

hann

ha&scap;

hann

hans

hann

han&stall;

hann

h&bar;s

hann

h&bar;&stall;

hann

honum

hann

honom

hann

h&bar;m

hann

The flectional diversity of the Medieval Nordic languages provides many cases of homography between lemmatic forms of the same lemma, what we call internal homography. This can be seen in the following example where the nominative singular of the feminine noun hetja has the same lemmatic form as genitive plural (NCFSN| NCFPG), and in oblique case singular (NCFSG| NCFSD| NCFSA) and nominative plural and accusative plural (NCFPN| NCFPA). The homography is marked with | between the tags for each lemmatic form:

Form

Lemma

Tag

hetja

hetja

NCFSNI | NCFPGI

hetju

hetja

NCFSGI | NCFSDI | NCFSAI

hetjur

hetja

NCFPNI | NCFPAI

In the initial markup of lemmatic forms it is suggested that all possible tags are given in the attribute pos. This is, however, not satisfying if we wish to have a consistent markup of the morphosyntactic analysis. In cases where the morphosyntactic analysis can be made consistently this should of course be done.

Further we must take into account the possibility that the graphic form for different lemmas appears in homographic forms on the level of lemmatic form, what we call external homography. An example of this could be the neutral noun vár 'spring' (NCNSN) and possessive determinative várr 'our' in feminine singular nominative, neutral plural nominative and accusative (DPFSN| DPNPN| DPNPA).

Form

Lemma

Tag

vár

vár

NCNSNI

vár

várr

DPFSN | DPNPN | DPNPA

The graphic forms can also be homographic for different lemmas as in the feminine noun þýða 'friendship' in nominative singular and the verb þýða 'interpretate' in infinitive.

Form

Lemma

Tag

þýða

þýða

NCFSNWI | VPresInfA####Wk

In these cases the morphosyntactic analysis has to be made manually. An alternative is to give all possible lemmatic forms in the attribute pos as in the above example.

 

8.4.2 Zero values: # and @

In the last example above, several positions are marked with the "#" sign. This is to indicate that some of the possible morphological categories are not relevant for this particular word. Thus, for an infinitive like þýða the positions Gender, Number, Case and Species are not relevant - no infintives are inflected for these categories. Cf. ch. 8.5.8.2 below. For simplicity, we suggest that "#" is read as "irrelevant".

In other cases, a word is inflected for a certain category, but the encoder is not able to specify a value. This may be the case with some proper nouns, for which no gender can be specified. This is a different type of "zero" value, and we therefore suggest to indicate these positions with the "@" sign. For simplicity, we suggest that "@" is read as "unknown".

 

 

8.5 Word classes


8.5.1 Nouns (N)

Nouns can be devided into two categories, appellatives, och propria. They are all marked with an N for noun. In the second field the markup define the two categories C, appellatives (Common Nouns), and P, propria, (Proper Nouns). In marginal cases, it may be difficult to decide whether a noun is a common or a proper name; in that case this field may be marked with an @.

Nouns should also be marked for gender. In the Medieval Nordic languages we define three gender categories masculine, feminine and neutral, which are marked in the third field as M, F and N respectively. Some proper nouns are indeterminate with respect to gender and should be marked with an #.

There are two categories for numerus. Singular and plural should be marked in the fourth field as S and P respectively. Most personal names are not inflected for number and should be marked with an #.

There are four categories for case in the Medieval Nordic languages, nominative, genitive, dative and accusative, which are marked in the fifth field as N, G, D and A respectively. Due to the high degree of internal homography (syncretism) in the declension of weak nouns, we suggets that it should be possible to refer to all oblique cases (i.e. genitive, dative and accusative) with a single code, Obl.

A noun occurs either in an indefinite or a definite form, e.g. "hestr" or "hestrinn". This is marked in the sixth field as I and D respectively. Concerning personal names and place-names only the last can occur in definite form.

Example: Encoding of the noun "ymr" in the phrase "þá heyrðu þeir ym mikinn ok gny":

<w lemma="ymr" pos="NCMSAI">ym</w>

Noun

Subcategory

Gender

Number

Case

Species

N

C
P
@

M
F
N
#

S
P
#

N
G
D
A
Obl

I
D

 

 


8.5.2 Adjectives (AJ)

Adjectives (AJ) are inflected for grade in three levels, positive, comparative and superlative, which are marked in the second field as P, C and S respectively.

Adjectives should also be marked for gender. In the Medieval Nordic languages there are three gender categories, masculine, feminine and neutral, which are marked in the third field as M, F and N respectively.

There are two categories for numerus. Singular and plural should be marked in the fourth field as S and P respectively.

There are four categories for case in Medieval Nordic languages, nominative, genitive, dative and accusative, which are marked in the fifth field as N, G, D and A respectively. Due to the high degree of internal homography (syncretism) in the declension of weak adjectives, we suggets that it should be possible to refer to all oblique cases (i.e. genitive, dative and accusative) with a single code, Obl.

Finally, an adjective occurs either in an indefinite (strong) form, e.g. "hvítr hestr", or an definite (weak) form, "inn hvíti hestr". This is shown in the sixth field as as I and D respectively.

Example: Encoding of the adjective "langr" in the phrase "seint er um langan veg at spyrja tíðenda":

<w lemma="langr" pos="AJPMSAI">langan</w>

Adjective

Grade

Gender

Number

Case

Species

AJ

P
C
S

M
F
N
#

S
P

N
G
D
A
Obl

I
D

Note that in the comparative form, adjectives only have weak (indefinite) inflection.

 


8.5.3 Pronouns (P)

In recent grammars the traditional category pronoun is usually divided into pronouns in a strict sense (words replacing a noun) and determinatives (adjunct words), and that is our recommendation as well, cf. ch. 8.5.3 and 8.5.4 below. However, in some projects (i.e. the Old Norwegian lemmatised corpus) there is only a single category pronoun, and we have therefore added in ch. 8.5.5 a combined category, pronouns and determiners (cf. EAGLES, major categories).

Although pronouns in the strict sense of "words replacing a noun" is a smaller category than the traditional one, there are a nonetheless three distinct sub-categories. In the following these are treated separately to provide an over-view. All pronouns are marked with P and then a field for subcategory, Per for personal pronouns, Int for interrogative pronouns and Ind for indefinite pronouns.

 

8.5.3.1 Personal pronouns (PPer)

The personal pronouns (PPer) are declined in first, second and third person. This is marked in the third field as 1, 2 and 3 respectively.

The inflection in gender varies for the personal pronouns, but we can generally account for three categories masculine, feminine and neutral, which are marked in the fourth field as M, F and N respectively. In some categories there is no grammatical markup for gender (see the list of tags below). In these cases the fourth field has an #.

Personal pronouns in the first and second person have three categories for number, singular, plural and dual, which are marked in the fifth field as S, P and D respectively. Personal pronouns in the third person have no inflection for number. The fifth field in this case has an #.

The personal pronouns are inflected in four cases, nominative, genitive, dative and accusative, which are marked in the sixth and final field as N, G, D and A respectively.

Example: Encoding of the personal pronoun "vit" in the phrase "vit erum fegnir":

<w lemma="vit" pos="PPer1#DN">vit</w>

Pronoun

Subcategory

Person

Gender

Number

Case

P

Per

1
2
3

M
F
N
#

S
D
P
#

N
G
D
A

 

8.5.3.2 Interrogative pronouns (PInt)

The interrogative pronouns (PInt) have no inflection in person. This field should therefore be marked with an #. They are declined in three categories for gender, masculine, feminine and neutral, which are marked in the fourth field as M, F and N respectively.

Interrogative pronouns are inflected in two categories for number, singular och plural, which are marked in the fifth field as S and P respectively.

Finally, the interrogative pronouns are inflected in four categories for case, nominative, genitive, dative and accusative, which are marked in the sixth and final field as N, G, D and A respectively.

Example: Encoding of the interrogative pronoun "hverr" in the phrase "Frigg spurði hverr sá væri með ásum":

<w lemma="hverr" pos="PInt#MSN">hverr</w>

Pronoun

Subcategory

Person

Gender

Number

Case

P

Int

#

M
F
N
#

S
P

N
G
D
A

 

8.5.3.3 Indefinite pronouns (PInd)

The indefinite pronouns (PInd) have no inflection for person. This field should therefore be marked with the # sign. They are inflected in three categories for gender, masculine, feminine and neutral, which are marked in the fourth field as M, F and N respectively.

Indefinite pronouns are inflected in two categories for number, singular och plural, which are marked in the fifth field as S and P respectively.

Finally, the indefinite pronouns are inflected in four categories for case, nominative, genitive, dative och accusative, which are marked in the sixth field as N, G, D and A respectively.

Example: Encoding of the indefinite pronoun "einnhverr" in the phrase "vill hann taka til at þreyta drykkju við einhvern mann":

<w lemma="einnhverr" pos="PInd#MSA">hverr</w>

Pronoun

Subcategory

Person

Gender

Number

Case

P

Ind

#

M
F
N
#

S
D
P
#

N
G
D
A

 


8.5.4 Determinatives (D)

There are two sub-categories for the determinatives. In the following these are treated separately to provide an over-view. All determinatives are marked with D in the first field. In the second field the sub-category is given as described below.

 

8.5.4.1 Possessive determinatives (DPos)

The possessive determinatives (DPos) are inflected in three categories for gender, masculine, feminine och neutral, which are marked in the third field as M, F and N respectively.

Possessive determinatives are inflected in two categories for number, singular och plural, which are marked in the fourth field as S and P respectively.

Finally, the possessive determinatives are inflected in four categories for case, nominative, genitive, dative and accusative, which are marked in the fifth field as N, G, D and A respectively.

Example: Encoding of the possessive "sinn" in the phrase "hann hugðisk þá at reyna afl sitt":

<w lemma="sinn" pos="DDetNSA">sitt</w>

Determinative

Subcategory

Gender

Number

Case

D

Pos

M
F
N

S
P

N
G
D
A

 

8.5.4.2 Demonstrative determinatives (DDet)

The demonstrative determinatives (DDet) are inflected in three categories for gender, masculine, feminine och neutral, which are marked in the third field as M, F and N respectively.

Demonstrative determinatives are furthermore inflected in two categories for number, singular och plural, which are marked in the fourth field as S and P respectively.

Finally, the demonstrative determinatives are inflected in four categories for case, nominative, genitive, dative and accusative, which are marked in the fifth field as N, G, D and A respectively.

Example: Encoding of the demonstrative "hinn" in the phrase "hitt fjall er hátt":

<w lemma="hinn" pos="DDetNSN">hitt</w>

Determinative

Subcategory

Gender

Number

Case

D

Det

M
F
N

S
P

N
G
D
A

 


8.5.5 Pronouns/determiners (PD)

This is the traditional category of pronoun, as defined in the grammars of e.g. Noreen 1923 and Iversen 1973. From a inflectional point of view this is a heterogenous category.

The personal pronouns are inflected in first, second and third person. This is marked in the third field as 1, 2 and 3 respectively. Other prononuns are not inflected in person and therefore marked with an #.

Many pronouns & determiners are inflected for gender, masculine, feminine and neutral, which are marked in the third field as M, F and N respectively. Once more, some are not, and are marked with an #.

Most pronouns & determiners are inflected for number, singular and plural, marked in the fourth field as S and P respectively, some also in dual, D. Those which are not inflected for number are marked with an #.

Finally, most pronouns & determiners are inflected for case, nominative, genitive, dative and accusative, marked in the fifth field as N, G, D and A respectively.

Example: Encoding of the pronoun "engi" in the phrase "ormrinn er slœgari en ekki annat kvikendi":

<w lemma="engi" pos="PD#NSN">ekki</w>  

 The categories for the combined category of pronouns and determiners can be given as follows:

Pronoun/determiners

Person

Gender

Number

Case

PD

1
2
3
#

M
F
N
#

S
D
P
#

N
G
D
A
#

 

 


8.5.6 Numerals (NU)

The numerals are devided into two sub-categories cardinals (NUC) and ordinals (NUO).

Ordinals and the cardinals 1-4 are inflected in three categories for gender, masculine, feminine and neutral, which are marked in the third field as M, F and N respectively. The rest of the cardinals are not inflected for gender and are therefore marked with an #.

Ordinals can be inflected for number, and are therefore marked for singular or plural in the fourth field. Cardinals are not inflected for number, and are therefor marked with an # in this field.

Ordinals and the cardinals 1-4 are inflected in four categories for case, nominative, genitive, dative and accusative, which are marked in the fifth field as N, G, D and A respectively.

The ordinal fyrstr occurs either in a indefinite (strong) form, "fyrstr", or an definite (weak) form, "fyrsti". This is shown in the sixth field as as I and D respectively. Other numerals are marked with an # in this field.

The numerals hundrað 'one hundred (and twenty)' and þúsund 'one thousand (two hundred)' are marked as nouns.

Example: Encoding of the numeral "sjaundi" in the phrase "in sjaunda borg":

<w lemma="sjaundi" pos="NUOFSN#">sjaunda</w>

Numerals

Subcategory

Gender

Number

Case

Species

NU

C
O
#

M
F
N
#

S
P
#

N
G
D
A
#

I
D
#

 


8.5.7 Articles (AT)

In recent grammars the traditional word class articles is usually classified as part of the word class determinatives. However, in some projects (i.e. the Old Norwegian lemmatised corpus) articles are treated as a separate class, and we suggest that as an alternative they may be classified as such. Cf. also the EAGLES guidelines, which recognise articles as a major category.

Articles are inflected as adjectives, except for grade. There are thus four categories - gender, number, case and species.

Example: Encoding of the article "einn" in the phrase "ein kona":

<w lemma="einn" pos="ATFSND">ein</w>

Adjective

Gender

Number

Case

Species

AT

M
F
N
#

S
P

N
G
D
A

I
D

 

 


8.5.8 Verbs (V)

Verbs are either finite or infinite. In the former category, they are inflected for tense, mood, person, number and voice. In the latter category, participles are basically inflected as adjectives, while infinitives have a very restricted inflection. For practical reasons, we recommend to treat finite and infinite forms separately.

 

8.5.8.1 Finite forms

Finite verbs are inflected in two categories for tense, present and preterite, which are marked in the second field as Pres and Pret respectively.

Next, verbs are inflected in three categories for mood, indicative, subjunctive and imperative, which are marked in the third field as Ind, Sub and Imp respectively.

In the personal inflection there are three categories, first, second and third person, which are marked in the fourth field as 1, 2 and 3 respectively.

The verbs are inflected in two categories for number, singular and plural, which are marked in the fifth field as S and P respectively.

Verbs are also inflected for voice, active and reflexive. This is marked in the sixth field as A and R respectively.

Optionally, verbs may be marked for morphological class. This is particularly useful for distinguishing verbs which appear in both weak and strong forms, such as brenna, svelg(j)a etc. We suggest that four main classes may be recognised, strong verbs (St), weak verbs (Wk), reduplicating verbs (Rd) and preterito-presentic verbs (Pp).

Finally, verbs with enclitic pronouns may be marked with the value Enc.

Example: Encoding of the verb "taldi" in the phrase "hon taldi":

<w lemma="telja" pos="VPretInd3SA">tel</w>

Verb

Tense

Mood

Person

Number

Voice

Class (optional)

Enclitics

V

Pres
Pret

Ind
Sub
Imp

1
2
3

S
P

A
R

St
Wk
Rd
Pp

Enc

 

8.5.8.2 Infinite forms

Infinite forms are either paritciples or infinitives. Both categories are inflected for tense, so we recommend that the two first fields are identical with the markup for finite forms.

As a next field, we recommend form, with the values participle (Part) and infinitive (Inf).

Since infinitives can have active as well as reflexive forms, the next field should indicate voice.

The next fields should be identical with the scheme for adjectives, i.e. gender, case, number and species. Infinitives are not inflected for these categories, and are therefore marked with the # sign.

Example: Encoding of the verb "koma" in the phrase "hann er kominn":

<w lemma="koma" pos="VPretPartMSND">tel</w>

Example: Encoding of the verb "fara" in the phrase "hann mun fara":

<w lemma="fara" pos="VPresInfA####St">fara</w>

Verb

Tense

Form

Voice

Gender

Number

Case

Species

Class (optional)

V

Pres
Pret

Part
Inf

A
R
#

M
F
N
#

S
P
#

N
G
D
A
#

I
D
#

St
Wk
Rd
Pp

 

 


8.5.9 Adverbs (AV)

Adverbs (AV) are only inflected for grade, i.e. positive (P), comparative (C) and superlative (P).

Example: Encoding of the adverb "sterkliga" in the phrase "hann svaf ok hraut sterkliga":

<w lemma="sterkliga" pos="AVP">sterkliga</w>

Adverbs

Grade

AV

P
C
S

 

 


8.5.10 Prepositions (AP)

Prepositions are not inflected and only marked for word class, AP. The latter is an abbreviation for "adposition", which is the hyponymous term for "preposition" and "postposition" (found in e.g. Japanese, but not in the Nordic languages).

Example: Encoding of the preposition "at" in the phrase "koma þeir at kveldi til eins búanda":

<w lemma="at" pos="AP">at</w>  

Prepositions

AP

 

 


8.5.11 Conjunctions and subjunctions (CC and CS)

In recent grammars, the traditional word class conjunctions is usually divided into two separate classes, conjunctions (e.g. "ok", "en") and subjunctions (e.g. "at", "ef"). The former category connects phrases on the same syntactical level, while the latter category typically introduces clauses. In traditional terminology, this is reflected in the subdivision of conjunctions into coordinating and subordinating. We recommend making a distinction between conjuntions proper = coordinating conjunctions (CC) and subjunctions = subordinating conjunctions (CS).

However, in some schemes (i.e. the Old Norwegian lemmatised corpus) only a single word class conjunctions is recognised. In that case, the second field may be given with an #.

Example: Encoding of the conjunction "ok" in the phrase "Logi hafÞi etit slátr allt ok beinin með":

<w lemma="ok" pos="CC">ok</w>

Example: Encoding of the subjunction "at" in the phrase "hon sagði at Baldr hafði þar riðit":

<w lemma="at" pos="CS">at</w>  

Conjunctions

Subcategory

C

C
S
#

 

 


8.5.12 Interjections (IT)

Interjections are not inflected and only marked for word class, IT.

Interjections

IT

 

 


8.5.13 Infinitive marker (IM)

The infinitive marker is undeclined and only marked as IM. In Old Norse it usually has the form at.

Infinitive marker

IM

 

 


8.5.14 Relative particle (RP)

The relative particle is undeclined and only marked as RP. In Old Norse it usually has the form er or sem. Some grammarians would classify the relative particle as a subjunction, while others tend to look upon it as a pronoun.

Relative particle

RP

 

 


8.5.15 Unassigned (U)

Some words are corrupt, diffcult to analyse, belong to another language or for other reason indeterminate. These words are marked as unassigned, U.

 

Top of page

 

 

Preliminary version created 8 April 2002. Version 1.0 published 20 May 2003.