Chapter 5. Characters and words
Version 3.0 beta
This is a preliminary version which can be changed or updated at any time.
The revision and updating of this chapter has been done by Odd Einar Haugen and Beeke Stegmann.
When transcribing a text, the transcriber will usually make a distinction between the individual characters, the white space between some of the characters, the words made up by sequences of characters, and the punctuation marks which are inserted between some of the words. The actual encoding can be as straightforward as the example in ch. 2.3 above, in which characters, punctuation marks and spaces have been typed directly from the keyboard:
Reiðr var þá Vingþórr er hann vaknaði ok síns hamars um saknaði, skegg nam at hrista, skör nam at dýja, réð Jarðar burr um at þreifask.
In a more complex encoding, the transcriber might like to identify the basic units as
such, so that a distinction easily can be drawn between single characters, words,
punctuation marks and the white space surrounding them. This chapter will discuss these
basic units and how they can be encoded specifically, if needed, using elements like
<c> for individual characters and
<w> for individual words.
The basic unit in any transcription of an alphabetic script is the individual letters. In a linguistic context a distinction is often drawn between the abstract entity of a grapheme and the representation of graphs in a written document. Variant forms are referred to as allographs, e.g. the Roman type of s and the Fraktur (black letter) type. The terminology is analogous to the distinction between phonemes, phones and allophones. For a general introduction to this terminology, see e.g. Sture Allén 1971, Manfred Kohrt 1985 or Christa Dürscheid 2016.
In this handbook we shall adopt the terminology of the Unicode Standard. The fundamental distinction drawn is between characters and glyphs. Characters are, as Unicode defines it, “the smallest components of written language that have semantic value”, while glyphs are “the shapes that characters can have when they are rendered or displayed” (cf. Unicode 9.0, ch. 2.2 Unicode Design Principles). What the transcriber sees in the source document is a series of individual glyphs, and the act of transcribing essentially involves linking these glyphs to the characters at the transcriber’s disposal.
The concept of a character is similar to, but not identical with the linguistic concept of a grapheme. These concepts are notoriously difficult, but for the purposes of this handbook we believe that the Unicode usage is robust and sufficiently well-defined.
The Unicode Standard puts great emphasis on the fact that individual characters may be represented by a number of glyphs, and is therefore reticent to accept as new characters what it percieves to be variant glyphs. It will be obvious to most people that the various shapes of letters in printed type faces, such as Courier, Times, Lucida etc., should not be seen as different characters, as shown in fig. 5.1.
Unicode draws a distinction between small (minuscule) characters such as “a” and large (majuscule) characters such as “A”, since there is a possible semantic value attached to each set of characters. Thus, “the white house” can refer to any house which is white in colour, while “the White House” refers (normally) to one specific building. It can be argued that the same applies to the distinction between regular characters, “a”, and italics, “a”. For example, while “Metope” refers a poem by the Norwegian author Olaf Bull, Metope (according to a widespread bibliographical practice) refers to the book in which this poem is published (a book which, co-incidentally, bears the same name as one of the poems contained in the book). However, Unicode does not regard italics (or bold type) as individual characters. There are good reasons for this, but the example serves to illustrate the fact that the definition of a character is not always clear-cut.
Medieval Nordic manuscripts were written in the Latin alphabet from the very beginning. The basic inventory is thus the characters a-z / A-Z. They were supplemented with a number of new (or borrowed) characters, several ligatures and a variety of diacritical marks. There was also a large number of abbreviation marks in use, especially in Old Icelandic and Old Norwegian manuscripts. In fact, some abbreviation marks behave as ordinary characters in the sense that they occupy a separate position on the base line. On the other hand, many components of ordinary characters are diacritical, i.e. placed above (or through or below) another character, and thus akin to typical abbreviation marks. This means that the rules for transcribing ordinary characters and abbreviation marks should be identical.
We believe that it is possible to identify a base line in all texts, as shown in fig. 5.2. We recommend that the transcriber identifies each separate character on the base line and record this in the same sequence as in the manuscript. Thus, the characters in fig. 5.2 would be transcribed as “abpþ” or “abpþ”. The last character may be encoded with its Unicode code point, “þ” at 00FE, or with an entity, “þ”. Both encodings are strictly equivalent.
If there are marks of any sort placed above, through or below any base line character, we recommend that these marks (if they are to be interpreted as characters) are transcribed immediately after the base line character. In general, we refer to these marks as diacritics. As mentioned above, abbreviation marks are also frequently written above (and in some cases through or below) a base-line character. Assuming that the sign above “h” should be referred to with the entity “&er;”, the transcription of the very first word in fig. 5.3 would be “h&er;”.
Diacritical marks are often seen as forming an integral part of a base line character and the whole encoded as a single character. This applies to accent marks, such as the one above “e” in fig. 5.3. This combination of a base line character and a combining mark can be encoded as a single character, in Unicode referred to as LATIN SMALL LETTER E WITH ACUTE and the hexidecimal code value 00E9. Alternatively, this letter can be decomposed and encoded as a combination of LATIN SMALL LETTER E and COMBINING ACUTE ACCENT. We would like to emphasize that both encodings are strictly equivalent.
Abbreviation marks, on the other hand, are usually treated as separate characters and encoded as characters in their own right. From a purely graphical point of view, the distinction between the acute accent in “é” and abbreviation marks such as the “zigzag” mark and the bar, both exemplified in fig. 5.3, is far from obvious, but the semantics are different. The acute accent may in some manuscripts be used to signify length, but it is often used quite freely, sometimes only to distinguish one minim character from another. Abbreviation marks have a definite (if sometimes ambiguous) meaning and can be expanded into one or more characters; the zigzag mark above “h” in fig. 5.3 signifies “er”, and the bar above “n” signifies another “n”.
5.2.1 Rules for encoding characters
We suggest the following basic rules for encoding characters, irrespective of whether they are ordinary (alphabetic) characters or abbreviation marks.
1. Each character is encoded according to its position in the direction of writing.
2. Alphabetical characters on the base line are encoded first:
2.1 If the character belongs to the ordinary Latin character set a-z / A-Z (commonly
known as ISO 646 or Basic Latin) it is always encoded as such.
2.2 Characters outside Basic Latin should either be encoded by Unicode codepoints or by entities, e.g. either as “abpþ” (recommended) or as “abpþ”.
2.3 Characters which are not part of the Unicode Standard must always be encoded by entities. See Appendix A for more details.
3. Abbreviation marks occupying a separate position on the base line are encoded in the same manner as alphabetical characters. This applies to e.g. LATIN SMALL LETTER P WITH STROKE THROUGH DESCENDER (for “per” or “par”), as explained in ch. 6 below.
4. Alphabetical characters with diacritical marks, e.g. “é”, are encoded in one of two equivalent ways:
4.1 As a base line character + one or more combining marks. Thus the character “é”
would be encoded as “e” + “&combacute;” (the latter entity meaning
COMBINING ACUTE ACCENT).
4.2 As a composite base line character and encoded with a single Unicode code point or an entity. Thus, the character “é” would be encoded as either “é” or as “é”.
5. Characters with abbreviation marks are encoded in the same manner as alphabetical characters, i.e. in one of two equivalent ways:
5.1 As a base line character + one or more combining marks. Thus the first character in
fig. 5.3 above would be encoded as “h” + “&er;” (the latter entity meaning
COMBINING ABBREVIATION MARK “ER”).
5.2 As a composite base line character and encoded with a single entity. Thus the above character might be encoded with a single entity, e.g. as “&her;”.
As a rule, we would recommend the first solution, since the number of combinations of
base line characters and combining abbreviation marks is very high. Furthermore, we
recommend that the abbreviation mark is identified by the
<am> element (if it
is encoded as such, typically in the facsimile level) or by the
(if it has been expanded, in this case as “er”, typically on the diplomatic level).
See ch. 4 above for an explanation of levels.
6. If there is more than one combining character, they are encoded in this order:
(a) Combinations with the base line character within the x height of the base line
(b) Combinations with the base line character outside its x height, but still in contact with it.
(c) Combinations with the base line character outside its x height and without any contact with it.
7. If there is more than one combining character in any of the three positions defined in (6) above, they are encoded in a clockwise direction, beginning at 6 o’clock and moving through 9 o’clock, 12 o’clock etc.
5.2.2 Entities and Unicode values
By using entities it is possible to define as many characters as one believes are necessary for the transcription of a certain corpus of texts. However, since most applications now fully support Unicode, we recommend that characters in the Unicode Standard are encoded by their Unicode code points.
Note that the type of encoding is specifed at the very begining of an XML file. If the specification is
<?xml version="1.0" encoding="ISO-8859-1"?>
entities must be used for all characters outside Basic Latin and Latin-1 Supplement. Thus, “a”, “é” and “þ” can be entered directly, but characters like “ǫ” (LATIN SMALL LETTER O WITH OGONEK) must be encoded with an entity, “&oogon;”.
If, however, the encoding is specified as
<?xml version="1.0" encoding="UTF-8"?>
all characters in the Unicode Standard can be encoded with their Unicode code points, without resorting to entities.
In TEI P5, all entities must be declared in a separate list. A complete list of entities for Medieval Nordic texts is part of the Menota schema, and can be consulted in Appendix D.1.1. An encoding using these entities will always be valid with respect to character encoding (but may, of course, be invalid for other reasons). In the Menota schema, entities are linked to code points defined in the MUFI character recommendation, so that if a Menota text is displayed with a fully compliant MUFI font, all entities will be displayed correctly.
The Basic Multilingual Plane of the Unicode Standard has 65,536 different code points. This includes a large Private Use Area (PUA), comprising some 6,000 code points. This area can be used for characters not defined in the Standard (so far). Our present recommendation is to use this area for characters not included in the Unicode Standard and to coordinate the allocation of codepoints with the recommendations by the Medieval Unicode Font Initiative. It should be noted that the use of PUA is an interim solution. A long-term solution is to apply to Unicode for the inclusion of additional characters and/or use other rendering techniques (such as OpenType).
Code points in Unicode are usually given in hexadecimal format, in which each digit spans a sequence of 16 positions, 0-1-2-3-4-5-6-7-8-9-A-B-C-D-E-F. Thus, 0001 equals 1 in the decimal system, 000F equals 15, 0010 equals 16 etc. The whole range thus goes from 0000 to FFFF (65,536). The PUA is located at E000-F8FF.
The Latin alphabet is the first to be described in the Unicode Standard. As was mentioned, many characters in Unicode can be defined in several ways, either as a single base line character (including any diacritical marks) or as combination of a base line character and one or more combining marks.
(a) Commonly used characters have a single description in Unicode. This applies to all base line characters in the Latin alphabet.
|Glyph||Encoding||Code point||Unicode descriptive name|
(b) Composite characters may be described in more than one way. Thus, an “a with acute accent” can be encoded as a combination of an “a” and a combining acute accent or as a single character, “a with acute accent”. Both descriptions are equivalent:
|Glyph||Entity||Code point||Unicode descriptive name|
(c) Some characters are not found in Unicode and must therefore be assigned to the Private Use Area (PUA), either as a character with its own code point or as a combination of an existing character and a combining diacritical mark in the PUA. The ligature of “k” and “ſ”is not included in the Unicode Standard (as of v. 9.0), and since we would rather not encode it as a sequence of “k” + “zero width joiner” + “ſ”, we have assigned it to a code point in the PUA, EBAE.
|Glyph||Entity||Code point||Descriptive name|
Encoding with entities referring to the PUA may look unnecessarily complicated. It should be borne in mind, however, that the great majority of characters are defined in Unicode, and in many transcriptions the need for special characters in the PUA will not arise. With appropriate fonts, the transcriber does not need to spend much time on technicalities of this type.
Finally, it should be noted that a text may be encoded with a mixture of Unicode code points and entities even for characters within the Unicode Standard. For the sake of clarity, some encoders might like to insert combining marks as entities. Thus, the example above might be encoded as:
h&er; sér han&bar;
Or, with the element
<am> for the abbreviation characters:
h<am>&er;</am> sér han<am>&bar;</am>
The two abbreviation characters COMBINING ZIGZAG ABOVE and COMBINING OVERLINE are part of the Unicode Standard, at 035B and 0305 respectively, so entitites are not really needed. However, some XML editors may not show combining characters in correct positions, and it is thus more legible to use entities, “&er;” for the combining zigzag above and “&bar;” for the combining bar above.
If an encoder, for some reason, would like to encode a character which is not in the Menota list of entities, this character has to be declared in the header of the file.
An ordinary Menota XML file will typically refer to the whole list of Menota entities in the third line of the file like this:
<!ENTITY % Menota_entities SYSTEM 'http://www.menota.org/menota-entities.txt'> %Menota_entities;]>
If, however, the transcriber would like to add a couple of entities not included in the Menota list, they must be specified as a sequence of the entity and its rendering:
<!ENTITY % Menota_entities SYSTEM 'http://www.menota.org/menota-entities.txt'> %Menota_entities; <!ENTITY trotdot "$"> <!ENTITY eacutesup "£">]>
In this example, it is specified that the first entity, “&trotdot;”, is going to be displayed as the hexadecimal character 0024, the dollar sign, and the second, “ésup;”, as 00A3, the pound sign. These are stop-gap measures, and the transcriber decides the actual rendering. A long-term solution would be to work with Menota in order to add these entities to the Menota entity list.
5.2.3 Encoding characters as such
In some cases, a character should be encoded as such. That kind of separate mark-up
allows for association with additional meta-data as well as easier processing.
The TEI P5 Guidelines recommend the element
<c> for this type of encoding, and we suggest to also use
@type (and potential others) for further specification.
|Element / attribute / value||Contents|
||(character) contains an individual character|
||type of character. Suggested values:|
|'word'||the character should be regarded as a full word|
|'initial'||the character is an initial|
|'hyphen'||the character is a hyphen|
A character should, for instance, be encoded as such when it forms a word in itself
instead of merely being part of a larger word. This can be the case if a character is
the object of a grammatical discussion.
A sentence like the following from the
X, hann er samsettr í látinu af c ok s.
would thus be encoded as
<w><c type="word">X</c></w>, hann er samsettr í látinu af <w><c type="word">c</c></w> ok <w><c type="word">s</c></w>.
The usage of the attribute
@type with the value 'word'
it from other kinds of characters one might want to mark-up.
Note that the
<c> element is placed within the elemet
<w> . This might seem somewhat
redundant in this case, since that information is also provided by the attribute. However, if
a character behaves like a word, such as in a sentence like “The left descender of the
x’es in this script go below the base line”, it has inflection and could
easily be lemmatised as the noun x. (See ch. 11.2
When displaying the text from the
First Grammatical Treatise on the normalized
level, one might also choose to display the
contents of the
<c> element in italics, which would be possible with the suggested mark-up:
X, hann er samsettr í látinu af c ok s.
Individual characters are moreover marked-up as such, when the character in question
is an initial or sentence initial. In that case, the character
is part of a larger word, meaning that the entire word is enclosed by the
element, while only the visually highlighted initial is enclosed by the
A detailed description of how to mark-up initials in the transcription is provided
in ch. 7.3. Note, however, that
the visual rendering of initials in a manuscript is only encoded on the facsimile level, not on the
diplomatic or normalized levels.
Finally, hypens – where they occur in manuscripts – are encoded with
<c> . For the mark-up of hyphens see ch. 5.5 below.
5.3.1 Basic mark-up
This subchapter will introduce some important elements and attributes for the encoding of word or word parts, mostly based on ch. 17.1 “Linguistic Segment Categories” in the TEI P5 Guidelines.
|Element / attribute / value||Contents|
||(word) contains an individual word|
||states the lexical citation form of a word|
||(morpheme) contains a part of a word|
||states the base form of a morpheme|
||(segment) groups one or more strings of text, e.g. words|
||states the type of segmentation. Suggested values:|
As a rule, medieval Nordic manuscripts in the Latin alphabet are written with a clearly
identifiable space between each word. This obviously facilitates the work for the
transcriber, since the word is a basic linguistic unit in grammars and dictionaries. In
a simple transcription, word division can simply be entered by the space bar on the
keyboard. Thus, a piece of text (from
Barlaams ok Josaphats saga ch. 48)
might be transcribed as
En ef ver fallum i hinar fornno syndir oc huerfum aptr til hinna fyrrv misverka sem hundr til spyu sinnar þa kann lettlega at vera at oss kunni til hannda at berazt sem i guðspialleno segir.
Here, each word is delimited by a space (or a punctuation mark). However, for a more
detailed analysis it can be convenient to identify each word with a separate
<w> element (for “word”). The
<w> element functions as a
container for information on levels of text representation (ch. 4 above) and morphological analysis (ch.
11). In this example, each word has been identified by the
and the lemma (dictionary entry) specified as an attribute to the
<w lemma="en">En</w> <w lemma="ef">ef</w> <w lemma="vér">ver</w> <w lemma="falla">fallum</w> <w lemma="í">i</w> <w lemma="hinn">hinar</w> <w lemma="forn">fornno</w> <w lemma="synd">syndir</w> etc.
For practical reasons, each word has a separate line in this encoding. Unless otherwise
specified, it is assumed that there is white space between each
5.3.2 One word or two? Graphical and lexical words
Although words as a rule are separated by spaces in medieval Nordic manuscripts, there
are many exceptions to this rule. For this reason, a distinction should be drawn between
graphical words and lexical words. A
graphical word is a sequence set out by space on either side, while a lexical word is a
member of the set of word forms defined by grammars and dictionaries for the language in
question. In the great majority of cases, graphical and lexical words are identical.
However, we sometimes see that a preposition and its object may be written as a single
word (“aveiðiskap” = “á veiðiskap”), or that compounds are written as two
separate words (“veiði kona” = “veiðikona”), as in this example from
Niðrstigningar saga in Holm perg 6 fol, f. 138:
veiði kona mykyl hevir hon veret ok miok agiarn aveiðiskap
If the transcriber wishes to analyse two (or more) graphical words as a single lexical
word, we suggest that this is done by putting the whole sequence within the
Information on e.g. lemma can be given as an attribute to the
<w lemma="veiðikona">veiði kona</w>
The sequence “veiði kona” thus appears within a single element. In other words, the transcriber interprets it as one lexical word, “veiðikona”. The space is left untouched, so that in a display of the transcription, the sequence will still show up as two graphical words, “veiði” and “kona”. However, since both graphical words are placed within a single element the lemma will refer to both parts.
The converse case is a single graphical word which the transcriber would like to
analyse as two (or more) lexical words, e.g. “aveiðiskap” = “á veiðiskap”.
Each lexical word should be placed within a
<w> element, and information on
lemma, morphological form etc. can be given within each
<w> element. However,
to generate a correct display of the text, i.e. a display with no space between each
part, we suggest that the
<seg> element is used with a type attribute. The
value “nb” would indicate that there is no break between the parts in the
<w> element. If the lemma is given by way of an attribute, the encoding would
look like this:
<seg type="nb"> <w lemma="á">a</w> <w lemma="veiðiskap">veiðiskap</w> </seg>
In some rather marginal cases, a sequence may be encoded as both types. A simplified example from Codex Regius is “aravk stola” which should be read as “a ravkstola”. This sequence might be encoded in this way:
<seg type="nb"> <w lemma="á">a</w> <w lemma="rǫkstóll">ravk stola</w> </seg>
This encoding shows that “a” in “aravk stola” is a lexical word, sc. the preposition “á”, and that “ravk stola” is another lexical word, sc. the noun “rǫkstóll”. It will also allow a correct display of the sequence, since it specifies that there should be no space between “a” and “rauk stola”, and the space between “rauk” and “stola” is also encoded (analoguous to the encoding of “veiði kona” above).
Enclitic words may be encoded in a smiliar way, e.g. “emk” which should be read as “em” + “[e]k” ‘am I’:
<seg type="enc"> <w lemma="vera">em</w> <w lemma="ek">k</w> </seg>
In some cases, it ca be difficult to draw the line between the main word and the enclitic, for example when there is an assimilation between the two, “ert” + “þú” > “ertu” ‘you are’. We recommend to give priority to the main word and leave the enclitic in a reduced form:
<seg type="enc"> <w lemma="vera">ert</w> <w lemma="þú">u</w> </seg>
Multi-level encodings follow the same rules, e.g. “scalltu” ‘you shall’:
<seg type="enc"> <w lemma="skulu"> <choice> <me:facs>ſcallꞇ</me:facs> <me:dipl>scallt</me:dipl> <me:norm>skalt</me:norm> </choice> </w> <w lemma="þú"> <choice> <me:facs>u</me:facs> <me:dipl>u</me:dipl> <me:norm>þú</me:norm> </choice> </w> </seg>
Stylesheets should display the readings at
<me:dipl> levels with no space, “ſcallꞇu” and “scalltu” respectively, but with a space on the
<me:norm> level, “skalt þú”.
The morphological encoding of enclitic words is further discussed in ch. 22.214.171.124 below.
5.3.3 Encoding of word constituents
The encoder might want to encode constituent parts of a word, e.g. prefixes, roots,
derivational forms, etc. We recommend using the
<m> element (for
“morpheme”) in such cases (cf. ch. 17.1 in the TEI P5 Guidelines). This element may also be used for
constituent parts such as “veiði” and “kona” in the examples above. The
<m> element may contain information on level of text representation, lemma
etc. We shall repeat the encoding of “veiði kona” above:
<w lemma="veiðikona">veiði kona</w>
Now, if the encoder wishes to add lexicographical (or other) information to the two
constituent parts, that can easily be done by inserting
<m> elements in the
<w lemma="veiðikona">veiði kona <m baseForm="veiði">veiði</m> <m baseForm="kona">kona</m> </w>
This encoding would make a clear distinction between lemmata on the first level of
encoding, in this case “veiðikona”, and the base form,
@baseForm, of each
constituent part, in this case “veiði” and “kona”.
Lemmatisation is further discussed in ch. 11.2 below and
is here only given as an example of a word-based type of mark-up. Grammatical
information can also be conveniently attached to the word through the
(morphosyntactical analysis) attribute. This is also discussed in ch. 11.
Having introduced elements for the encoding of individual characters and words, it can
also be useful to tag punctuation marks specifically. For punctuation characters in
general, we recommend using the
||contains a punctuation mark|
||contains a reading on a facsimile level|
||contains a reading on a diplomatic level|
||contains a reading on a normalised level|
||groups alternative readings, such as
The three levels of text representation, facs, dipl and norm, were explained in ch. 4 above. Note the prefix “me:” which indicates that these elements belongs to the Menota namespace and are not part of the elements defined in TEI P5. See ch. 2.9 above on the use of namespaces in TEI schemas.
5.4.1 Punctuation in a single-level transcription
In ch. 5.3.1 above, we said that a text can be encoded
character by character. Punctuation marks are simply inserted where they occur in the
manuscript, even if the position is wrong according to modern rules. If the actual
Barlaams ok Jospahats saga is added, the example above
looks like this:
En ef ver fallum i hinar fornno syndir. oc huerfum aptr. til hinna fyrrv misverka sem hundr til spyu sinnar. þa kann lettlega at vera. at oss kunni til hannda at berazt. sem i guðspialleno segir.
If a text is encoded using the
<w> element, we recommend using a
element for punctuation marks. This is what an encoding looks like on a single,
<w>En</w> <w>ef</w> <w>ver</w> <w>fallum</w> <w>i</w> <w>hinar</w> <w>fornno</w> <w>syndir</w> <pc>.</pc> <w>oc</w> <w>huerfum</w> <w>aptr</w> <pc>.</pc> etc.
The main reason for doing so follows from the encoding of more than one level of transcription. At a diplomatic level, the transcriber should encode the punctuation marks exactly where they are in the source, but at a normalised level, some punctuation marks should be suppressed, some should be retained and some should be added.
In addition to punctuation marks like FULL STOP, COMMA, COLON, SEMICOLON and HYPHEN, there are a number of specific medieval punctuation marks, including an early form of the QUESTION MARK and a PUNCTUS ELEVATUS. A full list of additional punctuation marks can be found in the MUFI character recommendation with appropriate character entities. For example, the PUNCTUS ELEVATUS, which sometimes appear in Medieval Nordic texts, should be encoded with the entity “&punctelev;”.
5.4.2 Punctuation in a multi-level transcription
While punctuation on the
<me:dipl> levels in most cases
will be identical, it is often radically different on the
Here, many dots in the manuscript will simply be suppressed, while other punctuation
marks will be added, including modern punctuation marks like quotation marks and
exclamation marks. Suppressing a punctuation mark is simply done by leaving the element
empty, while any supplied marks are encoded by adding a new
in which the
<me:facs> and possibly also the
<me:dipl> element will be
A text transcribed as
ok nu sagdi hann. þat er eigi sva. sem þu segir
<me:dipl> level would probably be rendered as
“Ok nú,” sagði hann, “Þat er eigi svá sem þú segir.”
<me:norm> level, allowing for some variation in the type of quotation
marks and the order of comma or full stop and quotation mark. In a fully marked-up text,
the dot after “sva” would probably be suppressed on the
while a comma after “nu” would be added and the dot
after “hann” would be changed into a comma. Finally, quotation marks would be added.
However, other than punctuation characters (e.g. commas and full stops),
quotation marks do not need to be written out by the transcriber. Instead, the element
simply placed around any part in direct speach, and the
stylesheet will then render the displayed text and potential puncutation characters inside
<q> <w> <choice> <me:dipl>ok</me:dipl> <me:norm>Ok</me:norm> </choice> </w> <w> <choice> <me:dipl>nu</me:dipl> <me:norm>nú</me:norm> </choice> </w> <pc> <choice> <me:dipl></me:dipl> <me:norm>,</me:norm> </choice> </pc> </q> <w> <choice> <me:dipl>sagdi</me:dipl> <me:norm>sagði</me:norm> </choice> </w> <w> <choice> <me:dipl>hann</me:dipl> <me:norm>hann</me:norm> </choice> </w> <pc> <choice> <me:dipl>.</me:dipl> <me:norm>,</me:norm> </choice> </pc> <q> <w> <choice> <me:dipl>þat</me:dipl> <me:norm>þat</me:norm> </choice> </w> <w> <choice> <me:dipl>er</me:dipl> <me:norm>er</me:norm> </choice> </w> <w> <choice> <me:dipl>eigi</me:dipl> <me:norm>eigi</me:norm> </choice> </w> <w> <choice> <me:dipl>sva</me:dipl> <me:norm>svá</me:norm> </choice> </w> <pc> <choice> <me:dipl>.</me:dipl> <me:norm></me:norm> </choice> </pc> <w> <choice> <me:dipl>sem</me:dipl> <me:norm>sem</me:norm> </choice> </w> <w> <choice> <me:dipl>þu</me:dipl> <me:norm>þú</me:norm> </choice> </w> <w> <choice> <me:dipl>segir</me:dipl> <me:norm>segir</me:norm> </choice> </w> </q>
In many cases, a dot should be interpreted as an abbreviation mark rather than a
punctuation mark. In such cases, we recommend that the dot is encoded using the ordinary
full stop in Basic Latin, but that it is placed within the
<am> element. A text
nu fann kgr. engan mann þar
<me:facs> level would probably be rendered as
nu fann konongr engan mann þar
<me:dipl> level. In a fully marked-up text, the abbreviated word
“kgr.” would be encoded within an
<am> element around the dot (the abbreviation mark)
<me:facs> level, while it would be
expanded into “onon” (or “onun”) on the
<w> <choice> <me:facs>nu</me:facs> <me:dipl>nu</me:dipl> </choice> </w> <w> <choice> <me:facs>fann</me:facs> <me:dipl>fann</me:dipl> </choice> </w> <w> <choice> <me:facs>kgr<am>.</am></me:facs> <me:dipl>k<ex>onon</ex>gr</me:dipl> </choice> </w> <w> <choice> <me:facs>engan</me:facs> <me:dipl>engan</me:dipl> </choice> </w> <w> <choice> <me:facs>mann</me:facs> <me:dipl>mann</me:dipl> </choice> </w> <w> <choice> <me:facs>þar</me:facs> <me:dipl>þar</me:dipl> </choice> </w> <pc> <choice> <me:facs></me:facs> <me:dipl>.</me:dipl> </choice> </pc>
In some cases, a word abbreviated with a dot may occur at the end of a sentence, e.g.
nu fann hann eigi kgr.
This dot would be interpreted as an abbreviation mark and possibly also as a
punctuation mark. On the
<me:facs> level it would be encoded as no more than a
dot (inside an
<am> element), while on the
<me:dipl> level it would be suppressed when “kgr.” had
been expanded to “konongr”. The encoder might, however, add a dot as a punctuation
mark within a
<pc> element. That would certainly be the case on the
<me:norm> level, possibly also on the
<me:dipl> level, but not on the
<w> <choice> <me:facs>nu</me:facs> <me:dipl>nu</me:dipl> <me:norm>Nú</me:norm> </choice> </w> <w> <choice> <me:facs>fann</me:facs> <me:dipl>fann</me:dipl> <me:norm>fann</me:norm> </choice> </w> <w> <choice> <me:facs>hann</me:facs> <me:dipl>hann</me:dipl> <me:norm>hann</me:norm> </choice> </w> <w> <choice> <me:facs>eigi</me:facs> <me:dipl>eigi</me:dipl> <me:norm>eigi</me:norm> </choice> </w> <w> <choice> <me:facs>kgr<am>.</am></me:facs> <me:dipl>k<ex>onon</ex>gr</me:dipl> <me:norm>konungr</me:norm> </choice> </w> <pc> <choice> <me:facs></me:facs> <me:dipl>.</me:dipl> <me:norm>.</me:norm> </choice> </pc>
With this markup, a dot will be displayed after the word “konungr” on all three levels, but the dot
<me:facs> level is classified as an abbreviation mark (since it occurs
<am> element), while the dot on the
<me:dipl> and the
<me:norm> levels is classified as a punctuation mark (since it occurs within
The dot is by far the most common punctuation mark in Medieval Nordic sources. A
question mark was sometimes used, while quotation marks and exclamation marks are
post-medieval and only seen in normalised editions. There are a few additional
punctuation marks, e.g. the punctus elevatus and the virgula. These marks can be encoded using entities, but should
otherwise be kept within the
Like modern texts, medieval manuscripts were by and large justified, i.e. each line had approximatley equal length. As a consequence, words often continued on the next line. However, hyphens were used a lot less than in modern texts, where they are more or less obligatory.
We recommend that hyphens are encoded whenever they occur in the manuscript, using the
<c> element. This element should have a
@type attribute with the
and, facultatively, a
@resp attribute specifying the
person responsible for the hyphenation, e.g. a later hand. If the scribe is responsible
for the hyphenation, we suggest that the
@resp attribute be left out.
When there is no hyphen, we do not think it is necessary to supply a hyphen, as long as
the word is placed within the
<w> element, which also contains a
element. A hyphen can then be supplied by the stylesheet at one or more levels, as
indicated by the
<lb> element. We suggest that a missing hyphen should not be
displayed on the facsimile level, but on the diplomatic and normalised levels.
|Element / attribute / value||Contents|
||states the type of character. Recommended value:|
|'hyphen'||meaning that there is a hyphen in the margin, only to be rendered when the word appears at the end of the line (soft hyphen)|
||states who is responsible for the hyphenation. Suggested value:|
|'#h2'||the hyphenation has been supplied by a hand specified and numbered in the header, typically a second or later hand|
If there is a hyphen in the margin of the line, we suggest this encoding:
This is how <w>hyphen<c type="hyphen">-</c><lb ed="ms" n="2"/>ation</w> can be encoded when there actually is a hyphen in the manuscript.
If the hyphen is missing in the manuscript, and that happens quite frequently, we suggest that no hyphen is encoded:
This how <w>hyphen<lb ed="ms" n="2"/>ation</w> can be encoded when there is no hyphen in the manuscript.
In the latter case, a suitable stylesheet can add a hyphen to the display of the word so as to simplify the reading for the users. The stylesheet should render the hyphenation in such a way (e.g. by using a different-looking hyphen characters) that the users will understand the difference between hyphenation in the manuscript and supplied hyphenation.
At the normalised level, there will sometimes be hard hyphens such as in the name “Egill Skalla-Grímssonar” (also spelt “Egill Skallagrímssonar”). This type of hyphen should be encoded with the ordinary hyphen character:
This how the name <w>Egill</w> <w>Skalla-Grímssonar</w> can be encoded.
The ordinary hyphen will be displayed in any position of the word, whether in the line or at the end of a line.
5.5.1 Hyphenation in a single-level transcription
The encoding of hyphenation in single-level transcription (cf. ch. 4) is essentially the same as in a multi-level transcription. We give examples of both.
This is how the hyphenated word “hæ-góma” (normalised “hégóma”) in lines 3–4 in fig. 5.5 should be encoded on the diplomatic level:
<w> hæ<c type="hyphen">-</c><lb ed="ms" n="4"/>góma </w>
The non-hyphenated word “hwilk-kæ” in fig. 5.6, lines 2–3 should receive this encoding, not recording any hyphenation:
<w> hwilk<lb ed="ms" n="3"/>kæ </w>
5.5.2 Hyphenation in a multi-level transcription
In a multi-level transcription, the rules for hyphenation will be identical to the ones above. This would be the encoding of the hyphen in fig. 5.5 above:
<w> <choice> <me:facs>hæ<c type="hyphen">-</c><lb ed="ms" n="4"/>góma</me:facs> <me:dipl>hæ<c type="hyphen">-</c><lb ed="ms" n="4"/>góma</me:dipl> <me:norm>he<c type="hyphen">-</c><lb ed="ms" n="4"/>góma</me:norm> </choice> </w>
When the hyphen is missing, as in fig. 5.6 above, we recommend that the encoder simply encodes the word as it is, leaving it to the style sheet to display a hyphen:
<w> <choice> <me:facs>hwilk<lb ed="ms" n="3"/>kæ</me:facs> <me:dipl>hwilk<lb ed="ms" n="3"/>kæ</me:dipl> <me:norm>hwil<lb ed="ms" n="3"/>kæ</me:norm> </choice> </w>
In this example, the encoder might decide to render the word as “hwilkæ” on the normalised level, assuming that the line break in the manuscript had led to the dittography “hwilkkæ”.
Note that a line break will appear several times in a multi-level transcriptions, if it
occurs within a word. Great caution must therefore be taken with automatic numbering of
Display of hyphenation
The display of hyphenation varies with the stylesheets being used.
In the Menota archive, hyphenation in the manuscript will be displayed with a hyphen on all levels, while missing hyphenation will not be displayed with any character on the
<me:facs> level, but (as a help for the users) with a middot (U+00B7) on the
<me:norm> levels. This is an approximate display:
|10 oc æro þæír þa ȷmıſkun konongs
11 oc ſuare þæír ꝼírí þat er aꞇ lo-
12 gum æıgu aꞇ ſuara
13 NU eꝼ hærs er
14 uon ȷlanð varꞇ. þa ſkulu
15 mænn víꞇa voꝛð ræíða. þa
16 ſkall lænðꝛ maðꝛ. æða vmboðs
17 maðꝛ gs laꞇa ſkera boð. en ſa
|10 oc æro þæir þa imiskun konongs
11 oc suare þæir firi þat er at lo-
12 gum æigu at suara
13 NU ef hærs er
14 uon jlanð vart. þa skulu
15 mænn vita vorð ræiða. þa
16 skall lænðr maðr. æða vmboðs·
17 maðr konongs lata skera boð. en sa
|10 oc eru þeir þá í miskunn konungs.
11 Ok svari þeir fyrir þat er at lǫ-
12 gum eigu at svara.
13 Nú ef hers er
14 ván í land várt, þá skulu
15 menn vitavǫrð reiða. Þá
16 skal lendr maðr eða umboðs·
17 maðr konungs láta skera boð. En sá
At the end of line 11 in the table above, there is a hyphen in the manuscript, and this is displayed as such on all three levels. In line 16, there is no hyphen, and consequently no display on the
<me:facs> . However, since “umboðs maðr” has been analysed as one word, a middot is displayed on the
In the three Menota stylesheets offered in Appendix F.3 of this handbook, hyphens in the manuscript are displayed only on the
<me:facs> level, since at this level, linebreaks are displayed according to the manuscript. On the
<me:norm> levels, hyphens in the manuscript are not displayed, since the text is rendered in continuous lines on these levels. As for missing hyphenation, no hyphen is displayed on the
<me:facs> , nor is any hyphen displayed on the
<me:norm> levels. However, since the text is rendered in continuous lines on the latter two levels, words will be displayed without any breaks. See Fig. 5.8–5.10 for examples, using the same text as in Fig. 5.7.
|Facs : approximate display according to the Menota stylesheet in Appendix F|
|10 oc æro þæír þa ȷmıſkun konongs
11 oc ſuare þæír ꝼírí þat er aꞇ lo-
12 gum æıgu aꞇ ſuara
13 NU eꝼ hærs er
14 uon ȷlanð varꞇ. þa ſkulu
15 mænn víꞇa voꝛð ræíða. þa
16 ſkall lænðꝛ maðꝛ. æða vmboðs
17 maðꝛ gs laꞇa ſkera boð. en ſa
|Dipl : approximate display according to the Menota stylesheet in Appendix F|
|oc æro þæir þa imiskun konongs oc suare þæir firi þat er at logum aigu at suara
NU ef hærs er uon jlanð vart. þa skulu mænn vita vorð ræiða. þa skall lænðr maðr.
æða vmboðsmaðr konongs lata skera boð. en sa
|Norm : approximate display according to the Menota stylesheet in Appendix F|
|oc eru þeir þá í miskunn konungs. Ok svari þeir fyrir þat er at lǫgum eigu at svara.
Nú ef hers er ván í land várt, þá skulu menn vitavǫrð reiða. Þá skal lendr maðr
eða umboðsmaðr konungs láta skera boð. En sá
According to the Menota stylesheets in Appendix F.3, the hyphen in “lo-gum” (end of line 11) is displayed only on the
<me:facs> level, otherwise not. Neither is the missing hyphen in “vmboðs maðr” (end of line 16) indicated on any level, but since “umboðs maðr” has been encoded in a single
<w> element, it is displayed as a single word on the
5.6 Dialogue and quotations
Many medieval texts contain ample dialogue. We recommend that dialogue is encoded with the
<q> element for each turn in the dialogue (e.g. for each question and answer). In the multi-level model recommended by Menota, quotation marks will typically be displayed on the
<me:norm> level, sometimes also on the
<me:dipl> level, but never on the
<me:facs> level. There were no quotation marks in the manuscripts, so a display on the
<me:facs> level would be rather anachronistic.
Texts may also contain quotations from other sources. These are usually not indicated by quotation marks, but might be displayed by italics or the like, or perhaps by a note. Stylesheets will vary with respect to the display of quotations.
||contains a part of a dialogue|
||contains a quotation from another source|
Since the graphical form of quotation marks varies widely, we recommend encoding dialogue with the
<q> element and leave it to the style sheet to decide which type of quotation mark to be displayed. If the encoder wishes to be very specific about the type of mark, this can be added in a
<q> element is placed outside the word(s) in the dialogue, irrespective of whether the encoding is on one or more levels. This is a simplifed single-level example from
Niðrstigningar saga in the fragment AM 233 a fol:
<w>Þeir</w> <w>spurðu</w> <w>þá</w> <w>hverir</w> <w>vǽri</w> <pc>,</pc> <q> <w>er</w> <w>þit</w> <w>hafið</w> <w>eigi</w> <w>dauðir</w> <w>verit</w> <w>með</w> <w>oss</w> <w>í</w> <w>helvíti</w> <pc>.</pc> </q> <w>Þá</w> <w>svaraði</w> <w>annarr</w> <w>þeira</w> <w>ok</w> <w>mǽlti</w> <pc>:</pc> <q> <w>Enoch</w> <w>heiti</w> <w>ek</w> <pc>.</pc> </q> etc.
Note that on the normalised level, a comma or a colon will often be added in a
<pc> element before a new turn in the dialogue, irrespective of whether there is a punctuation mark in the manuscript or not. Also note the position of the
<q> element after the final
A succesful stylesheet will display this encoding so that the quotation marks are of the intended opening and closing type, that there is a space between an introductory comma or colon and the opening quotation mark, and another space after the closing quotation mark. This would be a correct display, in which Anglo-American quotation marks have been used:
Þeir spurðu þá hverir vǽri, “er þit hafið eigi dauðir verit með oss í helvíti.” Þá svaraði annarr þeira: “Enoch heiti ek, ok var ek við Guðs orði hingat fǿrðr.”
Depending on the specifications in the style sheet, French-style quotation marks (also frequently used in Scandinavia) may be selected for the display:
Þeir spurðu þá hverir vǽri, «er þit hafið eigi dauðir verit með oss í helvíti.» Þá svaraði annarr þeira: «Enoch heiti ek, ok var ek við Guðs orði hingat fǿrðr.»
German-style quotation marks may also be selected for the display:
Þeir spurðu þá hverir vǽri, „er þit hafið eigi dauðir verit með oss í helvíti.“ Þá svaraði annarr þeira: „Enoch heiti ek, ok var ek við Guðs orði hingat fǿrðr.“
Sometimes, quotation marks appear within quotation marks. The
<q> element allows for nesting, e.g.
<q> <w>This</w> <w>city</w> <w>is</w> <w>called</w> <q> <w>Jorvík</w> </q> <pc>,</pc> </q> <w>he</w> <w>said</w> <pc>.</pc>
Ideally, the style sheet should display nested quotations with different marks, as in this example (using the Anglo-American style):
“This city is called ‘Jorvík’,” he said.
As stated above, some texts contain quotations. If the encoder wants to identify these, we recommend the
<quote> element. This is a new example from
Niðrstigningar saga in AM 233 a fol:
... <w>svá</w> <w>sem</w> <w>ritat</w> <w>er</w> <pc>:</pc> <quote> <w>Et</w> <w>multa</w> <w>corpora</w> <w>sanctorum</w> <w>qui</w> <w>dormierant</w> <w>surrexerunt</w> <pc>.</pc> </quote> <note>Matt 27:52</note>
The source of the quotation may be given in a
<note> element, as shown above.
5.7 White space
This subchapter discusses the encoding of what is not (in a sense) in the manuscript: white space between words and around punctuation. In addition to the elements already introduced in this chapter, one more element will be used:
||contains a number, including any delimiters|
In a single-level transcription, spaces may simply be inserted by the space bar. Note that
in XML as well as in HTML any amount of white space following each other (spaces, tabs and line breaks) are
interpreted as a single space. It is not possible to encode a long space in the mansucript
simply by hitting the space bar several times. Any distinctions in space length must be
encoded specifically. In our experience, there is no significant variation in word spacing
in Medieval Nordic manuscripts. If, however, a transcriber believes there are more than
one length of the space, the simplest way of encoding this is probably to define the
standard space, code point 0020, as the default space and to define deviating spaces with
reference to the list of various space lenghts in the Unicode chart
Punctuation, 2000-200B. For recommended entities, see the MUFI character recommendation.
As for the interpretation and display of spaces in a multi-level transcription, we suggest the following three rules:
1. A transcription using the
<w> and the
<pc> element should be
displayed with a space immediately after each element.
The example in ch. 5.4.1 above would then be interpreted (e.g. by an XSLT stylesheet) as
En ef ver fallum i hinar fornno syndir . oc huerfum aptr .
This is correct in so far as there should be a space after each
punctuation mark, but wrong in so far as there should not be a space before the punctuation mark. The following additions to the general rule must be
made with respect to the
2. When displaying the text, there should not be any white space before a
The example above will then be correctly displayed as
En ef ver fallum i hinar fornno syndir. oc huerfum aptr.
A transcriber might also wants to indicate direct speech by means of quotation marks, e.g.
Hann segir, “Ek veit ekki.”
We recommend that this is done
<q> element (see also ch. 5.4.2). Any number of
elements can be placed within the element
<q> and thus be marked as direct speech.
A suitable stylesheet can insert quotation marks in the display with the correct amount of white
space before or after the quotation (usually on the normalized level of the transcription only).
<w>Hann</w> <w>segir</w> <pc>,</pc> <q> <w>Ek</w> <w>veit</w> <w>ekki</w> <pc>.</pc> </q>
An exception to rule 2 are Roman numerals, which typically are delimited by a dot immediately before and after the number:
Hann er .xij. vetra gamall.
We recommend that the delimiters are encoded as part of the number, and thus contained in
<w>Hann</w> <w>er</w> <num>.xij.</num> <w>vetra</w> <w>gamall</w> <pc>.</pc>
If this text is going to be annotated for morphology, we recommend that the lemma for the Roman numeral is given as a number, in this case lemma="12".
When a number is spelt out in the text, we recommend using the
<w> element inside the
<w>Hann</w> <w>er</w> <num><w>tolf</w></num> <w>vetra</w> <w>gamall</w> <pc>.</pc>
In the case of annotation, the lemma should be given as a word, e.g. lemma="tolf". See ch. 11.2 for more details on lemmatisation.
If an ordinary punctuation mark is positioned immediately before a word rather
than after the preceding word, we recommend that a
@rend attribute is used with
the value “rightlocation”. Thus,
Hann kemr .opt.
should be encoded as
<w>Hann</w> <w>kemr</w> <pc rend="rightlocation">.</pc> <w>opt</w> <pc>.</pc>
The stylesheet can then be instructed to position the first punctuation mark accordingly, i.e. immediately in front of the following word.
Finally, the following addition to the general rule must be made with respect to the
3. If two or more
<w> elements are contained in a
(type="nb"), in the display on the
<dipl> levels there should
not be any space after the
<w> elements except for the last
contained in the
Thus, the following sequence
<seg type="nb"> <w> <me:facs>a</me:facs> <me:dipl>a</me:dipl> <me:norm>á</me:norm> </w> <w> <me:facs>lande</me:facs> <me:dipl>lande</me:dipl> <me:norm>landi</me:norm> </w> </seg>
should be displayed as “alande” on the
<me:facs> and the
level, with no word division, but as “á landi” on the
<me:norm> level, with
word division. In the latter case, rule (1) applies, which states that a space should be
displayed after each
<w> element. In the former case, rule (3) entails that there
should not be displayed any space after the first of the two words in the
element. Also see ch. 5.3.2 above.
If the above-mentioned rules 1-3 are part of the Menota XSLT stylesheet. When applied to texts encoded to these guidelines, white space should to be displayed correctly. See also Appendix F.2.