We use TEI

Chapter 5. Characters and words

5.1 Introduction
5.2 Characters
5.3 Words
5.4 Punctuation
5.5 Hyphenation
5.6 White space

Version 3.0 beta

This is a preliminary version which can be changed or updated at any time.
The revision and updating of this chapter has been done by Odd Einar Haugen and Beeke Stegmann.

 

5.1 Introduction

When transcribing a text, the transcriber will usually make a distinction between the individual characters, the white space between some of the characters, the words made up by sequences of characters, and the punctuation marks which are inserted between some of the words. The actual encoding can be as straightforward as the example in ch. 2.3 above, in which characters, punctuation marks and spaces have been typed directly from the keyboard:

Reiðr var þá Vingþórr
er hann vaknaði
ok síns hamars
um saknaði,
skegg nam at hrista,
skör nam at dýja,
réð Jarðar burr
um at þreifask.

In a more complex encoding, the transcriber might like to identify the basic units as such, so that a distinction easily can be drawn between single characters, words, punctuation marks and the white space surrounding them. This chapter will discuss these basic units and how they can be encoded specifically, if needed, using elements like <c> for individual characters and <w> for individual words.


5.2 Characters

The basic unit in any transcription of an alphabetic script is the individual letters. In a linguistic context a distinction is often drawn between the abstract entity of a grapheme and the representation of graphs in a written document. Variant forms are referred to as allographs, e.g. the Roman type of s and the Fraktur (black letter) type. The terminology is analogous to the distinction between phonemes, phones and allophones. For a general introduction to this terminology, see e.g. Sture Allén 1971, Manfred Kohrt 1985 or Christa Dürscheid 2016.

In this handbook we shall adopt the terminology of the Unicode Standard. The fundamental distinction drawn is between characters and glyphs. Characters are, as Unicode defines it, “the smallest components of written language that have semantic value”, while glyphs are “the shapes that characters can have when they are rendered or displayed” (cf. Unicode 9.0, ch. 2.2 Unicode Design Principles). What the transcriber sees in the source document is a series of individual glyphs, and the act of transcribing essentially involves linking these glyphs to the characters at the transcriber’s disposal.

The concept of a character is similar to, but not identical with the linguistic concept of a grapheme. These concepts are notoriously difficult, but for the purposes of this handbook we believe that the Unicode usage is robust and sufficiently well-defined.

The Unicode Standard puts great emphasis on the fact that individual characters may be represented by a number of glyphs, and is therefore reticent to accept as new characters what it percieves to be variant glyphs. It will be obvious to most people that the various shapes of letters in printed type faces, such as Courier, Times, Lucida etc., should not be seen as different characters, as shown in fig. 5.1.

Fig. 5.1. Various shapes (glyphs) of the characters “A” and “a” in Courier, Times and Lucida typefaces

Unicode draws a distinction between small (minuscule) characters such as “a” and large (majuscule) characters such as “A”, since there is a possible semantic value attached to each set of characters. Thus, “the white house” can refer to any house which is white in colour, while “the White House” refers (normally) to one specific building. It can be argued that the same applies to the distinction between regular characters, “a”, and italics, “a”. For example, while “Metope” refers a poem by the Norwegian author Olaf Bull, Metope (according to a widespread bibliographical practice) refers to the book in which this poem is published (a book which, co-incidentally, bears the same name as one of the poems contained in the book). However, Unicode does not regard italics (or bold type) as individual characters. There are good reasons for this, but the example serves to illustrate the fact that the definition of a character is not always clear-cut.

Medieval Nordic manuscripts were written in the Latin alphabet from the very beginning. The basic inventory is thus the characters a-z / A-Z. They were supplemented with a number of new (or borrowed) characters, several ligatures and a variety of diacritical marks. There was also a large number of abbreviation marks in use, especially in Old Icelandic and Old Norwegian manuscripts. In fact, some abbreviation marks behave as ordinary characters in the sense that they occupy a separate position on the base line. On the other hand, many components of ordinary characters are diacritical, i.e. placed above (or through or below) another character, and thus akin to typical abbreviation marks. This means that the rules for transcribing ordinary characters and abbreviation marks should be identical.

We believe that it is possible to identify a base line in all texts, as shown in fig. 5.2. We recommend that the transcriber identifies each separate character on the base line and record this in the same sequence as in the manuscript. Thus, the characters in fig. 5.2 would be transcribed as “abpþ” or “abp&thorn;”. The last character may be encoded with its Unicode code point, “þ” at 00FE, or with an entity, “&thorn;” (as explained in ch. 2.5 above). Both encodings are strictly equivalent.

Fig. 5.2. Position of characters on the base line

If there are marks of any sort placed above, through or below any base line character, we recommend that these marks (if they are to be interpreted as characters) are transcribed immediately after the base line character. In general, we refer to these marks as diacritics. As mentioned above, abbreviation marks are also frequently written above (and in some cases through or below) a base-line character. Assuming that the sign above “h” should be referred to with the entity “&er;”, the transcription of the very first word in fig. 5.3 would be “h&er;”.

Fig. 5.3. Diacritical marks and abbreviation marks

Diacritical marks are often seen as forming an integral part of a base line character and the whole encoded as a single character. This applies to accent marks, such as the one above “e” in fig. 5.3. This combination of a base line character and a combining mark can be encoded as a single character, in Unicode referred to as LATIN SMALL LETTER E WITH ACUTE and the hexidecimal code value 00E9. Alternatively, this letter can be decomposed and encoded as a combination of LATIN SMALL LETTER E and COMBINING ACUTE ACCENT. We would like to emphasize that both encodings are strictly equivalent.

Abbreviation marks, on the other hand, are usually treated as separate characters and encoded as characters in their own right. From a purely graphical point of view, the distinction between the acute accent in “é” and abbreviation marks such as the “zigzag” mark and the bar, both exemplified in fig. 5.3, is far from obvious, but the semantics are different. The acute accent may in some manuscripts be used to signify length, but it is often used quite freely, sometimes only to distinguish one minim character from another. Abbreviation marks have a definite (if sometimes ambiguous) meaning and can be expanded into one or more characters; the zigzag mark above “h” in fig. 5.3 signifies “er”, and the bar above “n” signifies another “n”.

5.2.1 Rules for encoding characters

We suggest the following basic rules for encoding characters, irrespective of whether they are ordinary (alphabetic) characters or abbreviation marks.

1. Each character is encoded according to its position in the direction of writing.

2. Alphabetical characters on the base line are encoded first:

2.1 If the character belongs to the ordinary Latin character set a-z / A-Z (commonly known as ISO 646 or Basic Latin) it is always encoded as such.
2.2 Characters outside Basic Latin should either be encoded by Unicode codepoints or by entities, e.g. either as “abpþ” (recommended) or as “abp&thorn;”.
2.3 Characters which are not part of the Unicode Standard must always be encoded by entities. See Appendix A for more details.

3. Abbreviation marks occupying a separate position on the base line are encoded in the same manner as alphabetical characters. This applies to e.g. LATIN SMALL LETTER P WITH STROKE THROUGH DESCENDER (for “per” or “par”), as explained in ch. 6 below.

4. Alphabetical characters with diacritical marks, e.g. “é”, are encoded in one of two equivalent ways:

4.1 As a base line character + one or more combining marks. Thus the character “é” would be encoded as “e” + “&combacute;” (the latter entity meaning COMBINING ACUTE ACCENT).
4.2 As a composite base line character and encoded with a single Unicode code point or an entity. Thus, the character “é” would be encoded as either “é” or as “&eacute;”.

5. Characters with abbreviation marks are encoded in the same manner as alphabetical characters, i.e. in one of two equivalent ways:

5.1 As a base line character + one or more combining marks. Thus the first character in fig. 5.3 above would be encoded as “h” + “&er;” (the latter entity meaning COMBINING ABBREVIATION MARK “ER”).
5.2 As a composite base line character and encoded with a single entity. Thus the above character might be encoded with a single entity, e.g. as “&her;”.

As a rule, we would recommend the first solution, since the number of combinations of base line characters and combining abbreviation marks is very high. Furthermore, we recommend that the abbreviation mark is identified by the <am> element (if it is encoded as such, typically in the facsimile level) or by the <ex> element (if it has been expanded, in this case as “er”, typically on the diplomatic level). See ch. 4 below for an explanation of levels.

6. If there is more than one combining character, they are encoded in this order:

(a) Combinations with the base line character within the x height of the base line character.
(b) Combinations with the base line character outside its x height, but still in contact with it.
(c) Combinations with the base line character outside its x height and without any contact with it.

7. If there is more than one combining character in any of the three positions defined in (6) above, they are encoded in a clockwise direction, beginning at 6 o’clock and moving through 9 o’clock, 12 o’clock etc.

5.2.2 Entities and Unicode values

By using entities it is possible to define as many characters as one believes are necessary for the transcription of a certain corpus of texts. However, since most applications now fully support Unicode, we recommend that characters in the Unicode Standard are encoded by their Unicode code points.

Note that the type of encoding is specifed at the very begining of an XML file. If the specification is

<?xml version="1.0" encoding="ISO-8859-1"?>

entities must be used for all characters outside Basic Latin and Latin-1 Supplement. Thus, “a”, “é” and “þ” can be entered directly, but characters like “ǫ” (LATIN SMALL LETTER O WITH OGONEK) must be encoded with an entity, “&oogon;”.

If, however, the encoding is specified as

<?xml version="1.0" encoding="UTF-8"?>

all characters in the Unicode Standard can be encoded with their Unicode code points, without resorting to entities.

In TEI P5, all entities must be declared in a separate list. A complete list of entities for Medieval Nordic texts is part of the Menota schema, and can be consulted in Appendix D.1.1. An encoding using these entities will always be valid with respect to character encoding (but may, of course, be invalid for other reasons). In the Menota schema, entities are linked to code points defined in the MUFI character recommendation, so that if a Menota text is displayed with a fully compliant MUFI font, all entities will be displayed correctly.

The Basic Multilingual Plane of the Unicode Standard has 65,536 different code points. This includes a large Private Use Area (PUA), comprising some 6,000 code points. This area can be used for characters not defined in the Standard (so far). Our present recommendation is to use this area for characters not included in the Unicode Standard and to coordinate the allocation of codepoints with the recommendations by the Medieval Unicode Font Initiative. It should be noted that the use of PUA is an interim solution. A long-term solution is to apply to Unicode for the inclusion of additional characters and/or use other rendering techniques (such as OpenType).

Code points in Unicode are usually given in hexadecimal format, in which each digit spans a sequence of 16 positions, 0-1-2-3-4-5-6-7-8-9-A-B-C-D-E-F. Thus, 0001 equals 1 in the decimal system, 000F equals 15, 0010 equals 16 etc. The whole range thus goes from 0000 to FFFF (65,536). The PUA is located at E000-F8FF.

The Latin alphabet is the first to be described in the Unicode Standard. As was mentioned, many characters in Unicode can be defined in several ways, either as a single base line character (including any diacritical marks) or as combination of a base line character and one or more combining marks.

(a) Commonly used characters have a single description in Unicode. This applies to all base line characters in the Latin alphabet.

Glyph Encoding Code point Unicode descriptive name
a 0061 LATIN SMALL LETTER A

(b) Composite characters may be described in more than one way. Thus, an “a with acute accent” can be encoded as a combination of an “a” and a combining acute accent or as a single character, “a with acute accent”. Both descriptions are equivalent:

Glyph Entity Code point Unicode descriptive name
a + &combacute; 0061 + 0301 LATIN SMALL LETTER A + COMBINING ACUTE ACCENT
&aacute; 00E1 LATIN SMALL LETTER A WITH ACUTE

(c) Some characters are not found in Unicode and must therefore be assigned to the Private Use Area (PUA), either as a character with its own code point or as a combination of an existing character and a combining diacritical mark in the PUA. The ligature of “k” and “ſ”is not included in the Unicode Standard (as of v. 9.0), and since we would rather not encode it as a sequence of “k” + “zero width joiner” + “ſ”, we have assigned it to a code point in the PUA, EBAE.

Glyph Entity Code point Descriptive name
&kslonglig; EBAE LATIN SMALL LIGATURE K AND LONG S

Encoding with entities referring to the PUA may look unnecessarily complicated. It should be borne in mind, however, that the great majority of characters are defined in Unicode, and in many transcriptions the need for special characters in the PUA will not arise. With appropriate fonts, the transcriber does not need to spend much time on technicalities of this type.

Finally, it should be noted that a text may be encoded with a mixture of Unicode code points and entities even for characters within the Unicode Standard. For the sake of clarity, some encoders might like to insert combining marks as entities. Thus, the example above might be encoded as:

h&er; sér han&bar;

Or, with the element <am> for the abbreviation characters:

h<am>&er;</am> sér han<am>&bar;</am>

The two abbreviation characters COMBINING ZIGZAG ABOVE and COMBINING OVERLINE are part of the Unicode Standard, at 035B and 0305 respectively, so entitites are not really needed. However, some XML editors may not show combining characters in correct positions, and it is thus more legible to use entities, “&er;” for the combining zigzag above and “&bar;” for the combining bar above.

If an encoder, for some reason, would like to encode a character which is not in the Menota list of entities, this character has to be declared in the header of the file.

An ordinary Menota XML file will typically refer to the whole list of Menota entities in the third line of the file like this:

<!ENTITY % Menota_entities SYSTEM 
'http://www.menota.org/menota-entities.txt'>
%Menota_entities;
]>

If, however, the transcriber would like to add a couple of entities not included in the Menota list, they must be specified as a sequence of the entity and its rendering:

<!ENTITY % Menota_entities SYSTEM 
'http://www.menota.org/menota-entities.txt'>
%Menota_entities; 
<!ENTITY trotdot "&#x0024;">
<!ENTITY eacutesup "&#x00A3;">
]>

In this example, it is specified that the first entity, “&trotdot;”, is going to be displayed as the hexadecimal character 0024, the dollar sign, and the second, “&eacutesup;”, as 00A3, the pound sign. These are stop-gap measures, and the transcriber decides the actual rendering. A long-term solution would be to work with Menota in order to add these entities to the Menota entity list.

5.2.3 Encoding characters as such

In some cases, a character should be encoded as such. That kind of separate mark-up allows for association with additional meta-data as well as easier processing. The TEI P5 Guidelines recommend the element <c> for this type of encoding, and we suggest to also use the attribute @type (and potential others) for further specification.

Element / attribute / value Contents
<c> (character) contains an individual character
@type type of character. Suggested values:
    'word' the character should be regarded as a full word
    'initial' the character is an initial
    'hyphen' the character is a hyphen

A character should, for instance, be encoded as such when it forms a word in itself instead of merely being part of a larger word. This can be the case if a character is the object of a grammatical discussion. A sentence like the following from the First Grammatical Treatise

X, hann er samsettr í látinu af c ok s.

would thus be encoded as

<w><c type="word">X</c></w>, hann er samsettr í 
  látinu af <w><c type="word">c</c></w> ok <w><c type="word">s</c></w>.

The usage of the attribute @type with the value 'word' distinguishes it from other kinds of characters one might want to mark-up. Note that the <c> element is placed within the elemet <w>. This might seem somewhat redundant in this case, since that information is also provided by the attribute. However, if a character behaves like a word, such as in a sentence like “The left descender of the x’es in this script go below the base line”, it has inflection and could easily be lemmatised as the noun x. (See ch. 11 on lemmatisation).

When displaying the text from the First Grammatical Treatise on the normalized level, one might also choose to display the contents of the <c> element in italics, which would be possible with the suggested mark-up:

X, hann er samsettr í látinu af c ok s.

Individual characters are moreover marked-up as such, when the character in question is an initial or sentence initial. In that case, the character is part of a larger word, meaning that the entire word is enclosed by the <w> element, while only the visually highlighted initial is enclosed by the <c> element. A detailed description of how to mark-up initials in the transcription is provided in ch. 7.3. Note, however, that the visual rendering of initials in a manuscript is only encoded on the facsimile level, not on the diplomatic or normalized levels.

Finally, hypens – where they occur in manuscripts – are encoded with the element <c>. For the mark-up of hyphens see ch. 5.5 below.


5.3 Words

5.3.1 Basic mark-up

This subchapter will introduce some important elements and attributes for the encoding of word or word parts, mostly based on ch. 17.1 “Linguistic Segment Categories” in the TEI P5 Guidelines.

Element / attribute / value Contents
<w> (word) contains an individual word
   @lemma states the lexical citation form of a word
<m> (morpheme) contains a part of a word
   @baseForm states the base form of a morpheme
<seg> (segment) groups one or more strings of text, e.g. words
   @type states the type of segmentation. Suggested values:
    'nb' no break
    'enc' enclitic

As a rule, medieval Nordic manuscripts in the Latin alphabet are written with a clearly identifiable space between each word. This obviously facilitates the work for the transcriber, since the word is a basic linguistic unit in grammars and dictionaries. In a simple transcription, word division can simply be entered by the space bar on the keyboard. Thus, a piece of text (from Barlaams ok Josaphats saga ch. 48) might be transcribed as

En ef ver fallum i hinar fornno syndir oc huerfum aptr til hinna fyrrv misverka sem hundr til spyu sinnar þa kann lettlega at vera at oss kunni til hannda at berazt sem i guðspialleno segir.

Here, each word is delimited by a space (or a punctuation mark). However, for a more detailed analysis it can be convenient to identify each word with a separate <w> element (for “word”). The <w> element functions as a container for information on levels of text representation (ch. 3 above) and morphological analysis (ch. 9). In this example, each word has been identified by the <w> element, and the lemma (dictionary entry) specified as an attribute to the <w> element:

<w lemma="en">En</w>
<w lemma="ef">ef</w>
<w lemma="vér">ver</w>
<w lemma="falla">fallum</w>
<w lemma="í">i</w>
<w lemma="hinn">hinar</w>
<w lemma="forn">fornno</w>
<w lemma="synd">syndir</w>
etc.	
          

For practical reasons, each word has a separate line in this encoding. Unless otherwise specified, it is assumed that there is white space between each <w> element.

Ch. 3 above introduced levels of transcription (facsimile and normalised), and ch. 9 below discusses how words can be marked for morphological categories.

5.3.2 One word or two? Graphical and lexical words

Although words as a rule are separated by spaces in medieval Nordic manuscripts, there are many exceptions to this rule. For this reason, a distinction should be drawn between graphical words and lexical words. A graphical word is a sequence set out by space on either side, while a lexical word is a member of the set of word forms defined by grammars and dictionaries for the language in question. In the great majority of cases, graphical and lexical words are identical. However, we sometimes see that a preposition and its object may be written as a single word (“aveiðiskap” = “á veiðiskap”), or that compounds are written as two separate words (“veiði kona” = “veiðikona”).

Fig. 5.4. Text adopted from Barlaams saga ok Josaphats, Holm perg. fol. nr. 6, f. 138

If the transcriber wishes to analyse two (or more) graphical words as a single lexical word, we suggest that this is done by putting the whole sequence within the <w> element:

<w>veiði kona</w>   
          

Information on e.g. lemma can be given as an attribute to the <w> element:

<w lemma="veiðikona">veiði kona</w>   
          

The sequence “veiði kona” thus appears within a single element. In other words, the transcriber interprets it as one lexical word, “veiðikona”. The space is left untouched, so that in a display of the transcription, the sequence will still show up as two graphical words, “veiði” and “kona”. However, since both graphical words are placed within a single element the lemma will refer to both parts.

The converse case is a single graphical word which the transcriber would like to analyse as two (or more) lexical words, e.g. “aveiðiskap” = “á veiðiskap”. Each lexical word should be placed within a <w> element, and information on lemma, morphological form etc. can be given within each <w> element. However, to generate a correct display of the text, i.e. a display with no space between each part, we suggest that the <seg> element is used with a type attribute. The value “nb” would indicate that there is no break between the parts in the <w> element. If the lemma is given by way of an attribute, the encoding would look like this:

<seg type="nb">
  <w lemma="á">a</w>
  <w lemma="veiðiskap">veiðiskap</w>
</seg>	
          

In some rather marginal cases, a sequence may be encoded as both types. A simplified example from Codex Regius is “aravk stola” which should be read as “a ravkstola”. This sequence might be encoded in this way:

<seg type="nb">
  <w lemma="á">a</w>
  <w lemma="rǫkstóll">ravk stola</w>
</seg>	
          

This encoding shows that “a” in “aravk stola” is a lexical word, sc. the preposition “á”, and that “ravk stola” is another lexical word, sc. the noun “rǫkstóll”. It will also allow a correct display of the sequence, since it specifies that there should be no space between “a” and “rauk stola”, and the space between “rauk” and “stola” is also encoded (analoguous to the encoding of “veiði kona” above).

Enclitic words may be encoded in a smiliar way, e.g. “emk” which should be read as “em” + “(e)k”, “am I”:

<seg type="enc">
  <w lemma="vera">em</w>
  <w lemma="ek">k</w>
</seg>	
          

5.3.3 Encoding of word constituents

The encoder might want to encode constituent parts of a word, e.g. prefixes, roots, derivational forms, etc. We recommend using the <m> element (for “morpheme”) in such cases (cf. ch. 17.1 in the TEI P5 Guidelines). This element may also be used for constituent parts such as “veiði” and “kona” in the examples above. The <m> element may contain information on level of text representation, lemma etc. We shall repeat the encoding of “veiði kona” above:

<w lemma="veiðikona">veiði kona</w>	
          

Now, if the encoder wishes to add lexicographical (or other) information to the two constituent parts, that can easily be done by inserting <m> elements in the <w> element:

<w lemma="veiðikona">veiði kona
  <m baseForm="veiði">veiði</m>
  <m baseForm="kona">kona</m>
</w> 	
          

This encoding would make a clear distinction between lemmata on the first level of encoding, in this case “veiðikona”, and the base form, @baseForm, of each constituent part, in this case “veiði” and “kona”.

Lemmatisation is further discussed in ch. 9 below and is here only given as an example of a word-based type of mark-up. Grammatical information can also be conveniently attached to the word through the @msa (morphosyntactical analysis) attribute. This is also discussed in ch. 9.


5.4 Punctuation

Having introduced elements for the encoding of individual characters and words, it can also be useful to tag punctuation marks specifically. For punctuation characters in general, we recommend using the <pc> element. For quotation marks, however, we recommend using the <q> element.

Element Contents
<pc> contains a punctuation mark
<q> contains a quotation
<me:facs> contains a reading on a facsimile level
<me:dipl> contains a reading on a diplomatic level
<me:norm> contains a reading on a normalised level
<choice> groups alternative readings, such as <me:facs>, <me:dipl> and <me:norm>

The three levels of text representation, facs, dipl and norm, were explained in ch. 4 above. Note the prefix “me:” which indicates that these elements belongs to the Menota namespace and are not part of the elements defined in TEI P5. See ch. 2.9 above on the use of namespaces in TEI schemas.

5.4.1 Punctuation in a single-level transcription

In ch. 5.3.1 above, we said that a text can be encoded character by character. Punctuation marks are simply inserted where they occur in the manuscript, even if the position is wrong according to modern rules. If the actual punctuation in Barlaams ok Jospahats saga is added, the example above looks like this:

En ef ver fallum i hinar fornno syndir. oc huerfum aptr. til hinna fyrrv misverka sem hundr til spyu sinnar. þa kann lettlega at vera. at oss kunni til hannda at berazt. sem i guðspialleno segir.

If a text is encoded using the <w> element, we recommend using a <pc> element for punctuation marks. This is what an encoding looks like on a single, diplomatic level:

<w>En</w>
<w>ef</w>
<w>ver</w>
<w>fallum</w>
<w>i</w>
<w>hinar</w>
<w>fornno</w>
<w>syndir</w>
<pc>.</pc>
<w>oc</w>
<w>huerfum</w>
<w>aptr</w>
<pc>.</pc>
etc.	
          

The main reason for doing so follows from the encoding of more than one level of transcription. At a diplomatic level, the transcriber should encode the punctuation marks exactly where they are in the source, but at a normalised level, some punctuation marks should be suppressed, some should be retained and some should be added.

In addition to punctuation marks like FULL STOP, COMMA, COLON, SEMICOLON and HYPHEN, there are a number of specific medieval punctuation marks, including an early form of the QUESTION MARK and a PUNCTUS ELEVATUS. A full list of additional punctuation marks can be found in the MUFI character recommendation with appropriate character entities. For example, the PUNCTUS ELEVATUS, which sometimes appear in Medieval Nordic texts, should be encoded with the entity “&punctelev;”.

5.4.2 Punctuation in a multi-level transcription

While punctuation on the <me:facs> and <me:dipl> levels in most cases will be identical, it is often radically different on the <me:norm> level. Here, many dots in the manuscript will simply be suppressed, while other punctuation marks will be added, including modern punctuation marks like quotation marks and exclamation marks. Suppressing a punctuation mark is simply done by leaving the element empty, while any supplied marks are encoded by adding a new <pc> element in which the <me:facs> and possibly also the <me:dipl> element will be empty.

A text transcribed as

ok nu sagdi hann. þat er eigi sva. sem þu segir

on the <me:dipl> level would probably be rendered as

“Ok nú,” sagði hann, “Þat er eigi svá sem þú segir.”

on the <me:norm> level, allowing for some variation in the type of quotation marks and the order of comma or full stop and quotation mark. In a fully marked-up text, the dot after “sva” would probably be suppressed on the <me:norm> level, while a comma after “nu” would be added and the dot after “hann” would be changed into a comma. Finally, quotation marks would be added. However, other than punctuation characters (e.g. commas and full stops), quotation marks do not need to be written out by the transcriber. Instead, the element <q> is simply placed around any part in direct speach, and the stylesheet will then render the displayed text and potential puncutation characters inside quotation marks:

<q>

<w>
  <choice>
    <me:dipl>ok</me:dipl>
    <me:norm>Ok</me:norm>
  </choice>
</w>

<w>
  <choice>
    <me:dipl>nu</me:dipl>
    <me:norm>nú</me:norm>
  </choice>
</w>

<pc>
  <choice>
    <me:dipl></me:dipl>
    <me:norm>,</me:norm>
  </choice>
</pc>

</q>

<w>
  <choice>
    <me:dipl>sagdi</me:dipl>
    <me:norm>sagði</me:norm>
  </choice>
</w>

<w>
  <choice>
    <me:dipl>hann</me:dipl>
    <me:norm>hann</me:norm>
  </choice>
</w>

<pc>
  <choice>
    <me:dipl>.</me:dipl>
    <me:norm>,</me:norm>
  </choice>
</pc>

<q>

<w>
  <choice>
    <me:dipl>þat</me:dipl>
    <me:norm>þat</me:norm>
  </choice>
</w>

<w>
  <choice>
    <me:dipl>er</me:dipl>
    <me:norm>er</me:norm>
  </choice>
</w>

<w>
  <choice>
    <me:dipl>eigi</me:dipl>
    <me:norm>eigi</me:norm>
  </choice>
</w>

<w>
  <choice>
    <me:dipl>sva</me:dipl>
    <me:norm>svá</me:norm>
  </choice>
</w>

<pc>
  <choice>
    <me:dipl>.</me:dipl>
    <me:norm></me:norm>
  </choice>
</pc>

<w>
  <choice>
    <me:dipl>sem</me:dipl>
    <me:norm>sem</me:norm>
  </choice>
</w>

<w>
  <choice>
    <me:dipl>þu</me:dipl>
    <me:norm>þú</me:norm>
  </choice>
</w>

<w>
  <choice>
    <me:dipl>segir</me:dipl>
    <me:norm>segir</me:norm>
  </choice>
</w>

</q>

In many cases, a dot should be interpreted as an abbreviation mark rather than a punctuation mark. In such cases, we recommend that the dot is encoded using the ordinary full stop in Basic Latin, but that it is placed within the <am> element. A text transcribed as

nu fann kgr. engan mann þar

on the <me:facs> level would probably be rendered as

nu fann konongr engan mann þar

on the <me:dipl> level. In a fully marked-up text, the abbreviated word “kgr.” would be encoded within an <am> element around the dot (the abbreviation mark) on the <me:facs> level, while it would be expanded into “onon” (or “onun”) on the <me:dipl> level:

<w>
  <choice>
    <me:facs>nu</me:facs>
    <me:dipl>nu</me:dipl>
  </choice>
</w>

<w>
  <choice>
    <me:facs>fann</me:facs>
    <me:dipl>fann</me:dipl>
  </choice>
</w>

<w>
  <choice>
    <me:facs>kgr<am>.</am></me:facs>
    <me:dipl>k<ex>onon</ex>gr</me:dipl>
  </choice>
</w>

<w>
  <choice>
    <me:facs>engan</me:facs>
    <me:dipl>engan</me:dipl>
  </choice>
</w>

<w>
  <choice>
    <me:facs>mann</me:facs>
    <me:dipl>mann</me:dipl>
  </choice>
</w>

<w>
  <choice>
    <me:facs>þar</me:facs>
    <me:dipl>þar</me:dipl>
  </choice>
</w>

<pc>
  <choice>
    <me:facs></me:facs>
    <me:dipl>.</me:dipl>
  </choice>
</pc>

In some cases, a word abbreviated with a dot may occur at the end of a sentence, e.g.

nu fann hann eigi kgr.

This dot would be interpreted as an abbreviation mark and possibly also as a punctuation mark. On the <me:facs> level it would be encoded as no more than a dot (inside an <am> element), while on the <me:dipl> level it would be suppressed when “kgr.” had been expanded to “konongr”. The encoder might, however, add a dot as a punctuation mark within a <pc> element. That would certainly be the case on the <me:norm> level, possibly also on the <me:dipl> level, but not on the <me:facs> level:

<w>
  <choice>
    <me:facs>nu</me:facs>
    <me:dipl>nu</me:dipl>
    <me:norm>Nú</me:norm>
  </choice>
</w>

<w>
  <choice>
    <me:facs>fann</me:facs>
    <me:dipl>fann</me:dipl>
    <me:norm>fann</me:norm>
  </choice>
</w>

<w>
  <choice>
    <me:facs>hann</me:facs>
    <me:dipl>hann</me:dipl>
    <me:norm>hann</me:norm>
  </choice>
</w>

<w>
  <choice>
    <me:facs>eigi</me:facs>
    <me:dipl>eigi</me:dipl>
    <me:norm>eigi</me:norm>
  </choice>
</w>

<w>
  <choice>
    <me:facs>kgr<am>.</am></me:facs>
    <me:dipl>k<ex>onon</ex>gr</me:dipl>
    <me:norm>konungr</me:norm>
  </choice>
</w>

<pc>
  <choice>
    <me:facs></me:facs>
    <me:dipl>.</me:dipl>
    <me:norm>.</me:norm>
  </choice>
</pc>

With this markup, a dot will be displayed after the word “konungr” on all three levels, but the dot on the <me:facs> level is classified as an abbreviation mark (since it occurs within the <am> element), while the dot on the <me:dipl> and the <me:norm> levels is classified as a punctuation mark (since it occurs within the <pc> element).

The dot is by far the most common punctuation mark in Medieval Nordic sources. A question mark was sometimes used, while quotation marks and exclamation marks are post-medieval and only seen in normalised editions. There are a few additional punctuation marks, e.g. the punctus elevatus and the virgula. These marks can be encoded using entities, but should otherwise be kept within the <pc> element. See also ch. 5 below.


5.5 Hyphenation

Like modern texts, medieval manuscripts were by and large justified, i.e. each line had approximatley equal length. As a consequence, words often continued on the next line. However, hyphens were used a lot less than in modern texts, where they are more or less obligatory.

We recommend that hyphens are encoded whenever they occur in the manuscript, using the <c> element. This element should have a @type attribute with the value 'hyphen' and, facultatively, a @resp attribute specifying the person responsible for the hyphenation, e.g. a later hand. If the scribe is responsible for the hyphenation, we suggest that the @resp attribute be left out.

When there is no hyphen, we do not think it is necessary to supply a hyphen, as long as the word is placed within the <w> element, which also contains a <lb> element. A hyphen can then be supplied by the stylesheet at one or more levels, as indicated by the <lb> element. We suggest that a missing hyphen should not be displayed on the facsimile level, but on the diplomatic and normalised levels.

Element / attribute / value Contents
<c> character
   @type states the type of character. Recommended value:
       'hyphen' meaning that there is a hyphen in the margin, only to be rendered when the word appears at the end of the line (soft hyphen)
   @resp states who is responsible for the hyphenation. Suggested value:
       '#h2' the hyphenation has been supplied by a hand specified and numbered in the header, typically a second or later hand

If there is a hyphen in the margin of the line, we suggest this encoding:

This is how <w>hyphen<c type="hyphen">-</c><lb n="2"/>ation</w>
can be encoded when there actually is a hyphen in the manuscript.
        

If the hyphen is missing in the manuscript, and that happens quite frequently, we suggest that no hyphen is encoded:

This how <w>hyphen<lb n="2"/>ation</w> can be encoded
when there is no hyphen in the manuscript.
        

In the latter case, a suitable stylesheet can add a hyphen to the display of the word so as to simplify the reading for the users. The stylesheet should render the hyphenation in such a way (e.g. by using a different-looking hyphen characters) that the users will understand the difference between hyphenation in the manuscript and supplied hyphenation.

At the normalised level, there will sometimes be hard hyphens such as in the name “Egill Skalla-Grímssonar” (also spelt “Egill Skallagrímssonar”). This type of hyphen should be encoded with the ordinary hyphen character:

This how the name <w>Egill</w> 
<w>Skalla-Grímssonar</w> can be encoded.
        

The ordinary hyphen will be displayed in any position of the word, whether in the line or at the end of a line.

5.5.1 Hyphenation in a single-level transcription

The encoding of hyphenation in single-level transcription (cf. ch. 4) is essentially the same as in a multi-level transcription. We give examples of both.

Fig. 5.5. Hyphenation in the manuscript. From the Old Norwegian homily book in AM 619 4to, fol. 47r, l. 1–4.

This is how the hyphenated word “hæ-góma” (normalised “hégóma”) in lines 3–4 in fig. 5.5 should be encoded on the diplomatic level:

<w>
   hæ<c type="hyphen">-</c><lb n="4"/>góma
</w>
          

Fig. 5.6. Missing hyphenation in the manuscript. From Henrik Harpestreng in NKS 66 8vo, fol. 116r, l. 1–3.

The non-hyphenated word “hwilk-kæ” in fig. 5.6, lines 2–3 should receive this encoding, not recording any hyphenation:

<w>
   hwilk<lb n="3"/>kæ
</w>
          

5.5.2 Hyphenation in a multi-level transcription

In a multi-level transcription, the rules for hyphenation will be identical to the ones above. This would be the encoding of the hyphen in fig. 5.5 above:

<w>
   <choice>
      <me:facs>hæ<c type="hyphen">-</c><lb n="4"/>góma</me:facs>
      <me:dipl>hæ<c type="hyphen">-</c><lb n="4"/>góma</me:dipl>
      <me:norm>he<c type="hyphen">-</c><lb n="4"/>góma</me:norm>
   </choice>
</w>
          

When the hyphen is missing, as in fig. 5.6 above, we recommend that the encoder simply encodes the word as it is, leaving it to the style sheet to display a hyphen:

<w>
   <choice>
      <me:facs>hwilk<lb n="3"/>kæ</me:facs>
      <me:dipl>hwilk<lb n="3"/>kæ</me:dipl>
      <me:norm>hwil<lb n="3"/>kæ</me:norm>
   </choice>
</w>
          

In this example, the encoder might decide to render the word as “hwilkæ” on the normalised level, assuming that the line break in the manuscript had led to the dittography “hwilkkæ”.

Note that a line break will appear several times in a multi-level transcriptions, if it occurs within a word. Great caution must therefore be taken with automatic numbering of <lb/> elements.

 


5.6 White space

This subchapter discusses the encoding of what is not (in a sense) in the manuscript: white space between words and around punctuation. In addition to the elements already introduced in this chapter, one more element will be used:

Element Contents
<num> contains a number, including any delimiters

In a single-level transcription, spaces may simply be inserted by the space bar. Note that in XML as well as in HTML any amount of white space following each other (spaces, tabs and line breaks) are interpreted as a single space. It is not possible to encode a long space in the mansucript simply by hitting the space bar several times. Any distinctions in space length must be encoded specifically. In our experience, there is no significant variation in word spacing in Medieval Nordic manuscripts. If, however, a transcriber believes there are more than one length of the space, the simplest way of encoding this is probably to define the standard space, code point 0020, as the default space and to define deviating spaces with reference to the list of various space lenghts in the Unicode chart General Punctuation, 2000-200B. For recommended entities, see the MUFI character recommendation.

As for the interpretation and display of spaces in a multi-level transcription, we suggest the following three rules:

1. A transcription using the <w> and the <pc> element should be displayed with a space immediately after each element.

The example in ch. 5.4.1 above would then be interpreted (e.g. by an XSLT stylesheet) as

En ef ver fallum i hinar fornno syndir . oc huerfum aptr .

This is correct in so far as there should be a space after each punctuation mark, but wrong in so far as there should not be a space before the punctuation mark. The following additions to the general rule must be made with respect to the <pc> element:

2. When displaying the text, there should not be any white space before a <pc> element.

The example above will then be correctly displayed as

En ef ver fallum i hinar fornno syndir. oc huerfum aptr.

A transcriber might also wants to indicate direct speech by means of quotation marks, e.g.

Hann segir, “Ek veit ekki.”

We recommend that this is done using the <q> element (see also ch. 5.4.2). Any number of <w> and <pc> elements can be placed within the element <q> and thus be marked as direct speech. A suitable stylesheet can insert quotation marks in the display with the correct amount of white space before or after the quotation (usually on the normalized level of the transcription only).

<w>Hann</w>
<w>segir</w>
<pc>,</pc>
<q>
<w>Ek</w>
<w>veit</w>
<w>ekki</w>
<pc>.</pc>
</q>
        

An exception to rule 2 are Roman numerals, which typically are delimited by a dot immediately before and after the number:

Hann er .xij. vetra gamall.

We recommend that the delimiters are encoded as part of the number, and thus contained in the <num> element:

<w>Hann</w>
<w>er</w>
<num>.xij.</num>
<w>vetra</w>
<w>gamall</w>
<pc>.</pc>
        

If this text is going to be annotated for morphology, we recommend that the lemma for the Roman numeral is given as a number, in this case lemma="12".

When a number is spelt out in the text, we recommend using the <w> element inside the <num> element:

<w>Hann</w>
<w>er</w>
<num><w>tolf</w></num>
<w>vetra</w>
<w>gamall</w>
<pc>.</pc>
        

In the case of annotation, the lemma should be given as a word, e.g. lemma="tolf". See ch. 11 for more details on lemmatisation.

If an ordinary punctuation mark is positioned immediately before a word rather than after the preceding word, we recommend that a @rend attribute is used with the value “rightlocation”. Thus,

Hann kemr .opt.

should be encoded as

<w>Hann</w>
<w>kemr</w>
<pc rend="rightlocation">.</pc>
<w>opt</w>
<pc>.</pc>
        

The stylesheet can then be instructed to position the first punctuation mark accordingly, i.e. immediately in front of the following word.

Finally, the following addition to the general rule must be made with respect to the <w> element:

3. If two or more <w> elements are contained in a <seg> element (type="nb"), in the display on the <facs> and <dipl> levels there should not be any space after the <w> elements except for the last <w> element contained in the <seg> element.

Thus, the following sequence

<seg type="nb">
   <w>
      <me:facs>a</me:facs>
      <me:dipl>a</me:dipl>
      <me:norm>á</me:norm>
   </w>
   <w>
      <me:facs>lande</me:facs>
      <me:dipl>lande</me:dipl>
      <me:norm>landi</me:norm>
   </w>
</seg>
        

should be displayed as “alande” on the <me:facs> and the <me:dipl> level, with no word division, but as “á landi” on the <me:norm> level, with word division. In the latter case, rule (1) applies, which states that a space should be displayed after each <w> element. In the former case, rule (3) entails that there should not be displayed any space after the first of the two words in the <seg> element. Also see ch. 5.3.2 above.

If the above-mentioned rules 1-3 are part of the Menota XSLT stylesheet. When applied to texts encoded to these guidelines, white space should to be displayed correctly. See also Appendix F.2.


First published 28 August 2016. Last updated 7 August 2017. Webmaster.