Chapter 3. Document structure

Version 3.0 beta

This is a preliminary version which can be changed or updated at any time.
The chapter has been revised by Odd Einar Haugen.

3.1 Introduction: The structure of the manuscript vs. the structure of the work

Viewed as physical objects, rather than as vehicles for texts, manuscripts have a certain structural hierarchy. What is regarded as a single manuscript may in fact comprise more than one volume; Flateyjarbók, for example, is bound in two volumes, and the large rímur codex Acc. 22 in three. A manuscript book is made up of quires or gatherings, each of which contains a number of leaves, normally eight. Each leaf has a recto side and a verso side, and each side may be further divided into columns. The text is then written in lines across the page or column. In order to be able to locate a word quickly and easily, all, or at least most, of these structural divisions must be registered. We need to know that a given word appears in the fifth line of the right-hand or b column on the recto side of folio 34. As it is customary to foliate manuscripts without regard to their quire division, the quires will not normally need to be included in the hierarchical structure, but since the quiring can have implications for the text itself this division should be indicated, and will also generally form part of the <msDesc> element, found in the document header.

At the same time, of course, manuscripts obviously do contain texts, which is the reason why most of us are interested in them in the first place. A single manuscript will often contain more than one work, each of which may, in the case of lengthy prose works such as sagas, be divided into chapters or sections. In the case of Medieval Icelandic poetry, rímur for example, a single work (rímnaflokkur) will usually consist of several cantos or fits, each containing a number of stanzas, made up of a number of lines. It may be necessary to group these lines in some other ways as well. The stanzas comprising the mansöngur should be distinguished from the main body of the fit, for example, while to facilitate certain types of metrical analysis it might be desirable to divide the individual stanzas into couplets. Some types of poetry, such as the vikivakakvæði, will have a refrain or burden, which should ideally also be distinguished from the narrative section(s) of the stanza.

XML has at its foundation the notion of a text as a single hierarchical structure, which means that it does not work well where there are several concurrent hierarchies, as is obviously the case when one wishes for example to indicate the line divisions both in a poem and in the manuscript in which the poem is contained. The TEI Guidelines offer various solutions to this problem, enabling both the structure of the document and the structure of the text to be encoded.

3.1.1 Hierarchical divisions

The principal means of representing hierarchy is the <div> (i.e. “division”) element. <div> elements may freely nest within each other. The <div> element has, in addition to the universally available @id and @n attributes, a @type attribute, which specifies the name conventionally given to the level of division, e.g. ‘chapter’, ‘stanza’, ‘couplet’, if attempting to represent the structure of the text, ‘page’, ‘column’, ‘line’ if the physical structure of the manuscript is to be preferred. It will be convenient to specify a value for the @type attribute in the <div> element at least each time a change of level occurs. The software, however, will keep count of the levels of nesting even if the type attribute is not used.

The complex structure of a work such as a set of rímur could be represented by using four levels of <div> elements, <div type="canto"> for the cantos or fits, <div type="part"> for the parts (for example the mansöngvar), <div type="stanza"> for the stanzas, and <div type="line"> for the lines. If the manuscript being encoded contains more than one set of rímur, as is frequently the case, it might be sensible to use <div type="canto"> for each set. A simpler form of mark-up is possible, however. Instead of <div> elements, the tags <l> (for “line”) and <lg> (for “line-group”, i.e. a group of lines functioning as a formal unit) can be used, reserving the <div> element for larger structural units. The @type attribute is then used to identify the type of unit, e.g. ‘stanza’, ‘couplet’, like in <lg type="stanza">. Here again the type need only be defined once. Lines and line-groups can also be numbered and identified using the @n and @id attributes.

This type of markup focusses on the hierarchical structure of the text. The actual physical realisation of the text is considered of secondary importance – if of importance at all – when dealing with modern printed literary works: little significance is attached to the page and line breaks in the various editions of, say, Orwell's Nineteen Eighty-Four. In some cases, however, the early editions of Joyce's works, for example, supervised by the author himself, the physical make-up of the text can be of great consequence. It may also be necessary to maintain the pagination and lineation of standard editions of major works, as these are frequently used in citations in scholarly works. In the case of chirographically transmitted material, the physical organisation of the text is more likely to be recognised as being of importance and in need of encoding. This can be done hierarchically, as above, using <div> elements, which are then given the appropriate @type attributes, e.g. ‘page’, ‘column’ or ‘line’, but it seems more appropriate to reserve these elements for structural divisions in the text, while indicating the physical structure of the document through the use of so-called “milestone” tags, i.e. <pb/>, <cb/> and <lb/>. These tags make up a separate hierarchy in the file and help to overcome the problem of overlapping structures in the mark-up; see also the discussion in ch. 16 below.

The rest of this chapter presents how the text may be encoded at higher structural levels than characters and words. Important elements here are the larger divisions of the text, like chapters, paragraphs (with headings), and stanzas. This chapter also presents how pagination and foliation, together with column-breaks and line-breaks, may be encoded. The following TEI elements are presented:

Elements Contents
<text>, <body> Main divisions of the text,
<div> division into chapters (multiple levels are encoded by nesting elements),
<p> prose paragraphs,
<w> words,
<pc> punctuation marks,
<lg>, <l> line groups and lines (for stanzas),
<head> headings,
<pb/>, <cb/>, <lb/> page-, column- and line-breaks.

We will also discuss a few other elements which are relevant for the document structure:

Elements Contents
<c> Any character of note,
<q> quotes, typically containing dialogue,
<quote> quotations from external sources,
<choice> grouping of alternative encoding,
<handShift> change of scribal hand.

3.2 Main divisions of a TEI document

The following presentation is based on ch. 4: Default Text Structure of the TEI P5 Guidelines.

A TEI document is always at its highest level enclosed by the start tag <TEI> and the end tag </TEI>. Within the <TEI> element, two other elements appear in a fixed order, namely the <teiHeader> and the <text> elements. Within the <text> element, the body text may appear, enclosed in the element <body>. If the text has front matter, there will be an element <front>, placed before <body> containing it. Similarly, there may be an element <back>, placed after <body> and containing back matter. The elements <teiHeader>, <text> and <body> are required in any TEI-conformant document, while <front> and <back> are optional. This, then, is the basic structure of a TEI document:

Elements Contents
<TEI> The TEI document begins here,
<teiHeader> ... </teiHeader> the header goes here,
<text> the text itself begins here,
<front> ... </front> any front matter goes here,
<body> ... </body> the main body of the text goes here,
<back> ... </back> any back matter goes here,
</text> the text ends here,
</TEI> the TEI document ends here.

3.2.1 Another possible first division of the text: More than one <text> element

The transcriber may want to divide a document into more than one text. This can be done with the <group> element, which should be contained in the top level <text> element taking the place of <body> in the simpler scheme illustrated above. The following structure appears:


<text>
  <front> ... </front>
  <group>
    <text>
      <front> ... </front>
      <body> ... </body>
      <back> ... </back>
    </text>
    <text>
      <front> ... </front>
      <body> ... </body>
      <back> ... </back>
    </text>
  </group>
  <back> ... </back>
</text>

The main structure of the text, at the levels of work, first main division, second main division, first chapter of first main division, second chapter of first main division and so on, can be encoded in different ways. If the electronic document consists of more than one work, the <group> structure illustrated above is the natural choice. In that case, one would get multiple sets of further structural divisions, one set within each of the <body> elements. If the electronic document is considered as a single work, and placed in one <text> element, there will only be a single <body> element that needs further divisions.

3.3 Chapters: <div>

Further division of the <body> block is achieved through <div> elements, with one level nesting inside the other as the transcriber moves down through the hierarchical structure of the text.

3.3.1 Type- and level-specified <div> elements

In a complex document, <div> elements may be specified by @type and @n attributes. In this example, the three first chapters of a work have been contained in <div> elements at the same hierarchical level (siblings):

Elements Contents
<div type="chapter" n="1"> ... </div> Chapter one goes here,
<div type="chapter" n="2"> ... </div> chapter two goes here,
<div type="chapter" n="3"> ... </div> chapter three goes here (and so on).

3.3.2 Unspecified <div> elements

It is also possible to use <div> elements without specifying their type:

Elements Contents
<div> ... </div> Chapter one goes here,
<div> ... </div> chapter two goes here,
<div> ... </div> chapter three goes here (and so on).

3.3.3 Nesting <div> elements

Note that <div> elements may nest inside each other. For example, the levels of work, chapter and then paragraph can be encoded in the following manner:

Elements Contents
<div type="work"> The whole work starts here,
<div type="chapter"> the first subdivision starts here (nested),
<p> ... </p> one paragraph of the subdivision goes here,
</div> end of the subdivision,
</div> end of the work.

While <div> elements may nest as shown here, <p> elements may not. They must be encoded sequentially, i.e. as siblings.

3.4 Paragraph text: <p>

The basic-level element for prose text is the paragraph, <p>. Typically, the deepest level <div> element will contain one or more <p> elements:

Elements Contents
<div> A new chapter starts here,
<head> ... </head> this contains the heading,
<p> ... </p> first paragraph,
<p> ... </p> second paragraph,
<p> ... </p> third paragraph,
</div> the chapter ends here.

The <p> element may appear in other contexts, such as in the <teiHeader> element. It may also contain a number of other elements, but – as underlined above – it may not contain other <p> elements, i.e. it is not allowed to nest.

3.5 Sentences: <s>

The text within a paragraph may be divided into sentences, <s>:

Elements Contents
<p> A new paragraph starts here,
<s> ... </s> first sentence,
<s> ... </s> second sentence,
<s> ... </s> third sentence,
</p> the paragraph ends here.

Paragraphs are usually not divided into sentences, but will simply list the words in sequential order. However, texts that are going to be annotated for syntax should preferably be divided into sentences according to a suitable definition of what constitutes a sentence. In the case of sentences linked by conjunctions (such as ok, en and eða) one might decide to declare a new sentence whenever there is a subject as well as a predicate following the conjunction. See Haugen and Øverland (2014: 65–67) for a discussion of sentence boundaries in Old Norwegian.

3.6 Words: <w>

Although words need not be delimited by anything else than white space, we recommend placing each word in a <w> element.

Elements Contents
<p> A new paragraph starts here,
<w> ... </s> first word,
<w> ... </s> second word,
<w> ... </s> third word,
</p> the paragraph ends here.

If the text is going to be annotated for morphology it is necessary to establish the word boundaries. The <w> element may be given several morphological attributes, in particular @lemma for the dictionary entry and <me:msa> for the grammatical form. These attributes are discussed in ch. 11.2 and ch. 11.3 below.

Another motivation for delimiting words by the <w> element is the fact that a number of graphical words (i.e. sequences of characters with space on either side) should be split or merged according to standard dictionaries, e.g. “aveiðiskap” for “á veiðiskap” and “veiði kona” for “veiðikona”. Splitting and merging of graphical words will be discussed in ch. 5.3 below. The encoding of word constituents will also be discussed here.

If the text is transcribed on a single level, typically the diplomatic level, no specification within the <w> element needs to be made. If, however, the transcription comprises more than one level of transcription, each level should be clearly marked. It is recommended, but not strictly necessary, to place these levels within a <choice> element (cf. ch. 3.13 below). In a multilevel encoding, each <w> element thus includes several other elements, <choice> for stating the fact that the contents of this element are alternatives, <me:facs> for rendering on a facsimile level, <me:dipl> for rendering on a diplomatic level, and <me:norm> for rendering on a normalised level. See ch. 4.5 for examples of multilevel encoding. (The “me:” part of the latter element name means that they are part of the name space established for Menota, as explained in ch. 2.8 above.)

Hyphenation of words is discussed in ch. 5.5 above.

3.7 Punctuation: <pc>

We recommend that punctuation characters are encoded in the same manner as words.

Elements Contents
<w> one word,
<w> another word,
<pc> a punctuation mark,
<w> a new word appears here,
<w> and one more.

If the text is transcribed on a single level, typically the diplomatic level, no specification within the <pc> element needs to be made. However, if the transcription comprises more than one level of transcription, each level should be clearly marked. The procedure is identical to the one specified for words above. See ch. 5.4 below for more details.

3.8 Metrical text: <lg> and <l>

The elements discussed here are defined and explained in ch. 6: Verse of the TEI P5 Guidelines.

Texts in verse should be encoded using <lg> (line group), which in turn contains one or more <l> elements (lines). As with <div>, <lg> elements can nest. According to the TEI Guidelines <lg> is a sibling of, i.e. at at the same level as, <p>, and cannot be contained within it (unless it appears within a <q> element). Example:

Elements Contents
... </p> A paragraph ends here,
<lg> a line group starts here,
<l> ... </l> first line,
<l> ... </l> second line,
<l> ... </l> third line,
</lg> the line group ends here,
<p> ... and a new paragraph starts here.

Nesting of <lg> elements is useful for marking up longer poems. When a poem consists of two levels of line groups one may encode its structure as shown here:

Elements Contents
<lg type="stanza"> Here a line group on level one begins, a stanza,
<lg type="couplet"> here a subgroup starts, a couplet,
<l> ... </l> the first line,
<l> ... </l> second line,
</lg> and here the subgroup ends, the first of the couplets.
<lg> Here a new subgroup starts,
<l> ... </l> line,
<l> ... </l> line,
</lg> here the second subgroup ends,
</lg> and here the level one line group ends.

The <lg> and <l> elements may have several attributes, among other things for encoding information about rhyme or other metrical phenomena. See ch. 13 of this handbook for a more detailed presentation of metrical encoding.

Having <p> and <lg> as siblings can create problems for the encoding of prosimetrum texts, where lines or verse or even whole poems can appear within prose text, often as part of direct speech. However, rather than including <lg> directly within the <p> element, we recommend inserting the <p> and <lg> elements within <div> elements, using one <div> for each of them:

Elements Contents
<div type="chapter" n="1"> A chapter opens here,
<div type="text"> beginning with some prose text, indicated by a <div> element.
<p> ... </p> The text goes here,
</div> and ends here, indicated by the <div> element.
<div type="stanza"> Then a poem begins, indicated by a new <div> element
<lg> with a linegroup (a stanza)
<l> ... </l> containing some lines.
</lg> The linegroup ends here,
</div> ... and the poem (i.e. the <div> element) also ends here.
<div type="text"> A new piece of prose text begins, indicated by a new <div> element.
<p> ... </p> The text goes here,
</div> and ends here, indicated by the <div> element.
</div> The chapter ends here.

3.9 Headings: <head>

The element <head> is used for containing headings on all levels of the document.

Elements Contents
<div> Here a chapter begins,
<head> ... </head> its heading,
<p> ... </p> the first paragraph of the chapter,
<p> ... </p> the second paragraph,
</div> and here the chapter ends.

It is not required that a <div> contains a head, but if the <head> element is being used within a <div>, it should be the first element, followed by one or more <p> (or <lg>) elements. There can only be one <head> element within a <div>:


<div>
  <head>...</head>
  <p>...</p>
  <p>...<p>
</div>

From a logical point of view, a heading is located at the beginning of a new chapter. If one accepts that the heading may be located on the same line as the end of the previous chapter, this is not uncommon in manuscripts. One should not expect headings in manuscripts to be set out with blank lines above and below as in modern print.

Fig. 3.1. A chapter heading. Landslǫg Magnúss Hákonarsonar. Holm perg 34 4to, f. 10v, l. 1–10.

In fig. 3.1, the heading, which is written in red, is located after the end of one chapter and at the beginning of a new chapter. The sequence of elements is thus straight-forward:


<p>. . .
raðe litizt annat loglegare              
</div>
<head>Um grið a frosta þingi Capitulus</head>
<div type="chapter">
<p>Aller þeir menn sem i frosto þings logum oc for ero skulu i griðum
. . . </p>
</div>

Unfortunately (from the point of view of the transcriber) many manuscripts locate headings so that they are in conflict with this simple sequencing, for example by positioning them at the end of more than one line, or even in the middle of a line containing the beginning of one chapter and the end of the previous chapter. Examples of these complications are discussed in some detail in ch. 16.3 below.

3.10 Initials and highlighted characters: <c>

As stated in ch. 3.3 above, the structuring of a medieval manuscript into chapters is marked up by <div> elements. Each chapter is usually indicated by an initial, often decorated and in colour, and a heading, a rubric, typically in red ink, as illustrated in fig. 3.2.

Fig. 3.2. Initials and highlighted characters. The Old Norwegian Homily Book. AM 619 4to, f. 47r, l. 15–24.

The initial “S” marks the opening of a new homily, while the emphasised characters mark divisions at a lower level, such as the “Д in the 4th line of the image, the “H” in the 5th line, etc. An initial as well as an emphasised character (littera notabilior) should be encoded with the <c> element.

The <c> element is discussed in ch. 5.2.3 below, while ch. 7 explains the encoding of graphical elements of initials.

See also Kirsten Berg (2010: 60–66) for a discussion of initials and highlighted characters in the structuring of a manuscript text like The Old Norwegian Homily Book.

3.11 Quotations and quotes: <q> and <quote>

Dialogue are usually set out by quotation marks in edited copy, and in the encoding of a text, we recommend that the <q> element should be used for identifying dialogue. Medeival manuscripts did not use quotation marks, so adding quotation marks is an editorial intervention which basically belongs to the normalised level of text encoding.

Quotation from other sources should be tagged with the <quote> element. This is obviously another type of editorial intervention.

Element Contents
<q> contains a part of a dialogue
<quote> contains a quotation from another source

A quotation mark can be regarded as a type of punctuation character, <pc>, but since quotation marks are displayed in so many different ways, we suggest that it is convenient to use a tag of its own for quotation marks. See the discussion and examples in ch. 5.6 below.

In some cases, there may be a structural conflict between the opening and closing of a <q> element, such as in this example if this were to be encoded for <s> as well as <q>:

  • Eyvindr sagðisk eigi mundu brott undan ríða, “því at ek veit eigi hverir þessir eru. Mundi þat mǫrgum manni hlǿgiligt þykkja ef ek renn at ǫllu úreyndu.” (Hrafnkels saga Freysgoða, ch. 18)

A possibe work-around is to use the general empty element <milestone/> with the obligatory @unit attribute and a @rendition attribute specifying the opening and closing quotation mark:


<s>Eyvindr sagðisk eigi mundu brott undan ríða, 
<milestone unit="turn" rendition="opening-quote"/>því at ek veit eigi hverir þessir eru.</s>
<s>Mundi þat mǫrgum manni hlǿgiligt þykkja ef ek renn at ǫllu úreyndu.<milestone 
unit="turn" rendition="closing-quote"/>

There are several other ways of solving this type of conflict, as we discuss at some length in ch. 16 below. See also the TEI P5 Guidelines, ch. 20: Non-hierarchical Structures.

3.12 Page, column and line breaks: <pb/>, <cb/>, <lb/>

3.12.1 Page breaks and column breaks

TEI uses the empty element <pb/> to indicate page breaks. This element has an attribute @n which can be used for the page numbers. As it is customary to refer to the manuscript leaves, rather than pages, the value of the @n attribute should indicate front or back pages (recto, verso). Column breaks, <cb/>, should also be indicated in manuscripts with two or more columns. Recommended values for the @n attribute of the <cb/> element are “A”, “B” and so on. For texts that are going to be displayed in the Menota archive, it is necessary to add an ed="ms" attribute to specify that the page and column breaks refer to the manuscript (and not to e.g. an edition of the text). Example:

Elements Contents
<pb ed="ms" n="1r"/> Folio one, recto page, begins here,
<cb ed="ms" n="A"/> the first column begins here,
<cb ed="ms" n="B"/> and the second column begins here.
---
<pb ed="ms" n="1v"/> Folio one, verso page, begins here,
<cb ed="ms" n="A"/> the first column of the verso page begins here,
<cb ed="ms" n="B"/> and the second column begins here.

Page break information from, for example, a printed standard edition, can be encoded in addition to the <pb ed="ms"/> tagging that refers to the manuscript itself. If one for example would like to add page break information from a standard edition, one should use the @ed attribute with a suitable value:


<pb ed="Standard Edition" n="1"/>

Some of the texts in the Menota archive have been encoded with more than one document structure, such as Konungs skuggsjá in AM 243 bα fol. In addition to the document structure of the manuscript, ed="ms", it has the Arnamagnæan pagination, ed="AM", the one from Finnur Jónsson’s 1920–1921 edition, ed="FJ_utg", and the one from Holm-Olsen’s 1945 edition, ed="H-O".

The relevant editions should be declared in the header. As a rule, we suggest that the edition is identified by the initials of the editor + the year of publication, e.g. FJ1920-1921, LHO1945, etc.

3.12.2 Line breaks

Line breaks are also indicated with an empty element, the <lb/>, which is placed at the beginning of a new line and may be numbered by using the @n attribute. For texts that are going to be displayed in the Menota archive, it is necessary to add an ed="ms" attribute to specify that the line breaks refer to the manuscript (and not to e.g. an edition of the text):


<lb ed="ms" n="1"/>Line number one begins here.

We recommend that each page, column and line be identified with an element at the very beginning. So for a manuscript with two columns, the three first lines in the first column on the back of the third leaf (folio) would be encoded in this manner:


<pb ed="ms" n="3v"/><cb ed="ms" n="A"/><lb ed="ms" n="1"/>This is the first line.
<lb ed="ms" n="2"/>This is the second line.
<lb ed="ms" n="3"/>This is the third line.
etc.

In other words, there should be as many <pb/> elements as there are pages, as many <cb/> elements as there are columns, and as many <lb/> elements as there are lines. We strongly discourage the use of the <lb/> element in the same way as the <br> element in HTML, in which there typically is one <br> element less than the number of lines (as the <br> element is inserted between the lines).

We recommend that <lb/> is used consistently for indicating the line breaks of the manuscript itself and that this is specified by the ed="ms" attribute. As discussed above, one may include more than one layer of line break encoding, distinguishing them from each another with the @ed attribute.

It is not uncommon for a word to be continued on the next line, with or without a hyphenation mark in the margin. In a single-level encoding, the <lb/> element will be located within the relevant <w> element, and in a multi-level encoding it will be repeated for each level. See examples of this in ch. 5.5 below.

3.13 Alternative encoding: <choice>

The <choice> element should be used whenever the encoded text string offers parallel readings. In the context of Menota encoding, there are two major instances of this, (1) words transcribed on more than one level, and (2) the combination of a <sic> and a <corr> element.

The first type is what we refer to as multilevel encoding, and it is discussed in a number of chapters, in particular ch. 4.5 below. The second type is discussed and exemplified in ch. 9.3.3 below.

3.14 Scribal hands: <handShift/>

While some manuscripts are written in a single hand throughout, it is not uncommon with more than one hand. It is not a requirement that the transcriber notes a change of hand, since it can be demanding to decide the number of hands and the location of the breaks between them. However, a transcriber may want to add information on hands in the document, either based on existing research or the transcriber’s own observations.

Since a shift of hands in most cases will conflict with other divisions of the text, the recommended element is an empty one, <handShift/>. This is an element of the same type as the ones discussed in ch. 3.12 above. If there are two hands in a document there will thus be a single <handShift> element, located at the break between hand 1 and hand 2.

Information on the hands of a document can be given as part of the header in the <handDesc> element. See ch. 14.3.3 below.