Chapter 4. Document structure

Version 1.0 (20 May 2003)

 

4.1 Introduction: The structure of the manuscript vs. the structure of the work
4.2 Main divisions of a TEI document
4.3 Chapters: <div>
4.4 Paragraph text: <p>
4.5 Metrical text: <lg> and <l>
4.6 Headings: <head>
4.7 Page-, column-, and linebreaks: <pb/>, <cb/>, <lb/>
4.8 Punctuation

 

4.1 Introduction: The structure of the manuscript vs. the structure of the work

Viewed as physical objects, rather than as vehicles for texts, manuscripts have a certain structural hierarchy. What is regarded as a single manuscript may in fact comprise more than one volume; Flateyjarbók, for example, is bound in two volumes, and the large rímur codex Acc. 22 in three. A manuscript book is made up of quires or gatherings, each of which contains a number of leaves, normally eight. Each leaf has a recto side and a verso side, and each side may be further divided into columns. The text is then written in lines across the page or column. In order to be able to locate a word quickly and easily, all, or at least most, of these structural divisions must be registered. We need to know that a given word appears in the fifth line of the right-hand or b column on the recto side of folio 34. As it is customary to foliate manuscripts without regard to their quire division, the quires will not normally need to be included in the hierarchical structure, but since the quiring can have implications for the text itself this division should be indicated, and will also generally form part of the <msDescription> element, found in the document header.

At the same time, of course, manuscripts obviously do contain texts, which is the reason why most of us are interested in them in the first place. A single manuscript will often contain more than one work, each of which may, in the case of lengthy prose works such as sagas, be divided into chapters or sections. In the case of poetry, rímur for example, a single work (rímnaflokkur) will usually consist of several cantos or fits, each containing a number of stanzas, made up of a number of lines. It may be necessary to group these lines in some other ways as well. The stanzas comprising the mansöngur should be distinguished from the main body of the fit, for example, while to facilitate certain types of metrical analysis it might be desirable to divide the individual stanzas into couplets. Some types of poetry, such as the vikivakakvæði, will have a refrain or burden, which should ideally also be distinguished from the narrative section(s) of the stanza.

XML has at its foundation the notion of a text as a single hierarchical structure, which means that it does not work well where there are several competing hierarchies, as is obviously the case when one wishes for example to indicate the line divisions both in a poem and in the manuscript in which the poem is contained. The TEI Guidelines offer various solutions to this problem, enabling both the structure of the document and the structure of the text to be encoded.

 

4.1.1 Hierarchical divisions

The principal means of representing hierarchy is the <div> (i.e. 'division') element. <div> elements may freely nest within each other. The <div> element has, in addition to the universally available id and n attributes, a type attribute, which specifies the name conventionally given to the level of division, e.g. 'chapter', 'stanza', 'couplet', if attempting to represent the structure of the text, 'page', 'column', 'line' if the physical structure of the manuscript is to be preferred. It will be convenient to specify a value for the 'type=' attribute in the <div> element at least each time a change of level occurs. The software, however, will keep count of the levels of nesting even if the type attribute is not used.

The complex structure of a work such as a set of rímur could be represented by using four levels of <div> elements, <div type="canto"> for the cantos or fits, <div type="part"> for the parts (for example the mansöngvar), <div type="stanza"> for the stanzas, and <div type="line"> for the lines. If the manuscript being encoded contains more than one set of rímur, as is frequently the case, it might be sensible to use <div type="canto"> for each set. A simpler form of mark-up is possible, however. Instead of <div> elements, the tags <l> (for 'line') and <lg> (for 'line-group', i.e. a group of lines functioning as a formal unit) can be used, reserving the <div> element for larger structural units. The type attribute is then used to identify the type of unit, e.g. 'stanza', 'couplet', like in <lg type="stanza">. Here again the type need only be defined once. Lines and line-groups can also be numbered and identified using the n and id attributes.

This type of markup focusses on the hierarchical structure of the text. The actual physical realisation of the text is considered of secondary importance - if of importance at all - when dealing with modern printed literary works: little significance is attached to the page and line breaks in the various editions of, say, Orwell's Nineteen Eighty-Four. In some cases, however, the early editions of Joyce's works, for example, supervised by the author himself, the physical make-up of the text can be of great consequence. It may also be necessary to maintain the pagination and lineation of standard editions of major works, as these are frequently used in citations in scholarly works. In the case of chirographically transmitted material, the physical organisation of the text is more likley to be recognised as being of importance and in need of encoding. This can be done hierarchically, as above, using <div> elements, which are then given the appropriate type attributes, e.g. 'page', 'column' or 'line', but it seems more appropriate to reserve these elements for structural divisions in the text, while indicating the physical structure of the document through the use of so-called 'milestone' tags.

The rest of this chapter presents how the text may be encoded at higher structural levels than characters and words. Important elements here are the larger divisions of the text, like chapters, paragraphs (with headings), and stanzas. This chapter also presents how pagination and foliation, together with column-breaks and line-breaks, may be encoded. The following TEI elements are presented:

Elements

Contents

<text>, <body>

Main divisions of the text,

<div>

division into chapters (multiple levels are encoded by nesting elements),

<p>

prose paragraphs,

<lg>, <l>

line groups and lines,

<head>

headings,

<pb/>, <cb/>, <lb/>

page-, column- and line-breaks.

 

4.2 Main divisions of a TEI document

The following presentation is based on Chapter 7 of the TEI Guidelines.

A TEI document is always at its highest level enclosed by the start tag <TEI.2> and the end tag </TEI.2>. Within the <TEI.2> element, two other elements appear in a fixed order, namely the <teiHeader> and the <text> elements. Within the <text> element, the body text may appear, enclosed in the element <body>. If the text has front matter, there will be an element <front>, placed before <body> containing it. Similarly, there may be an element <back>, placed after <body> and containing back matter. The elements <teiHeader>, <text> and <body> are required in any TEI-conformant document, while <front> and <back> are optional. This, then, is the besic structure of a TEI document:

Elements

Contents

<TEI.2>

the TEI document begins here,

<teiHeader> ... </teiHeader>

the header goes here,

<text>

the text itself begins here,

<front> ... </front>

any front matter goes here,

<body> ... </body>

the main body of the text goes here,

<back> ... </back>

any back matter goes here,

</text>

the text ends here,

</TEI.2>

the TEI document ends here.

 

4.2.1 Another possible first division of the text: More than one <text> elements

The transcriber may want to divide a document into more than one different text. This can be done with the <group> element, which should be contained in the top level <text> element, which means that it takes the place of <body> in the simpler scheme illustrated above. The following structure appears:

<text>
<front> ... </front>
<group>
<text>
<front> ... </front>
<body> ... </body>
<back> ... </back>
</text>
<text>
<front> ... </front>
<body> ... </body>
<back> ... </back>
</text>
</group>
<back> ... </back>
</text>

The main structure of the text, at the levels of work, first main division, second main division, first chapter of first main division, second chapter of first main division and so on, could be encoded in different ways. If the electronic document consists of more than one work, the <group> structure illustrated above is the natural choice. In that case one would get multiple sets of further structural divisions, one set within each of the <body> elements. If the electronic document is considered as only one work, and placed in one <text> element, we only have one single <body> element that needs further divisions.

 

4.3 Chapters: <div>

Further division of the <body> block is achieved through <div> elements, with one level nesting inside the other as we move down through the hierarchical structure of the text.

 

4.3.1 Type- and level-specified <div> elements

<div> elements may or may not be type specified and/or numbered, as said above. With "type" and "n" attributes, the three first chapters of a work may be contained in <div> elements at the same hierarchical level (siblings) like this:

 

Elements

Contents

<div type="chapter" n="1"> ... </div>

Chapter one goes here,

<div type="chapter" n="2"> ... </div>

chapter two goes here,

<div type="chapter" n="3"> ... </div>

chapter three goes here (and so on).

 

4.3.2 Unspecified <div> elements

One alternative is to use <div> elements without specifying their type, like this:

Elements

Contents

<div> ... </div>

Chapter one goes here,

<div> ... </div>

chapter two goes here,

<div> ... </div>

chapter three goes here (and so on).

<div> elements may nest inside each other, as was said. For example, the levels of work, chapter and then paragraph, can be encoded in the following manner:

 

4.3.3 Nesting <div> elements

Elements

Contents

<div type="work">

The whole work starts here,

<div type="chapter">

the first subdivision starts here (nested),

<p> ... </p>

one paragraph of the subdivision goes here,

</div>

end of the subdivision,

</div>

end of the work.

Note that while <div> elements may nest like this, <p> elements may not.

 

4.4 Paragraph text: <p>

The basic-level element for prose text is the paragraph, <p>. Typically, the deepest level <div> element will contain one or more <p> elements, like this:

Elements

Contents

<div>

A new chapter starts here,

<head> ... </head>

this contains the heading,

<p> ... </p>

first paragraph,

<p> ... </p>

second paragraph,

<p> ... </p>

third paragraph,

</div>

the chapter ends here.

The <p> element may also appear in other contexts, such as the <teiHeader> element (see TEI P3, printed edition, p. 1090 f). It may also contain a number of other elements, but - as mentioned above - it is not permitted to contain other <p> elements (to nest, that is).

 

4.5 Metrical text: <lg> and <l>

Below we present elements that are defined and explained in chapter 9 of the TEI Guidelines.

Texts in verse should be encoded using <lg> (linegroup), which in turn contains one or more <l> elements (lines). As with <div>, <lg> elements can nest. According to the TEI Guidelines <lg> is a sibling of, i.e. at at the same level as, <p>, and cannot be contained within it (unless it appears within a <q> element). Example:

 

Elements

Contents

... </p>

Here ends a paragraph,

<lg>

here starts a linegroup,

<l> ... </l>

first line,

<l> ... </l>

second line,

<l> ... </l>

third line,

</lg>

here ends the linegroup,

<p> ...

and here starts a new paragraph.

This can create problems for the encoding of prosimetrum texts, where lines or verse or even whole poems can appear within prose text, often as part of direct speech; in such cases it is neceassary to include <lg> directly within <p>, which requires a very slight modification to the DTD.

Nesting of <lg> elements is useful for marking up longer poems. When the poem consists of two levels of linegroups we may encode its structure as in the following example.

 

Elements

Contents

<lg type="stanza">

Here starts a linegroup on level one, a stanza

<lg type="couplet>

here starts a subgroup, a couplet

<l> ... </l>

the first line,

<l> ... </l>

second line,

</lg>

and here ends the subgroup, the first of the couplets.

<lg>

Here starts a new subgroup,

<l> ... </l>

line,

<l> ... </l>

line,

</lg>

here ends the second subgroup,

</lg>

and here ends the level one linegroup.

These elements are defined with several attributes, among other things for encoding information about rhyme or other metrical phenomena. See ch. 9 of this handbook for a more detailed presentation of metrical encoding.

 

4.6 Headings: <head>

The element <head> is used for containing headings on all levels of the document. If <head> is placed at the start of a <div> element, it typically contains a chapter heading:

 

Elements

Contents

<div>

Here starts a chapter,

<head> ... </head>

its heading,

<p> ... </p>

first paragraph of the chapter,

<p> ... </p>

second paragraph,

</div>

and here ends the chapter.

The level for a heading follows from the enclosing element. A <head> element within a level three <div> element, is a heading for a level three partition of the text.

An overlap problem may occur when, as is common in Old Norse manuscripts, headings for chapters are placed on the same text line as the last words of the preceeding chapter. Graphically, the heading of a following chapter is then in fact placed inside the text block of the preceeding chapter. As we, in our encoded transcription, want to place headings at the start of the textual divisions where they logically belong, we must override the structure of the layout. One way to do that, is to ignore the heading of the following chapter when transcribing the last lines of the preceeding chapter. When that chapter is closed with an end tag </div>, we open the next chapter with its start tag <div>, go back one or two lines in the manuscript to where the heading starts, and transcribe from there.

It is generally recommended (ch. 4.7) that line break elements <lb/> are put in as we transcribe, every time we move down to the next line of text in the manuscript. Following that rule, it is obvious that we cannot keep one single series of linebreak elements through the intersection between the chapters in the case of heading overlap. It is however not invalid according to TEI that <lb/> elements carrying the same number occur twice. Our recommendation is to use that possibility. When moving up again to encode the heading of the following chapter, then assign the actual number of that graphic line to its <lb/> element.

Consider the following column (line numbers in left margin):

05 ...............................
06 .... these are the last
07
Header for words of
08
chapter two chapter 1.
09 Here begins the text
10 of chapter two .........
11 ..............................

The example would be encoded this way (word tags omitted):

... <lb n="6"/> these are the last <lb n="7"/> words of <lb n="8"/>chapter 1. </p></div> <div><head><lb n="7"/><head>Header for <lb n="8"/> chapter two </head><p><lb n="9"/>Here begins the text <lb n="10"/>of chapter two...

This encoding will not record whether the header is placed at the left or right side of the column. That information can be included in an attribute, if headers are not always placed on the same side. In the latter case one can simply state in the TEI header of the electronic transcription how headers are placed in this particular manuscript.

When double numbering of linebreaks is used in a transcription, one should make sure that any automatic numbering program that is run on the <lb/> elements is set up not override manually given numbers.

 

4.7 Page-, column- and linebreaks: <pb/>, <cb/>, <lb/>

4.7.1 Page-breaks

TEI uses the empty element <pb/> to mark page-breaks. This element has an attribute "n=" which can be used for noting the page number. As it is customary to refer to the manuscript leaves, rather than pages, the value of the n attribute should indicate front or rear pages (recto, verso). Column-breaks will also be also necessary to mark up. To do that, we use the TEI element <cb/>. If you want explicit information about column numbers, we suggest using "A", "B" and so on, in the n attribute of <cb/>. Example:

 

Elements

Contents

<pb n="1r"/>

Here starts folio one, page recto,

<cb n="A"/>

here starts first column,

<cb n="B"/>

and here starts second column.

<pb n="1v"/>

Here starts folio one, page verso,

<cb n="A"/>

here starts the first column of the verso page,

<cb n="B"/>

and here starts second column.

Page-break information from, for example, a printed standard edition, can be encoded in addition to the <pb/> tagging that refer to the manuscript itself. This may be done in the same manner as described for linebreaks, below.

 

4.7.2 Line-breaks

These are also marked with an empty element, the <lb/>, which is placed at the start of a new line and may be numbered by using the n attribute:

<lb n="1"/> Here starts line number one.

When transcribing Old Norse sources, we recommend that <lb/> is used for indicating the linebreaks of the manuscript itself. One might also include more than one layer of linebreak encoding, distinguishing them from one another with the "type=" attribute. If one for example wanted add linebreak information from a standard edition, it could be done by adding tags like this:

<lb type="Standard Edition" n="1"/>

 

4.8 Punctuation

It is genereally recommended that punctuation marks are placed outside the word elements <w>. This means that the dot "." is to be placed outside of the <w> element when it has a pause function. The dot may however also function as an abbreviation sign, or even as both an abbreviation sign and a pause mark at the same time. When the dot has an abbreviating function, it is recommended that it, like other abbreviation signs, be transcribed inside the <w> element using an entity (&dot;). If that same dot also is interpreted as a pause mark, it is recommended that the transcriber place an additional dot immediately following the <w> element and enclosed in a <supplied> element. That way it is made explicit which of the signs that is added by the transcriber.

In the first of the following examples, the dot has only the abbreviating function. In the second example it can be interpreted as being also a pause mark:

Manuscript: "kgr. sagdi"
Facsimile level: <w>kgr&dot;</w> <w>sagdi</w>

Manuscript: "nu sagdi kgr."
Facsimile level: <w>nu</w> <w>sagdi</w> <w>kgr&dot;</w><supplied resp="transcriber" reason="implicit">.</supplied>

In the latter case priority is given to the abbreviation mark, which is encoded as being present in the manuscript. The pause function is considered secondary and only implicitly present, wherefore the <supplied> tag is used. One may indicate with the "reason=" attribute that the sign is indeed implicitly present in the manuscript.

 

 

 Top of page

 

 

Preliminary version created 4 March 2002. Version 1.0 published 20 May 2003.