Ch. 16. Dealing with overlapping structures

Version 3.0 beta

This is a preliminary version which can be changed or updated at any time.
This chapter has been written by Andreas Witt and Odd Einar Haugen.

 

16.1 Introduction

There are no simple ways of encoding overlapping structures in XML, since XML is a strict tree structure in which every element must be part of a single “parent” element. For example, a word or sentence may be written over two manuscript pages. If we represent the manuscript page as an element, the words will not belong to a single page and a parser error will occur.

This problem is dealt with in ch.3 by using empty elements to represent page breaks in the manuscript, rather than a page of text. The same is true for columns and lines, where words, sentences and paragraphs routinely overlap with the physical features of the manuscript. These elements, <pb/> , <cb/> and <lb/> , are empty in the sense that they are inserted at a specific point in the structure without any extension. For this reason, they are often referred to as milestones. Note the position of the slash in these elements.

In many cases, the combination of ordinary, hierarchical elements such as <div> , <s> , <w> , <pc> with the milestones exemplified here give the encoder sufficient flexibility. It is thus possible to represent in XML a text which from the point of view of its contents is diveded into chapters, sentences, words and punctuation marks, and from the point of view of its embodiment in a physical document divided into pages, columns and lines. This way, the same text can easily be compared between two or more manuscripts of different length and size, as long as the division into chapters etc. is the same.

There are, however, cases where this solution does not work. Solutions will be discussed in some details below.

16.2 Spanning and linking encoding

In ch. 11 “Representation of Primary Sources” in the TEI P5 Guidelines the elements <addSpan/> , <delSpan/> and <damageSpan/> are defined. These elements are counterparts to the elements <add> , <del> and <damage> , but are all empty, and should be used when the feature to be encoded crosses structural divisions. There are in fact many more elements which can cross structural divisions, e.g. <sic> , <corr> , <unclear> and <supplied> , but there are no corresponding <sicSpan> , <corrSpan> , <unclearSpan> and <suppliedSpan> . Rather that adding these and several other elements we recommend using one generic empty element to cover all cases of overlapping structures. We have called this new element <me:textSpan/> and given it attributes from the classes “att.spanning”, “att.transcriptional”, “att.typed” and “att.global”, and the attribute @me:category:

Elements and attributes Contents
<me:textSpan/> A generic element to handle overlapping text structures
    @category Specifies the type of span, restricted to this list of values:
    'gap' for contents that would otherwise be contained by the <gap/> element, cf. ch. 8
    'damage' for contents that would otherwise be contained by the <damage> element, cf. ch. 8
    'unclear' for contents that would otherwise be contained by the <unclear> element, cf. ch. 8
    'add' for contents that would otherwise be contained by the <add> element, cf. ch. 9
    'del' for contents that would otherwise be contained by the <del> element, cf. ch. 9
    'sic' for contents that would otherwise be contained by the <sic> element, cf. ch. 9
    'corr' for contents that would otherwise be contained by the <corr> element, cf. ch. 9
    'surplus' for contents that would otherwise be contained by the <surplus> element, cf. ch. 9
    'supplied' for contents that would otherwise be contained by the <supplied> element, cf. ch. 9
    'other' for any other contents
    @spanTo Specifies the end point of the text span, using values like:
    'an1' anchor 1
    'an2' anchor 2, etc.
<anchor/> An empty element (milestone) which attaches an identifier to a point within a text
    @xml:id Specifies the identifier corresponding to the one used in the @spanTo attribute of the preceding <me:textSpan> element, using values like:
    'an1' anchor 1
    'an2' anchor 2, etc.

We will discuss an example of an overlapping structure in AM 673 b 4to (Plácitusdrápa 1):

Fig. 16.1. Plácitusdrápa. AM 673 b 4to, f. 1r, l. 1–4.

The first three lines read approximately:

genget fiornes ualdr [quaþ........fr]egr nu | mun er lægiasc miuks scalldu manra[un sli] | ca morlins boþe finna uestu i frægre f[rest]

The letters in brackets were read by earlier editors, especially Finnur Jónsson in 1889. For this section, we will discuss the text at the end of the second line and at the start of the third. It is clear that part of each word is missing, but the damaged manuscript forms a single feature. Text can be supplied from Finnur Jónsson’s transcription, but we want to represent both the damage and the supplied text as a single feature, which overlaps with the middle of the two words. The simple encoding, without the unclear text marked or the supplied text, would be:


<w>manra<gap/></w>
<w><gap/><lb n="3"/>ca</w>

With the supplied text encoded in the conventional way, the following would produce an error:


<!-- WRONG: -->
<w>manra<supplied resp="FJ">aun</w>
<!-- the processor stops here because this is not well-formed XML --> 
<w>sli</supplied><lb n="3"/>ca</w>

The <unclear> and <supplied> elements, if used in their conventional way, would overlap with the <w> elements, meaning that the word tag would close before an element inside it had closed. That would stop an XML processor from proceding any further with the document.

In these guidelines, we offer two solutions to the problem of overlapping structures. The first is more complex, but more robust. The second is simpler, but is less machine-readable and may affect the validation of the document structure in other respects. Even so, we recommend the latter solution.

16.2.1 Linked segments

The following approach is more sound from the point of view of an XML document, but creates extra tagging. The feature is encoded in a series of separate elements, linked together.

In order to encode linked segments, the encoder should break the overlapping feature into parts which fit within the XML structure (usually within the word or dipl/facs/norm elements). Each part is identified using the @xml:id attribute, and they are linked together using the following attributes:

Attributes Contents
@xml:id provides a unique identifier for the element bearing the attribute
@next used at the start and in the middle: an IDREF pointing to the element which marks the next tag of the same feature
@prev used in the middle and at the end: an IDREF pointing to the element which marks the previous tag of the same feature

The two-word example above is encoded thus:


<w>man<supplied source="FJ" xml:id="sup1.1" next="sup1.2">raun
     </supplied></w>
<w><supplied xml:id="sup1.2" prev="sup1.1">ſli</supplied>
     <lb n="3"/>ca</w>

Adding all three textual levels, including the unclear text encoded at the facs level, we would have:


<w>
  <choice>
    <me:facs>man<unclear xml:id="unc1.1" next="unc1.2">
      <gap extent="8"/></unclear></me:facs>
    <me:dipl>man<supplied source="FJ" xml:id="sup1.1" next="sup1.2">raun
      </supplied></me:dipl>
    <me:norm>manraun</me:norm>
  </choice>
</w>
<w>
  <choice>
    <me:facs><unclear xml:id="unc1.2" prev="unc1.1">ſli</unclear>
      <lb n="3"/>ca</me:facs>
    <me:dipl><supplied xml:id="sup1.2" prev="sup1.1">ſli</supplied>
      <lb />ca</me:dipl>
    <me:norm>slíka</me:norm>
  </choice>
</w>

It is recommended that the additional information for the feature (such as the editor responsible, type, etc.) be only included in the first element, but editors may wish to include the attributes in all elements.

For the purposes of display, the start of a feature can be marked by selecting the element with the 'next' attribute set, but not the 'prev'; and the end can be marked by selecting the element with the 'prev' attribute set but not the 'next'.

16.2.2 Boundary marking with empty elements

Another solution is to encode the beginning and end of a text span with empty elements. This method has been described in ch. 20 “Non-hierarchical Structures” of the TEI P5 Guidelines and will be applied here in a slightly modified version. As outlined above, we have introduced a generic element <me:textSpan/> which is specified by way of a @category attribute. If, for example, the overlapping structure to be encoded is a piece of supplied text, this fact is expressed through the value of the @category attribute:


<me:textSpan category="supplied"/> 

Thus, all instances of supplied text in the file will either be contained in <supplied> elements (in non-overlapping contexts) or in <me:textSpan category="supplied"> elements (in overlapping contexts).

In addition to inserting the empty <me:textSpan/> element at the beginning of the textual span, an attribute @spanTo is added with a suitable index, e.g.


<me:textSpan category="supplied" spanTo="an1"/> 

It now remains to mark the end of the span, i.e. the extent of the supplied text, with another empty element, the TEI <anchor/> element. This must be specified with an @xml:id attribute having the same index as the @me:spanTo attribute at the beginning of the span:


<anchor xml:id="an1"/> 

The full encoding will be like this:


<w>man<me:textSpan category="supplied" spanTo="an1"/>raun</w>
<w>ſli<anchor xml:id="an1"/><lb n="3"/>ca</w> 

Note that the value of @xml:id attribute must be unique within the whole document.

There is no simple answer to the problem of non-hierarchical structures in XML encoding. However, we believe that using empty elements as boundary markers may prove to be the simplest and most general encoding, and it is therefore the solution we recommend. With either technique, only one method should be used in each document.

16.3 Rendition encoding

The following solution is a simplified version of 16.2.1 above, in the sense that the linking is not expressed by way of IDREF pointing but by way of the @rendition attribute. It is a robust solution and it has been successfuly implemented in the display of the Menota archive. More advanced stylesheets may be developed that can replace this solution, but for now this is our offer for any texts that should be displayed in the Menota archive, using the Corpuscle application.

The EpiDoc project has developed a template to its XSLT stylesheet which ensures that brackets or similar signs are not displayed more than once in such cases, see Editorial restoration: Segmented or adjacent lacunae.

As an example, we shall use a missing passage of text which begins in one word and ends in another, where the brackets mark the beginning and end of the passage:

   This is an exa[mple of a miss]ing passage

In an encoding using the <w> element for each word, it is not possble to open an element where the first bracket is located and close it in the second bracket, since this means that there will be a conflict of overlapping structures. The simplest, although most verbose, solution is to encode each word separately, i.e. either

   This is an exa[mple] [of] [a] [miss]ing passage

or slightly less verbose

   This is an exa[mple] [of a] [miss]ing passage

The solution is dependent on the intended display of the text, of which there are two types, single-character rendering and opening & closing signs.

16.3.1 Display by single-character rendering

This is a type of display where every character is rendered in a specific way, and there are no opening or closing signs. This apples to e.g. the display of <unclear> text, which typically is displayed by subpunction of each character or, as suggested here, by grey colouring. See 8.4.1 above. If this kind of display is chosen, no additional encoding is needed.

According to the display recommended in this handbook, this solution is necessary for the following elements:

Elements Display
<gap/> If there is an estimate of the size of the gap, by small zeros or dotted circles, cf. ch. 8.2.
<space/> If there is an estimate of the size of the space, by empty space, encoded by one or more &nbsp; entities, cf. ch. 8.3.
<unclear> By grey characters or by subpunction, cf. ch. 8.4.

Fig. 16.2. Konungs skuggsjá. NRA 58C, f. 1rA, 24–27.

The following example contains a passage of unclear text:


. . .
<w>
  <choice>
    <me:facs><c type="initial">Ð</c>o a&trot;</me:facs>
    <me:dipl><c type="initial">Ð</c>o at</me:dipl>
    <me:norm>Þóat</me:norm>
  </choice>
</w>
<w>
  <choice>
    <me:facs>ꝩıꞇ</me:facs>
    <me:dipl>vit</me:dipl>
    <me:norm>vit</me:norm>
  </choice>
</w>
<w>
  <choice>
    <me:facs>&slongklig;ıl<unclear rend="semilegible">ꝺım</unclear></me:facs>
    <me:dipl>skil<unclear rend="semilegible">dim</unclear></me:dipl>
    <me:norm>skyldim</me:norm>
  </choice>
</w>
<w>
  <choice>
    <me:facs><unclear rend="illegible">◌◌◌◌◌</unclear></me:facs>
    <me:dipl><unclear rend="illegible">◌◌◌◌◌</unclear></me:dipl>
    <me:norm><supplied reason="damage" resp="Holm-Olsen, 1983">fleira</supplied></me:norm>
  </choice>
</w>
<w>
  <choice>
    <me:facs>um</me:facs>
    <me:dipl>um</me:dipl>
    <me:norm>um</me:norm>
  </choice>
</w>
<w>
  <choice>
    <me:facs>þeſ<lb ed="ms" n="27"/>se</me:facs>
    <me:dipl>þes<lb ed="ms" n="27"/>se</me:dipl>
    <me:norm>þes<lb ed="ms" n="27"/>si</me:norm>
  </choice>
</w>
. . .

See the Menota archive, NRA 58C, f. 1r, col. A, l. 27, for a display of this encoding.

16.3.2 Display by opening and closing signs

This is a type of display where the beginning of the passage is displayed by an opening sign and the end by a closing sign. As long as the passage does not cross any word boundary, no additional encoding is used. If it does, we recommend using the @rendition attribute with the 'first' , 'middle' and 'last' values.

According to the display recommended in this handbook, this solution is necessary for the following elements:

Elements Display
<add> By insertion signs, cf. ch. 9.2.1.2.
<del> By vertical bars with quill, cf. ch. 9.2.2.2.
<supplied> By square brackets or by open angle brackets, ch. 9.3.1.2.
<surplus> By curly brackets, ch. 9.3.2.2.

In the next passage in fig 16.2, we can hardly read all the words. Beginning at the end of the next but last line, we can read

   um þes|se lond .................. þau þo

In the following encoding, the three words “rǿða þá eru” have been supplied on the normalised level, but this sequence is left out on the facsimile and diplomatic level. However, based on the supplied text on the normalised level, the encoder has inserted them as three words in <w> elements. These words have been given the @rendition attribute.


. . .
<w>
  <choice>
    <me:facs>um</me:facs>
    <me:dipl>um</me:dipl>
    <me:norm>um</me:norm>
  </choice>
</w>
<w>
  <choice>
    <me:facs>þeſ<lb ed="ms" n="27"/>se</me:facs>
    <me:dipl>þes<lb ed="ms" n="27"/>se</me:dipl>
    <me:norm>þes<lb ed="ms" n="27"/>si</me:norm>
  </choice>
</w><w>
  <choice>
    <me:facs>lonꝺ</me:facs>
    <me:dipl>lond</me:dipl>
    <me:norm>lǫnd</me:norm>
  </choice>
</w>
<w>
  <choice>
    <me:facs><unclear rend="illegible">◌◌◌◌</unclear></me:facs>
    <me:dipl><unclear rend="illegible">◌◌◌◌</unclear></me:dipl>
    <me:norm><supplied reason="damage" resp="Holm-Olsen, 1983" rendition="first">rǿða</supplied></me:norm>
  </choice>
</w>
<pc>
  <choice>
    <me:facs></me:facs>
    <me:dipl></me:dipl>
    <me:norm>,</me:norm>
  </choice>
</pc>	
<w>
  <choice>
    <me:facs><unclear rend="illegible">◌◌</unclear></me:facs>
    <me:dipl><unclear rend="illegible">◌◌</unclear></me:dipl>
    <me:norm><supplied reason="damage" resp="Holm-Olsen, 1983" rendition="middle">þá</supplied></me:norm>
  </choice>
</w>
<w>
  <choice>
    <me:facs><unclear rend="illegible">◌◌◌</unclear></me:facs>
    <me:dipl><unclear rend="illegible">◌◌◌</unclear></me:dipl>
    <me:norm><supplied reason="damage" resp="Holm-Olsen, 1983" rendition="last">eru</supplied></me:norm>
  </choice>
</w>
<w>
  <choice>
    <me:facs>þau</me:facs>
    <me:dipl>þau</me:dipl>
    <me:norm>þau</me:norm>
  </choice>
</w>
<w>
  <choice>
    <me:facs>þo</me:facs>
    <me:dipl>þo</me:dipl>
    <me:norm>þó</me:norm>
  </choice>
</w>
. . .