Converting a file into Menotic encoding

Make a text Menotic - questions and answers

1. Analyze the document
2. Preserve existing encoding
3. Character encoding
4. Style as encoding
5. What is Oxygen?
6. Line breaks
7. Page breaks
8. References to other editions
9. Encoding each word
10. Making the document a valid XML document
11. Add a prologue
12. Add a root element
13. Add a teiHeader
14. Add basic encoding in the text part of the document
15. Correcting errors
16. Using the attributes rend and rendition - Describing the visual appearance of the source text

By Tone Merete Bruvik (Aksis, Unifob) and Odd Einar Haugen (University of Bergen)

This document contains recommendations on how texts can be made into Menotic XML documents.

The following text will be used as an example: Elis saga i DG 4–7, fol. 9v.

1. Analyze the document

Question: I have a MS Word file, what should I do?

Answer: You need to analyze the document to see what is in it, and what needs to be done:

What kind of character encoding is in use?

In manuscripts in general, and in Old Norse manuscripts in particular, there are many characters that are not in ASCII. You need to know how these have been encoded. If you do not know what kind of encoding has been used (for instance you have a text transcribed 10 years ago, and there is no documentation) you will need to compare the text with a facsimile of the original document to figure out the character encoding.

What kind of encoding is available in the text already?

Look at the text; is there any markup there already? In many texts, line breaks, page breaks and column breaks will already be marked. Are there notes that should be transferred into the new version? Make a list of all these kinds of markup, and how they are expressed in the document.

Have changes in the text been marked?

For instance, are abbreviations expanded? In Old Norse text editions, these are typically marked with underline or italic. Add this to the list of encodings already in the text that you would like to keep.

Have errors in the text been corrected and marked in some way? If that is the case, add it to your list.

Is this text only a starting point?

In many cases, there exists an electronic text that is a transcription of a manuscript, but with changes or errors that you would like to do something with. In that case, you need to proofread it against some kind of original document, for instance a facsimile. Whether or not you do that before or after you transform the text into Menota, is something you will have to decide. Bear in mind that in the Menota handbook there is a terminology to express information about the text, and that might be helpful when you update the text.

2. Preserve existing encoding

Question: I have an overview of the document, what encoding should I start with?

Answer: You should start by preserving the existing encoding.

You start this task by using the word processor the document was made with, for instance MS Word. You should switch to an XML editor (for instance Oxygen) after the character encoding has been converted to Unicode and the encoding expressed by formatting (e.g. italic for expanded abbreviations) has been handled.

3. Character encoding

Question: How should I handle the character encoding?

Answer: Convert the characters into Unicode, and use a Unicode font. If there are characters in use which are not in Unicode, look at chapter 5 of the Menota Handbook and the MUFI recommendation.

Example: The first line below has partly wrong character encoding. The second line shows the correct characters:

/-/ Nu ly∂it go∂gæfliga. betra er fogr frπ∂e en kui/∂ar fylli. ›o scal vi∂ saugu súpa. 
									
/-/ Nu lyðit goðgæfliga. betra er fogr frǫðe en kui/ðar fylli. þo scal við saugu súpa.

4. Style as encoding

Question: How should I handle the use of bold, italic or other styles in the document, if these styles have been used as encoding?

Answer: If a style, for instance italic, has been used to mark up something in the text, you need to transform that into more robust encoding before you leave the word processor. Use the search and replace function to mark this up. In the sample text, underlines or italic means that something is expanded. Encode this with <ex> (editorial expansion). This can be done using the replace function in MS Word, where you add style to the search and wild chars. To see how this is done in MS Word (or any other word processor), look up for instance 'wildchar' or 'search and replace' in the help system of the software.

To find underlined text, search for “(*)”, replace with “<ex>\1</ex>”, check out “Use wild chars”, and state that the formatting of the font is underline. There is, however, a problem with this, as MS Word will find the shortest match possible, and only single letters in the document will be found. In order to find strings of underlined text, you will need to specify the number of characters to search for. Start with for instance 10 letters (most likely, there will be no expansions that are that long) by searching for “(*{10})”, replace with “<ex>\1</ex>”, then search for “(*{9})”, etc.

You can find more information on the use of search and replace in MS Word here.

haf∂i hann nalega

Changed to:

hafði ha<ex>nn</ex> nalega

After this has been done, you should stop using MS Word, and switch in to an XML editor, for instance Oxygen.

5. What is Oxygen?

Oxygen is an XML editor, that is an application that have been made to handle XML documents. Read more about it here and on Oxygen's home page.

Oxygen is a relatively complex application, and it takes time to master it. You will, however, need Oxygen or another XML editor with some of the same functionality to do the following steps.

6. Line breaks

Question: How should I transform line breaks?

Answer: In the Elis saga i DG 4–7, fol. 9v text, the line breaks of the manuscript have been encoded with “/”. Replace these with <lb/>:

Nu ly∂it go∂gæfliga. betra er fogr frǫ∂e en kui/
∂ar fylli. þo scal vi∂ saugu súpa. en æi ofmikit drec/
ka. sœm∂ er saugu at segia ef hæyrendr. til ly∂a. / 
en tapat starfi at hafna at hæyra. /

Changed to:

Nu lyðit goðgæfliga. betra er fogr frǫðe en kui<lb/>
ðar fylli. þo scal við saugu súpa. en æi ofmikit drec<lb/>
ka. sœmð er saugu at segia ef hæyrendr. til lyða. <lb/>
en tapat starfi at hafna at hæyra. <lb/>

In many cases, you would like to add line numbers, for instance <lb n='5'/>. That can be done automatically when you know where the page breaks are (so you know where to reset the line counting), but it is not vital that it is done at this point. You will in any case need a little script or a style sheet to do it. A simple Perl script is available to do this.

You will need to have Perl installed on the computer you are running the script on. If your are running on UNIX (including Mc OS X) or Linux it will be included, otherwise you may download Perl from perl.com or perl.org. Choose active Perl if given a choose.

Save the Perl script on you local computer or server, put the xml file you like to add line numbers to into the same directory as the Perl script.

To run the Perl script, open a terminal application (for instance "Command Prompt" in Windows or "Terminal" in Mac OS X) on your local computer or on your server, navigate by using the command "cd" (change directory) followed by the path to the directory where you installed the script, for example:

cd Documents/menota/Testing

Then you type in a command line like this:

perl lb_nummering_0_1.pl innfil=[your xml fil] utfil=[your new xml file]

For example:

perl lb_nummering_0_1.pl innfil='Elis utdrag_5.xml' utfil='Elis utdrag_6.xml'

This script has the limitation that there can only be one <lb/> in each line in the input document. If that is not the case in your document, search for "<lb/>" and replace them by a <lb/> followed by a carriage return.

The line numbers produced by this script are on the form "[page number]-[line number]", i.e. "12-5", that is page 12, line 5.

7. Page breaks

Question: How should I transform page breaks?

Answer: If the page breaks have been encoded, transform them into Menotic code. To do these kind of changes you might use search and replace functions with regular expressions, which is available in Oxygen and most other editors.

In this case, where the pages are from the Kölbing 1881 edition, it is simple to search for “side=(\d+)”, and replace it with “ <pb ed="Kölbing1881" n="$1"/> ”. This is an example of how a regular expression is used.

side=35

Changed to:

<pb ed="Kölbing1881" n="35"/>

8. References to other editions

Question: How should I encode references to other editions?

Answer: In this kind of text, references to earlier printed editions are common. A typical case is page breaks from a printed edition. In this case, every fifth line break from the text edition by Kölbing from 1881 has been marked. This can be kept by search and replace (using a regular expression) on “$(\d+)$” with "<lb ed="kölbing" n="$1">":

(5)

Changed to:

 <lb ed="Kölbing1881" n="5"/>

9. Encoding each word

Question: How should I automatically markup each word in the text?

Answer: In Menota, each word should normally be contained in a <w> element. Here, too, you can use search and replace functions with regular expressions, for instance in Oxygen. The regular expressions in these cases may become rather complex, for instance: find “([a-zA-Zæœǫþð|#]*)”, replace with “<w><me:dipl>$1<me:dipl></w>”:

Nu ly∂it go∂gæfliga.

Changed to:

<w><me:dipl>Nu</me:dipl></w> 
<w><me:dipl>lyðit</me:dipl></w> 
<w><me:dipl>goðgæfliga</me:dipl></w>
.

To encode the “.” at the end, search for “(\.)”, replace with “<me:punct>$1</me:punct>”:

<w><me:dipl>Nu</me:dipl></w> 
<w><me:dipl>lyðit</me:dipl></w> 
<w><me:dipl>goðgæfliga</me:dipl></w>
<me:punct>.</me:punct>

When you have done this, you might have this document. (Do not open this link in your web browser, but ask your browser to download it to your disk. Then, open it in Oxygen or a similar application. This document is not valid or well formed, but you may download it if you would like to see what the encoding looks like at this stage of the process. In Oxygen, you will get this message:

This document contains long lines which may affect performance when opened in the text editor. The longest line contains 6852 characters. This warning is displayed for lines which contain more than 5000 characters (see the Open/Save page from Preferences). Do you want to format and indent it before open?

Answer yes to this question.

It is a good idea to keep copies of the various stages of the document in case you do something really stupid to it.

A note to the search expression “[a-zA-Zæœǫþð|#]” used above. I prefer adding the characters which is actually used in the document, and not adding Unicode character ranges which is an option.

10. Making the document a valid XML document

Question: How should I make this a valid XML document?

Answer: You need to embed the text into a structure, including a XML prologue, a root element, a teiHeader etc.

11. Add a prologue

Question: What is a prologue?

Answer: A prologue is the first few lines of an XML document, containing the XML declaration and a reference to the DTD or schema in use. See the Menota handbook appendix D, D.2 Referring to the Menota schema, for more details on prologues.

I have chosen a Relax NG version of the Menota schema in this case, but you may also use a DTD.

<?xml version="1.0" encoding="UTF-8"?>
<?oxygen RNGSchema="http://www.menota.org/guidelines-2/schemes/menotaP5.rng" type="xml" ?>
<!DOCTYPE TEI 
[
<!ENTITY % Menota_entities 
	SYSTEM 'http://www.menota.uio.no/menota-entities.txt'>
%Menota_entities; ]>

12. Add a root element

Question: What is a root element, and what should it look like?

Answer: An XML document needs to have one element that contains all the content of the file except the prologue. Put the content of the document in a root <TEI> element, like this:

<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:me="http://www.menota.org/ns/1.0">
... 
  <w><me:dipl>Nu</me:dipl></w> 
  <w><me:dipl>lyðit</me:dipl></w> 
  <w><me:dipl>goðgæfliga</me:dipl></w>
  <me:punct>.</me:punct> 
  <w><me:dipl>betra</me:dipl></w> 
...
</TEI>

In the root element, two namespaces have been defined. A namespace contains a prefix to the tag names used. All tags starting with the Menota name space prefix “me:” are specific tags defined in Menota, all other tags (having no prefix), are part of TEI, which is the default namespace.

13. Add a teiHeader

Question: What is a teiHeader, and what should it look like?

Answer: See appendix E. Menota header of the Menota handbook. A simple teiHeader might look like this:

<teiHeader xml:lang="eng">
	<fileDesc>
		<titleStmt>
			<title>Elis saga in DG 4–7, fol. 9v: an electronic edition</title>
			<respStmt>
				<resp>Edition by </resp>
				<name>...</name>
			</respStmt>
		</titleStmt>
		<editionStmt>
			<p>First draft, <date when="2009-04-02">2 April 2009</date>
			</p>
		</editionStmt>
		<publicationStmt>
			<distributor>Medieval Nordic Text Archive</distributor>
			<idno type="Menota">Ms. XX</idno>
			<date when="2009-04-02"> 2 April 2009</date>
			<availability status="restricted">
				<p>This text is available for purposes of academic research 
				and teaching only. Re-distribution in any form without prior 
				permission is prohibited. Short extracts may be cited with 
				full acknowledgment of the source.</p>
			</availability>
		</publicationStmt>
		...
	</teiHeader>

14. Add basic encoding in the text part of the document

Question: What is basic encoding, and where should it be placed?

Answer: Enclose the text part of the document in the elements <text>, <body>, <div> (text division) and<p> (paragraph). In most cases, a document will contain multiple <div> and <p> elements:

<text xml:lang="en">
	<body>
		<div>
			<p>
				<lb/>
				-<lb/>
				<w><me:dipl>Nu</me:dipl></w>
				<w><me:dipl>lyðit</me:dipl></w>
				<w><me:dipl>goðgæfliga</me:dipl></w>
				<me:punct>.</me:punct>
				...
				<w><me:dipl>þessa</me:dipl></w>
				<lb/>
				-<lb/>
		  </p>
		</div>
	</body>
</text>

When you have done this, you might have a valid XML document, see this. Do not open this link in your web browser, but ask your browser to download it to your disk. Then, open it in Oxygen or a similar application.

15. Correcting errors

Question: What should I do if my file is not valid?

Answer: It is easy to make mistakes, and it takes time to learn to understand what the error messages means. Here is some advice:

Check on the way if the document is well formed. While you are in the middle of the conversion process it might not be the case, but check when you are done with a major change, and correct the errors. Oxygen uses red curly lines to indicate that some thing is wrong, and an error message is shown below the text, as in this case where a </p> is missing:

When the document is well formed, check if it is valid according to the schema. Read the error message to understand what is causing the error.

If there are many errors in a text (and it is not uncommon that there are thousands), many of these will be caused be the same error, for instance that an entity is used many places, without being declared, or an element is used which is not declared in the schema. You may sort the errors by clicking on the labels above the error messages in Oxygen, giving you an overview of the most common errors. You can correct a group of errors in one operation.

Some errors are hard to find, especially when the parser (that is the program that is looking for errors) does not point to the place actually causing the error. A typical case is that there are end tags missing somewhere, and the parser points to the end of the surrounding element. In some cases you need to make a copy of your document, and then remove large part of the content to trace the error.

16. Using the attributes @rend and @rendition - Describing the visual appearance of the source text

Among the global attributes which can be used in any element in the TEI are @rend and @rendition, which both are used to describe the visual appearance of something in the text, for instance its colour or size. Two points are easy to misunderstand concerning these attributes:

1. They should describe what these elements look like in the source, and even if this information might be used to give a special look on screen or in print, it should NOT be used to define how these elements are to be displayed in an edition. That is done by the display system, in most cases a stylesheet. Remember that something that is big and blue in a manuscript, might be shown without any particular layout in an edition of the text.

2. The difference between @rend and @rendition is that @rend might contain any string of text, while @rendition points to a description of the rendering or presentation used for this element in the source text. If there is a need for a systematic description of the rendition of a text, and this is made, use @rendition, otherwise use @rend. The TEI guidelines say further:

Where both rendition and rend are supplied, the latter is understood to override or complement the former.
Each URI provided should indicate a rendition element defining the intended rendition in terms of some appropriate style language, as indicated by the scheme attribute.

The description of the styles are given in a series of <rendition> elements in the teiHeader of the document. See in rendition in the TEI Guidelines for further details.

First published 07.04.2009. Last updated 28.07.2009. Webmaster.