Version 1.1 (5 May 2004)
5.1
Introduction
5.2 Base line
characters
5.3
Ligatures
5.4 Modified
characters
5.5 Complex
characters
5.6 Punctuation
marks
5.7 List of
characters
The basic characters a-z / A-Z in the Latin alphabet can be encoded in virtually any electronic system and transferred from one system to another without loss of information. Any other characters may cause problems, even well established ones such as Modern Scandinavian "æ", "ø" and "å". As explained in ch. 2.2 all characters outside a-z / A-Z should be encoded as entities, i.e. given an appropriate description and placed between the delimiters "&" and ";".
Entities are needed at the bottom level, as it were, in an XML transcription of a text. This is parallel to the source code of a typical HTML file, which can be inspected in most HTML editors and browsers, but is usually not shown. Although a number of characters will have to be referred to with entities, it is important to note that the transcriber does not have to type in entities when s/he is transcribing a manuscript or doing proof reading. With appropriate software and fonts the transcription can be displayed on screen and printed out with all (or at least most) entities shown as readable and recognizable characters.
The characters a-z / A-Z are seen as base line characters, i.e. characters occupying a separate position on the base line of a primary source (typically a manuscript) and transcribed one by one in the order they stand. In addition to the characters a-z / A-Z there are a number of ligatures, i.e. combination of two (or in principle more) characters making up a new base line character, such as "æ". There are also a number of variant base characters, e.g. a round form of "r" (r rotunda), or a tall form of "s", and there is even a whole set of small capitals to be reckoned with, especially in Old Icelandic script. Furthermore, the base line characters can be modified by a number of diacritics (accents, dots, hooks, strokes etc.), so that the theoretical number of combinations for any character is very high.
The rules for describing the individual elements of characters and their attributes are simple, and were described in ch. 2.2 above. To keep the number of entities down we have tried to specify how many of the possible character combinations in fact have been used in medieval Nordic manuscripts. The discussion below is therefore of some length, and the accumulated character list includes several hundred characters. Although we have tried to make the list as exhaustive as possible, we do not believe that it is definitive. It should be noted that many characters are open to interpretation, e.g. whether a character has double dots or double accents. Thus, some characters in our list may prove to be theoretical rather than factual combinations of base character and diacritics. This especially applies to capital variants of lesser used characters, such as some of the ligatures.
In general, we have tried to minimize the number of variants, whether of base characters or of diacritics. There is, for example, only one base line character "a", although this letter may have various forms in the manuscripts, i.e. "single-storeyed" (with a neck) or "double-storeyed" (closed without a neck). We regard this type of variation as paleographical, and suggest that it is not encoded, but that it is described elsewhere, e.g. in the TEI header or in the front matter of the electronic edition.
We should like to stress that the list of characters in this chapter should not be taken as a list of minimal and necessary distinctions to be made by the transcriber. We have defined two types of "s", a low (or round) one and a tall one. This does not mean that the transcriber should use both entities in the encoding of whichever manuscript exhibiting them, only that if s/he wishes to make the distinction, we suggest how that can be done.
Glyphs are the typical shape of a character. In this chapter, they are shown in a font based on the typeface Courier. Although this is a modern font the distinctive traits of each character are basically the same as in medieval script.
5.1.2 Entity names
All characters outside the range a-z / A-Z are referred to with entity names placed within the delimiters "&" and ";". As explained in ch. 2.2.2 above we recommend that entities as far as possible conform to the standard ISO entity sets. An updated list of ISO conformant entities can be found at the Oasis web site:
The ISO set only covers a minor selection of the entites we believe are necessary for the full transcription of medieval Nordic manuscripts. This chapter thus discusses a number of additional characters with accompanying entities. We have tried to adhere to the inventory and syntax of ISO entities. For a summary of the entity naming scheme, please refer to ch. 5.5 below.
We have supplied code points from Unicode 4.0 for all characters (or parts of characters) defined in this standard. For the remaining characters we have defined code points in the Private Use Area. These are shown in bold type (and dark blue). The character list contains Unicode values for all characters.
Each character is described according to the naming scheme in Unicode, as explained in ch. 2.2. We also suggest descriptive names for those characters not included in the Unicode standard.
Base line characters are unmodified characters occupying a separate position on the base line, i.e. characters which are not clearly modified by diacritical marks or being part of a ligature.
These characters are described in ISO 646 and are found on the keyboard of virtually any Western computer. They are identical to US ASCII positions 32-126 and are often referred to as Basic Latin. Characters in Basic Latin are encoded without use of entity references.
Unicode 4.0 defines these characters as belonging to the range Basic Latin (positions 0020-007E).
Glyph |
Letter |
Unicode |
Descriptive name |
a |
0061 |
LATIN SMALL LETTER A |
|
A |
0041 |
LATIN CAPITAL LETTER A |
etc.
Note that the distinction between minuscule (lowercase) and majuscule (uppercase) characters is an inherent trait of the coding scheme; it is not shown by entity names such as "&amin;" for "a" and "&amaj;" for "A". However, when it comes to the question of small capitals and enlarged minuscules it will be necessary to introduce entity names, as discussed in ch. 5.2.3 and 5.2.4 below.
Modern Icelandic has two characters for dental fricatives, "þ" (thorn) and "ð" (eth). In ISO 8859-1 they are referred to with the entity names "þ" and "ð", also adopted here.
Unicode 4.0 defines "þ" (thorn) and "ð" (eth) in the range Latin-1 Supplement.
Glyph |
Entity |
Unicode |
Descriptive name |
ð |
00F0 |
LATIN SMALL LETTER ETH |
|
Ð |
00D0 |
LATIN CAPITAL LETTER ETH |
|
þ |
00FE |
LATIN SMALL LETTER THORN |
|
Þ |
00DE |
LATIN CAPITAL LETTER THORN |
In addition to "þ" and "ð", Modern Icelandic has seven vowels with diacritical marks, "á", "é", "í", "ó", "ú", "ý" and "ö", and one ligature, "æ". These will be treated as modified characters and discussed below.
Small capitals have the same form as majuscules (capital letters), but are usually drawn with the same height as a minuscule (small letter) such as "x". Small capitals were used in Old Icelandic to denote geminates, i.e. long consonants, or they were used ornamentally (often so in Old Norwegian). The letters "B", "D", "G", "M", "N", "R", "S" and "T" were often used as geminates, while these and other letters might also be used as ornaments in the whole or in parts of highlighted words. Some of the small capitals, e.g. "O" and "C", are difficult to distinguish from minuscule letters. We suggest that small capitals receive the suffix "scap" (for "small capital") in the entity name.
Unicode 4.0 has defined nine small capitals in the IPA Extensions range, sc. "B", "G", "H", "I", "L", "N", "", "R" and "Y", and sixteen in the Phonetic Extensions range, sc. "A", "Æ", "C", "D", "ETH", "E", "J", "K", "M", "O", "P", "T", "U", "V", "W" and "Z". For the remaining small capitals we will have to resort to the Private Use Area, i.e. "F", "Q", "S", "THORN" and "X". Cf. the character list for an extensive overview.
Uralic Phonetic Alphabet characters for the UCS (20.03.2002) PDF file
Glyph |
Entity |
Unicode |
Descriptive name |
&gscap; |
0262 |
LATIN LETTER SMALL CAPITAL G |
|
&mscap; |
1D0D |
LATIN LETTER SMALL CAPITAL M |
etc.
We recommend that small capitals are transcribed as such, irrespective of whether they are being used for geminates or for ornamental purposes. Cf. ch. 6.2.10.
Some scholars believe that enlarged minuscules should be transcribed as separate characters. The traditional view is to interpret these characters as variants of capitals (majuscules) and encode them as such. There are comparatively few characters which appear as enlarged minuscules, and it is sometimes difficult to decide whether a minuscule character is enlarged or not. We recommend that enlarged minuscules are transcribed as capitals in cases where it seems obvious that they function as a capital and as ordinary minuscules elsewhere. If, however, the transcriber wishes to make a distinction between capitals and enlarged minuscules, we recommend the suffix "enl" (for "enlarged") in the entity name.
Unicode 4.0 does not recognise enlarged minuscules as separate characters. A small selection of enlarged minuscules has been included in the Private Use Area, e.g. "a" and "e". Cf. the character list for an extensive overview.
Glyph |
Entity |
Unicode |
Descriptive name |
&aenl; |
EEE0 |
LATIN ENLARGED LETTER SMALL A |
|
&eenl; |
EEE6 |
LATIN ENLARGED LETTER SMALL E |
etc.
A few characters have distinct Insular forms, e.g. "r", "f" and "v" (wynn). These characters are sometimes transcribed as separate characters, as opposed to their Carolingian counterparts. We suggest using the suffix "ins" (for "Insular").
Unicode 4.0 does not recognise Insular characters as separate characters, with the exceptions of "g" (yogh) and "w" (wynn) in Latin Extended-B. A few Insular characters have been included in the Private Use Area, e.g. "f" and "v".
Glyph |
Entity |
Unicode |
Descriptive name |
&fins; |
F10D |
LATIN SMALL LETTER INSULAR F |
|
&vins; |
F211 |
LATIN SMALL LETTER INSULAR V |
etc.
Insular "g" (yogh) is to our knowledge not found in medieval Nordic manuscripts.
As a rule, characters should be given identical names across various scripts (Carolingian, Insular, Gothic etc.). However, when clearly identifiable letter forms from one script appear within the context of another, as is the case with some Insular letter forms in Nordic Carolingian script, they may be singled out by the transcriber, if s/he wishes to do so.
A few characters may appear with a typical Uncial form, especially "d", "e", "m" and "t". These characters are sometimes transcribed as separate characters, as is the case with Insular letter forms. We suggest using the suffix "unc" in the entity name.
Unicode 4.0 does not recognise Uncial characters as separate characters. A small selection of Uncial characters has been included in the Private Use Area, e.g. "d" and "t". Cf. the character list for an extensive overview.
Glyph |
Entity |
Unicode |
Descriptive name |
&dunc; |
F109 |
LATIN LETTER UNCIAL D |
|
&tunc; |
F129 |
LATIN LETTER UNCIAL T |
etc.
Runes are normally not used in conjunction with the Latin alphabet, but when they appear in isolated instances - e.g. in The third grammatical treatise - they should be transcribed with appropriate entity names. We suggest using the suffix "run" (for "runes").
Unicode 4.0 has defined a selection of 81 runes from the Older and Younger Futhark in the Runic range. Note that the descriptive names given below are those chosen by Unicode.
Glyph |
Entity |
Unicode |
Descriptive name |
&frun; |
16A0 |
RUNIC LETTER FEHU FEOH FE F |
|
&mrun; |
16D8 |
RUNIC LETTER LONG-BRANCH-MADR M |
etc.
Note that the runes "m" and "f" may also be used as abbreviation signs, cf. ch. 6.2.6-7.
Some base line characters have commonly recognised variants. In general, we recommend that variants, e.g. "single storeyed a" and "two storeyd a", are not transcribed as separate entities. In many cases it is difficult to decide which of the variants to choose from. However, there are a few variants which are very distinctive and often recognised in transcriptions. This applies to "tall s" and "round r", for which we suggest the suffixes "tall" and "rot" (for "rotunda") respectively.
Unicode 4.0 recognises "long s" as part of the Latin Extended-A range, but "round r" is not recognised. This has been allocated to code point F20E in the Private Use Area.
Glyph |
Entity |
Unicode |
Descriptive name |
&slong; |
017F |
LATIN SMALL LETTER LONG S |
|
&rrot; |
F20E |
LATIN SMALL LETTER R ROTUNDA |
etc.
Ligatures are two base line characters which are joined so that they form a new, composite base line character. Some consist of two identical characters, e.g. "a+a", others of different characters, e.g. "a+v". Ligatures may be used to denote length, "a+a", diphtong, "a+v", or a distinct vowel quality, often mutation (Umlaut), "a+v". A well known example is the ligature "æ", formed of "a" and "e", encoded as "æ" in ISO 8879. In analogy with this usage we suggest that ligatures receive the suffix "lig" following those base line characters which make up the ligature.
Unicode 4.0 does not recognise ligatures in the Latin alphabet as base characters. The only exceptions are "æ", "" and "ij" (not used in Nordic). For "æ" see the Unicode range Latin-1 Supplement, and for "" Latin Extended-A. Other ligatures must be defined in the Private Use Area. Cf. the character list for an extensive overview.
Glyph |
Entity |
Unicode |
Descriptive name |
&aalig; |
EF91 |
LATIN SMALL LIGATURE AA |
|
&avlig; |
EF97 |
LATIN SMALL LIGATURE AV |
etc.
We recommend that only ligatures with a distinctive value should be given an entity name of their own, i.e. only those ligatures which possibly reflect a phonological opposition. We regard ligatures which are motivated by graphic economy as sporadic ligatures and recommend that they should be transcribed as separate characters. To this group belong ligatures such as "b+b", "p+p" etc. Especially in late Gothic script there are many examples of junctures (fusion of bows) which can be interpreted as ligatures, but which in our opinion should be encoded as individual characters.
If a transcriber wishes to transcribe sporadic ligatures as ligatures, we suggest using the element <seg> with the attribute type="ligature", e.g.
Glyph |
Encoding |
|
<seg type="ligature">pp</seg> |
Modified characters are base line characters with diacritical marks. They are described according to rule (4) in ch. 2.2.1. If there is more than one modification, they are listed in the sequence specified in rule (6).
The character "ø" is still being used in Modern Danish and Norwegian, and is encoded as "ø" in ISO 8879. In some manuscripts the stroke may be horizontal and in others diagonal, but in general we do not believe it is relevant to distinguish between variant strokes.
Unicode 4.0 has defined "ø" as part of the Latin-1 Supplement range.
Glyph |
Entity |
Unicode |
Descriptive name |
ø |
00F8 |
LATIN SMALL LETTER O WITH STROKE |
|
Ø |
00D8 |
LATIN CAPITAL LETTER O WITH STROKE |
etc.
A few vowels, especially "o" and "e", may have a hook. The latter combination, "e caudata", is common in Latin manuscripts, in which the letter form alternates with the ligature "æ". The hook may be placed below or above the base line character, facing either to the right or to the left. Of these combinations, the distinction between left- and right-turning hooks may simply be accidental. The two "canonical" forms are the hook below to the right and the hook above to the left. We recommend using "ogon" for the hook below and "curl" for the hook above (since "hook" possibly is more ambiguous).
Unicode 4.0 recognises "a" and "e" with hooks in the range Latin Extended-A, and "o" with hook in Latin Extended-B. In Unicode, the hook is referred to as "ogonek", a Polish word for "little tail". The ogonek is also defined as a combining character, 0328 in the range Combining Diacritical Marks. The hook above may be identified with the tone mark in Vietnamese, 0309 in the range Combining Diacritical Marks. This mark, however, has a slightly different form (comparable to the recognised distinction between the cedilla and the ogonek). For this reason, we suggest using a separate code point in the Private Use Area, F1C4.
Glyph |
Entity |
Unicode |
Descriptive name |
&oogon; |
01EB |
LATIN SMALL LETTER O WITH
OGONEK |
|
&ocurl; |
E7D3 |
LATIN
SMALL LETTER O WITH CURL |
|
&ucurl; |
E731 |
LATIN
SMALL LETTER U WITH CURL |
Loops are in most cases reduced forms of "a" or "o" and can thus be interpreted as ligatures. We suggest using the suffix "red" in the entity name, thus "oeligred" for the reduced version of the "oe" ligature, and "aoligred" for the reduced version of the "ao" ligature.
Unicode 4.0 does not recognise loops, either as separate characters or as combining diacritical marks.
Glyph |
Entity |
Unicode |
Descriptive name |
&oeligred; |
F20D |
LATIN SMALL LIGATURE OE WITH MISSING BOTTOM STROKE |
|
&aoligred; |
F206 |
LATIN SMALL LIGATURE AO NECKLESS |
Single and double acute accents are quite common in Nordic script. A single acute accent is encoded with the suffix "acute" in ISO 8879, e.g. "á", while double acute is encoded with the suffix "dblac". This usage is adopted here.
Unicode 4.0 defines "a",
"e", "i", "o", "u" and "y" with acute accents in the Latin-1
Supplement range, and
"æ" and "ø" in the Latin
Extended-B range. The
vowels "o" and "u" are defined with double acute accents in the
Latin
Extended-A range. Other
accented characters must be encoded as a combination of a base line
character and 0301 COMBINING ACUTE ACCENT or 030B COMBINING DOUBLE
ACUTE ACCENT from the range Combining
Diacritical Marks. As
explained in ch.
2.2 this "decomposed"
encoding can also be used for the precomposed vowels mentioned
above.
Glyph |
Entity |
Unicode |
Descriptive name |
á |
00E1 |
LATIN SMALL LETTER A WITH
ACUTE |
|
&adblac; |
E425 |
LATIN
SMALL LETTER A WITH DOUBLE
ACUTE |
|
&aaligacute; |
EFE1 |
LATIN
SMALL LIGATURE AA WITH
ACUTE |
|
&aaligdblac; |
EFEB |
LATIN
SMALL LIGATURE AA WITH DOUBLE
ACUTE |
Double acute accent sometimes resembles a circumflex, "^", cf. Seip 1954, p. 145.
Grave accent sporadically appears in comparatively young Icelandic manuscripts, especially "è", while double grave accent to our knowledge is not found in medieval Nordic script at all. If necessary, we suggest using the suffix "grave", e.g. "è", for the single grave accent.
Single and double dots are quite common in Old Norse script. Single dots appear over vowels as well as consonants, double dots usually only above vowels. In ISO 8879 the suffixes "dot" and "uml" (for "Umlaut") refer to single and double dots respectively. This usage is adopted here (although double dots in no are way restricted to the original mutated vowels).
Unicode 4.0 defines a number of consonants with a single dot above, sc. "b", "d", "f", "h", "m", "n", "p", "r", "s", "t", "w", "x" and "long s", and also the vowel "y", all in the Latin Extended Additional range. Other dotted characters must be encoded as a combination of a base line character and 0307 COMBINING DOT ABOVE or 0308 COMBINING DIAERESIS from the range Combining Diacritical Marks. As is the case with accents, "decomposed" encoding can also be used for the precomposed characters mentioned here.
Glyph |
Entity |
Unicode |
Descriptive name |
&ydot; |
1E8F |
LATIN SMALL LETTER Y WITH
DOT ABOVE |
|
ö |
00F6 |
LATIN SMALL LETTER O WITH
DOUBLE DOT ABOVE |
Single dots also appear over a number of consonants:
Glyph |
Entity |
Unicode |
Descriptive name |
&kdot; |
E568 |
LATIN
SMALL LETTER K WITH DOT
ABOVE = |
|
&gscapdot; |
EF20 |
LATIN
LETTER SMALL CAPITAL G WITH DOT
ABOVE |
Single dots above can be seen as a type of abbreviation, since the dot usually signifies gemination of the characters it is placed above. Cf. ch. 6.3.8.
The discussion in ch. 5.2-5.4 has shown that entity names are built up in a strict sequence with a limited number of possible values. The syntax and inventory is shown in the table below. Note that not all slots need to be filled in; in most cases only one or two slots are used.
Base line character |
Main type |
Variant |
Ligature |
Fixed modification |
Loose modification |
a |
comb |
long |
lig |
ogon |
acute |
Please note that if there is a conflict between the standard ISO entities and the syntax suggested here, ISO entites should be preferred.
On the basis of this table we can name and describe a number of complex characters (not necessarily occuring in medieval Nordic script). Some examples:
Glyph |
Entity name |
Descriptive name |
æogon; |
LATIN SMALL LIGATURE AE WITH OGONEK |
|
øogonacute; |
LATIN SMALL LETTER O WITH STROKE AND OGONEK AND ACUTE |
|
æogonuml; |
LATIN SMALL LIGATURE AE WITH OGONEK AND DIAERESIS |
The punctuation marks in medieval Nordic script are basically the same as in the Modern European languages, but their use was less consistent, and many manuscripts only used a single mark, the dot. There was also some special types of punctuation marks.
Unicode 4.0 has the marks in the table below in the ranges Basic Latin and Latin-1 Supplement, with the exception of the inverted semicolon, the pause mark and the triangular dots.
Glyph |
Character |
Unicode |
Descriptive name |
. |
002E |
FULL STOP |
|
· |
00B7 |
MIDDLE DOT |
|
, |
002C |
COMMA |
|
: |
003A |
COLON |
|
; |
003B |
SEMICOLON |
|
&punctelev; |
F161 |
PUNCTUATION MARK PUNCTUS ELEVATUS |
|
? |
003F |
QUESTION MARK |
|
&quest8; |
E501 |
QUESTION MARK HORIZONTAL 8 FORM |
|
- |
002D |
HYPHEN |
|
/ |
002F |
SOLIDUS |
|
&diacom; |
F1F2 |
PUNCTUATION MARK DIAERESIS ABOVE COMMA* |
|
* Cf. Hreinn Benediktsson 1965, p. 95. |
|||
&brevdot; |
F1F3 |
PUNCTUATION MARK BREVE ABOVE DOT* |
|
* Cf. Seip 1954, p. 34. |
|||
∴ |
2234 |
PUNCTUATION MARK UPWARDS-POINTING TRIANGULAR DOTS* |
|
* Cf. Seip 1954, p. 34. |
An extensive list of characters (including punctuation and abbreviation marks) is found in the character list.
Version 1.0 published 20 May 2003. Version 1.1 published 5 May 2004. |