Diacritical Marks & Special Characters

Much of this 1995 Acorn User article (actually titled ‘In foreign parts’) was really only relevant to Acorn users. However, a lot of the material in the article is relevant to anyone handling material in foreign languages, so the following is a suitably abridged copy.

Even though most European languages, and many others, use the same basic Roman alphabet as English, most of them use various little marks attached to the letters in addition, and in some cases special characters. For really professional DTP, if you are printing international names and addresses, or text in foreign languages, you need to be able to produce these properly. The general term diacritical marks, or diacritics, covers all the little marks.

Most of the diacritics are accents, and the term is often broadened to refer to them all. In the days of metal type, the letter and its diacritic were usually made as a single piece, and printers may even apply the term accent to the whole combination.

French, for example, uses à, â, ç, è, é, ê, ë, î, ô, û, ü and (very rarely) ÿ.

Diacritics’ names

à – a graveé – e acuteâ – a circumflex ñ – n tildeā – a macron
å – a ringă – a breveç – c cedilla ą – a ogonekċ – c dot
ü – u diaeresisč – c caronő – o double acute Ť – T caronť – t háček

Most of these diacritics can appear in combination with other letters, not only those shown. Diaeresis is called umlaut in German, and the German word is often used in English, too. (N.B. These are the names generally used by British typographers, referring to the typographical shapes and positions – linguists use the terms somewhat differently, according to their linguistic significance. Typographers of other nationalities are likely to use them differently too, and there can be some confusion, even within a group of British typographers! Get the marks right, and don’t worry too much about the names.)

Special characters

æ Æ – ae ligature (Scandinavian languages)ð Ð – eth (Faroese, Icelandic)đ Đ – eth (Croat, Sami)
ħ Ħ – h stroke (Maltese)ĸ – kra (Greenlandic, obsolete)ı – dotless i (Turkish)
İ – dotted I (Turkish)ij – ij ligature (Dutch)ł Ł – l stroke (Polish)
ŋ Ŋ – eng (Sami)ø Ø – o stroke (Danish)œ Œ – oe ligature (French)
ß – eszett (German)þ Þ – thorn (Icelandic)ŧ Ŧ – t stroke (Sami)

Traps for the unwary

If you are copying from handwritten manuscript (there’s a lovely tautology!), or poor or small printing, it can sometimes be hard to make out the diacritics. It’s then very useful to know which combinations occur in which languages. Sometimes you may be able to guess which language it is from the diacritics, and then go on to work out the illegible ones. (With a bit of knowledge of which pairs (or even trios) of letters tend to go together in which languages, you can do even better, but that's over the top for this article.)

The table shows which combinations and special characters are used in which languages.

Afrikaansê ë ôFrenchà â ç è é ê ë î ï ô œ ù û ü ÿPortugueseà á ã ç é í ó
Albanianë çGermanä ö ü ßLatin Americanà á â ã ç è é ê ì í ò ó ô ú ü
Catalanà ç è é í ŀ (Ŀ) ò ó ú ï üHungarianá é í ó ö ő ú ü űRomanianâ ă î ș ț
Croatianć č đ (Đ) š žIcelandicæ á ð Ð é í ó ö ú þ ÞSamoanā ē ī ō ū
Czechá č ď (Ď) é ě í ň ó ř š ť (Ť) ú ů ý žIrishá é í ó úScots Gaellicà è é ì ò ó ù
Danishæ å øItalianà è é ì ò ùSloveneč š ž
DutchëLatvianā č ē ģ Ģ ī ķ ļ š ū žSpanishá é í ñ ó ú ü
Esperantoĉ ĝ ĥ ĵ ŝ ŭLithuanianą č ę ė į š ų žSwedishä å ö
Estonianä õMalteseċ ġ ħ Ħ żTurkishç ğ ı İ ö ș ü [â î û]
Faroeseæ á ð (Ð) í ó ö ø ú ýNorwegianæ å øWelshâ á ê ë î ï ô ö û ŵ ŷ
Finnishä ü [å š ž]Polishą ć ę ł (Ł) ń ó ś ź ż

A character in parentheses is the upper case version of the preceding character. Characters in square brackets are only used in words borrowed from other languages.

The accuracy and completeness of these lists is not guaranteed. I'd be grateful for any corrections!

Note especially that, in Czech and Slovak, the lower case of T-caron (Ť) and D-caron (Ď) are t-háček (ť) and d-háček (ď) respectively (a lot of ‘authorities’ don’t know this). Other languages don’t use carons on t or d, and don’t use háčeks at all. Háček is positioned closer to its letter than a following apostrophe would be in English; depending on the font it is likely to be at a different height too.

Don’t confuse carons (č) with breves (ă). Breves can occur on g in Turkish, on a in Romanian, and on u in Esperanto. Carons occur on c, s and z in a number of languages, and on lots of letters in Czech, but happily never on g, a or u in any language.

Ogoneks (ą) are Polish or Lithuanian, on vowels. This makes them easy to distinguish from cedillas (ç), which are used on consonants in lots of languages.

Romanian, Turkish and Latvian use comma-shaped cedillas (ș). The cedilla is inverted above a lower case g (ģ) in Latvian! These comma-shaped cedillas are usually smaller than commas. Unicode (and consequently almost all computer systems) only provide the French style of c- cedilla, which is wrong for Turkish. Of course, a Turkish font could have the right c-cedilla for Turkish, but then you couldn't type French properly, and it rather goes against the principles of Unicode.

In Turkish the upper case of a dotted i is a dotted I (İ); I is the upper case of the dotless i (ı).

Double acutes (ő) occur only in Hungarian. Ordinary diaereses (ö) in Hungarian are not mistakes, though: it has both.

In Icelandic you may sometimes find what look like commas over vowels. This is merely a typographical variation; the significance is the same as an acute. The same word will be written with an acute in other typefaces. It is perfectly correct simply to use acutes. Icelandic and Faroese use a different lower case eth (ð) from Croat and Lapp (đ).

Don’t assume that the original must necessarily be right, especially if the originator is not a native user of the language concerned. Even if they are, they may have been restricted by their equipment. These problems afflict the comma-shaped cedilla and the breve especially, these often being rendered as ordinary cedillas and carons respectively, even in otherwise well-produced printed matter. Not everyone will care, but you’re always safe if you get it right.

Floating accents

Manual typewriters used to use floating accents. This meant that each diacritic was treated as a character in its own right, and the character set did not need to contain the combined letter-plus-accent characters at all. On a typewriter that could do diacritics, the diacritic was typed first. The carriage didn’t advance, and the letter was then typed in the same location. The diacritic was positioned such that it appeared in the correct location relative to the letter.

This was all very well for lower case letters and monospacing typewriters. In general, diacritics were omitted on upper case letters in typewritten work; some typewriters had an additional set of upper case diacritics, positioned differently. Proportional spacing typewriters had a problem in the horizontal direction as well, and adopted a compromise positioning of the diacritics, but it wasn’t always very satisfactory.

The Acorn software I wrote used the same system, except that it automatically positioned every combination of letter and diacritic correctly, and provided the diacritics for every language that used the Roman, Greek, Deva Nagari (Hindi) or Cyrillic alphabets – except Vietnamese! It needed fewer keystrokes than the usual method on modern computers, and was much easier to use than needing to know (or look up) code numbers for every letter/diacritic combination.