Unicode for the web

(expanded version of old article from Coffee with ILRT blog, which I’ve given a new home)

The bad old days

Anyone who remembers the Web in the 1990’s may also remember the limited range of characters which could be used in text displayed on the Web for writing languages other than English. Some characters with accents (and other diacritical marks such as cedillas, tildes and umlauts) for French, German, Italian, Spanish and Scandinavian languages were available, but there was no simple way of displaying any others.

For the benefit of the computer, each possible character had a numeric equivalent, known as an ASCII code. Plain unadorned letters for writing English with, together with numbers and punctuation, were a universal standard, covered by the range of ASCII codes from 0-127. Codes in the ASCII range 128-255 were reserved for the simple accents mentioned in the previous paragraph. So the HTML á – using the ASCII code 225 – produces the character a; the more memorable á is an alternative way to do the same thing.

However, you didn’t have to go far to find characters in use which could not be easily rendered this way on a Web page. There was no standard way of producing Welsh ŵ and ŷ, for example. Nor was there simple way of making text available in non-Roman scripts such as those required to write Greek, Russian, Hebrew, Arabic, Hindi, Mandarin and many other languages.

It was possible to have a local variant of ASCII codes 128-255 installed in order to display a particular language, at the cost of making the commoner accented characters described above illegible. It was also still impossible to display characters from different non-Roman scripts on the same page. (For example, biblical scholars could not display Hebrew and Greek side by side). It was not unusual for text in non-Roman scripts to be put on Web pages as a scanned image of a printed text.

Unicode arrives

This situation was clearly untenable and the Unicode standard was developed as a universally-agreed way of representing all scripts in use. Unicode includes the ASCII codes mentioned above, but expands the range of codes way beyond ASCII, more than enough to accommodate all characters in all scripts in use. Each possible character or ‘glyph’ in any script (including those with accents etc.) is given a number (there are now over 100,000 of these). Their position in Unicode is known as a ‘Unicode point’. Consecutive Unicode points are grouped into ‘ranges’, defining all the characters needed for a particular script.

A look at the list of Unicode character ranges (e.g. http://www.alanwood.net/unicode/#links) illustrates the diversity of human writing systems, ancient and modern: everything from Babylonian cuneiform to the many scripts in current use in India can be found there. Some ranges are devoted to symbols, such as musical and mathematical notation, and even representations of I Ching hexagrams, dominoes and mah-jongg tiles! (You may notice that the code is given in both decimal (base 10) and its hex (base 16) equivalent; decimal is the one you will need
when writing for the Web).

But I still can’t read it!

If you look at some of the Unicode resource pages, chances are that some characters won’t be interpreted correctly on your screen. E.g.: ? (for those of you who don’t have a Mongolian font installed.) They will probably appear as boxes (in Mozilla Firefox, the box will contain letters and numbers which are the hex code for that Unicode point). This means that your system doesn’t support those particular Unicode ranges. Windows offers support for some additional languages; go to the Control Panel, choose ‘Regional and Language Options’ and you can install support for ‘complex script and right-to-left’ languages (Arabic, Armenian, Georgian, Hebrew, the Indic languages, Thai and Vietnamese) and East Asian languages (Chinese/Japanese/Korean).

You will also of course need to have an appropriate font available. The languages just mentioned are included in the Unicode versions of standard fonts such as Arial, but if you still can’t read the script you may need to install a font that includes it. A list of some from non-commercial sources can be found at http://www.alanwood.net/unicode/fonts.html.

Writing for the Web with Unicode

Odd words or single characters can be inserted on to a Web page without the need for such software. For individual characters, the HTML is “&#” followed by the decimal version of the number in Unicode, followed by “;”. Several examples of this have already appeared in this article. So the Welsh characters mentioned above can be written as ŵ and ŷ. For more extended typing, Unicode keyboards are available for some of the commoner scripts. For example, Microsoft Global allows you to choose a language in an Office application and then type away in Chinese, Korean or Japanese.

Remember though that a word in non-Roman characters may not be legible to your readers if they don’t have an appropriate font and consider whether it might be better to (for example) give a transliteration of the word into Roman characters in addition or instead of it, or omit it altogether. It is also important to specify in the head part of the page that UTF-8 (the now universally understood Unicode standard) is what you are using, with the following tag: <meta http-equiv=”Content-Type” content=”text/html; charset=UTF-8;”>. At ILRT we work to this standard, essential in a world where the Internet crosses all linguistic borders.

Some further reading:

Leave a Reply

Your email address will not be published. Required fields are marked *