6.9. ASCII Text Encoding

In plain text files, each character is represented as a certain numeric code. Characters and their matching codes are defined in code tables. Depending on the code tables used by an application and by the print filter, the same code may be represented as one character on the screen and as another one when printed.

Standard character sets only comprise the range from code 0 to code 255. Of these, codes 0 through 127 represent the pure ASCII set, which is identical for every encoding. It comprises all normal letters as well as digits and some special characters, but none of the country-specific special characters. Codes 128 through 255 of the ASCII set are reserved for country-specific special characters, such as umlauts.

However, the number of special characters in different languages is much larger than 128. Therefore, codes 128 to 255 are not the same for each country. Rather, the same code may represent different country-specific characters, depending on the language used.

The codes for Western European languages are defined by ISO-8859-1 (also called Latin 1). The ISO-8859-2 encoding (Latin 2) defines the character sets for Central and Eastern European languages. Code 241 (octal), for example, is defined as the (Spanish) inverted exclamation mark in ISO-8859-1, but the same code 241 is defined as an uppercase A with an ogonek in ISO-8859-2. The ISO-8859-15 encoding is basically the same as ISO-8859-1, but, among other things, it includes the Euro currency sign, defined as code 244 (octal).

6.9.1. A Sample Text

The commands below must be entered as a single line without any of the backslashes (\) at the end of displayed lines.

Create a sample text file with:

echo -en "\rCode 241(octal): \ \241\r\nCode
   244(octal): \244\r\f" >example 

6.9.1.1. Visualizing the Sample with Different Encodings

Under X, enter these commands to open three terminals:

xterm -fn -*-*-*-*-*-*-14-*-*-*-*-*-iso8859-1 -title iso8859-1 &
xterm -fn -*-*-*-*-*-*-14-*-*-*-*-*-iso8859-15 -title iso8859-15 & 
xterm -fn -*-*-*-*-*-*-14-*-*-*-*-*-iso8859-2 -title iso8859-2 &

Use the terminals to display the sample file in each of them with cat example.

The iso8859-1 terminal should display code 241 as the inverted (Spanish) exclamation mark and code 244 as the general currency symbol.

The iso8859-15 terminal should display code 241 as the inverted (Spanish) exclamation mark and code 244 as the Euro symbol.

The iso8859-2 terminal should display code 241 as an uppercase A with an ogonek and code 244 as the general currency symbol.

Due to the fact that character encodings are defined as fixed sets, it is not possible to combine all the different country-specific characters with each other in an arbitrary way. For example, the A with an ogonek cannot be used together with the Euro symbol in the same text file.

To obtain more information (including a correct representation of each character), consult the corresponding man page in each terminal — iso_8859-1 in the iso8859-1 terminal, iso_8859-15 in the iso8859-15 terminal, and iso_8859-2 in the iso8859-2 terminal.

6.9.1.2. Printing the Sample with Different Encodings

When printed, ASCII text files, such as the example file, are treated in a similar way according to the encoding set for the print queue used. However, word processor documents should not be affected by this, because their print output is in PostScript format (not ASCII).

Consequently, when printing the above example file, characters are represented according to the encoding set for ASCII files in your printing system. You can also convert the text file into PostScript beforehand to change the character encoding as needed. The following a2ps commands achieve this for the example file:

a2ps -1 -X ISO-8859-1 -o example-ISO-8859-1.ps example 
a2ps -1 -X ISO-8859-15 -o example-ISO-8859-15.ps example 
a2ps -1 -X ISO-8859-2 -o example-ISO-8859-2.ps example 

When printing the files example-ISO-8859-1.ps, example-ISO-8859-15.ps, and example-ISO-8859-2.ps, the files are printed with the encoding determined with a2ps.