Software Localization

What is a code page and why is it needed?

Code pages are necessary because ANSI files only have 8 bits to display a character (char). This means there are only 256 possible characters–not nearly enough for all languages of the world.

The American charset needs only 128 different chars = 7-bit. Because 7-bit was a bit inefficient for computers, this led to the need for another bit; thus, currently, another 128 possibilities are available to display chars.

On MS-DOS systems, some of these bits have been used for drawing boxes and lines. With Windows, these boxes and lines have been removed from the charsets and more foreign chars have been added. For the most Western languages like English, French, German, and others, these additional chars work efficiently. For example, the German charset needs only seven extra chars to the US charset – leaving enough space for special chars from Spain, Norway, and so forth.

This is a dialog shown with wrong system code page
Typical display of a dialog shown with the wrong system codepage.

However, for certain charsets, such as Cyrillic charsets, the space was not big enough. Codepages fill that gap. A code page in Windows is nothing more than information, so that the upper 128 chars use some other characters. For example, instead of the German umlaut Ü, a Cyrillic Ш appears. both of these items have the ANSI value 205. Thus, if the Windows codepage 1252 is selected, a Ü appears, while with the Russian Windows codepage 1251 Ш (sha) is displayed.

If code pages are used, the system cannot possibly show Ü and Ш on the same display. This is only possible if UNICODE is used. For example, this page uses UNICODE (UTF-8) to display both chars.

While this solves the problem for most of the languages, the code page technique does not help languages with more than 128 special characters, such as Japanese, Korean and Chinese. For these languages, MBCS is available. While the lower 128 characters are still the same as in US code pages, the upper 128 are specially encoded. In this system, one character of the upper 128 chars starts a multi-byte sequence. This means that one character is stored in one or many chars. For example, in Japanese shift-jis, one character can use up to five bytes.

Thus, if a person writes a text file on her or his computer and does not use UNICODE to save it, the current code page is used. If this file is given to someone with some other current codepage, the file is not displayed correctly. So, if you are in Western Europe or the USA, and you get a text file from someone in Greece, Turkey, China, or Japan, the chances are high that the file is useless to you. Kaboom can fix these problems. Simply convert the file into UNICODE and print, edit, or use the file in any way–without losing information. If you edit the file and you want to return it with your changes, simply convert the file back into the code page that the receiver needs. Kaboom makes the entire process easy and quick.

Leave a Reply