Thursday, August 30, 2012

Non-English characters: How are they displayed

We have come a long way since every time we needed to read some non-English texts we needed to install special fonts in Windows. After I started using Linux I always wondered how does all these types of characters display so perfectly. I knew that it was Unicode. But what I didn't grasp immediately is how the OS can display such an enormous variety of characters. Unicode is not just a character set like we learnt in school. Instead it is an encoding algorithm (in some sense). What it does is that it assigns numbers different basic characters and then any characters can be constructed out of these elementary characters.

What happens is that two characters like a and ' are merged to get á. People who know LaTeX would instantly realize that this is how LaTeX also works. But this composting is done in the browser directly in the OS itself not as a separate software.

This process is called "pre-composition" or "decomposition" depending on how the merging of the characters are achieved.

A very good explanation of a difference of these two very different procedures can be found at

It explains why characters saved in a file in one system do not display properly in a different system. For example, a file with special characters if saved in Linux(uses pre-composition) would not display correctly in a Mac(uses decomposition) and vice-versa. The industry recommended is the pre-composition method.

This "double standards" (actually there are 4 of them) also affects coders who collaborate on a single project but use different systems.

No comments:

Print chess board in command line

The following bash one-liner will print a chess board in a terminal (the script works for the shells bash and ksh only) for (( i = 1; i ...