Not Just Western, Asian and Complex: The World Has More Than Three Languages

If you belong to the minority of people who only use their word processor to write documents in English, then you will hardly ever care about fonts for other languages. At most, you’ll want a different font for an emphasized word.

However, if you, like most people, write documents in other languages and scripts, you’ll usually need to choose different fonts for different languages. Some fonts include more than one script, but very few fonts include all the scripts.

Now, to specify non-Latin fonts you first need to enable support for this in your word processor, because developers of word processors assume that most people write in only one language:

LibreOffice language settings dialog with checkboxes to enable support for "Asian" and "CTL" languages

LibreOffice language settings dialog. Without getting into details, the corresponding box in Microsoft Word is similar.

After you’ve done this you’ll see a slightly different font selection dialog – now you can select the font for “Western text”, “Asian text” and “CTL text”:

LibreOffice character formatting dialog with font selection for "Western", "CTL" and "Asian" scripts.

LibreOffice character formatting dialog. Again, the corresponding dialog in Microsoft Word is similar.

This is wrong in every possible regard.

The simplest problem with this is that most people have no idea what “CTL” is. Microsoft Word calls this “Complex scripts”, and the C in CTL indeed stands for “Complex”, but most people are not supposed to know what “complex scripts” are either.

Furthermore, according to this weird division of the world’s languages, Hindi and Arabic are “complex”, but Japanese is “Asian”, even though Hindi and Arabic are also spoken in Asia. This is most probably a result of the ways Americans describe immigrants: The Chinese and the Japanese are “Asian Americans”, but Indians and Arabs are “Indian” and “Middle Eastern”.

This is preposterous. It pestered me really badly ever since i used Microsoft Word for the first time in 1997, but somehow i never bothered to complain. So here i am, finally complaining about this atrocity.


“Complex scripts” is a very old-fashioned term that survived from the time when more or less anything that wasn’t Latin was considered “complex”. More precisely, it was used for scripts that were not just rows of letters like Latin, Cyrillic and Greek, but required connected letters like Arabic, ligatures like most scripts of India and its neighbors, or right-to-left text, like Hebrew and, again, Arabic. According to this logic, Latin and Greek should be quite complex, too, since most languages written in these scripts require combinations of diacritics, like in the Lithuanian word “rūgščių̃”… but this never bothered the programmers of word processors.

So this term, “complex”, was used by programmers, and even that was hardly justified. It was never meant to be used by ordinary people. A person who writes Arabic is not supposed to know that his script is “complex”, because as far as he’s concerned it’s the simplest script there is. In fact, it’s quite insulting. And most of all, it’s hard to understand: When a person wants to select a font for Arabic text, the most logical thing to ask him is to specify an “Arabic font” – not a “complex font”.

But beyond the strange terminology there’s an even worse practical problem. Let’s say that i got used to the fact that Microsoft and LibreOffice call my script “complex”; but what if i have more than one “complex” language in my document? It’s not an edge case at all. Lately i’ve been reading–and making little edits to–a Word document, which is a grammar textbook of the Malayalam language for Hebrew-speaking students. Hebrew and Malayalam are both “complex”, but they are complex for entirely different reasons, and they need different fonts. The author of that document told me that it drove her nuts. I completely understand what was she talking about–she’s just one among millions of people who suffer from this… but for some reason not one of them complains.

The relatively convenient way to solve this problem with the current software is to use separate character styles for different “complex” languages, but most people don’t know at all what “character styles” are and even for those who know what they are this solution would be very inefficient.

So how font selection dialogs should really be done? They should treat each combination of language and script separately. This is a bit tricky, but only a bit.

The best place to start solving this would be to look at existing standards: ISO 15924, ISO 639 and the IANA Language subtag registry. ISO 15924 lists a few dozens of scripts; ISO 639 lists a few thousands of languages; the IANA Language subtag registry defines the rules for specifying combinations of languages, scripts and their varieties. Combinations are important, because it’s not enough to specify a “Latin” font or a “Serbian language”: Serbian can be written in Latin and Cyrillic, Azeri can be written in Latin, Cyrillic and Arabic–in which case its direction changes, too, etc.

This doesn’t mean at all that the font selection dialogs have to list thousands of combinations of languages and scripts. By default they should list a few languages that a user is expected to use, for example by looking which keyboard layouts the user has enabled in his operating system. And the user must be able to add more languages, by using some kind of an “Add” or “+” button: “I want to write Malayalam in this document; sometimes i want to do this in the Malayalam script in the Meera font, and sometimes i want to write it in IPA, which is a kind of a Latin script and then i want to do it in the Charis font.” In this scenario two lines would have to be added to the dialog using that add button.

There may be more clever ways to solve this problem, but at this stage my proposal is certainly better than grouping the world’s languages into three arbitrary and outdated groups.


Now where does Wikimedia come in? Wikimedia projects, the most popular of which is Wikipedia, are massively multilingual. That’s why the Wikimedia Foundation always took internationalization seriously and recently created a whole team dedicated to it–a team of which i am proud to be a member. One of the most important and urgent things that this team does is adding web fonts support to our websites, so that people wouldn’t see squares or question marks when they see a word in a language for which they don’t have a font on their computer.

The intention is to do it with orientation to languages and scripts, as described above. Even though a lot of people edit Wikipedia, it is still a website that is mostly read and not written by its visitors, so the fonts that will be used will be mostly decided by the programmers–that is, by our team–, but word processors are mostly used by people for writing, so they should combine language and script selection with manual font selection. Of course, providing good defaults would be a good idea.

Now all that’s left is for some LibreOffice developer to pick up the bug i opened about it and fix it, thus making LibreOffice far more friendly to the world than Microsoft Word is. After all, there are many more people who don’t speak English than those who do.


Three things made me write this post: The work of my team in Wikimedia on WebFonts and especially the work of Santhosh Thottingal; My Malayalam classes with Ophira Gamliel; and Lior Kaplan‘s and Caolán McNamara‘s questions about the font selection dialog in LibreOffice. Thank you, Santhosh, Ophira, Lior and Caolán for making me finally write this post, which i wanted to write for about fourteen years.

About these ads

1 Response to “Not Just Western, Asian and Complex: The World Has More Than Three Languages”


  1. 1 Caolán McNamara 2011-10-24 at 11:05

    FWIW, in LibreOffice the CTL settings are enabled by default when it is launched from a CTL locale, and similarly the CJK settings are enabled by default when launched from a CJK locale, so in the general case the hunting for the hard-to-find dialog isn’t necessary to enable them. This all falls down of course if launched from e.g. an en-US locale with a desire to write in e.g. Hebrew.

    It is indeed very clearly tuned to a division between the classic western scripts, then extended to CJK, and then a big fat catch-all CTL category rammed in to pick up the rest.

    There’s no real defence of course, but those are the three categories in ODF. And OpenXML IIRC has three, or four, equally odd divisions. There is merit in perhaps trying to hide this nonsense from the end-user and attempt to build a meaningful intermediate language+script category structure on top of it to attempt to make it sane.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s





Follow

Get every new post delivered to your Inbox.

Join 1,705 other followers

%d bloggers like this: