Archive for the 'Microsoft Office' Category

Not Just Western, Asian and Complex: The World Has More Than Three Languages

If you belong to the minority of people who only use their word processor to write documents in English, then you will hardly ever care about fonts for other languages. At most, you’ll want a different font for an emphasized word.

However, if you, like most people, write documents in other languages and scripts, you’ll usually need to choose different fonts for different languages. Some fonts include more than one script, but very few fonts include all the scripts.

Now, to specify non-Latin fonts you first need to enable support for this in your word processor, because developers of word processors assume that most people write in only one language:

LibreOffice language settings dialog with checkboxes to enable support for "Asian" and "CTL" languages

LibreOffice language settings dialog. Without getting into details, the corresponding box in Microsoft Word is similar.

After you’ve done this you’ll see a slightly different font selection dialog – now you can select the font for “Western text”, “Asian text” and “CTL text”:

LibreOffice character formatting dialog with font selection for "Western", "CTL" and "Asian" scripts.

LibreOffice character formatting dialog. Again, the corresponding dialog in Microsoft Word is similar.

This is wrong in every possible regard.

The simplest problem with this is that most people have no idea what “CTL” is. Microsoft Word calls this “Complex scripts”, and the C in CTL indeed stands for “Complex”, but most people are not supposed to know what “complex scripts” are either.

Furthermore, according to this weird division of the world’s languages, Hindi and Arabic are “complex”, but Japanese is “Asian”, even though Hindi and Arabic are also spoken in Asia. This is most probably a result of the ways Americans describe immigrants: The Chinese and the Japanese are “Asian Americans”, but Indians and Arabs are “Indian” and “Middle Eastern”.

This is preposterous. It pestered me really badly ever since i used Microsoft Word for the first time in 1997, but somehow i never bothered to complain. So here i am, finally complaining about this atrocity.


“Complex scripts” is a very old-fashioned term that survived from the time when more or less anything that wasn’t Latin was considered “complex”. More precisely, it was used for scripts that were not just rows of letters like Latin, Cyrillic and Greek, but required connected letters like Arabic, ligatures like most scripts of India and its neighbors, or right-to-left text, like Hebrew and, again, Arabic. According to this logic, Latin and Greek should be quite complex, too, since most languages written in these scripts require combinations of diacritics, like in the Lithuanian word “rūgščių̃”… but this never bothered the programmers of word processors.

So this term, “complex”, was used by programmers, and even that was hardly justified. It was never meant to be used by ordinary people. A person who writes Arabic is not supposed to know that his script is “complex”, because as far as he’s concerned it’s the simplest script there is. In fact, it’s quite insulting. And most of all, it’s hard to understand: When a person wants to select a font for Arabic text, the most logical thing to ask him is to specify an “Arabic font” – not a “complex font”.

But beyond the strange terminology there’s an even worse practical problem. Let’s say that i got used to the fact that Microsoft and LibreOffice call my script “complex”; but what if i have more than one “complex” language in my document? It’s not an edge case at all. Lately i’ve been reading–and making little edits to–a Word document, which is a grammar textbook of the Malayalam language for Hebrew-speaking students. Hebrew and Malayalam are both “complex”, but they are complex for entirely different reasons, and they need different fonts. The author of that document told me that it drove her nuts. I completely understand what was she talking about–she’s just one among millions of people who suffer from this… but for some reason not one of them complains.

The relatively convenient way to solve this problem with the current software is to use separate character styles for different “complex” languages, but most people don’t know at all what “character styles” are and even for those who know what they are this solution would be very inefficient.

So how font selection dialogs should really be done? They should treat each combination of language and script separately. This is a bit tricky, but only a bit.

The best place to start solving this would be to look at existing standards: ISO 15924, ISO 639 and the IANA Language subtag registry. ISO 15924 lists a few dozens of scripts; ISO 639 lists a few thousands of languages; the IANA Language subtag registry defines the rules for specifying combinations of languages, scripts and their varieties. Combinations are important, because it’s not enough to specify a “Latin” font or a “Serbian language”: Serbian can be written in Latin and Cyrillic, Azeri can be written in Latin, Cyrillic and Arabic–in which case its direction changes, too, etc.

This doesn’t mean at all that the font selection dialogs have to list thousands of combinations of languages and scripts. By default they should list a few languages that a user is expected to use, for example by looking which keyboard layouts the user has enabled in his operating system. And the user must be able to add more languages, by using some kind of an “Add” or “+” button: “I want to write Malayalam in this document; sometimes i want to do this in the Malayalam script in the Meera font, and sometimes i want to write it in IPA, which is a kind of a Latin script and then i want to do it in the Charis font.” In this scenario two lines would have to be added to the dialog using that add button.

There may be more clever ways to solve this problem, but at this stage my proposal is certainly better than grouping the world’s languages into three arbitrary and outdated groups.


Now where does Wikimedia come in? Wikimedia projects, the most popular of which is Wikipedia, are massively multilingual. That’s why the Wikimedia Foundation always took internationalization seriously and recently created a whole team dedicated to it–a team of which i am proud to be a member. One of the most important and urgent things that this team does is adding web fonts support to our websites, so that people wouldn’t see squares or question marks when they see a word in a language for which they don’t have a font on their computer.

The intention is to do it with orientation to languages and scripts, as described above. Even though a lot of people edit Wikipedia, it is still a website that is mostly read and not written by its visitors, so the fonts that will be used will be mostly decided by the programmers–that is, by our team–, but word processors are mostly used by people for writing, so they should combine language and script selection with manual font selection. Of course, providing good defaults would be a good idea.

Now all that’s left is for some LibreOffice developer to pick up the bug i opened about it and fix it, thus making LibreOffice far more friendly to the world than Microsoft Word is. After all, there are many more people who don’t speak English than those who do.


Three things made me write this post: The work of my team in Wikimedia on WebFonts and especially the work of Santhosh Thottingal; My Malayalam classes with Ophira Gamliel; and Lior Kaplan‘s and Caolán McNamara‘s questions about the font selection dialog in LibreOffice. Thank you, Santhosh, Ophira, Lior and Caolán for making me finally write this post, which i wanted to write for about fourteen years.

Daily minefield

OK, that’s it. Hadar has to move to Haifa to do her Ph.D. in the Technion.

Which means that i’ll have to leave the beautiful Giv’at Ye’arim and look for a new home and a new job. At least i can be happy that it’s not Tel-Aviv.

In my last round of job hunting everybody happily accepted the CV in the RTF format. This time i tried to use PDF for a change. One workplace already specifically asked me to send it as DOC. Talk about freedom of choice. To hell with PDF, then.

I need to find a job, so i’ll send DOC, but i will only use OpenOffice to edit it.

Zip

I just discovered a curious thing.

If you rename an OpenOffice.org file to a .zip file, you can unzip it and read its innards in plain XML.

It doesn’t work like that in Microsoft Office 2003, but it should be the default in Office 2007 – except the actual XML will look completely different. To make things utterly confusing, Microsoft called their kind of XML “Office Open XML”. Get it? “Office Open”, but it is not compatible with Open Office.

Maqaf

There’s a hyphen in Hebrew, which doesn’t look like the regular hyphen. It is called “maqaf” (מקף) and it is aligned with the top of the line like this: ־.

It appears in Torah scrolls and in most printed books and newspapers, however it doesn’t appear on keyboards, so most Israelis just write a minus instead when they type. So בית־ספר (beit-sefer, school, lit. book-house) becomes בית-ספר or even בית ספר. The rules for using the maqaf are not taught in schools, so many people – me too – use it inconsistently and often omit it altogether.

Apparently it has issues with Unicode – according to the Unicode standard, maqaf should be used as the hyphen for Hebrew, and proper implementation of Unicode will process it as a right-to-left character unlike the minus, which is a left-to-right character and should be used only with numbers. However, most popular implementations of Unicode (read: Microsoft Word and probably most web browsers, including Firefox) are not really correct. They make life easy for Israelis and treat the minus as the right-to-left hyphen, so it is easy to write this:

החנות פתוחה בשעות 09:00 – 16:00

(The shop is open 09:00 – 16:00)

The problem is that it disregards traditional Hebrew typography and few people seem to care. OpenOffice.org is correct as far as Unicode goes, but most Israelis think that it is just stupid that they can’t write the usual way and throw centuries of our printing tradition to garbage.

On my laptop i made a keyboard mapping that includes the maqaf and i try to use it whenever i can in email and documents. I use it in handwriting too. Some people on the Hebrew Wikipedia use it, although it is controversial. Some free-thinking Hebrew bloggers use it in their blogs (see Digital Words). And that’s about it.

But today i was pleasantly surprised. The maqaf appeared in an article about American junk-food on YNet (i wrote talkback 25). YNet is Israel’s number one online news source. I don’t think that all the articles use it – probably the author of this article was a crazy type like me, or maybe he used some auto-conversion software. I think that i’ll send an email to YNet asking them to use it everywhere.

Please tell me if you want the keyboard mapping with maqaf that i made. It is for Windows. If you use Linux, BSD or Mac, you are probably clever enough to find it on your system by yourself. If you have a server on which i can host it so the public will be able to download it, you’ll make me joyous.

Vegan Spam

Here you go:

X-Gmail-Received: af5da61812e6f0b5e7f7133d607317213a97b783
Delivered-To: amir.aharoni@gmail.com
Received: by 10.65.248.15 with SMTP id a15cs102679qbs;
        Mon, 17 Jul 2006 22:42:20 -0700 (PDT)
Received: by 10.49.41.18 with SMTP id t18mr336896nfj;
        Mon, 17 Jul 2006 22:42:20 -0700 (PDT)
Return-Path: <?WORD??WORD?@?mail_domain?>
Received: from F246A7D4ECFC4A2 ([210.75.200.85])
        by mx.gmail.com with ESMTP id r33si415786nfc.2006.07.17.22.42.18;
        Mon, 17 Jul 2006 22:42:20 -0700 (PDT)
Received-SPF: fail
Message-ID: <36781866608732.A3D0FB1D07@5MVMO>
From: "{WORD)" <{_WORD){WORD)@{MAIL_DOMAIN}>
To: <amir.aharoni@gmail.com>
Subject: {}NEW} {STOCK_2}
Date: {DATE}
MIME-Version: 1.0
X-Mailer: Microsoft Office Outlook, Build 11.0.5510
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106
Thread-Index: {ALNUM[36-36]}
Content-Type: text/plain;
        charset="Windows-1252"
Content-Transfer-Encoding: 7bit

{BODY}

It gives a peek into the spammers’ inner systems. {}NEW} {STOCK_2}, {BODY}, {_WORD){WORD)@{MAIL_DOMAIN} are probably templates, placeholders for actual values and something went wrong in their processing. The actual message that i received was blank.

Also, it was sent by Microsoft Office Outlook. Is Outlook efficient enough to process spam? Or is it fake?



Follow

Get every new post delivered to your Inbox.

Join 1,392 other followers