Archive for the 'localization' Category

The Case for Localizing Names, part 2

My name is written Amir Elisha Aharoni in English. In Hebrew it’s אמיר אלישע אהרוני, in Russian it’s Амир Элиша Аарони, in Hindi it’s अमीर एलिशा अहरोनि. It could be written in hundreds of other languages in many different ways.

More importantly, if I fill a form in Hebrew, I should write my name in Hebrew and not in English or in any other language.

Based on this simple notion, I wrote a post a year ago in support of localizing people’s names. I basically suggested, that it should be possible to have a person’s name written in more than one language in social networks, “from” and “to” fields in email, and in any other relevant place. Facebook allows doing this, but in a very rudimentary way; for example, the number of possible languages is very limited.

Today I am participating in the Open Source Language Summit in the Red Hat offices in Pune. Here we have, among many other talented an interesting people, two developers from the Mifos project, which creates Free software for microfinance. Mifos is being translated in translatewiki.net, a software translation site of which I am one of the developers.

Nayan Ambali, one of the Mifos developers, told me that they actually plan to implement a name localization feature in their software. This is not related to software localization, where a pre-defined set of strings is translated. It is something to be translated by the users of Mifos itself. The particular reason why Mifos needs such a feature comes from its nature as microfinance software: financial documents must be filled in the language of each country for legal purposes. Therefore, a Mifos user in the Indian state of Karnataka may need to have her name written in the software in English, Hindi, and Kannada – different languages, which are needed in different documents.

A simple sketch of database structure for storing names in multiple languages

A simple sketch of database structure for storing names in multiple languages

Such a feature is quite simple to implement. In the backend this means that the name must be stored in a separate table that will hold names in different languages; see the sketch I made with Nayan above. On the frontend it will need a widget for adding names in different languages, similar to the one that Wikidata has; see the screenshot below.

The name of Steven Spielberg in many languages in Wikidata, with an option to add more languages

The name of Steven Spielberg in many languages in Wikidata, with an option to add more languages

Of course, there’s also the famous problem of falsehoods that programmers believe about names, but this would be a good first step that can provide a good example to other programs.

A Relevant Tower of Babel

The Tower of Babel is frequently used as a symbol of foreign languages. For example, several language software packages are named after it, such as the Babylon electronic dictionary, MediaWiki’s Babel extension and the Babelfish translation service (itself named after the Babel fish from The Hitchhiker’s Guide).

In this post I shall use the Tower of Babel in a somewhat more relevant and specific way: It will speak about multilingualism and about Babel itself.

This is how most people saw the Wikipedia article about the Tower of Babel until today:

The Tower of Babel article. Notice the pointless squares in the Akkadian name. They are called "tofu" in the jargon on internationalization programmers.

The tower of Babel. Notice the pointless squares in the Akkadian name. They are called “tofu” in the jargon on internationalization programmers.

And this is how most people will see it from today:

And we have the name written in real Akkadian cuneiform!

And we have the name written in real Akkadian cuneiform!

Notice how the Akkadian name now appears as actual Akkadian cuneiform, and not as meaningless squares. Even if you, like most people, cannot actually read cuneiform, you probably understand that showing it this way is more correct, useful and educational.

This is possible thanks to the webfonts technology, which was enabled on the English Wikipedia today. It was already enabled in Wikipedias in some languages for many months, mostly in languages of India, which have severe problems with font support in the common operating systems, but now it’s available in the English Wikipedia, where it mostly serves to show parts of text that are written in exotic fonts.

The current iteration of the webfonts support in Wikipedia is part of a larger project: the Universal Language Selector (ULS). I am very proud to be one of its developers. My team in Wikimedia developed it over the last year or so, during which it underwent a rigorous process of design, testing with dozens of users from different countries, development, bug fixing and deployment. In addition to webfonts it provides an easy way to pick the user interface language, and to type in non-English languages (the latter feature is disabled by default in the English Wikipedia; to enable it, click the cog icon near “Languages” in the sidebar, then click “Input” and “Enable input tools”). In the future it will provide even more abilities, so stay tuned.

If you edit Wikipedia, or want to try editing it, one way in which you could help with the deployment of webfonts would be to make sure that all foreign strings in Wikipedia are marked with the appropriate HTML lang attribute; for example, that every Vietnamese string is marked as <span lang=”vi” dir=”ltr”>. This will help the software apply the webfonts correctly, and in the future it will also help spelling and hyphenation software, etc.

This wouldn’t be possible without the help of many, many people. The developers of Mozilla Firefox, Google Chrome, Safari, Microsoft Internet Explorer and Opera, who developed the support for webfonts in these browsers; The people in Wikimedia who designed and developed the ULS: Alolita Sharma, Arun Ganesh, Brandon Harris, Niklas Laxström, Pau Giner, Santhosh Thottingal and Siebrand Mazeland; The many volunteers who tested ULS and reported useful bugs; The people in Unicode, such as Michael Everson, who work hard to give a number to every letter in every imaginable alphabet and make massive online multilingualism possible; And last but not least, the talented and generous people who developed all those fonts for the different scripts and released them under Free licenses. I send you all my deep appreciation, as a developer and as a reader of Wikipedia.

Always define the language and the direction of your HTML documents, part 02: Backwards English

In part 01 of these series, I showed why is it important to always define the language and the direction of all HTML content and not rely on the defaults: The content may get embedded in a document with different direction and be displayed incorrectly.

This issue is laughably easy to avoid: If you are writing the content, you are supposed to know in what language it is written, so if it’s English, just write <html lang=”en” dir=”ltr”> even though these seem to be the defaults. Nineteen or so characters that ensure your content is readable and not displayed backwards. Please do it always and tell all your friends to do it.

The problem is that you don’t only have to explicitly set the language and the direction, but, as silly as it sounds, you have to set them correctly, too. A more subtle, but nevertheless quite frequent and disruptive bug is displaying presumably, but not actually, translated content in a different direction. This happens quite frequently when a website supports the browser language detection feature, known as Accept-Language:

  1. The web server sees that the browser requests content in Hebrew.
  2. The web server sends a response with <html lang=”he” dir=”rtl”>, but because the website is not actually translated, the text is shown in the fallback language, which is usually English.
  3. The user sees the content just like this numbered list, which I intentionally set to dir=”rtl”: with the numbers and the punctuation on the wrong side, and possibly invisible, because English is not a right-to-left language.

Of course, it can go even worse. Arrows can point the wrong way and buttons and images can overlap and hide each other, rendering the page not just hard to read, but totally unusable.

This bug is also an example of the Software Localization Paradox: It manifests itself when Accept-Language is not English, but most developers install English operating systems and don’t bother to change the preferred language settings in the browser, so they never see how this bug manifests itself. The site developers don’t bother to test for it either.

The solution, of course, is to set a different language and direction only if the site is actually translated, and not to pretend that it’s translated if it’s not.

Here are two examples of such brokenness. Both sites are important and useful, but hard to use for people whose Accept-Language is Hebrew, Persian or Arabic.

Here’s how the Mozilla Developer Network website looks in fake Hebrew:

Mozilla Developer Network website, in English, but right-to-left

Mozilla Developer Network website, in English, but right-to-left

Notice how the full stops are on the left end and how the text overlaps the images in the tiles on the right-hand side. This is how it is supposed to look, more or less:

Mozilla Developer Network home page in English, left-to-right

Mozilla Developer Network home page in English, left-to-right

I manually changed dir=”rtl” to dir=”ltr” using the element inspector from Firefox’s developer tools and I also had to tweak a CSS class to move the “mozilla” tab at the top.

The above troubles are reported as bug 816443 – lang and dir attributes must be used only if the page is actually translated.

After showing an example of a web development bug from a site for, ahem, web developers, here is an even funnier example: The home page of Unicode’s CLDR. That’s right: Unicode’s own website shows text with incorrect direction:

The Unicode CLDR website, in English but right-to-left

The Unicode CLDR website, in English but right-to-left

The only words translated here are “Contents” (תוכן) and “Search this site” (חיפוש באתר זה), which is not so useful. The rest is shown in English, and the direction is broken: Notice the strange alignment of the content and the schedule table. A few months ago that table was so broken that its content wasn’t visible at all, but that was probably patched.

Here’s how it is supposed to look:

The CLDR home page in English, appropriately left-to-right

The CLDR home page in English, appropriately left-to-right

I tried reporting the CLDR home page direction bug, but it was closed as “out-of-scope”: The CLDR developers say that the Google Sites infrastructure is to blame. This is frustrating, because as far as I know Google Sites doesn’t have a proper bug reporting system and all I can do is write a question about that direction problem in the Google Sites forum and hope that somebody notices it or poke my Googler friends.

One thing that I will not do is switch my Accept-Language to English. Whenever I can, I don’t just want to see the website correctly, but to try to help my neighbor: see the possible problems that can affect other users who use different language. Somebody has to break the Software Localization Paradox.

Web sight

Because of some not-so-interesting technical reasons I ended up on the mailing list for reporting bugs in Wikipedia’s mobile app (please see disclaimer in the end).

Reading real Wikipedia readers’ reactions is fascinating.

A lot of the emails there are just empty. People just press the button to report a problem and don’t actually write anything at all.

Sometimes they are just slightly less than empty. For example, quite a lot of people write things like “When will you fix your stupid app already???!?!!”. This may seem pointless and unconstructive, but actually these people think that there is context to what they say, because they see complaints from other people at Google’s or Apple’s app store and they assume that the app’s maintainers are aware of them. Some people also threaten to give the app a low rating in the app store; it’s not really wrong, but it’s not very helpful either.

A lot of the emails are about connectivity problems in Android 2.2.2 and about screen rotation problems on iPad. The developers are aware of both issues and are working on them.

And a whole lot of reports suggest fixes in content, rather than technical problems. Some of them are pointless, for example “The facts on this web sight is wrong and i want it changed to the corrected statement”. It never occurred to that person that it would be helpful to say what information is wrong or what should be written there (it can also be a troll). And some people do make useful suggestions. For example, one person reported that Obama didn’t write “How the Grinch Stole Christmas“. The report was correct: somebody indeed vandalized the article about the children’s book and wrote that its author is Obama. It was an easy fix, so I just fixed it myself and replied, thanking the person for the report and saying that in the future she can fix it herself by pressing the “edit” button.

If I see that fixing the problem will take more than a minute, I just reply with “you can fix it yourself”. This does make me think that a more robust way of telling people that they can fix the problems themselves is needed.


All these issues aside, there is something truly wonderful about this app: People write these emails in their language without caring at all about who will read them. Reporting a bug in Bugzilla is hard for many reasons, one of which is certainly the language. But the app gives the user a completely localized experience, so the users don’t think twice before sending a bug report in their language.

And this is a good thing. Some People from Some Companies told me explicitly that they give up on processing reports from too many people in too many languages; not Wikimedia. Wikimedia may acknowledge that it’s hard, Wikimedia won’t commit to replying to each email, but Wikimedia wouldn’t just shut it down and ignore it completely, either. We would rather think about more efficient ways to get volunteers to reply to people efficiently or to help people fix the issues themselves – that’s what the whole “wiki” idea is about in the first place.


(Important disclaimer: I am involved with this mailing list as a volunteer. It has nothing to do with the paid work that I do for the Wikimedia Foundation. I do not officially represent the Foundation in any actions that I take with regard to that mailing list.)

The Case for Localizing Names

I often help my friends and family members open email accounts. Sometimes they are starting to use the Internet and sometimes they move from old email services (Yahoo, Walla!, ISP) to something modern (like it or not, GMail).

At some point they have to fill their name, which will appear in the “from” field. And then I have to suggest them to write it in Latin characters, even though most of them speak languages that aren’t written in Latin characters – mostly Hebrew and Russian. Chances are that some day they will send an email to somebody who cannot read Russian or Hebrew, and Latin is relatively better known.

Only relatively, though. It may seem obvious to you that everybody knows the Latin script, but in fact, a lot of people are not comfortable with it at all. There are also other complications: lossy and inconsistent transliteration rules (is Amir אמיר or עמיר?), potential right-to-left rendering problems, and more. And of course, all people are happy to see their name in their language.

And people are also happy to see their friends’ names in their own language and not in a foreign or a neutral language. I have, for example, a lot of friends in India. Most of them write their names in English, but some write it in Marathi or in Malayalam. It’s certainly good for them, but in practice it’s much harder for me to find them this way, so English would be better – but Hebrew or Russian would be better yet.

Finally, there are a lot of people in the world who have more than one linguistic background. Mine are Russian, Hebrew and English, and I am really not such a special case. There are many millions of immigrants who have mixed backgrounds: Punjabi-Hindi-Urdu-English, Kurdish-Turkish-German, Kazakh-Russian-Norwegian, and others, and others and others. From each of these backgrounds they have friends, co-workers and family members, with whom they would love to communicate in the respective language. In each of these backgrounds they have friends who would want to find them using the name under which they know them there and using the appropriate language and writing system.

And sometimes people change their names, too. I did once, and so have many other people.

All this means that people’s names should be translatable, just like books, articles and software interfaces. Facebook and Google+ allow me to add a very limited number of names in foreign languages. Why wouldn’t they let me write my name in four, five, ten languages? This would make it easier for people who speak these languages to find me and to communicate with me. I would go even further and allow people who speak languages that I don’t know well to write my name as their hear it in their language and to add it to my details. Yet again, this would make me easier to find to even more people.

Some degree of automation can be possible. A lot of names are, after all, repetitive, so social networks would be able to suggest people with common names how their name would be written in other languages.

Wikipedia is actually quite good in this regard: Usually people have the same username across projects, and this username is not necessarily written in Latin letters, but people can customize the appearance of their signature in each project. I did it in a few languages, and people who speak those languages appreciate it.

I can only hope that social networks and email systems will allow as much flexibility as possible with this.

English typing computer

I’m in an Internet cafe in Mumbai. I tried to install Firefox with the Marathi interface, but on the computers here fonts for languages of India are not installed. That’s right – on computers in India fonts for languages of India are not installed. Hence, installing Firefox in Marathi failed at the very first stage, because the fonts are needed for the installation wizard.

Actually, I’m not surprised that these fonts are not installed, because it’s not my first time in India. I know that it happens a lot in this country. I would install them, but I don’t have a permission.

I find it incredibly weird – and tragic – that so many people in India don’t even try to use computers in any language except English. The one curious thing that I did find was an “English typing computer” shop. It’s just a place where you can use a computer to write Word documents in Hindi or Marathi, but using an English-based transliteration keyboard rather than the standard Indian Devanagari InScript keyboard, because they find transliteration keyboards easier. Of course, they could just install such a keyboard layout on their computers… but they prefer to go to an “English typing computer” shop.

We, software internationalization people, have so much more work to do.

Always define the language and the direction of your HTML documents, part 01

I received this email from Safari Books Online:

Email in English from Safari Books, oriented like Hebrew

Email in English from Safari Books, oriented like Hebrew. Click to enlarge.

The email is written in English, but notice how the text is aligned unusually to the right. Notice also that the punctuation marks appear at the wrong end of the sentence. I used Firefox developer tools to apply the correct direction, and saw it correctly:

The same email, with corrected left-to-right formatting using Firefox developer tools

The same email, with corrected left-to-right formatting using Firefox developer tools

This happens because I use GMail with the Hebrew interface. GMail has to guess the direction of the emails that I receive, because in plain text there’s no easy way to specify the direction (I hope to discuss it in a separate post soon). Usually GMail guesses correctly. Ironically, for HTML-formatted emails like this one, GMail often guesses incorrectly, even though in HTML, unlike in plain text, it’s quite easy to specify the direction by simply adding dir=”ltr” to the root element of the email.

Unfortunately a lot of HTML authors don’t bother to specify explicit direction. Many are not even aware of this exotic dir attribute. Others think that because “ltr” is the default, they don’t have to specify it. They are wrong: As this email shows, the left-to-right HTML content is embedded in a right-to-left environment, and the “rtl” definition propagates to the embedded content.

You could blame GMail, of course, but it’s much more practical to always define the direction of your HTML content, even if it’s the default. You can never know where will your content end up.

P.S.: I read this post before publishing and suddenly realized that its style is quite similar to “Best Practices” books, such as Damian Conway’s classic “Perl Best Practices” – it tells you to do something that is not obviously needed, and explains why it is needed nevertheless. I like to acknowledge sources of inspiration. Thank you, Damian.



Follow

Get every new post delivered to your Inbox.

Join 1,704 other followers