Archive for the 'Wikipedia' Category



Not Just Western, Asian and Complex: The World Has More Than Three Languages

If you belong to the minority of people who only use their word processor to write documents in English, then you will hardly ever care about fonts for other languages. At most, you’ll want a different font for an emphasized word.

However, if you, like most people, write documents in other languages and scripts, you’ll usually need to choose different fonts for different languages. Some fonts include more than one script, but very few fonts include all the scripts.

Now, to specify non-Latin fonts you first need to enable support for this in your word processor, because developers of word processors assume that most people write in only one language:

LibreOffice language settings dialog with checkboxes to enable support for "Asian" and "CTL" languages

LibreOffice language settings dialog. Without getting into details, the corresponding box in Microsoft Word is similar.

After you’ve done this you’ll see a slightly different font selection dialog – now you can select the font for “Western text”, “Asian text” and “CTL text”:

LibreOffice character formatting dialog with font selection for "Western", "CTL" and "Asian" scripts.

LibreOffice character formatting dialog. Again, the corresponding dialog in Microsoft Word is similar.

This is wrong in every possible regard.

The simplest problem with this is that most people have no idea what “CTL” is. Microsoft Word calls this “Complex scripts”, and the C in CTL indeed stands for “Complex”, but most people are not supposed to know what “complex scripts” are either.

Furthermore, according to this weird division of the world’s languages, Hindi and Arabic are “complex”, but Japanese is “Asian”, even though Hindi and Arabic are also spoken in Asia. This is most probably a result of the ways Americans describe immigrants: The Chinese and the Japanese are “Asian Americans”, but Indians and Arabs are “Indian” and “Middle Eastern”.

This is preposterous. It pestered me really badly ever since i used Microsoft Word for the first time in 1997, but somehow i never bothered to complain. So here i am, finally complaining about this atrocity.


“Complex scripts” is a very old-fashioned term that survived from the time when more or less anything that wasn’t Latin was considered “complex”. More precisely, it was used for scripts that were not just rows of letters like Latin, Cyrillic and Greek, but required connected letters like Arabic, ligatures like most scripts of India and its neighbors, or right-to-left text, like Hebrew and, again, Arabic. According to this logic, Latin and Greek should be quite complex, too, since most languages written in these scripts require combinations of diacritics, like in the Lithuanian word “rūgščių̃”… but this never bothered the programmers of word processors.

So this term, “complex”, was used by programmers, and even that was hardly justified. It was never meant to be used by ordinary people. A person who writes Arabic is not supposed to know that his script is “complex”, because as far as he’s concerned it’s the simplest script there is. In fact, it’s quite insulting. And most of all, it’s hard to understand: When a person wants to select a font for Arabic text, the most logical thing to ask him is to specify an “Arabic font” – not a “complex font”.

But beyond the strange terminology there’s an even worse practical problem. Let’s say that i got used to the fact that Microsoft and LibreOffice call my script “complex”; but what if i have more than one “complex” language in my document? It’s not an edge case at all. Lately i’ve been reading–and making little edits to–a Word document, which is a grammar textbook of the Malayalam language for Hebrew-speaking students. Hebrew and Malayalam are both “complex”, but they are complex for entirely different reasons, and they need different fonts. The author of that document told me that it drove her nuts. I completely understand what was she talking about–she’s just one among millions of people who suffer from this… but for some reason not one of them complains.

The relatively convenient way to solve this problem with the current software is to use separate character styles for different “complex” languages, but most people don’t know at all what “character styles” are and even for those who know what they are this solution would be very inefficient.

So how font selection dialogs should really be done? They should treat each combination of language and script separately. This is a bit tricky, but only a bit.

The best place to start solving this would be to look at existing standards: ISO 15924, ISO 639 and the IANA Language subtag registry. ISO 15924 lists a few dozens of scripts; ISO 639 lists a few thousands of languages; the IANA Language subtag registry defines the rules for specifying combinations of languages, scripts and their varieties. Combinations are important, because it’s not enough to specify a “Latin” font or a “Serbian language”: Serbian can be written in Latin and Cyrillic, Azeri can be written in Latin, Cyrillic and Arabic–in which case its direction changes, too, etc.

This doesn’t mean at all that the font selection dialogs have to list thousands of combinations of languages and scripts. By default they should list a few languages that a user is expected to use, for example by looking which keyboard layouts the user has enabled in his operating system. And the user must be able to add more languages, by using some kind of an “Add” or “+” button: “I want to write Malayalam in this document; sometimes i want to do this in the Malayalam script in the Meera font, and sometimes i want to write it in IPA, which is a kind of a Latin script and then i want to do it in the Charis font.” In this scenario two lines would have to be added to the dialog using that add button.

There may be more clever ways to solve this problem, but at this stage my proposal is certainly better than grouping the world’s languages into three arbitrary and outdated groups.


Now where does Wikimedia come in? Wikimedia projects, the most popular of which is Wikipedia, are massively multilingual. That’s why the Wikimedia Foundation always took internationalization seriously and recently created a whole team dedicated to it–a team of which i am proud to be a member. One of the most important and urgent things that this team does is adding web fonts support to our websites, so that people wouldn’t see squares or question marks when they see a word in a language for which they don’t have a font on their computer.

The intention is to do it with orientation to languages and scripts, as described above. Even though a lot of people edit Wikipedia, it is still a website that is mostly read and not written by its visitors, so the fonts that will be used will be mostly decided by the programmers–that is, by our team–, but word processors are mostly used by people for writing, so they should combine language and script selection with manual font selection. Of course, providing good defaults would be a good idea.

Now all that’s left is for some LibreOffice developer to pick up the bug i opened about it and fix it, thus making LibreOffice far more friendly to the world than Microsoft Word is. After all, there are many more people who don’t speak English than those who do.


Three things made me write this post: The work of my team in Wikimedia on WebFonts and especially the work of Santhosh Thottingal; My Malayalam classes with Ophira Gamliel; and Lior Kaplan‘s and Caolán McNamara‘s questions about the font selection dialog in LibreOffice. Thank you, Santhosh, Ophira, Lior and Caolán for making me finally write this post, which i wanted to write for about fourteen years.

Differences Between Things

The search box in Wikipedia suggests auto-completion when you start typing. For example, if you type “je” in the English Wikipedia search box, you’ll get the suggestions “Jews”, “Jewish”, “Jerusalem”, “Jesus”. (Jews kick ass!)

Jews Kick Ass. Henry Winkler, Albert Einstein, Sammy Davis Jr., Jesus, William Shatner, Bob Dylan

If you search for “differences between”, you’ll get this list:

auto-suggestions at Wikipedia for "differences between"

The top spot belongs to “Differences between editions of Dungeons & Dragons” and that shouldn’t be surprising: the article “List of Advanced Dungeons & Dragons 2nd edition monsters” only recently lost its first place in the list of the longest English Wikipedia articles by number of bytes to “‎2011 ITF Men’s Circuit” (it’s something in tennis).

Out of ten suggestions, six are related to languages. American and British English are considered one language, but everybody admits that it has many variations by pronunciation, spelling, vocabulary and many other parameters, and lots of people love to bicker about the spelling of “meter” and “aluminum”. Bosnian, Croatian and Serbian are one language that has different names for reasons that are more political than linguistic. Something similar can probably be said about Malaysian and Indonesian, Norwegian Bokmål and Standard Danish and Scottish Gaelic and Irish, but i know very little about these pairs.

Spanish and Portuguese are related, but definitely separate and mostly mutually unintelligible languages. It’s been said that it is easier for Portuguese speakers to understand Spanish speakers than the other way around, which is interesting, but it doesn’t really justify an encyclopedic article, as in the other cases. In fact, i am somewhat surprised that “Differences between Brazilian and European Portuguese dialects” is not in the list, given the huge number of arguments about it in the Portuguese – sorry, Lusophone – Wikipedia.

“Butterflies and moths” is probably the most serious article in this list, but that’s probably because i’m not a Biologist.

And the last two articles are about movies (James Bond – movies vs. novels) and religion (Codex Sinaiticus vs. Vaticanus), which is also very Wikipedia, the encyclopedia about which someone said that it has more stamp collectors than good writers. (Citation needed; I can’t find the original quote.)

The Software Localization Paradox

Wikimania in Haifa was great. Plenty of people wrote blog posts about it; the world doesn’t need a yet another post about how great it was.

What the world does need is more blog posts about the great ideas that grew in the little hallway conversations there. One of the things that i discussed with many people at Wikimania is what i call The Software Localization Paradox. That’s an idea that has been bothering me for about a year. I tried to look for other people who wrote about it online and couldn’t find anything.

Like any other translation, software localization is best done by people who know well both the original language in which the software interface was written – usually English, and the target language. People who don’t know English strongly prefer to use software in a language they know. If the software is not available in their language, they will either not use it at all or will have to memorize lots of otherwise meaningless English strings and locations of buttons. People who do know English often prefer to use software in English even if it is available in their native language. The two most frequent explanations for that is that the translation is bad and that people who want to use computers should learn English anyway. The problem is that for various reasons lots of people will never learn English even if it would be mandatory in schools and useful for business. They will have to suffer the bad translations and will have no way to fix it.

I’ve been talking to people at Wikimania about this, especially people from India. (I also spoke to people from Thailand, Russia, Greece and other countries, but Indians were the biggest group.) All of them knew English and at least one language of India. The larger group of Indian Wikipedians to whom i spoke preferred English for most communication, especially online, even if they had computers and mobile phones that supported Indian languages; some of them even preferred to speak English at home with their families. They also preferred reading and writing articles in the English Wikipedia. The second, smaller, group preferred the local language. Most of these people also happened to be working on localizing software, such as MediaWiki and Firefox.

So this is the paradox – to fix localization bugs, someone must notice them, and to notice them, more people who know English must use localized software, but people who know English rarely use localized software. That’s why lately i’ve been evangelizing about it. Even people who know English well should use software in their language – not to boost their national pride, but to help the people who speak that language and don’t know English. They should use the software especially if it’s translated badly, because they are the only ones who can report bugs in the translation or fix the bugs themselves.

(A side note: Needless to say, Free Software is much more convenient for localization, because proprietary software companies are usually too hard to even approach about this matter; they only pay translators if they have a reason to believe that it will increase sales. This is another often overlooked advantage of Free Software.)

I am glad to say that i convinced most people to whom i spoke about it at Wikimania to at least try to use Firefox in their native language and taught them where to report bugs about it. I also challenged them to write at least one article in the Wikipedia in their own language, such as Hindi, Telugu or Kannada – as useful as the English Wikipedia is to the world, Telugu Wikipedia is much more useful for people who speak Telugu, but no English. I already saw some results.

I am now looking for ideas and verifiable data to develop this concept further. What are the best strategies to convince people that they should use localized software? For example: How economically viable is software localization? What is cheaper for an education department of a country – to translate software for schools or to teach all the students English? Or: How does the absence of localized software affect different geographical areas in Africa, India, the Middle East?

Any ideas about this are very welcome.

Type O Negative, part 2

Since my previous and very negative post about Google+ i played with it a little more. Apparently, a lot of my misunderstanding was related to actual bugs in its interface – for example, people that i’m not supposed to follow appear in my stream. I guess that it’s understandable, given that the service is so young.

I do have something very nice to say about it – it has an excellent interface for reporting bugs. You simply click the problematic area on the screen, write a description and submit the report. It is very buggy on Firefox, but i can understand that, too, hoping that they will fix it. It does work well in Google Chrome, but i can’t really use it, because Chrome’s right-to-left editing support is very bad. The sad thing is that after the report is submitted i don’t have a way to know what happens to it. Public bug tracking is one of the most common, most appealing, and most overlooked features of Free Software. However, reporting bugs in Free Software projects is a relatively hard process – the interface of bug tracking software such as Bugzilla is intimidating and lots of people don’t even know that they can use it.

I hope that Free Software web frameworks such as MediaWiki (Wikipedia’s engine), WordPress and Drupal, will adopt a similar model for reporting bugs and combine it with the already excellent concept of public bug tracking. If that would be Google+’s contribution to the web, it would be enough to say that it doesn’t suck.

Palestinian geeks and RTL bugs

In the last few months i opened a bunch of MediaWiki bugs related to writing from right-to-left. If you click on the non-stricken-out numbers there, you’ll see my name at a few pages. Unfortunately i’m not yet much of a MediaWiki developer, but i’m quietly learning it at home.

This flood of right-to-left bugs was noticed. Mark Hershberger, Wikimedia’s bugmeister, wrote a blog post inviting developers who know RTL languages to fix the bugs. In the recent MediaWiki Hackathon 2011 in Berlin, which i attended as a member of the MediaWiki Language committee, i had the pleasure to meet Mark and many other MediaWiki developers in person – they taught me MediaWiki hacking tricks and i taught them the basics of RTL language handling in computers.

MediaWiki Hackathon 2011 participants, Berlin

MediaWiki Hackathon 2011 participants, Berlin. Photo: Tobias Schumann, CC-BY-SA-3.0-DE. Click to enlarge.

After the hackathon Mark’s blog post was made available for translation in translatewiki.net, the software localization hub for MediaWiki, Wikipedia-related projects and other Free Software. It makes sense to translate it, especially to RTL languages. I translated it to Hebrew. It was also translated to Macedonian and Bulgarian; to Bosnian and two types of Serbian; to French, Danish and German; to Latin, Albanian, Dutch, Chinese and Japanese.

Do you notice any right-to-left languages except Hebrew here? No, me neither. After i poked a few people, parts of it were translated to Persian, Urdu and Khowar, a language of Pakistan. And not a single line of it was translated into Arabic yet.

And i just don’t get it. It is a fact that there are Arab Free Software hackers on both sides of Jordan, as well as in Egypt, Saudi Arabia, Syria and other countries. Judging by the tweets with the #palgeeks hashtag in Twitter, there are more startups in Ramallah than in Herzliya. There are Arab Wikipedia editors in Israel and the West Bank, not to mention the rest of the Arab world. There are a lot of translations of software messages into Arabic in the same website, translatewiki.net. But not of this blog post, which could bring more fixes to RTL bugs, which would in turn benefit all the people writing and reading in the Arabic alphabet – that’s hundreds of millions of people.

You could say: Why bother translating it from English into Arabic? After all, someone who has the skill to fix bugs in PHP code, probably knows English. But the fact is that translating it into Hebrew was worth the few minutes i put into it, because it caused the Israeli MediaWiki developer Rotem Liss to fix one RTL bug. (Thank you, Rotem.) Just think what it may do if it is translated to Arabic, which is spoken by many, many more people.

So, dear #palgeeks and Arabic-speaking geeks in other countries! If any of you are reading this, please invest a few minutes to do the following:

  1. Go to translatewiki.net.
  2. If you don’t have an account: Create one by clicking “ادخل / أنشئ حسابا” or “Log in / create account” at the top. Then follow the instructions on the screen to request Translator permission.
  3. Go to Mark Hershberger’s post translation page.
  4. Start translating into Arabic.
  5. Copy the result to your own blog, publish it on Twitter, invite other Arab hackers to fix RTL bugs in MediaWiki.

Oh, and you are also cordially invited to Wikimania in Haifa and to the Hackathon that will take place for two days before it, starting on the 2nd of August. It’s not about politics; it’s about improving Wikipedia’s support for your language. And you’ll also get to meet Wikipedians from all around the world, which is even more fun in real life than it sounds. Really. (If you need assistance with getting into Israel, please contact me privately.)

Wikinails

a shop in Barcelona - WiKi NAILS: ungles i estètica (Catalan for "nails and aesthetics")

WiKiNails, Barcelona

People Speaking – Check

A gay Wikipedia editor is filling a survey:

“‘My sexual orientation is different from the majority of editors who edit Wikipedia’ – I was about to check this, but then I decided that I’m not sure that that’s really the situation.”

Do you edit Wikipedia? Thank the person who welcomed you

The Board of the Wikimedia Foundation published a Resolution on Openness. In short, quantitative studies show that new editors are joining Wikipedia and related projects slower than they used to, and the Board decided that this is the most important challenge that the Foundation must deal with in the near future.

One of the things that the Foundation is doing is to appeal to the community and ask to be more open towards new editors. I agree with this and pass this message on: Please, if you are one of the veteran editors of Wikipedia, remind yourself every once in a while not to bite the newcomers. Don’t just coldly tell them that they’re wrong, delete their contribution or block them. Maybe their contribution should be deleted, because it’s really bad, but please bother to explain it to them and don’t just send them a template message. Read chapter 31 of “Catch-22″ to get an idea on the damage that template messages do. Bite a newcomer and he will never come back. This newcomer may be an elementary school kid who has nothing better to do than adding bad jokes to Wikipedia, but it may also be a university professor who has knowledge about topics that nobody else knows. If you scare off that professor, he won’t come back and Wikipedia will not have any information about these topics for a long time, and possibly forever.

Just remember that the Wikimedia community is supposed to be easy to penetrate, not hard. Some other communities are even harder to penetrate, but it’s their loss. That’s one thing we don’t want to be. That’s the meaning of wiki.

And another thing. Remember that “welcome” message you received after you made your first edit in Wikipedia? Send a thank-you note to the user who sent it to you. Even if you already thanked that user in the past and even if that user retired from Wikipedia. Even if instead of a welcome message that user sent you a copyright violation notice – that happened to me and i am nevertheless thankful to that user, simply because he was polite about it. Send that user a thank-you note, now. Tell him about your achievements since then; tell him what was good about that welcome message; if he retired since then, tell him that you hope that he will come back. It will mean a lot to that user and it will mean a lot to you – it will remind you that that welcome message was more than just a template. It was the thing that made you part of the biggest community of people in history – the Wikimedia family.

Mobile Phones Suck

All mobile phones suck.

Mobile phones that need non-standard chargers suck.

Mobile phones with boot-up time of more than 10 seconds suck.

Mobile phones with touch screens that use the numeric keypad to enter text suck.

Mobile phones with touch screens in which it is hard in any way to use the numeric keypad for interactive voice response suck.

Mobile phones in which it is hard to change the volume of the speaker or of the ringer suck.

Mobile phones in which it is impossible to copy and paste text from anywhere to anywhere suck.

Mobile phones the software of which cannot be updated suck.

Mobile phones on which i cannot install my own fonts suck.

Mobile phones that need special software to be installed on a computer in order to get the ability to copy files to and from them suck.

Mobile phones that can be synchronized only with particular contact management software suck.

Mobile phones that don’t completely support reading and writing in any language in which it is possible to write in a modern GNU/Linux desktop computer suck.

Mobile phones that claim to be able to browse the Internet, but can’t be used to view a Wikipedia page without complaining about full memory suck.

Mobile phones that claim to be able to browse the Internet, but can’t be used to edit a Wikipedia page suck.

Mobile phones that claim to be able to play music, but cannot sort numbered album tracks suck.

Mobile phones that claim to be able to play music, but cannot play OGG or FLAC files suck.

Mobile phones that claim to be able to play music, but cannot display track names in any language suck.

Mobile phones that are hard to switch to vibration mode suck.

Finally, mobile phones that have any non-free software on them suck.

Translating Wikipedia Interface Into Amharic

There is a Wikipedia in the Amharic language, but it is developing slowly. One of the reasons for this is that the interface of MediaWiki, the software that is running the Wikipedia website, is translated into Amharic only partially, so people who don’t know Amharic can hardly use the website. Completing the translation of the interface will make the Amharic Wikipedia much more accessible to people who don’t know English. This is relevant not only to people who read and write Wikipedia online, but also to those who don’t have Internet access, because the Wikimedia Foundation and other organizations distribute offline copies of Wikipedia on CD-ROMs, printed books and other media.

Translation of Wikipedia’s interface is done by volunteers at the website translatewiki.net. I know this website well and i am willing to invest my time and teach any Amharic speaker who can translate software messages from English or from Hebrew. Practically no experience is needed – anyone who can use a web browser, can do this, too, and i shall provide all the needed support, anywhere in Israel. Do you know anyone who would be able to do this? This can be a great chance to improve one’s skills in computer use, in Amharic and in English and to help millions of Amharic speakers get access to one of the most important educational websites on the web.

If you know anyone who can help with it, please let me know.



Follow

Get every new post delivered to your Inbox.

Join 1,391 other followers