Archive for the 'language' Category



Writing, part 0 – Optimus Popularis keyboard

In 2008 the Russian design firm Art.Lebedev Studio released a groundbreaking product: The Optimus Maximus keyboard. It’s a keyboard in which every key is a display that changes according to its function – for example, it shows “QWERTY” if the Shift key is pressed and “qwerty” otherwise. Of course, it also shows completely different letters if a different language is selected, for example “ןוטארק” for Hebrew.

The Optimus was quite hot in the gadget lovers’ circles, which is rather strange, because gadget lovers think that they are too cool for any languages except English, and i can hardly imagine this keyboard being really useful to anybody except linguists. Unfortunately, it costs over $2,000, and that’s the kind of money that linguists usually don’t have.

I thought that since the tablet computers with on-screen keyboards are the hottest thing in the world now, there won’t be a new version of the Optimus keyboard. Apparently, i was wrong: Art.Lebedev is taking pre-orders for the “Optimus Popularis“, a cheaper version of “Optimus Maximus”. It is cheaper, because it has less display keys.

Relatively cheaper – it still costs over a thousand dollars, for which one could buy two good tablets or three netbooks. But what’s much worse is that it doesn’t have PageUp, PageDown, Home and End keys.

Quite possibly the designers of Optimus Popularis conducted a research and found that few people actually use these keys. Quite possibly it’s even true. But i use them all the time, and will be absolutely unable to use a keyboard that doesn’t have them for more than a minute.

That’s because these keys are an essential part of my writing experience. I need quick ways to go back and forth, to the beginning and to the end of the document and of the line. Otherwise i am unable to write.

So no, i don’t want an Optimus Popularis keyboard, even for much less money. I just won’t use it. I can hardly imagine anyone who will use it seriously.


This was the first in a series of posts about writing in computers.

Not Just Western, Asian and Complex: The World Has More Than Three Languages

If you belong to the minority of people who only use their word processor to write documents in English, then you will hardly ever care about fonts for other languages. At most, you’ll want a different font for an emphasized word.

However, if you, like most people, write documents in other languages and scripts, you’ll usually need to choose different fonts for different languages. Some fonts include more than one script, but very few fonts include all the scripts.

Now, to specify non-Latin fonts you first need to enable support for this in your word processor, because developers of word processors assume that most people write in only one language:

LibreOffice language settings dialog with checkboxes to enable support for "Asian" and "CTL" languages

LibreOffice language settings dialog. Without getting into details, the corresponding box in Microsoft Word is similar.

After you’ve done this you’ll see a slightly different font selection dialog – now you can select the font for “Western text”, “Asian text” and “CTL text”:

LibreOffice character formatting dialog with font selection for "Western", "CTL" and "Asian" scripts.

LibreOffice character formatting dialog. Again, the corresponding dialog in Microsoft Word is similar.

This is wrong in every possible regard.

The simplest problem with this is that most people have no idea what “CTL” is. Microsoft Word calls this “Complex scripts”, and the C in CTL indeed stands for “Complex”, but most people are not supposed to know what “complex scripts” are either.

Furthermore, according to this weird division of the world’s languages, Hindi and Arabic are “complex”, but Japanese is “Asian”, even though Hindi and Arabic are also spoken in Asia. This is most probably a result of the ways Americans describe immigrants: The Chinese and the Japanese are “Asian Americans”, but Indians and Arabs are “Indian” and “Middle Eastern”.

This is preposterous. It pestered me really badly ever since i used Microsoft Word for the first time in 1997, but somehow i never bothered to complain. So here i am, finally complaining about this atrocity.


“Complex scripts” is a very old-fashioned term that survived from the time when more or less anything that wasn’t Latin was considered “complex”. More precisely, it was used for scripts that were not just rows of letters like Latin, Cyrillic and Greek, but required connected letters like Arabic, ligatures like most scripts of India and its neighbors, or right-to-left text, like Hebrew and, again, Arabic. According to this logic, Latin and Greek should be quite complex, too, since most languages written in these scripts require combinations of diacritics, like in the Lithuanian word “rūgščių̃”… but this never bothered the programmers of word processors.

So this term, “complex”, was used by programmers, and even that was hardly justified. It was never meant to be used by ordinary people. A person who writes Arabic is not supposed to know that his script is “complex”, because as far as he’s concerned it’s the simplest script there is. In fact, it’s quite insulting. And most of all, it’s hard to understand: When a person wants to select a font for Arabic text, the most logical thing to ask him is to specify an “Arabic font” – not a “complex font”.

But beyond the strange terminology there’s an even worse practical problem. Let’s say that i got used to the fact that Microsoft and LibreOffice call my script “complex”; but what if i have more than one “complex” language in my document? It’s not an edge case at all. Lately i’ve been reading–and making little edits to–a Word document, which is a grammar textbook of the Malayalam language for Hebrew-speaking students. Hebrew and Malayalam are both “complex”, but they are complex for entirely different reasons, and they need different fonts. The author of that document told me that it drove her nuts. I completely understand what was she talking about–she’s just one among millions of people who suffer from this… but for some reason not one of them complains.

The relatively convenient way to solve this problem with the current software is to use separate character styles for different “complex” languages, but most people don’t know at all what “character styles” are and even for those who know what they are this solution would be very inefficient.

So how font selection dialogs should really be done? They should treat each combination of language and script separately. This is a bit tricky, but only a bit.

The best place to start solving this would be to look at existing standards: ISO 15924, ISO 639 and the IANA Language subtag registry. ISO 15924 lists a few dozens of scripts; ISO 639 lists a few thousands of languages; the IANA Language subtag registry defines the rules for specifying combinations of languages, scripts and their varieties. Combinations are important, because it’s not enough to specify a “Latin” font or a “Serbian language”: Serbian can be written in Latin and Cyrillic, Azeri can be written in Latin, Cyrillic and Arabic–in which case its direction changes, too, etc.

This doesn’t mean at all that the font selection dialogs have to list thousands of combinations of languages and scripts. By default they should list a few languages that a user is expected to use, for example by looking which keyboard layouts the user has enabled in his operating system. And the user must be able to add more languages, by using some kind of an “Add” or “+” button: “I want to write Malayalam in this document; sometimes i want to do this in the Malayalam script in the Meera font, and sometimes i want to write it in IPA, which is a kind of a Latin script and then i want to do it in the Charis font.” In this scenario two lines would have to be added to the dialog using that add button.

There may be more clever ways to solve this problem, but at this stage my proposal is certainly better than grouping the world’s languages into three arbitrary and outdated groups.


Now where does Wikimedia come in? Wikimedia projects, the most popular of which is Wikipedia, are massively multilingual. That’s why the Wikimedia Foundation always took internationalization seriously and recently created a whole team dedicated to it–a team of which i am proud to be a member. One of the most important and urgent things that this team does is adding web fonts support to our websites, so that people wouldn’t see squares or question marks when they see a word in a language for which they don’t have a font on their computer.

The intention is to do it with orientation to languages and scripts, as described above. Even though a lot of people edit Wikipedia, it is still a website that is mostly read and not written by its visitors, so the fonts that will be used will be mostly decided by the programmers–that is, by our team–, but word processors are mostly used by people for writing, so they should combine language and script selection with manual font selection. Of course, providing good defaults would be a good idea.

Now all that’s left is for some LibreOffice developer to pick up the bug i opened about it and fix it, thus making LibreOffice far more friendly to the world than Microsoft Word is. After all, there are many more people who don’t speak English than those who do.


Three things made me write this post: The work of my team in Wikimedia on WebFonts and especially the work of Santhosh Thottingal; My Malayalam classes with Ophira Gamliel; and Lior Kaplan‘s and Caolán McNamara‘s questions about the font selection dialog in LibreOffice. Thank you, Santhosh, Ophira, Lior and Caolán for making me finally write this post, which i wanted to write for about fourteen years.

Differences Between Things

The search box in Wikipedia suggests auto-completion when you start typing. For example, if you type “je” in the English Wikipedia search box, you’ll get the suggestions “Jews”, “Jewish”, “Jerusalem”, “Jesus”. (Jews kick ass!)

Jews Kick Ass. Henry Winkler, Albert Einstein, Sammy Davis Jr., Jesus, William Shatner, Bob Dylan

If you search for “differences between”, you’ll get this list:

auto-suggestions at Wikipedia for "differences between"

The top spot belongs to “Differences between editions of Dungeons & Dragons” and that shouldn’t be surprising: the article “List of Advanced Dungeons & Dragons 2nd edition monsters” only recently lost its first place in the list of the longest English Wikipedia articles by number of bytes to “‎2011 ITF Men’s Circuit” (it’s something in tennis).

Out of ten suggestions, six are related to languages. American and British English are considered one language, but everybody admits that it has many variations by pronunciation, spelling, vocabulary and many other parameters, and lots of people love to bicker about the spelling of “meter” and “aluminum”. Bosnian, Croatian and Serbian are one language that has different names for reasons that are more political than linguistic. Something similar can probably be said about Malaysian and Indonesian, Norwegian Bokmål and Standard Danish and Scottish Gaelic and Irish, but i know very little about these pairs.

Spanish and Portuguese are related, but definitely separate and mostly mutually unintelligible languages. It’s been said that it is easier for Portuguese speakers to understand Spanish speakers than the other way around, which is interesting, but it doesn’t really justify an encyclopedic article, as in the other cases. In fact, i am somewhat surprised that “Differences between Brazilian and European Portuguese dialects” is not in the list, given the huge number of arguments about it in the Portuguese – sorry, Lusophone – Wikipedia.

“Butterflies and moths” is probably the most serious article in this list, but that’s probably because i’m not a Biologist.

And the last two articles are about movies (James Bond – movies vs. novels) and religion (Codex Sinaiticus vs. Vaticanus), which is also very Wikipedia, the encyclopedia about which someone said that it has more stamp collectors than good writers. (Citation needed; I can’t find the original quote.)

Palestinian geeks and RTL bugs

In the last few months i opened a bunch of MediaWiki bugs related to writing from right-to-left. If you click on the non-stricken-out numbers there, you’ll see my name at a few pages. Unfortunately i’m not yet much of a MediaWiki developer, but i’m quietly learning it at home.

This flood of right-to-left bugs was noticed. Mark Hershberger, Wikimedia’s bugmeister, wrote a blog post inviting developers who know RTL languages to fix the bugs. In the recent MediaWiki Hackathon 2011 in Berlin, which i attended as a member of the MediaWiki Language committee, i had the pleasure to meet Mark and many other MediaWiki developers in person – they taught me MediaWiki hacking tricks and i taught them the basics of RTL language handling in computers.

MediaWiki Hackathon 2011 participants, Berlin

MediaWiki Hackathon 2011 participants, Berlin. Photo: Tobias Schumann, CC-BY-SA-3.0-DE. Click to enlarge.

After the hackathon Mark’s blog post was made available for translation in translatewiki.net, the software localization hub for MediaWiki, Wikipedia-related projects and other Free Software. It makes sense to translate it, especially to RTL languages. I translated it to Hebrew. It was also translated to Macedonian and Bulgarian; to Bosnian and two types of Serbian; to French, Danish and German; to Latin, Albanian, Dutch, Chinese and Japanese.

Do you notice any right-to-left languages except Hebrew here? No, me neither. After i poked a few people, parts of it were translated to Persian, Urdu and Khowar, a language of Pakistan. And not a single line of it was translated into Arabic yet.

And i just don’t get it. It is a fact that there are Arab Free Software hackers on both sides of Jordan, as well as in Egypt, Saudi Arabia, Syria and other countries. Judging by the tweets with the #palgeeks hashtag in Twitter, there are more startups in Ramallah than in Herzliya. There are Arab Wikipedia editors in Israel and the West Bank, not to mention the rest of the Arab world. There are a lot of translations of software messages into Arabic in the same website, translatewiki.net. But not of this blog post, which could bring more fixes to RTL bugs, which would in turn benefit all the people writing and reading in the Arabic alphabet – that’s hundreds of millions of people.

You could say: Why bother translating it from English into Arabic? After all, someone who has the skill to fix bugs in PHP code, probably knows English. But the fact is that translating it into Hebrew was worth the few minutes i put into it, because it caused the Israeli MediaWiki developer Rotem Liss to fix one RTL bug. (Thank you, Rotem.) Just think what it may do if it is translated to Arabic, which is spoken by many, many more people.

So, dear #palgeeks and Arabic-speaking geeks in other countries! If any of you are reading this, please invest a few minutes to do the following:

  1. Go to translatewiki.net.
  2. If you don’t have an account: Create one by clicking “ادخل / أنشئ حسابا” or “Log in / create account” at the top. Then follow the instructions on the screen to request Translator permission.
  3. Go to Mark Hershberger’s post translation page.
  4. Start translating into Arabic.
  5. Copy the result to your own blog, publish it on Twitter, invite other Arab hackers to fix RTL bugs in MediaWiki.

Oh, and you are also cordially invited to Wikimania in Haifa and to the Hackathon that will take place for two days before it, starting on the 2nd of August. It’s not about politics; it’s about improving Wikipedia’s support for your language. And you’ll also get to meet Wikipedians from all around the world, which is even more fun in real life than it sounds. Really. (If you need assistance with getting into Israel, please contact me privately.)

Mobile Phones Suck

All mobile phones suck.

Mobile phones that need non-standard chargers suck.

Mobile phones with boot-up time of more than 10 seconds suck.

Mobile phones with touch screens that use the numeric keypad to enter text suck.

Mobile phones with touch screens in which it is hard in any way to use the numeric keypad for interactive voice response suck.

Mobile phones in which it is hard to change the volume of the speaker or of the ringer suck.

Mobile phones in which it is impossible to copy and paste text from anywhere to anywhere suck.

Mobile phones the software of which cannot be updated suck.

Mobile phones on which i cannot install my own fonts suck.

Mobile phones that need special software to be installed on a computer in order to get the ability to copy files to and from them suck.

Mobile phones that can be synchronized only with particular contact management software suck.

Mobile phones that don’t completely support reading and writing in any language in which it is possible to write in a modern GNU/Linux desktop computer suck.

Mobile phones that claim to be able to browse the Internet, but can’t be used to view a Wikipedia page without complaining about full memory suck.

Mobile phones that claim to be able to browse the Internet, but can’t be used to edit a Wikipedia page suck.

Mobile phones that claim to be able to play music, but cannot sort numbered album tracks suck.

Mobile phones that claim to be able to play music, but cannot play OGG or FLAC files suck.

Mobile phones that claim to be able to play music, but cannot display track names in any language suck.

Mobile phones that are hard to switch to vibration mode suck.

Finally, mobile phones that have any non-free software on them suck.

Language teacher

If you search Google for “language teacher” (מורה ללשון) in Hebrew, the autocompletion suggests “language teacher killed herself” (מורה ללשון התאבדה). The word “teacher” is spelled the same for both genders, but the verb is feminine. I don’t know why does it happen, because actually searching for it doesn’t yield anything significant.

In Israeli schools where Hebrew is the medium of teaching, “Language” is the class where the grammar of Hebrew is taught… badly.

Immersion

Looking at this Facebook ad makes me think: Was the Orange Revolution in Ukraine a failure or a success?

Kiev is a safe, cheap, foreigner-friendly city with a lot of history and culture. Enrol now - get 10% off on group courses. Learn Russian in Kiev.

Russian Immersion in Kiev

The Orange Revolution is presented in the Western Media mostly as an uprising against election fraud and for democracy and freedom. But to Eastern Europeans it was mostly about Ukraine’s relationship with Russia: Will Ukraine develop its own independent identity or will it remain little but Russia’s twin? The questions of nationality, language and identity were far more important than the questions of democracy vs. authoritarianism.

Yuschenko won the Orange Revolution, but lost the last election. Ukrainians, even those who supported his nationalist ideas, were disappointed: he seemed to do little but talk about how important it is to speak and write Ukrainian instead of Russian, proclaimed controversial figures such as Roman Shukhevych national heroes and promoted the Holodomor narrative, also rather controversial.

The Ukrainian language is going rather strong – it is the preferred language for many young people, it has an excellent music scene and it’s flourishing online. But it is not yet the language of an overwhelming majority – millions of people in Ukraine speak Russian for various reasons. As this advertisement testifies, Russian, the “occupier’s language”, is strong enough in Kiev to be used for marketing the city.

So, the nationalistic element of the Orange Revolution may have been somewhat of a failure, which can’t be too bad, but its democratic element is probably doing well. The government can, and probably should, force Ukrainian in documents and education, but it cannot stifle other languages in commerce. Yuschenko may hate it, but that’s the beauty of democracy.

Arab Inventors in Wikipedia

The famous provocative Russian designer and blogger Artemy Lebedev wrote in his blog today (my translation from Russian):

European (Christian) consciousness is built differently than the Eastern (Muslim).

The main unique property of the European culture is the ability to invent and create new things, technologies, items and products. Arab peoples are absolutely unable to invent something. Do we know anything Arabic? A television? A telephone? A car? At least one thing? My main complaint towards Islam is this – as a culture it is so egotistic, that I feel suffocated there.

Though very provocative in his use of language and in his criticism against ugly design, Lebedev is usually very secularist and anti-nationalistic. Sometimes, though, he does make some shocking and scathing remarks about ethnic and religious groups, such as this one.

It did make me think, however. Everybody knows that in the Middle Ages Arabs made many important advances in literature, medicine, astronomy, mathematics and other fields, but i really couldn’t think of an Arab inventor from the recent centuries. So i went to Wikipedia, opened Category:Inventors and descended to Category:Inventors by nationality.

There was only one Arab country listed: United Arab Emirates. Other prominent Muslim countries were Pakistan, Afghanistan, Iran and Turkey. Hmm. So i went to the page List of inventors, hoping that it would be more inclusive and easy to search. It didn’t help much – i found very few Arabs there, and they were mostly medieval characters.

And then i recalled that it’s the English Wikipedia. So i went to Category:Inventors by nationality in the Arabic Wikipedia. There i found several sub-categories for Arab countries: Saudi Arabia, Tunisia, Algeria, Lebanon and Egypt. There was no category for UAE, even though one existed in the English Wikipedia, and none of the categories i found in Arabic had an English counterpart; the one that existed for Algerian inventors was deleted a few months ago, because it was empty.

I went over the articles in these categories in the Arabic Wikipedia. Most of them didn’t have an English counterpart. There was an article in English about Hassan Kamel Al-Sabbah, a Lebanese engineer, so i created Category:Lebanese inventors for him and now there are two Arab countries under Category:Inventors by nationality in English.

There was also an article in English about Ahmed Zewail, an Egyptian chemist, and a couple of other scientists. All of them are probably great people, but reading the articles about them in English it seemed to me that even though it’s correct to call them “scientists” and maybe “discoverers”, they probably aren’t inventors. Of course, it’s possible that i misunderstood something, but it may also mean that for the people who tagged these people as “inventors”, this word had a somewhat different meaning. This may or may not mean that the Arabic word used in the category name, مخترع, covers both inventions and discoveries. The Al-Mawrid Arabic-English dictionary, which i use most of the time, says that this word means “inventor, creator, originator, innovator, maker, author”.


So, there’s a little lesson in cultural divide to be learned here. No, i don’t agree with Artemy Lebedev – i am certain that Arabs can and do invent things and the existence of articles about alleged inventors from Arab countries in the Arabic Wikipedia probably means that this is true. But currently chauvinistic people can take a look in the English Wikipedia, see that it has almost no Arab inventors and keep being sure that Arabs are, indeed, stupid and incapable of invention. Since Wikipedia is so easily available, they probably won’t bother to search for information elsewhere.

Unfortunately, my understanding of the Arab culture and language is too small, but surely there must be an Arab who will take this challenge and improve the coverage of Arab inventors in the Wikipedia in English and other languages.

One way to do this would be to run the script that i wrote for finding and categorizing articles without interlanguage links; if you know Arabic and Perl, please contact me and i’ll gladly help you to set it up for the Arabic Wikipedia.

Unbearable Lightness

I was invited to the 10th anniversary celebration of the Catalan Wikipedia in Perpignan. Perpignan is a city in France, but from the Catalan point of view, it’s in Northern Catalonia – a rather large territory, also known as Roussillon, that was a part of Catalonia, but passed under French rule in 1659. Catalan is still spoken by many people there; how many exactly – i’ll have to see. I hope that it’s spoken by many people for a purely practical reason – my Catalan is much better than my French.

The Catalan Wikipedia is one of the first two Wikipedias created after the English one. The English Wikipedia was created on the 15th of January 2001; German and Catalan were created on the 16th of March 2001. Catalans love to tell that although their Wikipedia was created a few minutes after the German, it was the first one to have an actual article.

Since the Catalan Wikipedia is the oldest and the largest version of Wikipedia in a language which isn’t official in any big country (sorry, Andorra), the people behind it want to share their experiences promoting their language with other regional and minorized languages and this will be discussed in the event. More details on that later.


Direct El-Al flight from Tel-Aviv to Barcelona – 582 USD. Alitalia via Rome, 2 hours wait for connection – 460 USD. Czech Airlines (ČSA) via Prague, 11 hours wait for connection – 367 USD. Guess which one i picked. ČSA, of course – i pay less and i get to spend a day in Prague! Sorry, El-Al.

If you call Czech Airlines office in Tel-Aviv, you can choose one of the following languages, in that order: English, Russian, German, Czech, French, Spanish, Italian. No Hebrew or Arabic. Except that, however, the service is excellent. I spoke in Russian with the service people and they were very polite, helpful and efficient. They were Czech; They spoke Russian with a slight accent, but it was completely correct and easy to understand. I’ll have to wait for the flight itself to see how it is, but until now my impression is very good.


P.S. Typing the word “Czech” is surprisingly hard.

Componenta

Israeli programmers use many words of English origin when they speak Hebrew. (Many of them prefer to write only in English instead of Hebrew, which is a separate issue.)

When they use these English words, they tend to adapt them to Hebrew pronunciation. Some adaptations are simple, for example “router” is pronounced with an Israeli, rather than English [r] sound (some people – not necessarily purists! – use the Hebrew word נַתָּב [natav] for that). “SQL” is rarely pronounced as “sequel” – usually it’s “ess cue el”, and the same goes for MySQL.

But some are harder to explain. For example, “component” is often pronounced [kompoˈnenta]. I heard it in several companies and i don’t quite understand why. Note the [a] in the end and the stress, too: in English it’s supposed to be something in the area of [kʌmˈpoʊnənt] – on the second syllable, not the third. I have never heard an Israeli programmer pronounce it with correct stress when speaking in English – i always hear it as [ˈkomponənt] – with stress on the first syllable and with a [o]‘s in the first two syllables.

The only languages available on Google Translate in which this word is anywhere near [komponénta] are Serbian (компонента), German (Komponente), Romanian (componentă) and Spanish and Italian (componente). It may have something to do with them, but the solution is probably more complicated. Does anyone have any idea?



Follow

Get every new post delivered to your Inbox.

Join 1,392 other followers