Twitter Must Make it Easy to Mass-Report Spam Bots

I found a network of Russian female bots. Twitter spam bots.

They are not actually female. They just have Russian female names and female photos.

Most of those that I found were created in September 2016, although some were created at other times.

They all have similar taglines:

  • “In my opinion, everything is wonderful. I wonder what else” (“По-моему всё прекрасно. Интересно что ещё”)
  • “Right now absolutely everything is excellent. I wonder how else” (“Сейчас вообще всё отлично. Интересно как там ещё”)
  • “It looks like absolutely everything is wonderful. I’ll see what will happen next” (“Вроде вообще всё прекрасно. Посмотрю что будет дальше”)

… And so forth, with minor variations, which are very easy to detect for a human who knows Russian, although I’m less sure about software. (This reminds me of how I was interviewed for several natural language processing positions around 2011. All of them were about optimizing site text for Google ads, and all of them specifically targeted only English. When you only target English, other languages are used to spam you.)

Their usernames are all almost random and end with two digits: flowoghub90, viotrondo86, chirowsga88 (although “90” seem to be the most frequent digits). As location, they all indicate one of the large cities of Russia: Moscow, Krasnoyarsk, Perm, Saint-Petersburg, Rostov-on-Don, etc.

All of them post nothing but retweets of other accounts popular in Russia:

Curiously, all their names are only typical to ethnic Russians. Names of real women from Russia would be much more varied—there would be a lot of typical Armenian, Ukrainian, Jewish, Georgian, and Tatar names that reflect Russia’s diversity: Melikyan, Petrenko, Rivkind, Gamkrelidze, Khamitova. But these spam bot accounts only have names such as Kuznetsova, Romanova, Ershova, Medvedeva, Kiseleva. If you aren’t familiar with the Russian culture, let me make a comparison to the U.S.: It’s like having a lot of people named Smith, Harris, Anderson, and Roberts, and nobody named Gonzalez, Khan, O’Connor, Rosenberg, or Kim. Maybe the spammers wanted to be more mainstream than mainstream, and maybe it is just overt racism.

I found them when I noticed that a lot of unfamiliar accounts with Russian female names were retweeting something by Pavel Durov in which I was mentioned. Durov is the founder of VK and Telegram, and I guess that he can be classified under “major internet businesses” in the list above. I noticed the similar taglines of the “women”, and immediately understood they are all spam bots.

These accounts are active. Some of them retweeted stuff while I was writing this post. I also keep getting retweet notifications, more than two weeks after Durov’s original tweet was posted.

When I am looking at any of these accounts, Twitter suggests me similar ones, and they are all in the same network: Russian female names, similar “everything is wonderful” taglines, similar content. So Twitter’s software understands that they are similar, but doesn’t understand that they are spam bots that should be utterly banned. I also noticed that some of them are still suggested to me after I blocked them, which goes against the whole point of blocking.

I don’t know how many there are of them in this network. Likely thousands. I reported thirty or so, and I wonder whether it’s efficient for anything.

I also don’t know what is their purpose. Boost the popularity of other Russian accounts? But those that they retweet are popular already. Waste the time of people who try to use Twitter productively? Maybe; at least it’s the effect in my case. Function as bot followers in “pay to follow” networks? Possibly, but they have existed for a year, and they don’t follow so many people.

I’m probably not discovering anything very new in this post. But especially if I don’t, it all the more makes me wonder why isn’t this problem already addressed somehow. At the very least it should be possible to report them more efficiently with one click or tap. And Twitter should also provide a form for mass-reporting; currently, Twitter’s guides about spam only suggest this: “The most effective way to report spam is to go directly to the offending account profile, click the drop-down menu in the upper right corner, and select “report account as spam” from the list.” It’s OK for one account, but it requires five clicks, and it doesn’t scale for something as systematic as what I am describing in this post.

I do hope that somebody from Twitter will read this and do something about it. This is obvious systematic abuse, and I have no better way to report it.

Advertisement

The Curious Problem of Belarusian and Igbo in Twitter and Bing Translation

Twitter sometimes offers machine translation for tweets that are not written in the language that I chose in my preferences. Usually I have Hebrew chosen, but for writing this post I temporarily switched to English.

Here’s an example where it works pretty well. I see a tweet written in French, and a little “Translate from French” link:

Emmanuel Macron on Twitter.png

The translation is not perfect English, but it’s good enough; I never expect machine translation to have perfect grammar, vocabulary, and word order.

Now, out of curiosity I happen to follow a lot of people and organizations who tweet in the Belarusian language. It’s the official language of the country of Belarus, and it’s very closely related to Russian and Ukrainian. All three languages have similar grammar and share a lot of basic vocabulary, and all are written in the Cyrillic alphabet. However, the actual spelling rules are very different in each of them, and they use slightly different variants of Cyrillic: only Russian uses the letter ⟨ъ⟩; only Belarusian uses ⟨ў⟩; only Ukrainian uses ⟨є⟩.

Despite this, Bing gets totally confused when it sees tweets in the Belarusian language. Here’s an example form the Euroradio account:

Еўрарадыё   euroradio    Twitter double.pngBoth tweets are written in Belarusian. Both of them have the letter ⟨ў⟩, which is used only in Belarusian, and never in Ukrainian and Russian. The letter ⟨ў⟩ is also used in Uzbek, but Uzbek never uses the letter ⟨і⟩. If a text uses both ⟨ў⟩ and ⟨і⟩, you can be certain that it’s written in Belarusian.

And yet, Twitter’s machine translation suggests to translate the top tweet from Ukrainian, and the bottom one from Russian!

An even stranger thing happens when you actually try to translate it:

Еўрарадыё   euroradio    Twitter single Russian.pngNotice two weird things here:

  1. After clicking, “Ukrainian” turned into “Russian”!
  2. Since the text is actually written in Belarusian, trying to translate it as if it was Russian is futile. The actual output is mostly a transliteration of the Belarusian text, and it’s completely useless. You can notice how the letter ⟨ў⟩ cannot be transliterated.

Something similar happens with the Igbo language, spoken by more than 20 million people in Nigeria and other places in Western Africa:

 4  Tweets with replies by Ntụ Agbasa   blossomozurumba    Twitter.png

This is written in Igbo by Blossom Ozurumba, a Nigerian Wikipedia editor, whom I have the pleasure of knowing in real life. Twitter identifies this as Vietnamese—a language of South-East Asia.

The reason for this might be that both Vietnamese and Igbo happen to be written in the Latin alphabet with addition of diacritical marks, one of the most common of which is the dot below, such as in the words ibụọla in this Igbo tweet, and the word chọn lọc in Vietnamese. However, other than this incidental and superficial similarity, the languages are completely unrelated. Identifying that a text is written in a certain language only by this feature is really not great.

If I paste the text of the tweet, “Nwoke ọma, ibụọla chi?”, into translate.bing.com, it is auto-identified as Italian, probably because it includes the word chi, and a word that is written identically happens to be very common in Italian. Of course, Bing fails to translate everything else in the Tweet, but this does show a curious thing: Even though the same translation engine is used on both sites, the language of the same text is identified differently.

How could this be resolved?

Neither Belarusian nor Igbo languages are supported by Bing. If Bing is the only machine translation engine that Twitter can use, it would be better to just skip it completely and not to offer any translation, than to offer this strange and meaningless thing. Of course, Bing could start supporting Belarusian; it has a smaller online presence than Russian and Ukrainian, but their grammar is so similar, that it shouldn’t be that hard. But what to do until that happens?

In Wikipedia’s Content Translation, we don’t give exclusivity to any machine translation backend, and we provide whatever we can, legally and technically. At the moment we have Apertium, Yandex, and YouDao, in languages that support them, and we may connect to more machine translation services in the future. In theory, Twitter could do the same and use another machine translation service that does support the Belarusian language, such as Yandex, Google, or Apertium, which started supporting Belarusian recently. This may be more a matter of legal and business decisions than a matter of engineering.

Another thing for Twitter to try is to let users specify in which languages do they write. Currently, Twitter’s preferences only allow selecting one language, and that is the language in which Twitter’s own user interface will appear. It could also let the user say explicitly in which languages do they write. This would make language identification easier for machine translation engines. It would also make some business sense, because it would be useful for researchers and marketers. Of course, it must not be mandatory, because people may want to avoid providing too much identifying information.

If Twitter or Bing Translation were free software projects with a public bug tracking system, I’d post this as a bug report. Given that they aren’t, I can only hope that somebody from Twitter or Microsoft will read it and fix these issues some day. Machine translation can be useful, and in fact Bing often surprises me with the quality of its translation, but it has silly bugs, too.

Five More Privileges of English Speakers, part 2: Language and Software

For the previous part in the series, see Five Privileges of English Speakers, part 1.

I’m continuing the series of posts in each of which I write about five privileges that English speakers have without giving it a lot of thought. The examples I give mostly come from my experience translating software, Wikipedia articles, blog posts, and some other texts between English, Hebrew, and Russian. Hebrew and Russian are the languages I know best. If you have interesting examples from other languages, I am very interested in hearing them and writing about them.

I’m writing them mostly as they come into my mind, without a particular order, but the five items in this part of the series will focus on usage of the English language in software, and try to show that the dominance of English is not only a consequence of economics and history, but that it’s further reinforced by features of the language itself.

1. Software usually begins its life in English

English is the main language of software development worldwide.

The world’s best-known place for software development is Silicon Valley, an English-speaking place. That’s the place of Facebook, Google, Apple, Oracle and many others. California is also the home of Adobe.

There are several other hubs of software development in United States: Seattle (Microsoft, Amazon), North Carolina (Red Hat), New York (IBM, CA), Massachusets (TripAdvisor, Lotus, RSA), and more. The U.S. is also the source for much of computer science research and education, coming from Berkeley, MIT, and plenty of other schools. The U.S. is also the birthplace of the Internet, originally supported by the U.S. Department of Defense and several American universities. The world wide web, which brought the Internet to the masses, was created in Switzerland by an English speaker.

Software is developed in other countries—India, Russia, Israel, France, Germany, Estonia, and many other countries. But the dominance of the U.S. and of the English language is clear. The reason for this is not only that the U.S. is the source for much of computer technologies, but also—and probably more importantly—that the U.S. is the biggest consumer market for software. So developers in all countries tend to optimize the product for the highest-paying consumers, and these only need English.

When engineers write the user interface of their software in English, they often do not give any thought to other languages at all, or make translation possible, but complicated by English-centric assumptions about number, gender, text direction, text size, personal names, and plenty of other things, which will be explored in further points.

2. Terminology

English is also the source for much of the computer world’s terminology. Other languages have to adapt terms like smartphone, network, token, download, authentication, and thousands of others.

Some language communities work hard to translate them all meticulously into native words; Icelandic, Lithuanian, French, Chinese, and Croatian are famous examples. This is nice, but requires effort on behalf of terminology committees, who need to keep up with the fast pace of technological development, and on behalf of the software translators, who have to keep with the committees.

Some just transliterate most of them: keep the term essentially in English, but rewritten in the native alphabet. Hindi and Japanese are examples of that. This seems easy, but it is based on a problematic assumption: that the target language speakers who will use the software know at least some English! This assumption is correct for the translators, who don’t just know the English terms, but are probably also quite accustomed to it, but it’s not necessarily correct for the end users. Thus, the privilege is perpetuated.

Some languages, such as Hebrew, German, and Russian, are mid-way, with language academics and purists pulling to purer native language, engineers pulling to more English-based words, and the general public settling somewhere in between—accepting the neologisms for some terms, and going for English-based words for others.

For the non-English languages it provides fertile ground for arguments between purists and realists, in which the needs of the actual users are frequently forgotten. All the while, English speakers are not even aware of all this.

3. Easy binary logic word formation

One particular area of computer terminology is binary logic. This sounds complicated, but it’s actually simple: in electronics and software opposite notions such as true / false, success / failure, OK / Cancel, and so forth, are very common.

This translates to a great need for words that express opposites: enable / disable, do / undo, log in / log out, delete / undelete, block / unblock, select / deselect, online / offline, connect / disconnect, read / unread, configured / misconfigured.

Notice something? All of the above words are formed with the same root, with the addition of a prefix (un-, dis-, de-, mis-, a-), or with the words “on” and “off”.

A distinct, but closely related need, is words for repetition. Computers are famously good at doing things again and again, and that’s where the prefix re- is handy: reconnect, retry, redo, retransmit.

These features happen to be conveniently built into the English language. While English has extremely simple morphology for declension and conjugation (see the section “Spell-checking” in part 1 of the series), it has a slightly more complex morphology for word formation, but it’s still fairly easy.

It is also productive. That is, a software developer can create new words using it. For example, the MediaWiki software has the concept of “oversight”—hiding a problematic page in such a way that only users with a particular permission can read it. What happens if a page was hidden by mistake? Correct: “unoversight”. This word doesn’t quite exist elsewhere, but it doesn’t sound incorrect, because familiar English word formation rules were used to coin it.

As it always happens, English-speaking software engineers either don’t think about it at all, or think that other languages also have similar word formation rules. If you haven’t guessed it already, it is not true. Sime other European languages have similar constructs, but not necessarily as consistent as in English. And for Semitic languages like Hebrew it’s a disaster, because in Semitic languages prefixes are used for entirely different things, and the grammar doesn’t have constructs for repetition and negation. So when translating software user interface strings into Hebrew, we have to use different words as opposites. For example the English pair connect / disconnect is translated as lehitḥabér / lehitnaték—completely different roots, which Hebrew is just lucky to have. Another option is to use negative words like lo and bilti, or bitul, but they are often unnatural or outright wrong. Having to deal with something like “Mark as unread” is every Hebrew software translator’s nightmare, even though it sounds pretty straightforward in English.

English itself also has pairs of negative words that are not formed using the above prefixes, for example next / previous and open / close, but in many other languages they are much more common.

4. Verbing

“Verbing weirds language”, as one of the famous Calvin and Hobbes panels says.

Despite being a funny joke in the comic, it’s a real feature of the English language: because of how English morphology and syntax work, nouns can easily jump into the roles of adjectives and verbs without changing the way they are written.

For English, this is a useful simplification, and it works in labeling, as well as in advertising. “Enjoy Coca-Cola” is something more than an imperative. The fact that it’s a short single word and that it’s the same in all genders and numbers, makes it more usable as a call to action than it would be in other languages. And, other than advertising, where are calls to action very common? Software, of course. When you’re trying to tell a user to do something, a word that happens to be both the abstract concept and the imperative is quite useful.

Perhaps the most famous example of this these days is Facebook’s “Like”. Grammatically, what is it in English? Imperative? A noun describing an abstract action? Maybe a plain old noun, as in “chasing likes” (this is a plural noun—English verb don’t have a plural form!)? Answer: it’s all of them and more.

When translated to Hebrew in Facebook’s interface, it’s Ahávti, which literally means “I loved it”. Actually, this translation is mostly good, because it’s understandable, idiomatic, and colloquial enough without compromising correctness. Still, it’s a verb, which is not imperative, and it’s definitely not a noun, so you cannot use it in a sentence as if it was a noun. Indeed, Hebrew speakers are comfortable using this button, but when they speak and write about this feature, they just use its English name: “like” (in plural láykim). It even became a slightly awkward, but commonly used verb: lelaykék. Something similar happens in Russian.

It would be impossible in Hebrew and Russian to use the exact same word for the noun and the verb, especially in different persons and genders. Sometimes the languages are lucky enough to be able to adapt an English verb in a way that is more or less natural, but sometimes it’s weird, and hurts the user experience.

5. Word length

This one is relatively simple and not unique to English, but should be mentioned anyway: English words are neither very long, nor very short. Examples of languages where words are, on average, longer than in English, are Finnish, Tamil, German, and occasionally Russian. Hebrew tends to be shorter, although sometimes a single English word has to be translated with several Hebrew words, so it can get also get longer. This is true for a pretty much any language, really.

In designing interfaces, especially for smaller screens, the length of the text is often important. If a button label is too long, it may overflow from the button, or be truncated, making the display ugly, or unusable, or both.

If you’re an English speaker, it probably won’t happen with you, because almost all software is usually designed with the word length of your language in mind. Other languages are almost always an afterthought.

The good practice for software engineers and designers is to make sure that translated strings can be longer. Their being shorter is rarely a problem, although sometimes a string is so short that the button may become to small to click or tap conveniently.


Generally, what can you do about these privileges?

Whoever you are, remember it. If you know English, you are privileged: Software is designed more for you than for people who speak other languages.

If you are a software engineer or a designer, at the very least, make your software translatable. Try to stick to good internationalization practices and to standards like Unicode and CLDR. Write explanations for every translatable string in as much detail as possible. Listen to users’ and translators’ complaints patiently—they are not whining, they are trying to improve your software! The more internationalizable it is, the more robust it is for you as a developer, and for your English-speaking users, too, because better design thinking will be going into each of its components, and less problematic assumptions will be made.

Five Privileges of English Speakers, part 1

It’s very common today on progressive blogs to urge people to check their privilege.

Being an English speaker, native or non-native, is a privilege.

It’s not as often as discussed as other forms of privilege, such as white, male, cis, hetero, or rich privilege. The reason for this is simple: The world’s media is dominated by the English language. English-language movies are more popular in many countries than movies in these countries’ own languages, English-language news networks are quoted by the rest of the world, the world’s most popular social networks are based in the U.S. and are optimized for U.S. audiences, etc.

So, when English speakers discuss privilege among each other, English is not much of an issue, and they dedicate more time to race, gender, wealth, religion, and other factors that differentiate between people in English-speaking countries.

Despite this, I am not the first one to describe English as a privilege. A simple Google search for english language privilege will yield many interesting results.

What I do want to try to do in this series of posts is to list the particular nuances that make English such a privilege in as much detail as possible. I wanted to write this for a long time, but there are many such nuances, so I’ll just do it in batches of five, in no particular order:

1. Keyboard

If you speak English, congratulations: A keyboard on which your language can be written is available on all electronic devices.

All of them.

All desktops, laptops, phones, tablets, watches. The only notable exception I can think of is typewriters, which only makes the point more tragic: technology moved forward and made writing easier in English, but harder in many other languages, where local-language typewriters were replaced with computers with English-only keyboard.

At the very worst case, writing English on a computer will be slightly inconvenient in countries like Germany, France, or Turkey, where the placement of the Latin letters on the keys is slightly different from the U.S. and U.K. QWERTY standard. Oh, poor American tourists.

On a more serious note, though, even though a lot of languages use the Latin alphabet, a lot of them also use a lot of extra diacritics and special characters, and English is one of the very few that doesn’t. Of the top 100 world’s languages by native speakers, only Malay, Kinyarwanda, Somali, and Uzbek have standardized orthographies that can be written in the basic 26-letter Latin alphabet without any extra characters. We can also add Swahili, which has a large number of non-native speakers, but that’s it. With other languages you can get stuck and not be able to write your language at all (Hindi, Chinese, Russian, etc.), or you may have to write in a substandard orthography because you can’t type letters like é or ł (French, Vietnamese, Polish, etc.).

The above is just the teeny-tiny tip of the iceberg; the keyboard problem will be explored in more points later.

2. Spell-checking

English word morphology is laughably simple.

There’s -s for plurals and for third person present tense verbs, there’s -‘s for possession, and there are -ed and -ing verb forms. There are also some contractions (‘d, ‘s, ‘ll, ‘ve), and a long, but finite list of irregular verb forms, and an even shorter list of irregular plural noun forms. And that’s it.

Most languages aren’t like that. In most languages words change with prefixes, suffixes, infixes, clitics, and so on, according to their role in the sentence.

Beyond the fact that English writing is (arguably) easier for children and foreigners to learn, this means that software tools for processing a language are easy to develop for English and hard to develop for other languages.

The first simple example is spell-checking.

English has had not just spelling, but also grammar and style checkers built into common word processors for decades, and many languages of today don’t even have spelling checkers, not to mention grammar, or style, or convenient searching. (See below.)

So in English, when you type “kinh”, most word processors will suggest correcting it to “king”, but then, some of them may also suggest replacing this word with “monarch” to be more inclusive for women, and this is just one of the hundreds of style improvement suggestions that these tools can make. For a lot of other languages, even simple spell-checking of single words hasn’t been developed yet, and grammar checking is a barely-imaginable dream.

3. Autocompletion

Simpler morphology has many other effects.

Even though Russian is my first native language and I speak it more fluently than I speak English, I am much slower when I’m typing in Russian on my phone. In English, the autocompleting keyboard makes it possible to write just two or three letters of a word and let the software complete the rest. In Russian, the ending of the word must be typed, and autocompletion rarely guesses it correctly. Typing an incorrect ending will make a sentence convey incorrect information, or just make it completely ungrammatical.

4. Searching

A yet-another issue of the previous point, English’s very simple morphology makes searching easier.

For example word processors have a search and replace function. For English, it will likely find all forms of the word, because there are so few of them anyway. But in Hebrew and Arabic, letters are often inserted or changed in the middle of the word according to its grammatical state, and you need to search for each form, which is quite agonizing. It’s comparable to “man” vs. “men” in English, except that in English such changes are very rare, while in many other languages it happens in almost every word.

With search engines that must find words across thousands of documents it gets even harder. Google can easily figure out that if you’re searching for “drive”, you may also be interested in “driving”, “drove”, and “driven”, but Russian has dozens of other forms for this word. A few languages are lucky: special support was developed for them in search engines, and tasks of this kind are automated, but most languages our just out in the cold. But English barely needs extra support like this in the first place.

5. Very little gender

A lot can be said about gendered language, but as far as basic grammar goes, English has very little in the area of gender. “He” and “She”, and that’s about it. There are also man/woman, actor/actress, boy/girl, etc., but these distinctions are rarely relevant in technology.

In many other languages gender is far more pervasive. In Semitic and Slavic languages, a lot of verb forms have gender. In English, the verb “retweeted” is the same in “Helen retweeted you” and “Michael retweeted you”, but in Hebrew the verb is different. Because Twitter doesn’t know that Helen needs a different verb, it uses the masculine verb there, which sounds silly to Hebrew speakers.

I asked Twitter developers about this many times, and they always replied that there’s no field for gender in the user profile. It becomes more and more amusing lately, now that it has become so common —and for good reasons!— to mention what one’s preferred pronouns are in the Twitter profile bio. So people see it, but computers don’t.

On a more practical note, in the relatively rare cases when third person pronouns must be used in software strings, English will often use the singular “they” instead of “he” or “she”. So English-speaking developers do notice it, but not as often as they should, and when they do, they just use the lazy singular-they solution, which is socially acceptable and doesn’t require any extra coding. If only they’d notice it more often, using their software in other languages would be much more convenient for people of all genders.

The only software packages that I know that have reasonably good support for grammatical gender are MediaWiki and Facebook’s software. I once read that Diaspora had a very progressive solution for that, but I don’t know anybody who actually uses it. There may be other software packages that do, but probably very few.


These are just the first five examples of English-language privilege I can think of. There will be many, many more. Stay tuned, and send me your ideas!

The Original Snakes on a Plane

It may seem tasteless to many of you, but I just had to share it. The lost Malaysian airplane story reminds me of an 1980s Polish-Soviet adventure-sci-fi film “The Curse of Snake Valley”.

It was, without doubt, inspired by Indiana Jones films: a Polish linguistics professor deciphers an ancient manuscript that promises “great power” to anybody who recovers a treasure from South-East Asian temple. He goes to recover it with the help of an aging French tough guy, who turns out to be a villain who wants the power to himself, and a tastelessly sexy female French journalist, who also turns out to be a villain working on behalf of a sinister organization, which – you guessed it – also wants to take over that mysterious power. The treasure turns out to be a biological weapon brought to Earth by aliens who have a thing for snakes (yet another Dr. Jones reference). When the first test of the weapon goes awry and kills the sinister organization’s boss, the new boss sends it for testing in a Pacific atoll, and the airplane that carries it disappears in the sky.

This is the Russian-dubbed version. You can see the airplane scene at 1:31:30.

Don’t have big expectations: The movie was voted in a poll in Poland as one of the worst Polish movies ever. It was, however, a huge hit with Soviet children back in 1988. I went to see it in the neighborhood cinema at least three times, and I had oh so many discussions with my friends about the deep meanings in its plot.

And, well, yes, it reminds of the odd Malaysian story. Can’t help it. At least it’s an opportunity to tell a strange little story from my Soviet childhood.

Turkic Wikimedia Conference 2012, Almaty: Master Class, Kazakh in China and Developers’ Workshop

The translatewiki.net “master class”

On the morning of the second day of the Turkic Wikimedia Conference 2012 I held a translatewiki.net workshop. The participants called it a “master class” and I didn’t object :)

People sitting on benches. Amir Aharoni operating a notebook and a projector
Doing a "master class" in translatewiki.net

In the master class I demonstrated how to translate Wikimedia software. People opened accounts and started translating MediaWiki and the Wikipedia Mobile app. During the master class several issues were raised. Some of them turned out to be technical issues of translatewiki.net. I intent to find a solution soon.

Language support for Kazakh speakers in China

After the translatewiki.net master class I had a relatively short, but really fantastic meeting with Akytbek, a Kazakh speaker from North-Western China. He told me that two million Chinese Kazakhs are well-connected to the Internet and that they vigorously use the Kazakh language online. (According to official Chinese data, there are 1.25 millions Kazakhs in China, but whatever the number is, it’s a lot of people.) That is good, of course, but they only do it only in the Arabic alphabet, and not the Cyrillic, which is used in Kazakhstan. He said that there is a great potential of having many Chinese Kazakh contributors to Wikipedia, and that even though the Kazakh Wikipedia already supports the Arabic script, some improvements are needed to realize this potential.

People sitting together on benches and looking on a laptop computer
Working with Akytbek from China on Arabic script support for the Kazakh Wikipedia

I showed Akytbek our current language tools – the automatic script conversion, WebFonts and the Narayam typing tool, and we decided to work together to adapt them better for the needs of Chinese Kazakhs.

By the way, Akytbek didn’t speak any Russian and he knew little English, so another Kazakh speaker who knew Russian acted as an interpreter. This is yet another proof of the importance of never assuming anything about languages and people.

MediaWiki development workshop

According to the schedule, the same morning I was also supposed to hold a workshop for programmers that would introduce them to MediaWiki development. The workshop did not take place at its scheduled time – network problems spoiled the opportunity. However, as it is so important, we did not give up and held it later at the hotel where we were staying.

It was intense, and intensely good, too: Talented and experienced people from Turkmenistan, Kyrgyzstan, Bashkortostan and Kazakhstan sat and listened to me talking for two hours or so about MediaWiki configuration, special pages, i18n files, installation procedures, extensions, preferences, templates, bots, source control and so on. Because of the quality of the questions, I am sure that my presentation was understood. What made me really happy is that several people asked how they could contribute patches and new features.

To be continued…

Keyboards, Firefox, Chrome and Privacy

I hardly ever used Google Chrome because of a bug that made the Ctrl-arrow keyboard shortcut work incorrectly in right-to-left languages. This shortcut works makes the cursor jump a word to the left or to the right. In Hebrew and Arabic it would jump to the left when the right arrow was be pressed. It works well in most other programs, but since Chrome doesn’t use the operating system’s text editing capabilities, this worked incorrectly.

I write a lot of email, blog posts and Wikipedia articles and this keyboard shortcut is essential for me, so if it doesn’t work correctly in a program, i simply cannot use it and will use the competitor, in my case Firefox. Since i love Firefox anyway, it was not really a problem for me.

It took more than two years to do it, but this bug is more or less solved now and the fix will probably be released soon. I am now trying a preliminary version and the Ctrl-arrow shortcut seems to work correctly. However, as i expected, i quickly found other problems because of which i cannot use Google Chrome. Long story short, i cannot write Russian there. It’s not that it’s impossible – it’s just way too hard for me.

I could enable the Russian keyboard layout in my operating system, but it would be very hard to use for me. Keyboards sold in my country usually come with Latin and Hebrew letters printed on the keys and not Russian. It’s possible to buy a keyboard with Russian letters on it, and i did it once, but it didn’t help me much. You see, i write Russian several times a day, but less often than i write Hebrew or English, and the Russian layout is very different from the Latin layout, so i type in it very slowly even if i have the letters in front of my eyes.

Since 2006 my solution for this issue was the Transliterator add-on for Firefox, created by Alex Benenson (thank you so much, Alex). It was first called “ToCyrillic”, because it only helped with the Cyrillic alphabet, but later it was adapted to many other languages. It allows me to type Russian phonetically, so the Latin ‘b’ is automatically converted to Cyrillic ‘б’, ‘sh’ becomes ‘ш’ etc. It works everywhere in Firefox – websites’ input fields, the address bar, the dialog windows etc.

I couldn’t find anything like it for Chrome. It’s possible that i didn’t look well enough, but the add-ons i did find that claimed to do transliteration, phonetic typing or keyboard emulation either did something completely different or asked me to allow the add-on access my data on all websites and my tabs and browsing activity. I don’t understand why such an add-on would need access to my data and browsing activity – it is only supposed to translate the characters i type into other characters and forget it.

It’s possible that the message that tells me about these privacy implications is over-zealous and the add-ons in question don’t actually breach my privacy, but it is still weird to see them, so i didn’t install them.

So there – i still have a strong reason not to move to Google Chrome. It’s not really Google’s fault. In fact, i could myself develop an extension that does something that i want – the source and the API are open and it’s probably not a lot of work. But why would i waste even a minute of my time doing such a thing if i already have Firefox and its Transliterator add-on that work perfectly well? You could say that Google Chrome is faster and uses less memory; it is not quite true in the first place, and even if it would be true, i wouldn’t care about it, because being able to write the language i want is far more important than minor differences in performance.


As a side note, in some Google websites it’s possible to type in transliteration. However, it works only on these particular sites and needs the machine to be online, because it uses a web service to translate every word. That is weird software design and has rather unacceptable privacy implications.

Wikipedia already has phonetic typing support in Malayalam, Tamil and other languages and soon it is going to be deployed to other languages. It works in-place – it translates the text immediately in the browser letter by letter. Of course, it only works in one website; it would be better to help people to enable their native keyboard layouts rather than do it in only one website, but apparently doing it this way helps people start writing and searching immediately. More details on that soon.

Arab Inventors in Wikipedia

The famous provocative Russian designer and blogger Artemy Lebedev wrote in his blog today (my translation from Russian):

European (Christian) consciousness is built differently than the Eastern (Muslim).

The main unique property of the European culture is the ability to invent and create new things, technologies, items and products. Arab peoples are absolutely unable to invent something. Do we know anything Arabic? A television? A telephone? A car? At least one thing? My main complaint towards Islam is this – as a culture it is so egotistic, that I feel suffocated there.

Though very provocative in his use of language and in his criticism against ugly design, Lebedev is usually very secularist and anti-nationalistic. Sometimes, though, he does make some shocking and scathing remarks about ethnic and religious groups, such as this one.

It did make me think, however. Everybody knows that in the Middle Ages Arabs made many important advances in literature, medicine, astronomy, mathematics and other fields, but i really couldn’t think of an Arab inventor from the recent centuries. So i went to Wikipedia, opened Category:Inventors and descended to Category:Inventors by nationality.

There was only one Arab country listed: United Arab Emirates. Other prominent Muslim countries were Pakistan, Afghanistan, Iran and Turkey. Hmm. So i went to the page List of inventors, hoping that it would be more inclusive and easy to search. It didn’t help much – i found very few Arabs there, and they were mostly medieval characters.

And then i recalled that it’s the English Wikipedia. So i went to Category:Inventors by nationality in the Arabic Wikipedia. There i found several sub-categories for Arab countries: Saudi Arabia, Tunisia, Algeria, Lebanon and Egypt. There was no category for UAE, even though one existed in the English Wikipedia, and none of the categories i found in Arabic had an English counterpart; the one that existed for Algerian inventors was deleted a few months ago, because it was empty.

I went over the articles in these categories in the Arabic Wikipedia. Most of them didn’t have an English counterpart. There was an article in English about Hassan Kamel Al-Sabbah, a Lebanese engineer, so i created Category:Lebanese inventors for him and now there are two Arab countries under Category:Inventors by nationality in English.

There was also an article in English about Ahmed Zewail, an Egyptian chemist, and a couple of other scientists. All of them are probably great people, but reading the articles about them in English it seemed to me that even though it’s correct to call them “scientists” and maybe “discoverers”, they probably aren’t inventors. Of course, it’s possible that i misunderstood something, but it may also mean that for the people who tagged these people as “inventors”, this word had a somewhat different meaning. This may or may not mean that the Arabic word used in the category name, مخترع, covers both inventions and discoveries. The Al-Mawrid Arabic-English dictionary, which i use most of the time, says that this word means “inventor, creator, originator, innovator, maker, author”.


So, there’s a little lesson in cultural divide to be learned here. No, i don’t agree with Artemy Lebedev – i am certain that Arabs can and do invent things and the existence of articles about alleged inventors from Arab countries in the Arabic Wikipedia probably means that this is true. But currently chauvinistic people can take a look in the English Wikipedia, see that it has almost no Arab inventors and keep being sure that Arabs are, indeed, stupid and incapable of invention. Since Wikipedia is so easily available, they probably won’t bother to search for information elsewhere.

Unfortunately, my understanding of the Arab culture and language is too small, but surely there must be an Arab who will take this challenge and improve the coverage of Arab inventors in the Wikipedia in English and other languages.

One way to do this would be to run the script that i wrote for finding and categorizing articles without interlanguage links; if you know Arabic and Perl, please contact me and i’ll gladly help you to set it up for the Arabic Wikipedia.

Who is Albert Sánchez Piñol?

Who is Albert Sánchez Piñol? Let’s look at Wikipedias in different languages, translated into English, ordered by the English name of the language:

Basque: Albert Sánchez Piñol is a Catalan writer and anthropologist.

Catalan: Albert Sánchez Piñol is a Catalan anthropologist and writer who wrote the known works “The Cold Skin” (2002) and “Pandora in Congo” (2005).

Dutch: Albert Sánchez Piñol is a Spanish anthropologist and employee of the Center for African Studies of the University of Barcelona. (The rest of the article describes his work in the field of anthropology. The last sentence says that he writes in Catalan.)

English: Albert Sánchez Piñol (Catalan pronunciation: [əɫˈβɛrt ˈsantʃeθ piˈɲɔɫ]) is a Catalan Spanish author and anthropologist writing in the Catalan language.

German: Albert Sánchez Piñol is a Spanish anthropologist and writer. (Catalan is not mentioned in the article, but the article is included in the category “Literature (Catalan)”).

Italian: Albert Sánchez Piñol is a Spanish writer and anthropologist. (The fact that “The Cold Skin” was written in Catalan is mentioned towards the end.)

Norwegian: Albert Sánchez Piñol is a Spanish author and social anthropologist, writing in Catalan.

Polish: Albert Sánchez Piñol, a Spanish writer, a prosaist writing in the Catalan language. By education he is an anthropologist.

Russian: Albert Sánchez Piñol – a Catalan anthropologist and writer.

Spanish: Albert Sánchez Piñol is a Spanish writer and anthropologist. His literary work is written in Catalan.

(All articles say that he was born in Barcelona in 1965. Only English has an IPA transcription of the name, although it’s probably wrong.)

Japanese, Germans and Israelis of the world

Through i-iter i came upon this interesting post: Tamil, Kannada and the middle path. Tamil and Kannada are two important languages spoken in the south of India and their speakers are quite proud of their identity.

The article complains that not enough is being done for the linguistic normalization of non-Hindi languages in India. It was very interesting to read it and, being Israeli, i was surprised to see the compliments to “Japanese, Germans and Israelis of the world who aren’t wasting time tom-toming about antiquity, beauty or originality, but are instead investing their time, money and energy in using their languages for almost all known purposes”.

I was curious – why did they choose these three? Why not Russians and French, who use their languages for everything because many of them openly consider them to be better than all the others? Why not Catalans, whose language is in a political situation which is much more similar to that of Tamil and Kannada?

And why Israelis? Sure, we use Hebrew a lot; Hebrew Wikipedia, for example, is our pride. But i don’t think that we use Hebrew enough. For example, a lot of people (not all) write email in English. They write email in English even if they don’t know English well. They write email in English even though practically all the technical problems with encoding and bi-directionality were solved years ago. And they write email in English even if the email is about a topic for which Hebrew is perfectly suitable: one could argue that English is more convenient for writing about software or physics, but quite a lot of people write email in English just to to tell recent family news or to make an appointment.

I used to do that, too, but i made a conscious decision to stop writing email in English unless it is absolutely necessary. I tell all my friends about it. Some of them are indifferent and some of them – especially those in the software industry – say that Israel should have adopted English and not Hebrew as its language. Shame on them. Students think that i know English well, so they often ask me what is the most polite way to make an appointment with their professors in English, and i always tell them: “If your professor can read Hebrew, just write the email in Hebrew!”

Of course, there’s also the matter of university papers. In physics, for example, even though Hebrew is used in classroom, it goes for granted that papers at M.A.-level and higher are written only in English. The need for an English version is understandable, because in the world scale very few people would be able to read a paper in Hebrew, but i would imagine that it’s much better to write the paper in Hebrew and translate it. Yes, it would take time and probably money, but it is nevertheless useful and not just for the honor of the Hebrew language: it would actually advance science and education, because this way people would express themselves in their own language and think about physics instead of thinking about English.

Finally, there’s Facebook. For some reason many Israelis still use Facebook with the English interface – again, even though they don’t know English well, and even though they never read or write anything in English there. The translation of Facebook into Hebrew is terrible, and what’s especially frustrating is that i would gladly fix it, but i can’t, because the interface for submitting translation corrections is absolutely unusable. I nevertheless use Facebook in Hebrew, because it solves the bi-directionality problems – for example, the notorious problem with the punctuation marks appearing at the wrong end of the sentence. There was a newspaper report saying that Facebook influences Israeli children so much that they got used to writing the question mark at the beginning of the sentence – and that’s how they submit their homework! Some Israelis develop weird tricks to make the punctuation appear on the correct side of the sentence, for example by adding a letter after the period – compare “אתה בא לכדורגל בערב?י” and “אתה בא לכדורגל בערב?” – notice the placement of the question mark and the redundant letter in the first sentence. But they could simply switch to Hebrew. (And one day i will write an email to Facebook offices and tell them that they really should improve the translation.)

It’s quite pleasing to see that speakers of Kannada look up to us, but it doesn’t mean that we already did all we could to normalize Hebrew.

(And why am i writing this in English? Because i started writing it as a comment for that blog and it grew into a post by itself.)