Archive for the 'language' Category

Link Wikipedia Articles in Different Languages

OK THIS IS AWESOME, and “awesome” is not a word that I use lightly.

As a gift for the second birthday of the Wikidata project, nice people at Google created a tool that helps people link articles in different languages that are not linked yet. They prepared a list with thousands of pairs of articles in different languages that are supposed to be about the same subject according to their automatic guesswork. The tool only shows such articles, and a human editor must check whether they actually match, and if they do—make the linking automatically.

There were thirty six such articles for the Hebrew–English pair. About four of them were unrelated, and I fixed the linking between the rest of them. Some of them required manual intervention, because there were interfering links to unrelated subjects. For some simple cases it took me just a few seconds, and for a few complicated ones—a few minutes.

I also tried doing the same for Russian–English, but there are over a thousand article pairs there, so I only did a few. I also did a few for Catalan and Greek, and I finished all ten pairs for Bengali, even though I don’t actually know Greek or Bengali. I just used a bit of healthy intuition and Google Translate, and I’m pretty sure that I did it well.

You can help!

Here are my suggested instructions for doing this.


  1. Log in to This account is used also for the tool.
  2. Now go to the tool’s site. Click Login, and allow the tool to use your account.
  3. Go to settings, and choose your pair of languages.
  4. Go to “Check by list” and you’ll see a list of article pairs. If there are no suggested article pairs for the language pair you selected, go back to number 3 choose some other languages. As I wrote above, from my experience, you don’t need to know a language thoroughly to perform this useful work ;)

Now click a link to a pair of articles that looks reasonable. Articles in both languages will open side by side.

  1. If the articles are definitely not about the exact same subject, click “No” in the list and find another pair.
  2. If the articles are about the same subject and one of them doesn’t have any interlanguage links, click “Add links” in the interlanguage area. In the box that will open, write the language name of the other language in the first field and the title of the article in the other field, and then click the “Link with page” button. A list of articles in other languages will be shown. If it looks reasonable, click “Confirm”, and then “Close dialog and reload page”. That’s it, the pages are linked! Click “Yes” in the list in the linking tool and proceed to another article pair.
  3. If the articles are about the same subject, but both of them appear to have links to other language, it’s possible that explicit interlanguage links are written in the source code of the articles. To resolve this, do the following:
    1. Open both articles for editing in source mode.
    2. Scroll all the way down and find whether they have explicit interlanguage links.
    3. If these are correct links to articles about the same subjects in other languages, go to those articles, and link them using Wikidata. Note that it often happens in such cases that these are links to redirects, so the actual current title may be different.
    4. If these are links to articles about other subjects, even if they are related, remove those links. For example, if the article in Bengali is about an island, and the article in Dutch is about a city on that island, remove the link – these subject are distinct enough. Ditto if the article in English is about an American human rights organization and the article in French is about a French human rights organization.
    5. If you were able to remove all the explicit links from the source, go back to point 2 above and link the articles using Wikidata.
    6. If it’s too complicated to remove these links for any reason, feel free to go to another article, but it would be nice to leave a note about this on the articles’ talk pages so that other editors would clean this up some time.

That’s it. It may get a tad complicated for some cases, but if you ask me, it’s a lot of fun.


Serbian Spam

I always celebrate when I receive spam in a language in which I haven’t yet received spam. I just received spam in Serbian for the first time. It was in the Cyrillic alphabet; Serbian can also be written in Latin, and it is frequently done in Serbia, possibly even more frequently than in Cyrillic, even though the government prefers Cyrillic.

This makes me wonder: Is Serbian in Cyrillic popular and important enough for spamming in it, or did the silly spammer just use Google Translate to translate to Serbian and got the result in Cyrillic, because that’s what Google Translate does?

If you know Serbian, can you please tell me whether it looks real or machine-translated? Words like “5иеарс” and the spaces before the punctuation marks give me a strong suspicion that it’s machine translation, but I might be wrong.

Молим вас за попустљивост за нежељене природи овог писма , али је рођена из очаја и тренутног развоја . Молимо носе са мном . Моје име је сер Алекс Бењамин Хубертревизор Африке развојне банке открио постојећи налог за успавану 5иеарс .

Када сам открио да није било ни наставак ни исплате са овог рачуна на овог дугог периода и наши банкарских закона предвиђа да ће било неупотребљивим чине више од 5иеарс иду на банковни прихода као неостварен фонда .

Ја сам се распитивала за личне депонента и његове најближе , али нажалост ,депонент и његове најближе преминуо на путу до Сенегала за тајкун , а он је оставио иза себе нема тело за ову тврдњу само сам направио ову истрагу само да буде двоструко сигурни у ту чињеницу , а пошто сам био неуспешан у лоцирању родбину .

So, how does it look? And do you receive Serbian spam? Thanks.

A Relevant Tower of Babel

The Tower of Babel is frequently used as a symbol of foreign languages. For example, several language software packages are named after it, such as the Babylon electronic dictionary, MediaWiki’s Babel extension and the Babelfish translation service (itself named after the Babel fish from The Hitchhiker’s Guide).

In this post I shall use the Tower of Babel in a somewhat more relevant and specific way: It will speak about multilingualism and about Babel itself.

This is how most people saw the Wikipedia article about the Tower of Babel until today:

The Tower of Babel article. Notice the pointless squares in the Akkadian name. They are called "tofu" in the jargon on internationalization programmers.

The tower of Babel. Notice the pointless squares in the Akkadian name. They are called “tofu” in the jargon on internationalization programmers.

And this is how most people will see it from today:

And we have the name written in real Akkadian cuneiform!

And we have the name written in real Akkadian cuneiform!

Notice how the Akkadian name now appears as actual Akkadian cuneiform, and not as meaningless squares. Even if you, like most people, cannot actually read cuneiform, you probably understand that showing it this way is more correct, useful and educational.

This is possible thanks to the webfonts technology, which was enabled on the English Wikipedia today. It was already enabled in Wikipedias in some languages for many months, mostly in languages of India, which have severe problems with font support in the common operating systems, but now it’s available in the English Wikipedia, where it mostly serves to show parts of text that are written in exotic fonts.

The current iteration of the webfonts support in Wikipedia is part of a larger project: the Universal Language Selector (ULS). I am very proud to be one of its developers. My team in Wikimedia developed it over the last year or so, during which it underwent a rigorous process of design, testing with dozens of users from different countries, development, bug fixing and deployment. In addition to webfonts it provides an easy way to pick the user interface language, and to type in non-English languages (the latter feature is disabled by default in the English Wikipedia; to enable it, click the cog icon near “Languages” in the sidebar, then click “Input” and “Enable input tools”). In the future it will provide even more abilities, so stay tuned.

If you edit Wikipedia, or want to try editing it, one way in which you could help with the deployment of webfonts would be to make sure that all foreign strings in Wikipedia are marked with the appropriate HTML lang attribute; for example, that every Vietnamese string is marked as <span lang=”vi” dir=”ltr”>. This will help the software apply the webfonts correctly, and in the future it will also help spelling and hyphenation software, etc.

This wouldn’t be possible without the help of many, many people. The developers of Mozilla Firefox, Google Chrome, Safari, Microsoft Internet Explorer and Opera, who developed the support for webfonts in these browsers; The people in Wikimedia who designed and developed the ULS: Alolita Sharma, Arun Ganesh, Brandon Harris, Niklas Laxström, Pau Giner, Santhosh Thottingal and Siebrand Mazeland; The many volunteers who tested ULS and reported useful bugs; The people in Unicode, such as Michael Everson, who work hard to give a number to every letter in every imaginable alphabet and make massive online multilingualism possible; And last but not least, the talented and generous people who developed all those fonts for the different scripts and released them under Free licenses. I send you all my deep appreciation, as a developer and as a reader of Wikipedia.

Hugo Chávez Is Still Not Dead

There are articles about Chávez in Wikipedias in ninety-six languages. He’s still not dead according to thirteen of them:

  1. Cantonese (about the language) – FIXED
  2. Central Bikol (about the language) – FIXED
  3. Ido (about the language) – FIXED
  4. Ladino (about the language) – FIXED
  5. Min Nan (about the language)
  6. Ossetic (about the language) – FIXED
  7. Papiamento (about the language) – FIXED
  8. Samogitian (about the language) – FIXED
  9. Sicilian (about the language) – FIXED
  10. Somali (about the language)
  11. Upper Sorbian (about the language) – FIXED
  12. Võro (about the language) – FIXED
  13. Walloon (about the language) – FIXED

Looking at the different language Wikipedias often brings about other useful things. For example, Chávez’ death date was marked in the Manx Wikipedia, but the name of the month of March was spelled incorrectly, so I corrected it. In the Russian Wikipedia I noticed that the banner that invites people to Wikimania 2013 in Hong Kong is translated incorrectly, and I corrected it.

If you know one of the above languages, consider adding the death date of Hugo Chávez to the articles, and writing some other things there, too. Millions of people will appreciate your contribution.

Yakutsk 2012

When I was about five years old, I saw a map of the world on the wall of my Moscow home. I noticed that the USSR is very, very big. And that it has a lot of rivers, like Ob, Yenisey, and Lena. “Lena”, I thought, “How nice. Like a name of a girl.”

On the Lena river I saw a city called Yakutsk. The name sounded a bit funny to me, but I became curious about it somehow.

And last month I went there.

Yakutsk is the capital of the Sakha Republic, also known as Yakutia – the largest administrative region in the world that is not a country. The largest native ethnic group of Sakha, after which the republic is named, speak a Turkic language of the same name, although it is also frequently called “Yakut”. Even though I spent almost all of my Soviet life in Moscow, I was always very curious about all the other regions and languages of the USSR, so when I discovered Wikipedia, I devoted a lot of time to reading about them and to visiting Wikipedias in these languages, even though I cannot really read them.

A request to start a Wikipeda in Sakha was filed in 2006, and I was quick to support it. After a few months of preparations it was opened. It is now one of the relatively more active Wikipedias in languages of Russia – it has over 8,000 articles, and for a minority language, most speakers of which are bilingual in another major language, this is a good number.

I kept constant and positive contact with Nikolai Pavlov – the founder and the unofficial leader of the Sakha Wikipedia – since the very start of this Wikipedia. It was great to give these people technical and organizational advice: how to write articles effectively, how to choose topics, how to organize meet-ups of Wikipedians. For a long time I dreamt of meeting them in person, but because Yakutsk is so far away from practically any other imaginable place, I didn’t think that it will ever happen. But in April 2012 I met Nikolai at the Turkic Wikimedia Conference in Almaty, Kazakhstan.

A few days after that conference Nikolai suggested that I submit a talk for an IT conference in the North-Eastern Federal University in Yakutsk. At first I thought that I’m not really related to it, but after reading the description, I decided to give it a try and wrote a talk proposal about my favorite topics: MediaWiki and Software Localization. Somewhat surprisingly, the talks were accepted and I received an invitation to present at that conference.

With Nikolai Pavlov, also known as Halan Tul. The unofficial leader of the Sakha Wikipedia and the excellent organizer of my trip to Yakutsk.

With Nikolai Pavlov, also known as Halan Tul. The unofficial leader of the Sakha Wikipedia and the excellent organizer of my trip to Yakutsk.

I flew from Tel-Aviv to Moscow, and then six more hours from Moscow to Yakutsk. Yakutsk is apparently a modern, bustling and developed city, but with interesting twists. Most notably, because it is in the permafrost area, all the houses are built on piles and all the pipelines are above ground. But actually this is just a small detail, because the general feeling is that it was a whole different country from the European part of Russia, to which I was used, and in a very good way.

I am standing on a new bridge being built

I am standing on a new bridge being built

I was most pleasantly surprised by the liveliness of the Sakha language: practically all people there know Russian, but the Sakha speech is frequently heard on the streets, Sakha writing is frequently seen on advertising and store signs, and Sakha songs are played from many passing cars.

Myself standing in front of a classroom, speaking about MediaWiki

Speaking about MediaWiki in Yakutsk

The conference was very varied – with presenters from South Korea, China, Bulgaria, Switzerland and major Russian cities – Moscow, St. Petersburg and others. The topics were very varied, too, but the central topic was using computer technologies for education and human development, so I felt that my talks about Wikipedia and software localization were fitting.

I am standing holding a microphone in front of an audience in a university auditorium. Behind me - a screen with a GNU head, the logo of the Free Software Foundation.

Presenting my main plenary lecture about software localization. One of my main points is that using Free Software, represented by the GNU head, is very easy to internationalize.

Except participating in the conference itself, I also attended many meetings that Nikolai organized for me. It was fascinating to meet all these people.

Meeting the manager of Bichik, the national book publisher. On the wall - portraits of notable Sakha writers.

Meeting the manager of Bichik, the national book publisher. On the wall – portraits of notable Sakha writers.

I spoke to the editor and the manager of the republic’s largest book publishing company – they told me that the local literature has great artistic value, but since less than half a million people speak this language, it’s hard to earn a lot of profit from it and to develop it. They also complained that some authors – as well as some deceased authors’ families – are too harsh about copyrights. I suggested them to try to talk with authors and release some works under the Creative Commons license and see whether it gets them more exposure, and they promised to read Lawrence Lessig’s “Free Culture” book.

I am sitting in a classroom and speaking to a group of about ten people.

Meeting Yakutsk linguists and explaining them how putting their works on Wikipedia will make them much more accessible to the whole world.

I also met with linguists from the university, who work on researching and documenting the Sakha language and other languages of the region, such as Evenki and Yukagir. I suggested them to use Wikimedia resources for storage and documentation of the works they gather, and they liked the idea; I am definitely going to follow up with them on that.

In the offices of, with the manager of the company - and a Kanban board in the background.

In the offices of, with the manager of the company – and a Kanban board in the background.

Another great meeting I had was with local tech people – a community of proud local IT geeks, who had lots of ideas for promoting Wikipedias in regional languages, and also the management and the employees of the local Internet portal Their offices look just like a building of a hi-tech company in the Silicon Valley or in Israel – with cozy rooms and lounges, and a Kanban board. The people made an excellent impression on me, too: we had a very professional and engaging conversation about developing web applications and agile management methodologies.

I am sitting on a couch and the TV crew prepare my microphone for the interview

Preparing for an interview at NVK, the national TV station

I also spoke to several journalists and to the local TV and radio stations, inviting people to read Wikipedia in their own language and to contribute to it. I felt a bit like a celebrity, and well, I hope that it made somebody realize how effective can the Internet be in promoting local cultures and how proud should people be about their own languages.

One last comment is about the Sakha literature, which I mentioned earlier. I return from almost all my trips abroad with a lot of books about the local languages and cultures. And I actually read them. It happened in this trip, too, except this time most of the books were given to me as gifts by all those very nice people that I met. Sakha prose and Olonkho poetry in translation to Russian are simply wonderful. In all honesty. This is beautiful world-class literature and it deserves more exposure. If this little blog post made you curious about it, then it’s the most important thing that it could achieve.

(All photos were taken by Nikolai Pavlov, except the one in which he appears.)

The Longest Articles

In Wikipedia in every language you can go to a page called “Special:LongPages” and see what are the longest articles in that language.

Some fun facts that I found by random browsing of that page in a few languages:

  • The longest article in the Polish Wikipedia is “Finnish grammar”. It’s 117 pages long in print – basically a book.
  • The longest article in the Telugu Wikipedia is “Adolf Hitler”.
  • The longest article in the Kannada Wikipedia is “History of the SLR camera”. The second longest is “Adolf Hitler”. Kannada is spoken in India near Telugu.
  • The longest article in the Italian Wikipedia is “List of serial killers by number of victims”.
  • The longest article in the Hindi Wikipedia is “History of Australia” – about 50 pages in print. The article “History of India” will take 5 pages in print.
  • The longest articles in Chinese, Japanese and Korean Wikipedias are related to video games.
  • Finally, the longest article in the English Wikipedia is “List of Advanced Dungeons & Dragons 2nd edition monsters”.

What do the people want? Part 2: Machine translation in their language – Google or Apertium

Another technical issue that bothered many people in the Turkic Wikimedia Conference in Almaty is support for their language in Google Translate. Though this is not directly related to Wikimedia, I was asked about this repeatedly by the participants, as well as by local journalists who interviewed me. Some people even referred to it as a “conspiracy”.


Tilek Mamutov, giving a talk about Google Translate

Tilek Mamutov, giving a talk about Google Translate

Luckily, one of the participants was Tilek Mamutov, a Google employee from Kyrgyzstan, and he delivered a whole talk about it. His main message was that there is no conspiracy, and that to support more languages Google mostly needs to process as many texts as possible in that language, if possible – with a parallel translation. There are much less digital texts in languages like Kyrgyz and Bashkir than there are in German and Spanish, so it is not yet possible.

However, there is hope: a group of volunteers in Kyrgyzstan is working on creating a database of digital translated texts with the specific goal of making it usable in Google Translate. WikiBilim, the Kazakh association that organized the conference works on a similar initiative, too.

On my behalf, I suggested a convenient way to gather texts in these languages: to upload literature in them to Wikisource. I also mentioned the existence of Apertium. Apertium is a Free machine translation engine, which can be adapted to any language. It was developed in Valencia, and the first languages that it started to support are languages that are relevant for Spain: Spanish, Catalan, Basque, English and also the closely-related Esperanto, and it translates between them quite well. It supports a few other languages, too.

And it can support even more languages. Like Google Translate, it also needs as many digital texts as possible to actually start working, and it also It needs dictionaries and tables of grammar rules, because it tries several methodologies for translation. Work has already begun for Turkish-Azeri and Turkish-Kyrgyz, and there are projects for Turkish-Chuvash and other language pairs. All these projects need people who can test them, contribute words to the dictionaries and check the grammar rules. So if you want to help complete a Free Turkish-Azeri machine translation system or to create an English-Kyrgyz translation system, contact the Apertium project.

To be continued…

Oh (edit): A correction came from Apertium developers: Apertium *doesn’t* need any texts, except for testing purposes. The more texts we have, the more we can test, of course, but above all, we need native speakers of languages who understand the grammar of the languages they’re working on and can work with computational formalisms.