Archive for the 'localization' Category

The Case for Localizing Names, part 2

My name is written Amir Elisha Aharoni in English. In Hebrew it’s אמיר אלישע אהרוני, in Russian it’s Амир Элиша Аарони, in Hindi it’s अमीर एलिशा अहरोनि. It could be written in hundreds of other languages in many different ways.

More importantly, if I fill a form in Hebrew, I should write my name in Hebrew and not in English or in any other language.

Based on this simple notion, I wrote a post a year ago in support of localizing people’s names. I basically suggested, that it should be possible to have a person’s name written in more than one language in social networks, “from” and “to” fields in email, and in any other relevant place. Facebook allows doing this, but in a very rudimentary way; for example, the number of possible languages is very limited.

Today I am participating in the Open Source Language Summit in the Red Hat offices in Pune. Here we have, among many other talented an interesting people, two developers from the Mifos project, which creates Free software for microfinance. Mifos is being translated in translatewiki.net, a software translation site of which I am one of the developers.

Nayan Ambali, one of the Mifos developers, told me that they actually plan to implement a name localization feature in their software. This is not related to software localization, where a pre-defined set of strings is translated. It is something to be translated by the users of Mifos itself. The particular reason why Mifos needs such a feature comes from its nature as microfinance software: financial documents must be filled in the language of each country for legal purposes. Therefore, a Mifos user in the Indian state of Karnataka may need to have her name written in the software in English, Hindi, and Kannada – different languages, which are needed in different documents.

A simple sketch of database structure for storing names in multiple languages

A simple sketch of database structure for storing names in multiple languages

Such a feature is quite simple to implement. In the backend this means that the name must be stored in a separate table that will hold names in different languages; see the sketch I made with Nayan above. On the frontend it will need a widget for adding names in different languages, similar to the one that Wikidata has; see the screenshot below.

The name of Steven Spielberg in many languages in Wikidata, with an option to add more languages

The name of Steven Spielberg in many languages in Wikidata, with an option to add more languages

Of course, there’s also the famous problem of falsehoods that programmers believe about names, but this would be a good first step that can provide a good example to other programs.

A Relevant Tower of Babel

The Tower of Babel is frequently used as a symbol of foreign languages. For example, several language software packages are named after it, such as the Babylon electronic dictionary, MediaWiki’s Babel extension and the Babelfish translation service (itself named after the Babel fish from The Hitchhiker’s Guide).

In this post I shall use the Tower of Babel in a somewhat more relevant and specific way: It will speak about multilingualism and about Babel itself.

This is how most people saw the Wikipedia article about the Tower of Babel until today:

The Tower of Babel article. Notice the pointless squares in the Akkadian name. They are called "tofu" in the jargon on internationalization programmers.

The tower of Babel. Notice the pointless squares in the Akkadian name. They are called “tofu” in the jargon on internationalization programmers.

And this is how most people will see it from today:

And we have the name written in real Akkadian cuneiform!

And we have the name written in real Akkadian cuneiform!

Notice how the Akkadian name now appears as actual Akkadian cuneiform, and not as meaningless squares. Even if you, like most people, cannot actually read cuneiform, you probably understand that showing it this way is more correct, useful and educational.

This is possible thanks to the webfonts technology, which was enabled on the English Wikipedia today. It was already enabled in Wikipedias in some languages for many months, mostly in languages of India, which have severe problems with font support in the common operating systems, but now it’s available in the English Wikipedia, where it mostly serves to show parts of text that are written in exotic fonts.

The current iteration of the webfonts support in Wikipedia is part of a larger project: the Universal Language Selector (ULS). I am very proud to be one of its developers. My team in Wikimedia developed it over the last year or so, during which it underwent a rigorous process of design, testing with dozens of users from different countries, development, bug fixing and deployment. In addition to webfonts it provides an easy way to pick the user interface language, and to type in non-English languages (the latter feature is disabled by default in the English Wikipedia; to enable it, click the cog icon near “Languages” in the sidebar, then click “Input” and “Enable input tools”). In the future it will provide even more abilities, so stay tuned.

If you edit Wikipedia, or want to try editing it, one way in which you could help with the deployment of webfonts would be to make sure that all foreign strings in Wikipedia are marked with the appropriate HTML lang attribute; for example, that every Vietnamese string is marked as <span lang=”vi” dir=”ltr”>. This will help the software apply the webfonts correctly, and in the future it will also help spelling and hyphenation software, etc.

This wouldn’t be possible without the help of many, many people. The developers of Mozilla Firefox, Google Chrome, Safari, Microsoft Internet Explorer and Opera, who developed the support for webfonts in these browsers; The people in Wikimedia who designed and developed the ULS: Alolita Sharma, Arun Ganesh, Brandon Harris, Niklas Laxström, Pau Giner, Santhosh Thottingal and Siebrand Mazeland; The many volunteers who tested ULS and reported useful bugs; The people in Unicode, such as Michael Everson, who work hard to give a number to every letter in every imaginable alphabet and make massive online multilingualism possible; And last but not least, the talented and generous people who developed all those fonts for the different scripts and released them under Free licenses. I send you all my deep appreciation, as a developer and as a reader of Wikipedia.

Always define the language and the direction of your HTML documents, part 02: Backwards English

In part 01 of these series, I showed why is it important to always define the language and the direction of all HTML content and not rely on the defaults: The content may get embedded in a document with different direction and be displayed incorrectly.

This issue is laughably easy to avoid: If you are writing the content, you are supposed to know in what language it is written, so if it’s English, just write <html lang=”en” dir=”ltr”> even though these seem to be the defaults. Nineteen or so characters that ensure your content is readable and not displayed backwards. Please do it always and tell all your friends to do it.

The problem is that you don’t only have to explicitly set the language and the direction, but, as silly as it sounds, you have to set them correctly, too. A more subtle, but nevertheless quite frequent and disruptive bug is displaying presumably, but not actually, translated content in a different direction. This happens quite frequently when a website supports the browser language detection feature, known as Accept-Language:

  1. The web server sees that the browser requests content in Hebrew.
  2. The web server sends a response with <html lang=”he” dir=”rtl”>, but because the website is not actually translated, the text is shown in the fallback language, which is usually English.
  3. The user sees the content just like this numbered list, which I intentionally set to dir=”rtl”: with the numbers and the punctuation on the wrong side, and possibly invisible, because English is not a right-to-left language.

Of course, it can go even worse. Arrows can point the wrong way and buttons and images can overlap and hide each other, rendering the page not just hard to read, but totally unusable.

This bug is also an example of the Software Localization Paradox: It manifests itself when Accept-Language is not English, but most developers install English operating systems and don’t bother to change the preferred language settings in the browser, so they never see how this bug manifests itself. The site developers don’t bother to test for it either.

The solution, of course, is to set a different language and direction only if the site is actually translated, and not to pretend that it’s translated if it’s not.

Here are two examples of such brokenness. Both sites are important and useful, but hard to use for people whose Accept-Language is Hebrew, Persian or Arabic.

Here’s how the Mozilla Developer Network website looks in fake Hebrew:

Mozilla Developer Network website, in English, but right-to-left

Mozilla Developer Network website, in English, but right-to-left

Notice how the full stops are on the left end and how the text overlaps the images in the tiles on the right-hand side. This is how it is supposed to look, more or less:

Mozilla Developer Network home page in English, left-to-right

Mozilla Developer Network home page in English, left-to-right

I manually changed dir=”rtl” to dir=”ltr” using the element inspector from Firefox’s developer tools and I also had to tweak a CSS class to move the “mozilla” tab at the top.

The above troubles are reported as bug 816443 – lang and dir attributes must be used only if the page is actually translated.

After showing an example of a web development bug from a site for, ahem, web developers, here is an even funnier example: The home page of Unicode’s CLDR. That’s right: Unicode’s own website shows text with incorrect direction:

The Unicode CLDR website, in English but right-to-left

The Unicode CLDR website, in English but right-to-left

The only words translated here are “Contents” (תוכן) and “Search this site” (חיפוש באתר זה), which is not so useful. The rest is shown in English, and the direction is broken: Notice the strange alignment of the content and the schedule table. A few months ago that table was so broken that its content wasn’t visible at all, but that was probably patched.

Here’s how it is supposed to look:

The CLDR home page in English, appropriately left-to-right

The CLDR home page in English, appropriately left-to-right

I tried reporting the CLDR home page direction bug, but it was closed as “out-of-scope”: The CLDR developers say that the Google Sites infrastructure is to blame. This is frustrating, because as far as I know Google Sites doesn’t have a proper bug reporting system and all I can do is write a question about that direction problem in the Google Sites forum and hope that somebody notices it or poke my Googler friends.

One thing that I will not do is switch my Accept-Language to English. Whenever I can, I don’t just want to see the website correctly, but to try to help my neighbor: see the possible problems that can affect other users who use different language. Somebody has to break the Software Localization Paradox.

Web sight

Because of some not-so-interesting technical reasons I ended up on the mailing list for reporting bugs in Wikipedia’s mobile app (please see disclaimer in the end).

Reading real Wikipedia readers’ reactions is fascinating.

A lot of the emails there are just empty. People just press the button to report a problem and don’t actually write anything at all.

Sometimes they are just slightly less than empty. For example, quite a lot of people write things like “When will you fix your stupid app already???!?!!”. This may seem pointless and unconstructive, but actually these people think that there is context to what they say, because they see complaints from other people at Google’s or Apple’s app store and they assume that the app’s maintainers are aware of them. Some people also threaten to give the app a low rating in the app store; it’s not really wrong, but it’s not very helpful either.

A lot of the emails are about connectivity problems in Android 2.2.2 and about screen rotation problems on iPad. The developers are aware of both issues and are working on them.

And a whole lot of reports suggest fixes in content, rather than technical problems. Some of them are pointless, for example “The facts on this web sight is wrong and i want it changed to the corrected statement”. It never occurred to that person that it would be helpful to say what information is wrong or what should be written there (it can also be a troll). And some people do make useful suggestions. For example, one person reported that Obama didn’t write “How the Grinch Stole Christmas“. The report was correct: somebody indeed vandalized the article about the children’s book and wrote that its author is Obama. It was an easy fix, so I just fixed it myself and replied, thanking the person for the report and saying that in the future she can fix it herself by pressing the “edit” button.

If I see that fixing the problem will take more than a minute, I just reply with “you can fix it yourself”. This does make me think that a more robust way of telling people that they can fix the problems themselves is needed.


All these issues aside, there is something truly wonderful about this app: People write these emails in their language without caring at all about who will read them. Reporting a bug in Bugzilla is hard for many reasons, one of which is certainly the language. But the app gives the user a completely localized experience, so the users don’t think twice before sending a bug report in their language.

And this is a good thing. Some People from Some Companies told me explicitly that they give up on processing reports from too many people in too many languages; not Wikimedia. Wikimedia may acknowledge that it’s hard, Wikimedia won’t commit to replying to each email, but Wikimedia wouldn’t just shut it down and ignore it completely, either. We would rather think about more efficient ways to get volunteers to reply to people efficiently or to help people fix the issues themselves – that’s what the whole “wiki” idea is about in the first place.


(Important disclaimer: I am involved with this mailing list as a volunteer. It has nothing to do with the paid work that I do for the Wikimedia Foundation. I do not officially represent the Foundation in any actions that I take with regard to that mailing list.)

The Case for Localizing Names

I often help my friends and family members open email accounts. Sometimes they are starting to use the Internet and sometimes they move from old email services (Yahoo, Walla!, ISP) to something modern (like it or not, GMail).

At some point they have to fill their name, which will appear in the “from” field. And then I have to suggest them to write it in Latin characters, even though most of them speak languages that aren’t written in Latin characters – mostly Hebrew and Russian. Chances are that some day they will send an email to somebody who cannot read Russian or Hebrew, and Latin is relatively better known.

Only relatively, though. It may seem obvious to you that everybody knows the Latin script, but in fact, a lot of people are not comfortable with it at all. There are also other complications: lossy and inconsistent transliteration rules (is Amir אמיר or עמיר?), potential right-to-left rendering problems, and more. And of course, all people are happy to see their name in their language.

And people are also happy to see their friends’ names in their own language and not in a foreign or a neutral language. I have, for example, a lot of friends in India. Most of them write their names in English, but some write it in Marathi or in Malayalam. It’s certainly good for them, but in practice it’s much harder for me to find them this way, so English would be better – but Hebrew or Russian would be better yet.

Finally, there are a lot of people in the world who have more than one linguistic background. Mine are Russian, Hebrew and English, and I am really not such a special case. There are many millions of immigrants who have mixed backgrounds: Punjabi-Hindi-Urdu-English, Kurdish-Turkish-German, Kazakh-Russian-Norwegian, and others, and others and others. From each of these backgrounds they have friends, co-workers and family members, with whom they would love to communicate in the respective language. In each of these backgrounds they have friends who would want to find them using the name under which they know them there and using the appropriate language and writing system.

And sometimes people change their names, too. I did once, and so have many other people.

All this means that people’s names should be translatable, just like books, articles and software interfaces. Facebook and Google+ allow me to add a very limited number of names in foreign languages. Why wouldn’t they let me write my name in four, five, ten languages? This would make it easier for people who speak these languages to find me and to communicate with me. I would go even further and allow people who speak languages that I don’t know well to write my name as their hear it in their language and to add it to my details. Yet again, this would make me easier to find to even more people.

Some degree of automation can be possible. A lot of names are, after all, repetitive, so social networks would be able to suggest people with common names how their name would be written in other languages.

Wikipedia is actually quite good in this regard: Usually people have the same username across projects, and this username is not necessarily written in Latin letters, but people can customize the appearance of their signature in each project. I did it in a few languages, and people who speak those languages appreciate it.

I can only hope that social networks and email systems will allow as much flexibility as possible with this.

English typing computer

I’m in an Internet cafe in Mumbai. I tried to install Firefox with the Marathi interface, but on the computers here fonts for languages of India are not installed. That’s right – on computers in India fonts for languages of India are not installed. Hence, installing Firefox in Marathi failed at the very first stage, because the fonts are needed for the installation wizard.

Actually, I’m not surprised that these fonts are not installed, because it’s not my first time in India. I know that it happens a lot in this country. I would install them, but I don’t have a permission.

I find it incredibly weird – and tragic – that so many people in India don’t even try to use computers in any language except English. The one curious thing that I did find was an “English typing computer” shop. It’s just a place where you can use a computer to write Word documents in Hindi or Marathi, but using an English-based transliteration keyboard rather than the standard Indian Devanagari InScript keyboard, because they find transliteration keyboards easier. Of course, they could just install such a keyboard layout on their computers… but they prefer to go to an “English typing computer” shop.

We, software internationalization people, have so much more work to do.

Always define the language and the direction of your HTML documents, part 01

I received this email from Safari Books Online:

Email in English from Safari Books, oriented like Hebrew

Email in English from Safari Books, oriented like Hebrew. Click to enlarge.

The email is written in English, but notice how the text is aligned unusually to the right. Notice also that the punctuation marks appear at the wrong end of the sentence. I used Firefox developer tools to apply the correct direction, and saw it correctly:

The same email, with corrected left-to-right formatting using Firefox developer tools

The same email, with corrected left-to-right formatting using Firefox developer tools

This happens because I use GMail with the Hebrew interface. GMail has to guess the direction of the emails that I receive, because in plain text there’s no easy way to specify the direction (I hope to discuss it in a separate post soon). Usually GMail guesses correctly. Ironically, for HTML-formatted emails like this one, GMail often guesses incorrectly, even though in HTML, unlike in plain text, it’s quite easy to specify the direction by simply adding dir=”ltr” to the root element of the email.

Unfortunately a lot of HTML authors don’t bother to specify explicit direction. Many are not even aware of this exotic dir attribute. Others think that because “ltr” is the default, they don’t have to specify it. They are wrong: As this email shows, the left-to-right HTML content is embedded in a right-to-left environment, and the “rtl” definition propagates to the embedded content.

You could blame GMail, of course, but it’s much more practical to always define the direction of your HTML content, even if it’s the default. You can never know where will your content end up.

P.S.: I read this post before publishing and suddenly realized that its style is quite similar to “Best Practices” books, such as Damian Conway’s classic “Perl Best Practices” – it tells you to do something that is not obviously needed, and explains why it is needed nevertheless. I like to acknowledge sources of inspiration. Thank you, Damian.

Yakutsk 2012

When I was about five years old, I saw a map of the world on the wall of my Moscow home. I noticed that the USSR is very, very big. And that it has a lot of rivers, like Ob, Yenisey, and Lena. “Lena”, I thought, “How nice. Like a name of a girl.”

On the Lena river I saw a city called Yakutsk. The name sounded a bit funny to me, but I became curious about it somehow.

And last month I went there.


Yakutsk is the capital of the Sakha Republic, also known as Yakutia – the largest administrative region in the world that is not a country. The largest native ethnic group of Sakha, after which the republic is named, speak a Turkic language of the same name, although it is also frequently called “Yakut”. Even though I spent almost all of my Soviet life in Moscow, I was always very curious about all the other regions and languages of the USSR, so when I discovered Wikipedia, I devoted a lot of time to reading about them and to visiting Wikipedias in these languages, even though I cannot really read them.

A request to start a Wikipeda in Sakha was filed in 2006, and I was quick to support it. After a few months of preparations it was opened. It is now one of the relatively more active Wikipedias in languages of Russia – it has over 8,000 articles, and for a minority language, most speakers of which are bilingual in another major language, this is a good number.

I kept constant and positive contact with Nikolai Pavlov – the founder and the unofficial leader of the Sakha Wikipedia – since the very start of this Wikipedia. It was great to give these people technical and organizational advice: how to write articles effectively, how to choose topics, how to organize meet-ups of Wikipedians. For a long time I dreamt of meeting them in person, but because Yakutsk is so far away from practically any other imaginable place, I didn’t think that it will ever happen. But in April 2012 I met Nikolai at the Turkic Wikimedia Conference in Almaty, Kazakhstan.

A few days after that conference Nikolai suggested that I submit a talk for an IT conference in the North-Eastern Federal University in Yakutsk. At first I thought that I’m not really related to it, but after reading the description, I decided to give it a try and wrote a talk proposal about my favorite topics: MediaWiki and Software Localization. Somewhat surprisingly, the talks were accepted and I received an invitation to present at that conference.

With Nikolai Pavlov, also known as Halan Tul. The unofficial leader of the Sakha Wikipedia and the excellent organizer of my trip to Yakutsk.

With Nikolai Pavlov, also known as Halan Tul. The unofficial leader of the Sakha Wikipedia and the excellent organizer of my trip to Yakutsk.

I flew from Tel-Aviv to Moscow, and then six more hours from Moscow to Yakutsk. Yakutsk is apparently a modern, bustling and developed city, but with interesting twists. Most notably, because it is in the permafrost area, all the houses are built on piles and all the pipelines are above ground. But actually this is just a small detail, because the general feeling is that it was a whole different country from the European part of Russia, to which I was used, and in a very good way.

I am standing on a new bridge being built

I am standing on a new bridge being built

I was most pleasantly surprised by the liveliness of the Sakha language: practically all people there know Russian, but the Sakha speech is frequently heard on the streets, Sakha writing is frequently seen on advertising and store signs, and Sakha songs are played from many passing cars.

Myself standing in front of a classroom, speaking about MediaWiki

Speaking about MediaWiki in Yakutsk

The conference was very varied – with presenters from South Korea, China, Bulgaria, Switzerland and major Russian cities – Moscow, St. Petersburg and others. The topics were very varied, too, but the central topic was using computer technologies for education and human development, so I felt that my talks about Wikipedia and software localization were fitting.

I am standing holding a microphone in front of an audience in a university auditorium. Behind me - a screen with a GNU head, the logo of the Free Software Foundation.

Presenting my main plenary lecture about software localization. One of my main points is that using Free Software, represented by the GNU head, is very easy to internationalize.

Except participating in the conference itself, I also attended many meetings that Nikolai organized for me. It was fascinating to meet all these people.

Meeting the manager of Bichik, the national book publisher. On the wall - portraits of notable Sakha writers.

Meeting the manager of Bichik, the national book publisher. On the wall – portraits of notable Sakha writers.

I spoke to the editor and the manager of the republic’s largest book publishing company – they told me that the local literature has great artistic value, but since less than half a million people speak this language, it’s hard to earn a lot of profit from it and to develop it. They also complained that some authors – as well as some deceased authors’ families – are too harsh about copyrights. I suggested them to try to talk with authors and release some works under the Creative Commons license and see whether it gets them more exposure, and they promised to read Lawrence Lessig’s “Free Culture” book.

I am sitting in a classroom and speaking to a group of about ten people.

Meeting Yakutsk linguists and explaining them how putting their works on Wikipedia will make them much more accessible to the whole world.

I also met with linguists from the university, who work on researching and documenting the Sakha language and other languages of the region, such as Evenki and Yukagir. I suggested them to use Wikimedia resources for storage and documentation of the works they gather, and they liked the idea; I am definitely going to follow up with them on that.

In the offices of Ykt.ru, with the manager of the company - and a Kanban board in the background.

In the offices of Ykt.ru, with the manager of the company – and a Kanban board in the background.

Another great meeting I had was with local tech people – a community of proud local IT geeks, who had lots of ideas for promoting Wikipedias in regional languages, and also the management and the employees of the local Internet portal ykt.ru. Their offices look just like a building of a hi-tech company in the Silicon Valley or in Israel – with cozy rooms and lounges, and a Kanban board. The people made an excellent impression on me, too: we had a very professional and engaging conversation about developing web applications and agile management methodologies.

I am sitting on a couch and the TV crew prepare my microphone for the interview

Preparing for an interview at NVK, the national TV station

I also spoke to several journalists and to the local TV and radio stations, inviting people to read Wikipedia in their own language and to contribute to it. I felt a bit like a celebrity, and well, I hope that it made somebody realize how effective can the Internet be in promoting local cultures and how proud should people be about their own languages.

One last comment is about the Sakha literature, which I mentioned earlier. I return from almost all my trips abroad with a lot of books about the local languages and cultures. And I actually read them. It happened in this trip, too, except this time most of the books were given to me as gifts by all those very nice people that I met. Sakha prose and Olonkho poetry in translation to Russian are simply wonderful. In all honesty. This is beautiful world-class literature and it deserves more exposure. If this little blog post made you curious about it, then it’s the most important thing that it could achieve.

(All photos were taken by Nikolai Pavlov, except the one in which he appears.)

Turkic Wikimedia Conference 2012, Almaty: Other Highlights and Summing Up

Other highlights

Of course, the Turkic Wikimedia Conference had many other highlights except my talks and workshops. Jonas Öberg from Creative Commons delivered a keynote speech about the importance of letting people freely share their works, especially with regards to cultures which are not as known as the American or the Western European, such as that of Kazakhstan. Basically, anybody who is curious about the culture of Kazakhstan will only be able to know about it the things that are freely posted online. If it’s gathering dust in the library or locked behind a password in a pay-to-read website, nobody will read it.

Jonas Öberg. By: Ashina. License: CC-BY-SA 3.0.

Jonas Öberg. By: Ashina. License: CC-BY-SA 3.0.

The Wikimedian Daniel Mietchen, who is an advocate for Open Science, convincingly explained why opening up academic articles and experiments will not just make them cheaper, but also more correct scientifically.

Daniel Mietchen

Daniel Mietchen

Daniel also impressed lots of people with his Russian speaking skills: Apparently, he grew up in East Germany, where all children had to study Russian in schools, and he was one of the few children who actually bothered to learn it well. He said that at first he didn’t like to be forced to learn a language that wasn’t useful to him, but when he had to read a book of prose – The Tales of the Late Ivan Petrovich Belkin – as homework, he found it very satisfying, even though it was very hard in the beginning.

Another highlight was a book about editing Wikipedia given to me by one of its authors Irada Alakbarova, a participant from Azerbaijan. It is similar in content and scope to the book written by the French Wikimedians Guillaume Paumier and Florence Devouard, but it’s impressive that Irada is not just an enthusiastic Wikimedian, but also a department head in the Information Technology Institute of the Azerbaijan Academy of Sciences, and the book’s other author Rasim Aliquliyev is the Institute’s director. (In precise Azeri spelling their names are İradə Ələkbərova and Rasim Əliquliyev. The letter Ə is a part of Azerbaijan’s Latin-based writing system, but looks too weird to many English readers.)

İradə Ələkbərova

İradə Ələkbərova

Irada also told me that some time ago she gathered any information that she could about Wikipedia’s server configuration and used it as an example for teaching configuration of high-performance websites. She was very happy when I told that the Wikimedia server configuration became even more transparent recently.

Summing up

I participated in many conferences lately, and this one was unusually satisfying in many ways.

As usual, meeting the people was the best part. This refers both to the people from places like Bashkortostan and Sakha, with whom I communicated by email for many years, hardly imagining how do they look, and also to people whom I had not known before and who came from countries that I could hardly imagine of ever visiting, like Kyrgyzstan and Turkmenistan. The international press mostly reports bad and weird news from these countries, but as it often happens, the image created by the media has little to do with the real people – I was stunned by the talent, the originality and the vigor that they demonstrated.

I was not the only one who felt that the conference was a great success, so we already started to throw around ideas for the location of another one. The names of Bishkek, Ufa, Baku and Istanbul were suggested, and I would certainly be very happy to go to any of these cities or to meet these wonderful people elsewhere.

Most importantly, this conference left me and the other participants a long list of exciting tasks to do.

What do the people want? Part 2: Machine translation in their language – Google or Apertium

Another technical issue that bothered many people in the Turkic Wikimedia Conference in Almaty is support for their language in Google Translate. Though this is not directly related to Wikimedia, I was asked about this repeatedly by the participants, as well as by local journalists who interviewed me. Some people even referred to it as a “conspiracy”.

X

Tilek Mamutov, giving a talk about Google Translate

Tilek Mamutov, giving a talk about Google Translate

Luckily, one of the participants was Tilek Mamutov, a Google employee from Kyrgyzstan, and he delivered a whole talk about it. His main message was that there is no conspiracy, and that to support more languages Google mostly needs to process as many texts as possible in that language, if possible – with a parallel translation. There are much less digital texts in languages like Kyrgyz and Bashkir than there are in German and Spanish, so it is not yet possible.

However, there is hope: a group of volunteers in Kyrgyzstan is working on creating a database of digital translated texts with the specific goal of making it usable in Google Translate. WikiBilim, the Kazakh association that organized the conference works on a similar initiative, too.

On my behalf, I suggested a convenient way to gather texts in these languages: to upload literature in them to Wikisource. I also mentioned the existence of Apertium. Apertium is a Free machine translation engine, which can be adapted to any language. It was developed in Valencia, and the first languages that it started to support are languages that are relevant for Spain: Spanish, Catalan, Basque, English and also the closely-related Esperanto, and it translates between them quite well. It supports a few other languages, too.

And it can support even more languages. Like Google Translate, it also needs as many digital texts as possible to actually start working, and it also It needs dictionaries and tables of grammar rules, because it tries several methodologies for translation. Work has already begun for Turkish-Azeri and Turkish-Kyrgyz, and there are projects for Turkish-Chuvash and other language pairs. All these projects need people who can test them, contribute words to the dictionaries and check the grammar rules. So if you want to help complete a Free Turkish-Azeri machine translation system or to create an English-Kyrgyz translation system, contact the Apertium project.

To be continued…


Oh (edit): A correction came from Apertium developers: Apertium *doesn’t* need any texts, except for testing purposes. The more texts we have, the more we can test, of course, but above all, we need native speakers of languages who understand the grammar of the languages they’re working on and can work with computational formalisms.



Follow

Get every new post delivered to your Inbox.

Join 1,701 other followers