Archive for the 'linguistics' Category

The Case for Localizing Names, part 2

My name is written Amir Elisha Aharoni in English. In Hebrew it’s אמיר אלישע אהרוני, in Russian it’s Амир Элиша Аарони, in Hindi it’s अमीर एलिशा अहरोनि. It could be written in hundreds of other languages in many different ways.

More importantly, if I fill a form in Hebrew, I should write my name in Hebrew and not in English or in any other language.

Based on this simple notion, I wrote a post a year ago in support of localizing people’s names. I basically suggested, that it should be possible to have a person’s name written in more than one language in social networks, “from” and “to” fields in email, and in any other relevant place. Facebook allows doing this, but in a very rudimentary way; for example, the number of possible languages is very limited.

Today I am participating in the Open Source Language Summit in the Red Hat offices in Pune. Here we have, among many other talented an interesting people, two developers from the Mifos project, which creates Free software for microfinance. Mifos is being translated in, a software translation site of which I am one of the developers.

Nayan Ambali, one of the Mifos developers, told me that they actually plan to implement a name localization feature in their software. This is not related to software localization, where a pre-defined set of strings is translated. It is something to be translated by the users of Mifos itself. The particular reason why Mifos needs such a feature comes from its nature as microfinance software: financial documents must be filled in the language of each country for legal purposes. Therefore, a Mifos user in the Indian state of Karnataka may need to have her name written in the software in English, Hindi, and Kannada – different languages, which are needed in different documents.

A simple sketch of database structure for storing names in multiple languages

A simple sketch of database structure for storing names in multiple languages

Such a feature is quite simple to implement. In the backend this means that the name must be stored in a separate table that will hold names in different languages; see the sketch I made with Nayan above. On the frontend it will need a widget for adding names in different languages, similar to the one that Wikidata has; see the screenshot below.

The name of Steven Spielberg in many languages in Wikidata, with an option to add more languages

The name of Steven Spielberg in many languages in Wikidata, with an option to add more languages

Of course, there’s also the famous problem of falsehoods that programmers believe about names, but this would be a good first step that can provide a good example to other programs.


Words for “so”

When I study foreign languages, it’s very important for me to know the words that explain causation. Words like “then”, “therefore”, “so”, “consequently”. And it surprises me that textbooks don’t teach them early. Maybe it says something about how my mind is wired? That to me it’s important to understand the explanations and meanings of everything, and that other people care less about it?

Here are some of those words in a few languages:

  • Catalan: doncs, per això, idò, llavors
  • Spanish: entonces, pues
  • French: donc, puis
  • Russian: поэтому, потому, значит, так что, стало быть
  • Esperanto: do
  • Polish: więc, zatem
  • Italian: perciò, quindi, dunque, pertanto
  • Portuguese: então, por isso, portanto
  • Hebrew: אז, לכן

Can you add it for your language? I tried to find it for Hindi, for example. My textbook (Rupert Snell, “Complete Hindi”) didn’t have anything clear like in the first few lessons (or maybe I missed it).

I can find some clues at the description of “so” in OmegaWiki, but I’d appreciate human input. Thanks.

A Relevant Tower of Babel

The Tower of Babel is frequently used as a symbol of foreign languages. For example, several language software packages are named after it, such as the Babylon electronic dictionary, MediaWiki’s Babel extension and the Babelfish translation service (itself named after the Babel fish from The Hitchhiker’s Guide).

In this post I shall use the Tower of Babel in a somewhat more relevant and specific way: It will speak about multilingualism and about Babel itself.

This is how most people saw the Wikipedia article about the Tower of Babel until today:

The Tower of Babel article. Notice the pointless squares in the Akkadian name. They are called "tofu" in the jargon on internationalization programmers.

The tower of Babel. Notice the pointless squares in the Akkadian name. They are called “tofu” in the jargon on internationalization programmers.

And this is how most people will see it from today:

And we have the name written in real Akkadian cuneiform!

And we have the name written in real Akkadian cuneiform!

Notice how the Akkadian name now appears as actual Akkadian cuneiform, and not as meaningless squares. Even if you, like most people, cannot actually read cuneiform, you probably understand that showing it this way is more correct, useful and educational.

This is possible thanks to the webfonts technology, which was enabled on the English Wikipedia today. It was already enabled in Wikipedias in some languages for many months, mostly in languages of India, which have severe problems with font support in the common operating systems, but now it’s available in the English Wikipedia, where it mostly serves to show parts of text that are written in exotic fonts.

The current iteration of the webfonts support in Wikipedia is part of a larger project: the Universal Language Selector (ULS). I am very proud to be one of its developers. My team in Wikimedia developed it over the last year or so, during which it underwent a rigorous process of design, testing with dozens of users from different countries, development, bug fixing and deployment. In addition to webfonts it provides an easy way to pick the user interface language, and to type in non-English languages (the latter feature is disabled by default in the English Wikipedia; to enable it, click the cog icon near “Languages” in the sidebar, then click “Input” and “Enable input tools”). In the future it will provide even more abilities, so stay tuned.

If you edit Wikipedia, or want to try editing it, one way in which you could help with the deployment of webfonts would be to make sure that all foreign strings in Wikipedia are marked with the appropriate HTML lang attribute; for example, that every Vietnamese string is marked as <span lang=”vi” dir=”ltr”>. This will help the software apply the webfonts correctly, and in the future it will also help spelling and hyphenation software, etc.

This wouldn’t be possible without the help of many, many people. The developers of Mozilla Firefox, Google Chrome, Safari, Microsoft Internet Explorer and Opera, who developed the support for webfonts in these browsers; The people in Wikimedia who designed and developed the ULS: Alolita Sharma, Arun Ganesh, Brandon Harris, Niklas Laxström, Pau Giner, Santhosh Thottingal and Siebrand Mazeland; The many volunteers who tested ULS and reported useful bugs; The people in Unicode, such as Michael Everson, who work hard to give a number to every letter in every imaginable alphabet and make massive online multilingualism possible; And last but not least, the talented and generous people who developed all those fonts for the different scripts and released them under Free licenses. I send you all my deep appreciation, as a developer and as a reader of Wikipedia.

Marriage in Dictionaries

The definition of marriage is the hottest topic in US news lately.

My favorite place for looking up definitions of English words is, unsurprisingly, the Merriam-Webster dictionary.

And indeed, the editors of M-W’s website noticed the public interest in the definition of marriage, and here’s what they had to write about it:

The word became the subject of renewed scrutiny as the Supreme Court heard arguments in cases seeking to overturn California’s ban on gay marriage and the federal government’s Defense of Marriage Act.

Marriage has become a controversial definition, although its original sense – “the state of being united to a person of the opposite sex” – has not changed.

However, because the word is used in phrases such as “same-sex marriage” and “gay marriage” (by proponents and opponents alike), a second definition – “the state of being united to a person of the same sex in a relationship like that of a traditional marriage” – was added to the dictionary to provide an accurate picture of the word’s current use.

I recently read Herbert Morton’s excellent book The Story of Webster’s Third: Philip Gove’s Controversial Dictionary and Its Critics. It’s excellent because it’s very well written and because it could be a handbook in how to make dictionaries in general: how to balance scientific linguistic precision with usefulness to the general public.

Sadly, this remark about the definition of marriage is a departure from the principles of excellence that guided the editors of Webster’s Third. If the sentence says “same-sex marriage”, then “same-sex” means, literally, “same-sex”; there’s no need to say “the state of being united to a person of the same sex“.

Why not just say that “marriage” is “the state of being united to a person”? Maybe “legally united”, or “religiously united”. Or “united in a family”. It neatly avoids the political problems around sex and gender and all that, and is correct linguistically.

The official dictionary of the Catalan language already did it:

Comparison of two versions of a dictionary definition.

Comparison of two versions of a dictionary definition in the Catalan language.

The Institute of Catalan Studies, which publishes the dictionary, also publishes a list of updates in each edition. In this image you can see how the definition of marriage changed from “a legal union of a man and a woman” to “a legitimate union of two people who promise each other a common life, established through certain rituals or legal formalities”. The last usage example also says: “In some countries the legislation provides for marriage between two persons of the same sex”.

And well, yes, before you ask: of course there is a political background. Catalonia was one of the first jurisdictions that made same-sex marriage equal to different-sex marriage. But from the purely linguistic point of view the newer definition, which doesn’t mention a man and a woman, is perfectly correct. And saying that the definition of “marriage” is different in “marriage” and in “same-sex marriage” is not correct. Simple, really.

Yakutsk 2012

When I was about five years old, I saw a map of the world on the wall of my Moscow home. I noticed that the USSR is very, very big. And that it has a lot of rivers, like Ob, Yenisey, and Lena. “Lena”, I thought, “How nice. Like a name of a girl.”

On the Lena river I saw a city called Yakutsk. The name sounded a bit funny to me, but I became curious about it somehow.

And last month I went there.

Yakutsk is the capital of the Sakha Republic, also known as Yakutia – the largest administrative region in the world that is not a country. The largest native ethnic group of Sakha, after which the republic is named, speak a Turkic language of the same name, although it is also frequently called “Yakut”. Even though I spent almost all of my Soviet life in Moscow, I was always very curious about all the other regions and languages of the USSR, so when I discovered Wikipedia, I devoted a lot of time to reading about them and to visiting Wikipedias in these languages, even though I cannot really read them.

A request to start a Wikipeda in Sakha was filed in 2006, and I was quick to support it. After a few months of preparations it was opened. It is now one of the relatively more active Wikipedias in languages of Russia – it has over 8,000 articles, and for a minority language, most speakers of which are bilingual in another major language, this is a good number.

I kept constant and positive contact with Nikolai Pavlov – the founder and the unofficial leader of the Sakha Wikipedia – since the very start of this Wikipedia. It was great to give these people technical and organizational advice: how to write articles effectively, how to choose topics, how to organize meet-ups of Wikipedians. For a long time I dreamt of meeting them in person, but because Yakutsk is so far away from practically any other imaginable place, I didn’t think that it will ever happen. But in April 2012 I met Nikolai at the Turkic Wikimedia Conference in Almaty, Kazakhstan.

A few days after that conference Nikolai suggested that I submit a talk for an IT conference in the North-Eastern Federal University in Yakutsk. At first I thought that I’m not really related to it, but after reading the description, I decided to give it a try and wrote a talk proposal about my favorite topics: MediaWiki and Software Localization. Somewhat surprisingly, the talks were accepted and I received an invitation to present at that conference.

With Nikolai Pavlov, also known as Halan Tul. The unofficial leader of the Sakha Wikipedia and the excellent organizer of my trip to Yakutsk.

With Nikolai Pavlov, also known as Halan Tul. The unofficial leader of the Sakha Wikipedia and the excellent organizer of my trip to Yakutsk.

I flew from Tel-Aviv to Moscow, and then six more hours from Moscow to Yakutsk. Yakutsk is apparently a modern, bustling and developed city, but with interesting twists. Most notably, because it is in the permafrost area, all the houses are built on piles and all the pipelines are above ground. But actually this is just a small detail, because the general feeling is that it was a whole different country from the European part of Russia, to which I was used, and in a very good way.

I am standing on a new bridge being built

I am standing on a new bridge being built

I was most pleasantly surprised by the liveliness of the Sakha language: practically all people there know Russian, but the Sakha speech is frequently heard on the streets, Sakha writing is frequently seen on advertising and store signs, and Sakha songs are played from many passing cars.

Myself standing in front of a classroom, speaking about MediaWiki

Speaking about MediaWiki in Yakutsk

The conference was very varied – with presenters from South Korea, China, Bulgaria, Switzerland and major Russian cities – Moscow, St. Petersburg and others. The topics were very varied, too, but the central topic was using computer technologies for education and human development, so I felt that my talks about Wikipedia and software localization were fitting.

I am standing holding a microphone in front of an audience in a university auditorium. Behind me - a screen with a GNU head, the logo of the Free Software Foundation.

Presenting my main plenary lecture about software localization. One of my main points is that using Free Software, represented by the GNU head, is very easy to internationalize.

Except participating in the conference itself, I also attended many meetings that Nikolai organized for me. It was fascinating to meet all these people.

Meeting the manager of Bichik, the national book publisher. On the wall - portraits of notable Sakha writers.

Meeting the manager of Bichik, the national book publisher. On the wall – portraits of notable Sakha writers.

I spoke to the editor and the manager of the republic’s largest book publishing company – they told me that the local literature has great artistic value, but since less than half a million people speak this language, it’s hard to earn a lot of profit from it and to develop it. They also complained that some authors – as well as some deceased authors’ families – are too harsh about copyrights. I suggested them to try to talk with authors and release some works under the Creative Commons license and see whether it gets them more exposure, and they promised to read Lawrence Lessig’s “Free Culture” book.

I am sitting in a classroom and speaking to a group of about ten people.

Meeting Yakutsk linguists and explaining them how putting their works on Wikipedia will make them much more accessible to the whole world.

I also met with linguists from the university, who work on researching and documenting the Sakha language and other languages of the region, such as Evenki and Yukagir. I suggested them to use Wikimedia resources for storage and documentation of the works they gather, and they liked the idea; I am definitely going to follow up with them on that.

In the offices of, with the manager of the company - and a Kanban board in the background.

In the offices of, with the manager of the company – and a Kanban board in the background.

Another great meeting I had was with local tech people – a community of proud local IT geeks, who had lots of ideas for promoting Wikipedias in regional languages, and also the management and the employees of the local Internet portal Their offices look just like a building of a hi-tech company in the Silicon Valley or in Israel – with cozy rooms and lounges, and a Kanban board. The people made an excellent impression on me, too: we had a very professional and engaging conversation about developing web applications and agile management methodologies.

I am sitting on a couch and the TV crew prepare my microphone for the interview

Preparing for an interview at NVK, the national TV station

I also spoke to several journalists and to the local TV and radio stations, inviting people to read Wikipedia in their own language and to contribute to it. I felt a bit like a celebrity, and well, I hope that it made somebody realize how effective can the Internet be in promoting local cultures and how proud should people be about their own languages.

One last comment is about the Sakha literature, which I mentioned earlier. I return from almost all my trips abroad with a lot of books about the local languages and cultures. And I actually read them. It happened in this trip, too, except this time most of the books were given to me as gifts by all those very nice people that I met. Sakha prose and Olonkho poetry in translation to Russian are simply wonderful. In all honesty. This is beautiful world-class literature and it deserves more exposure. If this little blog post made you curious about it, then it’s the most important thing that it could achieve.

(All photos were taken by Nikolai Pavlov, except the one in which he appears.)

Ones and O’s: The Advantages of Digital Texts in Wikisource

I’ve been asked what the advantages are of using Wikisource over simply uploading scanned books to a website. The people who asked me about this speak languages of India, but my replies apply to all languages.

First, what is Wikisource? It’s a sister project of Wikipedia, which hosts freely-licensed documents that were already published elsewhere. The English Wikisource, for example, hosts many books that passed into the public domain, such as Alice in Wonderland, the Sherlock Holmes stories and Gesenius’ Hebrew Grammar (my favorite pet project). It also hosts many other types of texts, for example speeches by US presidents from Washington to Obama, because according to the American law they are all in the public domain.

And now to the main question: Why bother to type the texts letter-by-letter as digital texts rather than just scanning them? For languages written in the Latin, Cyrillic and some other scripts this question is less important, because for these scripts OCR technology makes the process half-automatic. It’s never fully automatic, because OCR output always has to be proofread, but it’s still makes the process easier and faster.

For the languages of India it is harder, because as far as i know there’s no OCR software for them, so they have to be typed letter-by-letter. This is very hard work. What is it good for?

In general, an image of a scanned page is a digital ghost: It is only partially useful to a human and it is almost completely useless to a computer. A computer’s heart only beats ones and O’s – it usually doesn’t care whether an image shows a kitten or a text of a poem.

It’s possible – and easy – to copy a digital text

It’s almost impossible to copy text from a scanned image. You can, of course, use some graphics editing software to cut the text and paste it as an image in your document, but that is very slow and the quality of the output will be bad. Why is it useful to copy text from a book that was already published? It’s very useful to people who write papers about literary works. This happens to all children who study literature in their native language in school and to university students and researchers in departments of language and literature. It is also useful if you want to quickly copy a quote from a book to an email, a status update on a social network or a Wikipedia article. Some people would think that copying from a book to a school paper is cheating, but it isn’t; copying another paper about a book may be cheating, but copying quotes from the original book to a paper you’re writing is usually OK and a digitized book just makes it easier and helps you concentrate on the paper.


In the previous point i mentioned copying text to an email from a book. It’s easy if you know what the book is and on which page the text appears. But it’s hard if you don’t know these things, and this happens very often. That’s where searching comes in, but searching works only if the text is digital – it’s very hard for the computer to understand whether an image shows a kitten or a scanned text of a poem, unless a human explains it. (OCR makes it only slightly easier.)


The letters “ht” in “http” and “html”, the names of the central technologies of the web, stand for “hypertext”. Hypertext is a text with links. A printed book only has references that point you to other pages, and then you have to turn pages back and forth. If they point to another book, you’ll have to go the shelf, find it, and turn pages there. Digital texts can be very easily linked to one another, so you’ll just have to click it to see where you are referred. This is very useful in scientific books and articles. It is rarely needed in poetry and stories, but it can be added to them too; for example, you can add a footnote that says: “Here the character quotes a line from a poem by Rabindranath Tagore” and link to the poem.


This one is very simple: Scanned images of texts use much more bandwidth than digital texts. In these days of broadband it may not seem very important, but the gaps between digital texts and images is really huge, and it may be especially costly, in time and in money, to people who don’t have access to broadband.

Machine Translation

The above points are relatively easy to understand, but now it starts to get less obvious. Most modern machine translation engines, such Google, Bing and Apertium rely at least partly on pairs of translated texts. The more texts there are in a language, the better machine translation gets. The are many translated parallel texts in English, Spanish, Russian, German and French, so the machine translation for them works relatively well, but for languages with a smaller web presence it works very badly. It will take time until this influence will actually be seen, but it has to begin somewhere.

Linguistic research and education

This is another non-obvious point: Digital texts are useful for linguists, who can analyze texts to find the frequency of words and to find n-grams. Put very simply, n-grams are sequences of words, and it can be assumed that words that frequently come in a sequence probably have some special meaning. Such things are directly useful only to linguists, but the work of linguists is later used by people who write textbooks for language learning. So, the better the digital texts in a language will be, the better textbooks the children who speak that language will get. (The link between advances in linguistic research and school language textbooks was found and described in at least one academic paper by an Israeli researcher.)

Language tools

Big collections of digital texts in a language can be easily used to make better language software tools, such as spelling, grammar and style checkers.


And all this brings us back to thing from which we began: OCR technology. More digital texts well help developers of OCR software to make it better, because they’ll be able to compare existing images of text with proofread digital texts and use the comparison for testing. This is a wonderful way in which non-developers help developers and vice-versa.

So these are some of the advantages. The work is hard, but the advantages are really big, even if not immediately obvious.

If you have any more questions about Wikisource, please let me know.

The Secret Spell – how to easily make spelling checkers better

Software localization and language tools are poorly understood by a lot of people in general. Probably the most misunderstood language tool, despite its ubiquity, is spell checking.

Here are some things that most people probably do understand about spelling checkers:

  • Using a spelling checker does not guarantee perfect grammar and correctness of the text. False positives and false negatives happen.
  • Spelling checkers don’t include all possible words – they don’t have names, rare technical terms, neologisms, etc.

And here are some facts about spelling checkers that people often don’t understand. Some of them are are so basic that they seem ridiculous, but nevertheless i heard them more than once:

  • Spelling checkers can exist for any language, not just for English.
  • At least in some programs it is possible to check the spelling of several languages at once, in one document.
  • Some spelling checkers intentionally omit some words, because they are too rare to be useful.
  • The same list of words can be used in several programs.
  • Contrariwise, the same language can have several lists of words available.

But probably the biggest misunderstanding about spelling checkers is that they are software just like any other: It was created by programmers, it has maintainers, and it has bugs. These bugs can be reported and fixed. This is relatively easy to do with Free Software like Firefox and LibreOffice, because proprietary software vendors usually don’t accept bug reports at all. But in fact, even with Free Software it is easy only in theory.

The problem with spelling checkers is that almost any person can easily find lots of missing words in them just by writing email and Facebook updates (and dare i mention, Wikipedia articles). It’s a problem, because there’s no easy way to report them. When the spell checker marks a legitimate word in red, the user can press “Add to dictionary”. This function adds the word to a local file, so it’s useful only for that user on that computer. It’s not even shared with that user’s other computers or mobile devices, and it’s certainly not shared with other people who speak that language and for whom that word can be useful.

The user can report a missing word as a bug in the bug tracking system of the program that he uses to write the texts, the most common examples being Firefox and LibreOffice. Both of these projects use Bugzilla to track bugs. However, filling a whole Bugzilla report form just to report a missing word is way too hard and time-consuming for most users, so they won’t do it. And even if they would do it, it would be hard for the maintainers of Firefox and LibreOffice to handle that bug report, because the spelling dictionaries are usually maintained by other people.

Now what if reporting a missing word to the spelling dictionary maintainers would be as easy as pressing “Add to dictionary”?

The answer is very simple – spelling dictionaries for many languages would quickly start to grow and improve. This is an area that just begs to be crowd-sourced. Sure, big, important and well-supported languages like English, French, Russian, Spanish and German may not really need it, because they have huge dictionaries already. But the benefit for languages without good software support would be enormous. I’m mostly talking about languages of Africa, India, the Pacific and Native American languages, too.

There’s not much to do on the client side: Just let “Add to dictionary” send the information to a server instead of saving it locally. Anonymous reporting should probably be the default, but there can be an option to attach an email address to the report and get the response of the maintainer. The more interesting question is what to do on the server side. Well, that’s not too complicated, either.

When the word arrives, the maintainer is notified and must do something about it. I can think of these possible resolutions:

  • The word is added to the dictionary and distributed to all users in the next released version.
  • The word is an inflected form of an existing word that the dictionary didn’t recognize because of a bug in the inflection logic. The bug is fixed and the fix is distributed to all users in the next released version.
  • The word is correct, but not added to the dictionary which is distributed to general users, because it’s deemed too rare to be useful for most people. It is, however, added to the dictionary for the benefit of linguists and other people who need complete dictionaries. Personal names that aren’t common enough to be included in the dictionary can receive similar treatment.
  • The word is not added to the dictionary, because it’s in the wrong language, but it can be forwarded to the maintainer of the spelling dictionary for that language. (The same can be done for a different spelling standard in the same language, like color/colour in English.)
  • The word is not added to the dictionary, because it’s a common misspelling (like “attendence” would be in English.)
  • The word is not added to the dictionary, because it’s complete gibberish.

Some of the points above can be identified semi-automatically, but the ultimate decision should be up to the dictionary maintainer. Mistakes that are reported too often – again, “attendence” may become one – can be filtered out automatically. The IP addresses of abusive users who send too much garbage can be blocked.

The same system for maintaining spelling dictionaries can be used for all languages and reside on the same website. This would be similar to – one website in which all the translations for MediaWiki and related projects are handled. It makes sense on, because the translation requirements for all languages are pretty much the same and the translators help each other. The requirements for spelling dictionaries are also mostly the same for all languages, even though they differ in the implementation of morphology and in other features, so developers of dictionaries for different languages can collaborate.

I already started implementing a web service for doing this. I called it Orthoman – “orthography manager”. I picked Perl and Catalyst for this – Perl is the language that i know best and i heard that Catalyst is a good framework for writing web services. I never wrote a web service from scratch before, so i’m slowish and this “implementation” doesn’t do anything useful yet. If you have a different suggestion for me – Ruby, Python, whatever -, you are welcome to propose it to me. If you are a web service implementation genius and can implement the thing i described here in two hours, feel free to do it in any language.