The Secret Spell – how to easily make spelling checkers better

Software localization and language tools are poorly understood by a lot of people in general. Probably the most misunderstood language tool, despite its ubiquity, is spell checking.

Here are some things that most people probably do understand about spelling checkers:

  • Using a spelling checker does not guarantee perfect grammar and correctness of the text. False positives and false negatives happen.
  • Spelling checkers don’t include all possible words – they don’t have names, rare technical terms, neologisms, etc.

And here are some facts about spelling checkers that people often don’t understand. Some of them are are so basic that they seem ridiculous, but nevertheless i heard them more than once:

  • Spelling checkers can exist for any language, not just for English.
  • At least in some programs it is possible to check the spelling of several languages at once, in one document.
  • Some spelling checkers intentionally omit some words, because they are too rare to be useful.
  • The same list of words can be used in several programs.
  • Contrariwise, the same language can have several lists of words available.

But probably the biggest misunderstanding about spelling checkers is that they are software just like any other: It was created by programmers, it has maintainers, and it has bugs. These bugs can be reported and fixed. This is relatively easy to do with Free Software like Firefox and LibreOffice, because proprietary software vendors usually don’t accept bug reports at all. But in fact, even with Free Software it is easy only in theory.

The problem with spelling checkers is that almost any person can easily find lots of missing words in them just by writing email and Facebook updates (and dare i mention, Wikipedia articles). It’s a problem, because there’s no easy way to report them. When the spell checker marks a legitimate word in red, the user can press “Add to dictionary”. This function adds the word to a local file, so it’s useful only for that user on that computer. It’s not even shared with that user’s other computers or mobile devices, and it’s certainly not shared with other people who speak that language and for whom that word can be useful.

The user can report a missing word as a bug in the bug tracking system of the program that he uses to write the texts, the most common examples being Firefox and LibreOffice. Both of these projects use Bugzilla to track bugs. However, filling a whole Bugzilla report form just to report a missing word is way too hard and time-consuming for most users, so they won’t do it. And even if they would do it, it would be hard for the maintainers of Firefox and LibreOffice to handle that bug report, because the spelling dictionaries are usually maintained by other people.

Now what if reporting a missing word to the spelling dictionary maintainers would be as easy as pressing “Add to dictionary”?

The answer is very simple – spelling dictionaries for many languages would quickly start to grow and improve. This is an area that just begs to be crowd-sourced. Sure, big, important and well-supported languages like English, French, Russian, Spanish and German may not really need it, because they have huge dictionaries already. But the benefit for languages without good software support would be enormous. I’m mostly talking about languages of Africa, India, the Pacific and Native American languages, too.

There’s not much to do on the client side: Just let “Add to dictionary” send the information to a server instead of saving it locally. Anonymous reporting should probably be the default, but there can be an option to attach an email address to the report and get the response of the maintainer. The more interesting question is what to do on the server side. Well, that’s not too complicated, either.

When the word arrives, the maintainer is notified and must do something about it. I can think of these possible resolutions:

  • The word is added to the dictionary and distributed to all users in the next released version.
  • The word is an inflected form of an existing word that the dictionary didn’t recognize because of a bug in the inflection logic. The bug is fixed and the fix is distributed to all users in the next released version.
  • The word is correct, but not added to the dictionary which is distributed to general users, because it’s deemed too rare to be useful for most people. It is, however, added to the dictionary for the benefit of linguists and other people who need complete dictionaries. Personal names that aren’t common enough to be included in the dictionary can receive similar treatment.
  • The word is not added to the dictionary, because it’s in the wrong language, but it can be forwarded to the maintainer of the spelling dictionary for that language. (The same can be done for a different spelling standard in the same language, like color/colour in English.)
  • The word is not added to the dictionary, because it’s a common misspelling (like “attendence” would be in English.)
  • The word is not added to the dictionary, because it’s complete gibberish.

Some of the points above can be identified semi-automatically, but the ultimate decision should be up to the dictionary maintainer. Mistakes that are reported too often – again, “attendence” may become one – can be filtered out automatically. The IP addresses of abusive users who send too much garbage can be blocked.

The same system for maintaining spelling dictionaries can be used for all languages and reside on the same website. This would be similar to translatewiki.net – one website in which all the translations for MediaWiki and related projects are handled. It makes sense on translatewiki.net, because the translation requirements for all languages are pretty much the same and the translators help each other. The requirements for spelling dictionaries are also mostly the same for all languages, even though they differ in the implementation of morphology and in other features, so developers of dictionaries for different languages can collaborate.

I already started implementing a web service for doing this. I called it Orthoman – “orthography manager”. I picked Perl and Catalyst for this – Perl is the language that i know best and i heard that Catalyst is a good framework for writing web services. I never wrote a web service from scratch before, so i’m slowish and this “implementation” doesn’t do anything useful yet. If you have a different suggestion for me – Ruby, Python, whatever -, you are welcome to propose it to me. If you are a web service implementation genius and can implement the thing i described here in two hours, feel free to do it in any language.

Kim Jong Il, Tumblr, WebFonts and Firefox

Kim Jong Il died.

Then a humorous blog called “kim jong-il looking at things” surged in popularity.

I looked at it, too, and found it funny.

And then i looked at its about section and became sad. Its about section said: “for a more beautiful experience use google chrome or safari. font-face seems to have an issue with firefox and will display a very bland arial instead of the exquisite amaranth.” Someone reading this may think that it’s a bug in Firefox, but as a matter of fact, Firefox is the browser that implements font-face correctly according to the CSS standard.

This Kim Jong Il blog is hosted on tumblr.com – a nice and stylish blog service. Among other services, tumblr gives its gives users an option to use web fonts to improve the appearance of their blogs. tumblr’s developers probably only tested this feature with Chrome and Safari and when it didn’t work on Firefox nobody cared – after all, as nice as it is, it’s just another English font.

tumblr.com has the same issue that Wikipedias in Indic languages had after we installed WebFonts there – it tries to load the font files from a different server, but Firefox, according to the standard, doesn’t load the font from a different domain if that domain is not explicitly configured to support font loading. We in Wikimedia fixed it immediately after finding it, because using web fonts for us is a way to make our website readable. For tumblr, as for most other English websites, using web fonts is just a way to make the website a little more beautiful.

tumblr.com should fix this bug. I reported this font problem at getsatisfaction.com, hoping that tumblr developers would notice it. It hasn’t been done yet, even though it’s a one-line fix.

tumblr webmasters! If you happen to read this post – please fix this issue. Thank you.

The Software Localization Paradox

Wikimania in Haifa was great. Plenty of people wrote blog posts about it; the world doesn’t need a yet another post about how great it was.

What the world does need is more blog posts about the great ideas that grew in the little hallway conversations there. One of the things that i discussed with many people at Wikimania is what i call The Software Localization Paradox. That’s an idea that has been bothering me for about a year. I tried to look for other people who wrote about it online and couldn’t find anything.

Like any other translation, software localization is best done by people who know well both the original language in which the software interface was written – usually English, and the target language. People who don’t know English strongly prefer to use software in a language they know. If the software is not available in their language, they will either not use it at all or will have to memorize lots of otherwise meaningless English strings and locations of buttons. People who do know English often prefer to use software in English even if it is available in their native language. The two most frequent explanations for that is that the translation is bad and that people who want to use computers should learn English anyway. The problem is that for various reasons lots of people will never learn English even if it would be mandatory in schools and useful for business. They will have to suffer the bad translations and will have no way to fix it.

I’ve been talking to people at Wikimania about this, especially people from India. (I also spoke to people from Thailand, Russia, Greece and other countries, but Indians were the biggest group.) All of them knew English and at least one language of India. The larger group of Indian Wikipedians to whom i spoke preferred English for most communication, especially online, even if they had computers and mobile phones that supported Indian languages; some of them even preferred to speak English at home with their families. They also preferred reading and writing articles in the English Wikipedia. The second, smaller, group preferred the local language. Most of these people also happened to be working on localizing software, such as MediaWiki and Firefox.

So this is the paradox – to fix localization bugs, someone must notice them, and to notice them, more people who know English must use localized software, but people who know English rarely use localized software. That’s why lately i’ve been evangelizing about it. Even people who know English well should use software in their language – not to boost their national pride, but to help the people who speak that language and don’t know English. They should use the software especially if it’s translated badly, because they are the only ones who can report bugs in the translation or fix the bugs themselves.

(A side note: Needless to say, Free Software is much more convenient for localization, because proprietary software companies are usually too hard to even approach about this matter; they only pay translators if they have a reason to believe that it will increase sales. This is another often overlooked advantage of Free Software.)

I am glad to say that i convinced most people to whom i spoke about it at Wikimania to at least try to use Firefox in their native language and taught them where to report bugs about it. I also challenged them to write at least one article in the Wikipedia in their own language, such as Hindi, Telugu or Kannada – as useful as the English Wikipedia is to the world, Telugu Wikipedia is much more useful for people who speak Telugu, but no English. I already saw some results.

I am now looking for ideas and verifiable data to develop this concept further. What are the best strategies to convince people that they should use localized software? For example: How economically viable is software localization? What is cheaper for an education department of a country – to translate software for schools or to teach all the students English? Or: How does the absence of localized software affect different geographical areas in Africa, India, the Middle East?

Any ideas about this are very welcome.

Type O Negative, part 2

Since my previous and very negative post about Google+ i played with it a little more. Apparently, a lot of my misunderstanding was related to actual bugs in its interface – for example, people that i’m not supposed to follow appear in my stream. I guess that it’s understandable, given that the service is so young.

I do have something very nice to say about it – it has an excellent interface for reporting bugs. You simply click the problematic area on the screen, write a description and submit the report. It is very buggy on Firefox, but i can understand that, too, hoping that they will fix it. It does work well in Google Chrome, but i can’t really use it, because Chrome’s right-to-left editing support is very bad. The sad thing is that after the report is submitted i don’t have a way to know what happens to it. Public bug tracking is one of the most common, most appealing, and most overlooked features of Free Software. However, reporting bugs in Free Software projects is a relatively hard process – the interface of bug tracking software such as Bugzilla is intimidating and lots of people don’t even know that they can use it.

I hope that Free Software web frameworks such as MediaWiki (Wikipedia’s engine), WordPress and Drupal, will adopt a similar model for reporting bugs and combine it with the already excellent concept of public bug tracking. If that would be Google+’s contribution to the web, it would be enough to say that it doesn’t suck.

Number in the Middle

I didn’t measure it, but i probably search Google in English more often than in Hebrew. Under the result link there’s a short summary of the page. Very frequently the first thing that is written in this summary is a date. Google forces right-to-left too strongly on all of the page, so the first number of the date goes to the other end of the summary:

Google search results - right to left
Google search results - right to left

The result is that very, very often i see things like “at most restaurants in 21 Lima and Cusco” and “What if 26 you buy a shite gun”, which doesn’t make sense.

These are the results in complete left-to-right display:

Google search results - left to right
Google search results - left to right

Dear Google, please fix this bug. It’s annoying me for a long time.