Software localization and language tools are poorly understood by a lot of people in general. Probably the most misunderstood language tool, despite its ubiquity, is spell checking.
Here are some things that most people probably do understand about spelling checkers:
- Using a spelling checker does not guarantee perfect grammar and correctness of the text. False positives and false negatives happen.
- Spelling checkers don’t include all possible words – they don’t have names, rare technical terms, neologisms, etc.
And here are some facts about spelling checkers that people often don’t understand. Some of them are are so basic that they seem ridiculous, but nevertheless i heard them more than once:
- Spelling checkers can exist for any language, not just for English.
- At least in some programs it is possible to check the spelling of several languages at once, in one document.
- Some spelling checkers intentionally omit some words, because they are too rare to be useful.
- The same list of words can be used in several programs.
- Contrariwise, the same language can have several lists of words available.
But probably the biggest misunderstanding about spelling checkers is that they are software just like any other: It was created by programmers, it has maintainers, and it has bugs. These bugs can be reported and fixed. This is relatively easy to do with Free Software like Firefox and LibreOffice, because proprietary software vendors usually don’t accept bug reports at all. But in fact, even with Free Software it is easy only in theory.
The problem with spelling checkers is that almost any person can easily find lots of missing words in them just by writing email and Facebook updates (and dare i mention, Wikipedia articles). It’s a problem, because there’s no easy way to report them. When the spell checker marks a legitimate word in red, the user can press “Add to dictionary”. This function adds the word to a local file, so it’s useful only for that user on that computer. It’s not even shared with that user’s other computers or mobile devices, and it’s certainly not shared with other people who speak that language and for whom that word can be useful.
The user can report a missing word as a bug in the bug tracking system of the program that he uses to write the texts, the most common examples being Firefox and LibreOffice. Both of these projects use Bugzilla to track bugs. However, filling a whole Bugzilla report form just to report a missing word is way too hard and time-consuming for most users, so they won’t do it. And even if they would do it, it would be hard for the maintainers of Firefox and LibreOffice to handle that bug report, because the spelling dictionaries are usually maintained by other people.
Now what if reporting a missing word to the spelling dictionary maintainers would be as easy as pressing “Add to dictionary”?
The answer is very simple – spelling dictionaries for many languages would quickly start to grow and improve. This is an area that just begs to be crowd-sourced. Sure, big, important and well-supported languages like English, French, Russian, Spanish and German may not really need it, because they have huge dictionaries already. But the benefit for languages without good software support would be enormous. I’m mostly talking about languages of Africa, India, the Pacific and Native American languages, too.
There’s not much to do on the client side: Just let “Add to dictionary” send the information to a server instead of saving it locally. Anonymous reporting should probably be the default, but there can be an option to attach an email address to the report and get the response of the maintainer. The more interesting question is what to do on the server side. Well, that’s not too complicated, either.
When the word arrives, the maintainer is notified and must do something about it. I can think of these possible resolutions:
- The word is added to the dictionary and distributed to all users in the next released version.
- The word is an inflected form of an existing word that the dictionary didn’t recognize because of a bug in the inflection logic. The bug is fixed and the fix is distributed to all users in the next released version.
- The word is correct, but not added to the dictionary which is distributed to general users, because it’s deemed too rare to be useful for most people. It is, however, added to the dictionary for the benefit of linguists and other people who need complete dictionaries. Personal names that aren’t common enough to be included in the dictionary can receive similar treatment.
- The word is not added to the dictionary, because it’s in the wrong language, but it can be forwarded to the maintainer of the spelling dictionary for that language. (The same can be done for a different spelling standard in the same language, like color/colour in English.)
- The word is not added to the dictionary, because it’s a common misspelling (like “attendence” would be in English.)
- The word is not added to the dictionary, because it’s complete gibberish.
Some of the points above can be identified semi-automatically, but the ultimate decision should be up to the dictionary maintainer. Mistakes that are reported too often – again, “attendence” may become one – can be filtered out automatically. The IP addresses of abusive users who send too much garbage can be blocked.
The same system for maintaining spelling dictionaries can be used for all languages and reside on the same website. This would be similar to translatewiki.net – one website in which all the translations for MediaWiki and related projects are handled. It makes sense on translatewiki.net, because the translation requirements for all languages are pretty much the same and the translators help each other. The requirements for spelling dictionaries are also mostly the same for all languages, even though they differ in the implementation of morphology and in other features, so developers of dictionaries for different languages can collaborate.
I already started implementing a web service for doing this. I called it Orthoman – “orthography manager”. I picked Perl and Catalyst for this – Perl is the language that i know best and i heard that Catalyst is a good framework for writing web services. I never wrote a web service from scratch before, so i’m slowish and this “implementation” doesn’t do anything useful yet. If you have a different suggestion for me – Ruby, Python, whatever -, you are welcome to propose it to me. If you are a web service implementation genius and can implement the thing i described here in two hours, feel free to do it in any language.
20 thoughts on “The Secret Spell – how to easily make spelling checkers better”
Great idea. I have wondered about the same solution myself before.
However, have you considered just parsing all Wikipedia articles of a given language. Then sort the words by their number of occurances and check them against a spellchecker…
Yes, that’s another way to add words to spelling dictionaries or to start a spelling dictionary for a language that doesn’t have one. I heard that it was used for some languages already.
There’s more than one way to do it :)
There is a small problem. For example somebody writes “אמא” and does not understand why it is not correct, it adds it to the dictionary and it goes to the mainstream spell checker, while the correct is “אימא”.
I’m not sure how good such idea as many common incorrect spellings would find them in the dictionary, especially when incorrect spellings are very common in Hebrew and for example MS Word’s spell checker allows them frequently and disallows correct spelling according to the Hebrew Language Academy.
I don’t think it is that simple
It’s not a problem. The dictionary maintainer decides whether to add a word to the dictionary. I only propose to crowd-source the reporting – not the verification, not the maintenance and not the release of new versions of the dictionary.
The case of “אמא” would be resolved as “common misspelling”, or maybe another resolution can be added, such as “outdated spelling”.
This is similar to what Google does with search spelling corrections. You’d need a statistical model to do this well — have a look at Peter Norvig’s brilliant talk The Unreasonable Effectiveness of Data: http://www.youtube.com/watch?v=yvDCzhbjYWs
Thank you for the link – i will see the talk. But i don’t want this to be fully automatic and statistics-based – i only want to make reporting easier. As i envision it, statistics may help triaging, but not take over it.
This is how I imagine it:
This is a voting system for words that should be correct. If a language has official spelling rules, voting shouldn’t be needed – either the word is correct or it is incorrect.
At most, voting and statistics may help the triage. They cannot replace a maintainer, who must have some language qualification or at least good understanding of the spelling rules in the relevant language.
Voting meant for moderators to help them better decide if a word is worth adding to the dictionary. For example, Facebook is highlighted as a misspelling right here while Mozilla, Firefox, Google and Twitter doesn’t. People might want to have the term Facebook added to their dictionaries, and will submit it to the application. Than, it is up to the dictionary team to decide what will make it into the dictionary and what will be left out – A voting system is the best way to help them discuss on each term/word.
I love the idea, it is essentially easy and it has the potential to make a hell of a lot of difference to many, many languages. This is language support at its best … Thank you Amir
Small note, it is not always possible. It is related to language not to the programs. It is very easy to make a spell checker that works for both English and Hebrew or French and Russian but it is almost impossible to make a spell checker that works for both Italian and Spanish or Russian and Ukrainian or even English and French. Because they both share the script and even vocabulary.
It is almost not possible to create spell checker to work for two different languages in the same text that share common script and linguistic roots (that for example almost all European languages)
It is possible if you install dictionaries for all the languages and mark the language of the text. In LibreOffice it is done at the status bar at the bottom of the screen. In Word, if i recall correctly, it is somewhere in the formatting menus.
Actually, your approach could even be useful for the large languages: German for example does currently not have a dictionary under a license that can be shipped with Firefox right away. I’d be tempted to call for a project to produce a PD/CC0-“licensed” German dictionary if there was such an easy way to get it started.
Hehe, it frustrates me, too. I actually blogged about it once: https://aharoni.wordpress.com/2009/04/13/obnoxious-firefox-licensing/ .
If German speakers find it useful, i am very happy.
Sounds awesome. One thought, though: as someone who is a Perl hacker and has written a web service in Catalyst (BzAPI), and hacked on a non-Catalyst Perl web app (Bugzilla) I would caution against it. I’m afraid I’ve come to the sad conclusion that the Perl 5 community is dying. Bug reports against packages fester, unattended, for months, even when major things are broken and patches are provided. People have either moved to hack on Perl 6, or switched to other languages, or are just focussed on maintaining their legacy apps.
I know not much Python, but even today if I had to write a web app I’d learn it, and Django, rather than do it in Perl. It pains me a great deal to say it, but it’s true.
Thanks. I have a rather different experience with reporting bugs in Perl 5 packages, but i’ll consider your advice.
It seems kind of what they are doing at http://www.speling.org/ . But I think their tools are shell scripts, so probably not so end user friendly. The web service is probably the easier part (I would probably choose php, since it is easiest to get hosted), compared to getting it integrated into the gui apps in a usable way.
Are you familiar with Enchant (LGPL license)?
which supports numerous spell checker backends.
Aspell/Pspell (intends to replace Ispell)
Ispell (old as sin, could be interpreted as a defacto standard)
MySpell/Hunspell (an OOo project, also used by Mozilla)
Uspell (primarily Yiddish, Hebrew, and Eastern European languages – hosted in AbiWord’s CVS under the module “uspell”)
AppleSpell (Mac OSX)
Rather than starting completely from scratch, Iwould suggest looking at whether something like Enchant can be extended or leveraged.
Thanks for the link. I wasn’t familiar with it.
It’s interesting, but i don’t understand how is it related to what i am proposing. I propose a system to report missing words over the network.