Ones and O’s: The Advantages of Digital Texts in Wikisource

I’ve been asked what the advantages are of using Wikisource over simply uploading scanned books to a website. The people who asked me about this speak languages of India, but my replies apply to all languages.

First, what is Wikisource? It’s a sister project of Wikipedia, which hosts freely-licensed documents that were already published elsewhere. The English Wikisource, for example, hosts many books that passed into the public domain, such as Alice in Wonderland, the Sherlock Holmes stories and Gesenius’ Hebrew Grammar (my favorite pet project). It also hosts many other types of texts, for example speeches by US presidents from Washington to Obama, because according to the American law they are all in the public domain.

And now to the main question: Why bother to type the texts letter-by-letter as digital texts rather than just scanning them? For languages written in the Latin, Cyrillic and some other scripts this question is less important, because for these scripts OCR technology makes the process half-automatic. It’s never fully automatic, because OCR output always has to be proofread, but it’s still makes the process easier and faster.

For the languages of India it is harder, because as far as i know there’s no OCR software for them, so they have to be typed letter-by-letter. This is very hard work. What is it good for?

In general, an image of a scanned page is a digital ghost: It is only partially useful to a human and it is almost completely useless to a computer. A computer’s heart only beats ones and O’s – it usually doesn’t care whether an image shows a kitten or a text of a poem.

It’s possible – and easy – to copy a digital text

It’s almost impossible to copy text from a scanned image. You can, of course, use some graphics editing software to cut the text and paste it as an image in your document, but that is very slow and the quality of the output will be bad. Why is it useful to copy text from a book that was already published? It’s very useful to people who write papers about literary works. This happens to all children who study literature in their native language in school and to university students and researchers in departments of language and literature. It is also useful if you want to quickly copy a quote from a book to an email, a status update on a social network or a Wikipedia article. Some people would think that copying from a book to a school paper is cheating, but it isn’t; copying another paper about a book may be cheating, but copying quotes from the original book to a paper you’re writing is usually OK and a digitized book just makes it easier and helps you concentrate on the paper.

Searching

In the previous point i mentioned copying text to an email from a book. It’s easy if you know what the book is and on which page the text appears. But it’s hard if you don’t know these things, and this happens very often. That’s where searching comes in, but searching works only if the text is digital – it’s very hard for the computer to understand whether an image shows a kitten or a scanned text of a poem, unless a human explains it. (OCR makes it only slightly easier.)

Linking

The letters “ht” in “http” and “html”, the names of the central technologies of the web, stand for “hypertext”. Hypertext is a text with links. A printed book only has references that point you to other pages, and then you have to turn pages back and forth. If they point to another book, you’ll have to go the shelf, find it, and turn pages there. Digital texts can be very easily linked to one another, so you’ll just have to click it to see where you are referred. This is very useful in scientific books and articles. It is rarely needed in poetry and stories, but it can be added to them too; for example, you can add a footnote that says: “Here the character quotes a line from a poem by Rabindranath Tagore” and link to the poem.

Bandwidth

This one is very simple: Scanned images of texts use much more bandwidth than digital texts. In these days of broadband it may not seem very important, but the gaps between digital texts and images is really huge, and it may be especially costly, in time and in money, to people who don’t have access to broadband.

Machine Translation

The above points are relatively easy to understand, but now it starts to get less obvious. Most modern machine translation engines, such Google, Bing and Apertium rely at least partly on pairs of translated texts. The more texts there are in a language, the better machine translation gets. The are many translated parallel texts in English, Spanish, Russian, German and French, so the machine translation for them works relatively well, but for languages with a smaller web presence it works very badly. It will take time until this influence will actually be seen, but it has to begin somewhere.

Linguistic research and education

This is another non-obvious point: Digital texts are useful for linguists, who can analyze texts to find the frequency of words and to find n-grams. Put very simply, n-grams are sequences of words, and it can be assumed that words that frequently come in a sequence probably have some special meaning. Such things are directly useful only to linguists, but the work of linguists is later used by people who write textbooks for language learning. So, the better the digital texts in a language will be, the better textbooks the children who speak that language will get. (The link between advances in linguistic research and school language textbooks was found and described in at least one academic paper by an Israeli researcher.)

Language tools

Big collections of digital texts in a language can be easily used to make better language software tools, such as spelling, grammar and style checkers.

OCR

And all this brings us back to thing from which we began: OCR technology. More digital texts well help developers of OCR software to make it better, because they’ll be able to compare existing images of text with proofread digital texts and use the comparison for testing. This is a wonderful way in which non-developers help developers and vice-versa.

So these are some of the advantages. The work is hard, but the advantages are really big, even if not immediately obvious.

If you have any more questions about Wikisource, please let me know.

Advertisement

In praise of Wiktionary

The Wikimedia Foundation manages the servers for several projects. Wikipedia gets almost all of the attention, and the others get almost none, even though at least some deserve a lot of it.

My personal favorite is Wikisource, a collection of freely-licensed texts that were already published elsewhere. It is similar to Project Gutenberg, but with somewhat different focus and style.

A multi-volume Latin dictionary (Egidio Forcellini: Totius Latinitatis Lexicon, 1858–87) on a table in the main reading room of the University Library of Graz. Picture taken and uploaded on 15 Dec 2005 by Dr. Marcus Gossler.
A multi-volume Latin dictionary (Egidio Forcellini: Totius Latinitatis Lexicon, 1858–87) on a table in the main reading room of the University Library of Graz. Picture taken and uploaded on 15 Dec 2005 by Dr. Marcus Gossler (license: CC-BY-SA). This is the illustration in the English Wiktionary entry "dictionary".

But there’s another project, which deserves more and more attention and praise as the years go by: Wiktionary. Even though i love printed and digital dictionaries, i never became a frequent editor of Wiktionary for two reasons. The first reason is software: MediaWiki runs Wikipedia and all the other Wikimedia projects. It is quite well suited for Wikipedia, which thrives with long encyclopedic articles sorted in a very liberal tree of categories. It’s much less suited for a dictionary, which requires a rather different model of storing, linking and sorting the entries. Some attempts were made to improve this, for example, the many templates and gadgets developed locally in the English Wiktionary and the OmegaWiki project. Both of them have nice ideas that go in the right direction, but still have many implementation problems.

The second reason is problematic methodology. It’s a hard problem to explain, but i’ll try: Writing a good dictionary is a lot harder than writing a good encyclopedia. When you are writing an encyclopedia, you can base your article on one or more reliable source about the nature and the history of a certain subject. The limits of what needs to be described in an encyclopedic article, at least for important subjects and fairly well-known people, are generally easy to determine. Dictionary compilation works entirely differently: to make a good dictionary, the editor must possess a large and representative collection of texts in a given language, to find all instances of a given word, to sort them into groups and to describe the usage of the given word. Such resources are very hard to find, and there are very few people who have the needed qualification to use them well.

Despite these problems, i find myself using Wiktionary quite often. Here are a few things for which i actually use Wiktionary repeatedly and successfully:

  • English Internet acronyms: AFAICT, TTYL, IRL, FTW, AYBABTU. They often appear in emails and chat sessions, they are legitimate dictionary terms, and the Wiktionary definitions for them are usually accurate.
  • Catalan, Spanish and Italian verb conjugation tables: I learn these languages, and i find the verb conjugation tables in Wiktionary complete and very easy to use. I have no reason to think that they have mistakes.
  • Studying Dutch. I studied Dutch for a couple of months a year ago. Unfortunately i couldn’t find the time to go on with it – i hope to come back to it! – but while i did it, i intentionally tried to use the Dutch Wiktionary to find words in the translation tasks that i got as homework. I found all the needed words easily and the explanations and the translations were clear and helpful. Of course, words in homework for beginners are probably simple, but then beginners are probably the most important and frequent users of dictionaries. In any case, the Dutch Wiktionary did the job very well.

Another advantage that Wiktionary has over other paper and digital dictionaries is that it is very richly illustrated. Paper dictionaries usually have few illustrations, if at all, because they want to save paper. Commercial digital dictionaries also have few illustrations because their publishers don’t want to pay a lot of money to photographers and designers. Wiktionary doesn’t have either of these problems: Wikipedia is very richly illustrated thanks to the enormous amount of images contributed by people and Wiktionary has direct and easy access to the Wikimedia Commons – the same repository of Free images, sounds and video that is used by Wikipedia. And of course, Wiktionary is not made of paper.

So there: Wiktionary may still not be as strong as Wikipedia in completeness and in popularity, but it definitely deserves attention. And the people who work on it despite the enormous difficulties deserve a lot of praise.

The Secret Spell – how to easily make spelling checkers better

Software localization and language tools are poorly understood by a lot of people in general. Probably the most misunderstood language tool, despite its ubiquity, is spell checking.

Here are some things that most people probably do understand about spelling checkers:

  • Using a spelling checker does not guarantee perfect grammar and correctness of the text. False positives and false negatives happen.
  • Spelling checkers don’t include all possible words – they don’t have names, rare technical terms, neologisms, etc.

And here are some facts about spelling checkers that people often don’t understand. Some of them are are so basic that they seem ridiculous, but nevertheless i heard them more than once:

  • Spelling checkers can exist for any language, not just for English.
  • At least in some programs it is possible to check the spelling of several languages at once, in one document.
  • Some spelling checkers intentionally omit some words, because they are too rare to be useful.
  • The same list of words can be used in several programs.
  • Contrariwise, the same language can have several lists of words available.

But probably the biggest misunderstanding about spelling checkers is that they are software just like any other: It was created by programmers, it has maintainers, and it has bugs. These bugs can be reported and fixed. This is relatively easy to do with Free Software like Firefox and LibreOffice, because proprietary software vendors usually don’t accept bug reports at all. But in fact, even with Free Software it is easy only in theory.

The problem with spelling checkers is that almost any person can easily find lots of missing words in them just by writing email and Facebook updates (and dare i mention, Wikipedia articles). It’s a problem, because there’s no easy way to report them. When the spell checker marks a legitimate word in red, the user can press “Add to dictionary”. This function adds the word to a local file, so it’s useful only for that user on that computer. It’s not even shared with that user’s other computers or mobile devices, and it’s certainly not shared with other people who speak that language and for whom that word can be useful.

The user can report a missing word as a bug in the bug tracking system of the program that he uses to write the texts, the most common examples being Firefox and LibreOffice. Both of these projects use Bugzilla to track bugs. However, filling a whole Bugzilla report form just to report a missing word is way too hard and time-consuming for most users, so they won’t do it. And even if they would do it, it would be hard for the maintainers of Firefox and LibreOffice to handle that bug report, because the spelling dictionaries are usually maintained by other people.

Now what if reporting a missing word to the spelling dictionary maintainers would be as easy as pressing “Add to dictionary”?

The answer is very simple – spelling dictionaries for many languages would quickly start to grow and improve. This is an area that just begs to be crowd-sourced. Sure, big, important and well-supported languages like English, French, Russian, Spanish and German may not really need it, because they have huge dictionaries already. But the benefit for languages without good software support would be enormous. I’m mostly talking about languages of Africa, India, the Pacific and Native American languages, too.

There’s not much to do on the client side: Just let “Add to dictionary” send the information to a server instead of saving it locally. Anonymous reporting should probably be the default, but there can be an option to attach an email address to the report and get the response of the maintainer. The more interesting question is what to do on the server side. Well, that’s not too complicated, either.

When the word arrives, the maintainer is notified and must do something about it. I can think of these possible resolutions:

  • The word is added to the dictionary and distributed to all users in the next released version.
  • The word is an inflected form of an existing word that the dictionary didn’t recognize because of a bug in the inflection logic. The bug is fixed and the fix is distributed to all users in the next released version.
  • The word is correct, but not added to the dictionary which is distributed to general users, because it’s deemed too rare to be useful for most people. It is, however, added to the dictionary for the benefit of linguists and other people who need complete dictionaries. Personal names that aren’t common enough to be included in the dictionary can receive similar treatment.
  • The word is not added to the dictionary, because it’s in the wrong language, but it can be forwarded to the maintainer of the spelling dictionary for that language. (The same can be done for a different spelling standard in the same language, like color/colour in English.)
  • The word is not added to the dictionary, because it’s a common misspelling (like “attendence” would be in English.)
  • The word is not added to the dictionary, because it’s complete gibberish.

Some of the points above can be identified semi-automatically, but the ultimate decision should be up to the dictionary maintainer. Mistakes that are reported too often – again, “attendence” may become one – can be filtered out automatically. The IP addresses of abusive users who send too much garbage can be blocked.

The same system for maintaining spelling dictionaries can be used for all languages and reside on the same website. This would be similar to translatewiki.net – one website in which all the translations for MediaWiki and related projects are handled. It makes sense on translatewiki.net, because the translation requirements for all languages are pretty much the same and the translators help each other. The requirements for spelling dictionaries are also mostly the same for all languages, even though they differ in the implementation of morphology and in other features, so developers of dictionaries for different languages can collaborate.

I already started implementing a web service for doing this. I called it Orthoman – “orthography manager”. I picked Perl and Catalyst for this – Perl is the language that i know best and i heard that Catalyst is a good framework for writing web services. I never wrote a web service from scratch before, so i’m slowish and this “implementation” doesn’t do anything useful yet. If you have a different suggestion for me – Ruby, Python, whatever -, you are welcome to propose it to me. If you are a web service implementation genius and can implement the thing i described here in two hours, feel free to do it in any language.

Mongol Bichig, or why Microsoft Internet Explorer is better than Firefox, Chrome and Opera

After writing this post I found out that Google Chrome, in fact, does support vertical Mongolian text.

The title of this post is designed to catch the eye. Microsoft Internet Explorer is not better than Firefox, Chrome and Opera – it’s worse than them in every imaginable regard.

Except one: the support for Mongol Bichig, the vertical Mongolian script.

Text in vertical Mongolian
Text in vertical Mongolian

Mongolian script is unique: its letters are connected, similarly to Arabic and its lines are written vertically. About three million Mongols in the independent republic of Mongolia use this script mostly for historical purposes, and use the Cyrillic script in their daily life, but the classical vertical script is the regular script for nearly six million Mongols in China – that’s about twice as much people.

The only browser that is able to display the vertical Mongolian script is Microsoft Internet Explorer. I don’t really know why Microsoft bothered to do it; maybe because the government of the People’s Republic of China demanded it. If that is true, then i salute the government of the People’s Republic of China. And i definitely salute Microsoft. I don’t like Microsoft’s insistence on keeping their code proprietary, but pioneering the support for this script, or any other, is praiseworthy.

I am very sad that at this time i cannot recommend my Mongolian friends to use my favorite browser, Firefox, or other modern browsers such as Google Chrome and Opera. For all their modernity, speed, feature richness and standards compliance, they are useless to over six million people who want to read and write in the vertical Mongolian script. At most, these browsers can display the script horizontally and with some letters incorrectly rendered. This also means that the only useful operating system for these people is Microsoft Windows.

One explanation that i heard for not supporting the vertical Mongolian script is that the CSS writing modes standard is not completely defined. This is actually a good and even noble reason, but when the most basic ability to read a language is in question, experimental support is better than no support.

So, which modern free browser will be the first to support the Mongolian script? I guess that it will be Firefox, given its excellent track record in supporting Unicode, and that Google Chrome will follow it after three years or so. But if Chrome developers surprise me and get there first, i’ll be just as happy. In any case, i am waiting impatiently, along with more than six million Mongols.

* * *


A completely unrelated postscript, intentionally hidden here, feel free to stop reading now: This morning i woke up to find that my Planet Mozilla feed was filled with reactions to a post by Gervaise Markham a.k.a. Gerv, in which he advocated keeping marriage defined as a union between a man and a woman, essentially opposing gay marriage. A lot of people were angry that anti-gay comments appear in a Mozilla-related feed and a lot of people were angry that anything off-topic appears there. Some people supported Gerv in different ways.

Gerv is a very well-known and very talented Mozilla programmer, and also a devout Christian. His blog is called “Hacking for Christ”. There’s nothing weird or wrong about it – there are many other excellent Christian hackers, like Perl’s Larry Wall and Jonathan Worthington and Mozilla’s Jonathan Kew. Gerv’s comment wasn’t particularly hateful; as it often goes, it focused on the legal side of things. Gerv is also an unusually charming person; i had the pleasure to meet him in Berlin.

All that said, i support gay marriage, i don’t support Gerv’s comment and i think that he shouldn’t have post it that way. But once he did, hey – water under the bridge. I care much more about his contributions to Mozilla’s code than about his social, legal and religious opinions.

And the loveliest part of it all is that in one the many comments to his post, i found a link to the play “8”, about the fight for recognizing gay marriage in California. On one hand, it’s a very well played PR stunt, with the highest league stars such as like Brad Pitt, George Clooney, Martin Sheen, Jamie Lee Curtis, Kevin Bacon, Yeardley Smith, John C. Reilly and George Takei. On the other hand, it’s actually worth watching. If this is what came out of that poorly placed blog post, then i’m not complaining.

Traditions I can trace: Wikipedia and Firefox in the Library

Liron Dorfman, a Wikimedia Israel activist, periodically lectures in a library in the north of the country, helping librarians contribute their vast knowledge and experience with reference works to the Free Encyclopedia.

At one of these lectures he called me in panic and asked for urgent help: He was trying to teach the librarians how to upload images to the Commons, Wikipedia’s images repository, and the Upload Wizard got stuck.

My first question, of course, was “Which browser are you using?”

The answer was, non-surprisingly, “Microsoft Internet Explorer”.

So i told him to try another browser. There wasn’t one installed, so i told him to download one. He wanted to download Chrome, but i insisted on Firefox, and he agreed.

So he installed Firefox, tried the Upload Wizard there, and it worked. Win.

It was a nice demonstration of how Firefox can save the day. It would probably work in Google Chrome, too; it has many bugs that make it almost unusable to me, but for this matter, Firefox was just a matter of personal preference.

Of course, uploading should work in Microsoft Internet Explorer, too; about 30% of Wikipedia readers still use it, and about half of them use the old Internet Explorer 8, which is the newest version available on the still-popular Windows XP. The fact, however, is that for better or worse MediaWiki developers mostly use GNU/Linux and Mac, on which Microsoft Internet Explorer doesn’t run at all, so we don’t even open it unless they have a reason. We usually test new features on it, but it is rare for us to actually use it for browsing the web, and that is essential for noticing bugs that would otherwise go unnoticed.

We all wish that all our users would stop using the old, proprietary and non-compliant Microsoft Internet Explorer, but we cannot convert millions of people overnight; even the giants Google and Facebook tried to do that and until now it was not a great success. Until then, we hope that the people who still use it will at least be able to read and contribute text and media. We can only fix problems if we know about them, so if you use Internet Explorer and encounter a problem in Wikipedia or the websites related to it, please report it at Wikimedia’s bug reporting site.

But if you just stop using Internet Explorer and move to a modern browser, we’ll be quite happy, too.

And to get back to the opening point – never be shy to introduce your friends who still use Microsoft Internet Explorer to Firefox! They’ll thank you in any case, but it works especially well when things break. If you find yourself doing that a lot, then you are already very cool and you should consider going further by becoming a Mozilla Rep.

Firefox Aurora – Mozilla’s biggest breakthrough since Firefox itself

This post encourages you to be a little more adventurous. Please try doing what it says, even if you don’t consider yourself a techie person.

The release of Firefox 4 in March 2011 brought many noticeable innovations in the browser itself, but there was another important innovation that was overlooked and misunderstood by many: A new procedure for testing and releasing new versions.

Before Firefox 4, the release schedule of the Firefox browser was inconsistent and versions were released “when they were ready”. Beta versions were released at rather random dates and quite frequently they were unstable. Nightly builds were appropriately called “Minefield” – they crashed so often that it was impossible to use them for daily web browsing activities.

The most significant breakthrough with regards to the testing of the Firefox browser came a year ago: Mozilla decided on a regular six-week release schedule and introduced the “release channels”: Nightly, Aurora, Beta and Release. The “Release” version is what most people download and use. “Beta” could be called a “Release candidate” – few, if any, changes are made to it before it becomes “Release”. Both “Aurora” and “Nightly” are updated daily and the differences between them are that “Nightly” has more experimental features that come right from the developers’ laptops and that “Aurora” is usually released with translations to all the languages that Firefox supports, while “Nightly” is mostly released in English.

Now here’s the most important part: I use Aurora and Nightly most of the time and my own experience is that both of them are actually very stable and can be used for daily browsing. It’s possible to install all the versions side-by-side on one machine and to have them use the same add-ons, preferences, history and bookmarks. This makes it possible for many testers to fully use them for whatever they need the browser for in their life without going back to the stable version. There certainly are surprises and bugs in functionality, but i have yet to encounter one that would make me give up. In comparison, in the old “Minefield” builds the browser would often crash before a tester would even notice these bugs, so it not so useful for testing.

This change is huge. Looking back at the year of this release schedule, this may be the biggest breakthrough in the world of web browsers since the release of Firefox 1.0 in 2004. In case you forgot, before Firefox was called “Firefox”, it was just “Mozilla”; it was innovative, but too experimental for the casual user: it had clunky user interface and it couldn’t open many websites, which were built with only Microsoft Internet Explorer in mind. Consequently, it was frequently laughed at. “Firefox” was an effort to take the great innovative thing that Mozilla was, clean it up and make it functional, shiny, inviting and easy to install and use. That effort was an earth-shaking success, that revived competition and innovation in Internet technologies.

Aurora does to software testing what Firefox did to web browsing. It makes beta testing easy and fun for many people – it turns testing from a bug hunting game that only nerds want to play into a fun and unobtrusive thing that anybody can do without even noticing. And it is a yet another thing that the Mozilla Foundation does to make the web better for everybody, with everybody’s participation.

A few words about Mozilla’s competitors: The Google Chrome team does something similar with what they call “Canary builds”. I use them to peek into the future of Chrome and i occasionally report bugs in them, but i find them much less stable than Firefox Nightly, so they aren’t as game-changing. Just as Minefield from Mozilla’s distant past, they crash too often to be useful as a daily web browser, so i keep going back to Firefox Aurora. Microsoft releases new versions of Microsoft Internet Explorer very rarely and installing future test versions is way too hard for most people, so it’s not even in the game. Opera is in the middle: It releases new versions of its browser quite frequently and offers beta builds for downloading, but it doesn’t have a public bug tracking system, so i cannot really participate in the development process.

To sum things up: Download Firefox Aurora and start using it as your daily browser and report bugs if you find any. You’ll see that it’s easier than you thought to make the Web better.