I’ve been asked what the advantages are of using Wikisource over simply uploading scanned books to a website. The people who asked me about this speak languages of India, but my replies apply to all languages.
First, what is Wikisource? It’s a sister project of Wikipedia, which hosts freely-licensed documents that were already published elsewhere. The English Wikisource, for example, hosts many books that passed into the public domain, such as Alice in Wonderland, the Sherlock Holmes stories and Gesenius’ Hebrew Grammar (my favorite pet project). It also hosts many other types of texts, for example speeches by US presidents from Washington to Obama, because according to the American law they are all in the public domain.
And now to the main question: Why bother to type the texts letter-by-letter as digital texts rather than just scanning them? For languages written in the Latin, Cyrillic and some other scripts this question is less important, because for these scripts OCR technology makes the process half-automatic. It’s never fully automatic, because OCR output always has to be proofread, but it’s still makes the process easier and faster.
For the languages of India it is harder, because as far as i know there’s no OCR software for them, so they have to be typed letter-by-letter. This is very hard work. What is it good for?
In general, an image of a scanned page is a digital ghost: It is only partially useful to a human and it is almost completely useless to a computer. A computer’s heart only beats ones and O’s – it usually doesn’t care whether an image shows a kitten or a text of a poem.
It’s possible – and easy – to copy a digital text
It’s almost impossible to copy text from a scanned image. You can, of course, use some graphics editing software to cut the text and paste it as an image in your document, but that is very slow and the quality of the output will be bad. Why is it useful to copy text from a book that was already published? It’s very useful to people who write papers about literary works. This happens to all children who study literature in their native language in school and to university students and researchers in departments of language and literature. It is also useful if you want to quickly copy a quote from a book to an email, a status update on a social network or a Wikipedia article. Some people would think that copying from a book to a school paper is cheating, but it isn’t; copying another paper about a book may be cheating, but copying quotes from the original book to a paper you’re writing is usually OK and a digitized book just makes it easier and helps you concentrate on the paper.
In the previous point i mentioned copying text to an email from a book. It’s easy if you know what the book is and on which page the text appears. But it’s hard if you don’t know these things, and this happens very often. That’s where searching comes in, but searching works only if the text is digital – it’s very hard for the computer to understand whether an image shows a kitten or a scanned text of a poem, unless a human explains it. (OCR makes it only slightly easier.)
The letters “ht” in “http” and “html”, the names of the central technologies of the web, stand for “hypertext”. Hypertext is a text with links. A printed book only has references that point you to other pages, and then you have to turn pages back and forth. If they point to another book, you’ll have to go the shelf, find it, and turn pages there. Digital texts can be very easily linked to one another, so you’ll just have to click it to see where you are referred. This is very useful in scientific books and articles. It is rarely needed in poetry and stories, but it can be added to them too; for example, you can add a footnote that says: “Here the character quotes a line from a poem by Rabindranath Tagore” and link to the poem.
This one is very simple: Scanned images of texts use much more bandwidth than digital texts. In these days of broadband it may not seem very important, but the gaps between digital texts and images is really huge, and it may be especially costly, in time and in money, to people who don’t have access to broadband.
The above points are relatively easy to understand, but now it starts to get less obvious. Most modern machine translation engines, such Google, Bing and Apertium rely at least partly on pairs of translated texts. The more texts there are in a language, the better machine translation gets. The are many translated parallel texts in English, Spanish, Russian, German and French, so the machine translation for them works relatively well, but for languages with a smaller web presence it works very badly. It will take time until this influence will actually be seen, but it has to begin somewhere.
Linguistic research and education
This is another non-obvious point: Digital texts are useful for linguists, who can analyze texts to find the frequency of words and to find n-grams. Put very simply, n-grams are sequences of words, and it can be assumed that words that frequently come in a sequence probably have some special meaning. Such things are directly useful only to linguists, but the work of linguists is later used by people who write textbooks for language learning. So, the better the digital texts in a language will be, the better textbooks the children who speak that language will get. (The link between advances in linguistic research and school language textbooks was found and described in at least one academic paper by an Israeli researcher.)
Big collections of digital texts in a language can be easily used to make better language software tools, such as spelling, grammar and style checkers.
And all this brings us back to thing from which we began: OCR technology. More digital texts well help developers of OCR software to make it better, because they’ll be able to compare existing images of text with proofread digital texts and use the comparison for testing. This is a wonderful way in which non-developers help developers and vice-versa.
So these are some of the advantages. The work is hard, but the advantages are really big, even if not immediately obvious.
If you have any more questions about Wikisource, please let me know.