Ones and O’s: The Advantages of Digital Texts in Wikisource

I’ve been asked what the advantages are of using Wikisource over simply uploading scanned books to a website. The people who asked me about this speak languages of India, but my replies apply to all languages.

First, what is Wikisource? It’s a sister project of Wikipedia, which hosts freely-licensed documents that were already published elsewhere. The English Wikisource, for example, hosts many books that passed into the public domain, such as Alice in Wonderland, the Sherlock Holmes stories and Gesenius’ Hebrew Grammar (my favorite pet project). It also hosts many other types of texts, for example speeches by US presidents from Washington to Obama, because according to the American law they are all in the public domain.

And now to the main question: Why bother to type the texts letter-by-letter as digital texts rather than just scanning them? For languages written in the Latin, Cyrillic and some other scripts this question is less important, because for these scripts OCR technology makes the process half-automatic. It’s never fully automatic, because OCR output always has to be proofread, but it’s still makes the process easier and faster.

For the languages of India it is harder, because as far as i know there’s no OCR software for them, so they have to be typed letter-by-letter. This is very hard work. What is it good for?

In general, an image of a scanned page is a digital ghost: It is only partially useful to a human and it is almost completely useless to a computer. A computer’s heart only beats ones and O’s – it usually doesn’t care whether an image shows a kitten or a text of a poem.

It’s possible – and easy – to copy a digital text

It’s almost impossible to copy text from a scanned image. You can, of course, use some graphics editing software to cut the text and paste it as an image in your document, but that is very slow and the quality of the output will be bad. Why is it useful to copy text from a book that was already published? It’s very useful to people who write papers about literary works. This happens to all children who study literature in their native language in school and to university students and researchers in departments of language and literature. It is also useful if you want to quickly copy a quote from a book to an email, a status update on a social network or a Wikipedia article. Some people would think that copying from a book to a school paper is cheating, but it isn’t; copying another paper about a book may be cheating, but copying quotes from the original book to a paper you’re writing is usually OK and a digitized book just makes it easier and helps you concentrate on the paper.

Searching

In the previous point i mentioned copying text to an email from a book. It’s easy if you know what the book is and on which page the text appears. But it’s hard if you don’t know these things, and this happens very often. That’s where searching comes in, but searching works only if the text is digital – it’s very hard for the computer to understand whether an image shows a kitten or a scanned text of a poem, unless a human explains it. (OCR makes it only slightly easier.)

Linking

The letters “ht” in “http” and “html”, the names of the central technologies of the web, stand for “hypertext”. Hypertext is a text with links. A printed book only has references that point you to other pages, and then you have to turn pages back and forth. If they point to another book, you’ll have to go the shelf, find it, and turn pages there. Digital texts can be very easily linked to one another, so you’ll just have to click it to see where you are referred. This is very useful in scientific books and articles. It is rarely needed in poetry and stories, but it can be added to them too; for example, you can add a footnote that says: “Here the character quotes a line from a poem by Rabindranath Tagore” and link to the poem.

Bandwidth

This one is very simple: Scanned images of texts use much more bandwidth than digital texts. In these days of broadband it may not seem very important, but the gaps between digital texts and images is really huge, and it may be especially costly, in time and in money, to people who don’t have access to broadband.

Machine Translation

The above points are relatively easy to understand, but now it starts to get less obvious. Most modern machine translation engines, such Google, Bing and Apertium rely at least partly on pairs of translated texts. The more texts there are in a language, the better machine translation gets. The are many translated parallel texts in English, Spanish, Russian, German and French, so the machine translation for them works relatively well, but for languages with a smaller web presence it works very badly. It will take time until this influence will actually be seen, but it has to begin somewhere.

Linguistic research and education

This is another non-obvious point: Digital texts are useful for linguists, who can analyze texts to find the frequency of words and to find n-grams. Put very simply, n-grams are sequences of words, and it can be assumed that words that frequently come in a sequence probably have some special meaning. Such things are directly useful only to linguists, but the work of linguists is later used by people who write textbooks for language learning. So, the better the digital texts in a language will be, the better textbooks the children who speak that language will get. (The link between advances in linguistic research and school language textbooks was found and described in at least one academic paper by an Israeli researcher.)

Language tools

Big collections of digital texts in a language can be easily used to make better language software tools, such as spelling, grammar and style checkers.

OCR

And all this brings us back to thing from which we began: OCR technology. More digital texts well help developers of OCR software to make it better, because they’ll be able to compare existing images of text with proofread digital texts and use the comparison for testing. This is a wonderful way in which non-developers help developers and vice-versa.

So these are some of the advantages. The work is hard, but the advantages are really big, even if not immediately obvious.

If you have any more questions about Wikisource, please let me know.

10 thoughts on “Ones and O’s: The Advantages of Digital Texts in Wikisource

  1. These are the features I’ve tried the National Library of Belarus to be aware of. Unfortunately, the people from NLB I was working with were said to stop such “innovative collaboration”. Belarus is so Belarus…

  2. Interesting post on the benefits of digitization. What works against is the lack of creative aspect in digitization, not even on par with contribution to other wiki projects.

    1. Yes, i understand that some people find it boring. I’m going to write a detailed post about that, too, but basically, you are supposed to pick a text that you like in the first place or which you wanted to read for a while, and then instead of just reading or re-reading it, you type it, too. But there’s more than that.

  3. Really good post sir, As a frequent user of Wikisource, I got another point of advantage for the digitization, which is, the protection and reproduction of books that are very old and and have been stopped from being published. One good example for that are the whole lot of Sanskrit texts.

  4. Thanks!

    For my MA thesis I work on Ælfric’s Lives of Saints. Since there is no adequate digitised edition for it, I have to create one on my own, typing the whole book (hundreds of pages in Old English!). You’ve convinced my to upload the result to Wikisource, instead of hosting it on my website. I’ll do it as soon as I finish the first volume (of two).

  5. Re: “typing the text in”. We don’t have to do that in several of the language domains. The proofread extension in Wikisource means that we can take OCRed text and correct that. This means that we have the best of both worlds – the scan is available to those who want to verify the original text and a searchable cleaned up digital copy is available to those who need it in that format.

  6. I’d point out an additional benefit found in the hebrew wikisource domain — the ability to calculate the gematryiot of words in texts. very useful in certain genre of texts.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.