Archive for the 'lexicography' Category

Amir Aharoni’s Little Take on the Lodestar Affair

In case you haven’t heard, an op-ed called I Am Part of the Resistance Inside the Trump Administration was published in the New York Times on September 5. It was allegedly written by an anonymous senior person in the White House, and it made a whole lot of noise in the news.

People immediately started guessing who this is. One of the popular guesses is that it’s vice president Mike Pence, because the article uses the word “lodestar”, which is relatively rare, but unusually common in Pence’s past speeches.

And here’s my tiny, tiny conspiracy theory about it: “lodestar” was Merriam-Webster’s word of the day on August 28. Being a dictionary lover, I listen to Merriam-Webster’s Word of the Day podcast every day using Podcast Addict, a simple RSS-based podcast player. I didn’t hear this episode. If you try to download this episode using Podcast Addict, you’ll see that the title is “lodestar”, but in fact it’s the episode for “rubric“, the previous day’s episode.

It’s kind of weird, but maybe it’s a total coincidence. Maybe the person who wrote the op-ed just follows the word of the day not through the podcast, but elsewhere on the web. And maybe it has nothing to do with Merriam-Webster, and they are just an educated person who knows words like “lodestar”.

But hey, feel free to spread the rumor that Merriam-Webster is trying to subvert the government, or make up whatever other nonsense you want.

Advertisements

Words for “so”

When I study foreign languages, it’s very important for me to know the words that explain causation. Words like “then”, “therefore”, “so”, “consequently”. And it surprises me that textbooks don’t teach them early. Maybe it says something about how my mind is wired? That to me it’s important to understand the explanations and meanings of everything, and that other people care less about it?

Here are some of those words in a few languages:

  • Catalan: doncs, per això, idò, llavors
  • Spanish: entonces, pues
  • French: donc, puis
  • Russian: поэтому, потому, значит, так что, стало быть
  • Esperanto: do
  • Polish: więc, zatem
  • Italian: perciò, quindi, dunque, pertanto
  • Portuguese: então, por isso, portanto
  • Hebrew: אז, לכן

Can you add it for your language? I tried to find it for Hindi, for example. My textbook (Rupert Snell, “Complete Hindi”) didn’t have anything clear like in the first few lessons (or maybe I missed it).

I can find some clues at the description of “so” in OmegaWiki, but I’d appreciate human input. Thanks.

Ones and O’s: The Advantages of Digital Texts in Wikisource

I’ve been asked what the advantages are of using Wikisource over simply uploading scanned books to a website. The people who asked me about this speak languages of India, but my replies apply to all languages.

First, what is Wikisource? It’s a sister project of Wikipedia, which hosts freely-licensed documents that were already published elsewhere. The English Wikisource, for example, hosts many books that passed into the public domain, such as Alice in Wonderland, the Sherlock Holmes stories and Gesenius’ Hebrew Grammar (my favorite pet project). It also hosts many other types of texts, for example speeches by US presidents from Washington to Obama, because according to the American law they are all in the public domain.

And now to the main question: Why bother to type the texts letter-by-letter as digital texts rather than just scanning them? For languages written in the Latin, Cyrillic and some other scripts this question is less important, because for these scripts OCR technology makes the process half-automatic. It’s never fully automatic, because OCR output always has to be proofread, but it’s still makes the process easier and faster.

For the languages of India it is harder, because as far as i know there’s no OCR software for them, so they have to be typed letter-by-letter. This is very hard work. What is it good for?

In general, an image of a scanned page is a digital ghost: It is only partially useful to a human and it is almost completely useless to a computer. A computer’s heart only beats ones and O’s – it usually doesn’t care whether an image shows a kitten or a text of a poem.

It’s possible – and easy – to copy a digital text

It’s almost impossible to copy text from a scanned image. You can, of course, use some graphics editing software to cut the text and paste it as an image in your document, but that is very slow and the quality of the output will be bad. Why is it useful to copy text from a book that was already published? It’s very useful to people who write papers about literary works. This happens to all children who study literature in their native language in school and to university students and researchers in departments of language and literature. It is also useful if you want to quickly copy a quote from a book to an email, a status update on a social network or a Wikipedia article. Some people would think that copying from a book to a school paper is cheating, but it isn’t; copying another paper about a book may be cheating, but copying quotes from the original book to a paper you’re writing is usually OK and a digitized book just makes it easier and helps you concentrate on the paper.

Searching

In the previous point i mentioned copying text to an email from a book. It’s easy if you know what the book is and on which page the text appears. But it’s hard if you don’t know these things, and this happens very often. That’s where searching comes in, but searching works only if the text is digital – it’s very hard for the computer to understand whether an image shows a kitten or a scanned text of a poem, unless a human explains it. (OCR makes it only slightly easier.)

Linking

The letters “ht” in “http” and “html”, the names of the central technologies of the web, stand for “hypertext”. Hypertext is a text with links. A printed book only has references that point you to other pages, and then you have to turn pages back and forth. If they point to another book, you’ll have to go the shelf, find it, and turn pages there. Digital texts can be very easily linked to one another, so you’ll just have to click it to see where you are referred. This is very useful in scientific books and articles. It is rarely needed in poetry and stories, but it can be added to them too; for example, you can add a footnote that says: “Here the character quotes a line from a poem by Rabindranath Tagore” and link to the poem.

Bandwidth

This one is very simple: Scanned images of texts use much more bandwidth than digital texts. In these days of broadband it may not seem very important, but the gaps between digital texts and images is really huge, and it may be especially costly, in time and in money, to people who don’t have access to broadband.

Machine Translation

The above points are relatively easy to understand, but now it starts to get less obvious. Most modern machine translation engines, such Google, Bing and Apertium rely at least partly on pairs of translated texts. The more texts there are in a language, the better machine translation gets. The are many translated parallel texts in English, Spanish, Russian, German and French, so the machine translation for them works relatively well, but for languages with a smaller web presence it works very badly. It will take time until this influence will actually be seen, but it has to begin somewhere.

Linguistic research and education

This is another non-obvious point: Digital texts are useful for linguists, who can analyze texts to find the frequency of words and to find n-grams. Put very simply, n-grams are sequences of words, and it can be assumed that words that frequently come in a sequence probably have some special meaning. Such things are directly useful only to linguists, but the work of linguists is later used by people who write textbooks for language learning. So, the better the digital texts in a language will be, the better textbooks the children who speak that language will get. (The link between advances in linguistic research and school language textbooks was found and described in at least one academic paper by an Israeli researcher.)

Language tools

Big collections of digital texts in a language can be easily used to make better language software tools, such as spelling, grammar and style checkers.

OCR

And all this brings us back to thing from which we began: OCR technology. More digital texts well help developers of OCR software to make it better, because they’ll be able to compare existing images of text with proofread digital texts and use the comparison for testing. This is a wonderful way in which non-developers help developers and vice-versa.

So these are some of the advantages. The work is hard, but the advantages are really big, even if not immediately obvious.

If you have any more questions about Wikisource, please let me know.

In praise of Wiktionary

The Wikimedia Foundation manages the servers for several projects. Wikipedia gets almost all of the attention, and the others get almost none, even though at least some deserve a lot of it.

My personal favorite is Wikisource, a collection of freely-licensed texts that were already published elsewhere. It is similar to Project Gutenberg, but with somewhat different focus and style.

A multi-volume Latin dictionary (Egidio Forcellini: Totius Latinitatis Lexicon, 1858–87) on a table in the main reading room of the University Library of Graz. Picture taken and uploaded on 15 Dec 2005 by Dr. Marcus Gossler.

A multi-volume Latin dictionary (Egidio Forcellini: Totius Latinitatis Lexicon, 1858–87) on a table in the main reading room of the University Library of Graz. Picture taken and uploaded on 15 Dec 2005 by Dr. Marcus Gossler (license: CC-BY-SA). This is the illustration in the English Wiktionary entry "dictionary".

But there’s another project, which deserves more and more attention and praise as the years go by: Wiktionary. Even though i love printed and digital dictionaries, i never became a frequent editor of Wiktionary for two reasons. The first reason is software: MediaWiki runs Wikipedia and all the other Wikimedia projects. It is quite well suited for Wikipedia, which thrives with long encyclopedic articles sorted in a very liberal tree of categories. It’s much less suited for a dictionary, which requires a rather different model of storing, linking and sorting the entries. Some attempts were made to improve this, for example, the many templates and gadgets developed locally in the English Wiktionary and the OmegaWiki project. Both of them have nice ideas that go in the right direction, but still have many implementation problems.

The second reason is problematic methodology. It’s a hard problem to explain, but i’ll try: Writing a good dictionary is a lot harder than writing a good encyclopedia. When you are writing an encyclopedia, you can base your article on one or more reliable source about the nature and the history of a certain subject. The limits of what needs to be described in an encyclopedic article, at least for important subjects and fairly well-known people, are generally easy to determine. Dictionary compilation works entirely differently: to make a good dictionary, the editor must possess a large and representative collection of texts in a given language, to find all instances of a given word, to sort them into groups and to describe the usage of the given word. Such resources are very hard to find, and there are very few people who have the needed qualification to use them well.

Despite these problems, i find myself using Wiktionary quite often. Here are a few things for which i actually use Wiktionary repeatedly and successfully:

  • English Internet acronyms: AFAICT, TTYL, IRL, FTW, AYBABTU. They often appear in emails and chat sessions, they are legitimate dictionary terms, and the Wiktionary definitions for them are usually accurate.
  • Catalan, Spanish and Italian verb conjugation tables: I learn these languages, and i find the verb conjugation tables in Wiktionary complete and very easy to use. I have no reason to think that they have mistakes.
  • Studying Dutch. I studied Dutch for a couple of months a year ago. Unfortunately i couldn’t find the time to go on with it – i hope to come back to it! – but while i did it, i intentionally tried to use the Dutch Wiktionary to find words in the translation tasks that i got as homework. I found all the needed words easily and the explanations and the translations were clear and helpful. Of course, words in homework for beginners are probably simple, but then beginners are probably the most important and frequent users of dictionaries. In any case, the Dutch Wiktionary did the job very well.

Another advantage that Wiktionary has over other paper and digital dictionaries is that it is very richly illustrated. Paper dictionaries usually have few illustrations, if at all, because they want to save paper. Commercial digital dictionaries also have few illustrations because their publishers don’t want to pay a lot of money to photographers and designers. Wiktionary doesn’t have either of these problems: Wikipedia is very richly illustrated thanks to the enormous amount of images contributed by people and Wiktionary has direct and easy access to the Wikimedia Commons – the same repository of Free images, sounds and video that is used by Wikipedia. And of course, Wiktionary is not made of paper.

So there: Wiktionary may still not be as strong as Wikipedia in completeness and in popularity, but it definitely deserves attention. And the people who work on it despite the enormous difficulties deserve a lot of praise.

Phonecalls

Nobody answers my phonecalls. Nobody.


Plus, the word “phonecall” doesn’t appear in my Firefox’ spelling dictionary, in Google Translate, in Merriam-Webster Collegiate Dictionary and not even in the Oxford English Dictionary online. I guess that i should write “phone call”, but there are over 400,000 Google hits for “phonecall”. OED, MW, and Google, please wake up.

Houaiss Unicode: Portuguese vs. Hebrew

I bought the Houaiss dictionary of Portuguese language.

It is very good, with some features that i haven’t seen in any other dictionary. For example, if you search for “gato” (cat), you’ll find a list of collective nouns for cats – bichanada, gataria. I am not familiar with an English dictionary that points me from “cat” to “pack”. It also lists the sounds that cats make – berrar, miar, roncar, ronronar, miada, miado, miau, mio, rom-rom, roufenho and many others. This feature exists for other animals, too.

It also has etymologies, synonyms, paronyms, antonyms, date of first usage, similar-sounding words, and many other lovely features.

I bought a paper edition with a CD-ROM. To install it from the CD-ROM i need to type an obnoxious serial number, but i can live with that. It also works only on Windows, but i can live with that, too, even though i am terribly ashamed of it.

But it does have one particularly obnoxious mis-feature: it doesn’t support Unicode. So i sent them this email:

Hello,

I am only a student of the Portuguese language and i don’t write it so well yet. Feel free to reply in Portuguese.

I bought the Houaiss dictionary, versão monousuário 1.0 junho de 2009. I installed it on my Windows XP PC and i was very disappointed to find out that most of this program doesn’t support Unicode. You probably programmed the strings in some kind of an ANSI encoding and not in Unicode.

I live in Israel and my computer is set to display non-Unicode programs in Hebrew. If you don’t know what am i talking about – in Windows XP, take a look at Control Panel -> Regional and Language Options -> Advanced -> Language for non-Unicode programs. Unfortunately i still have to use some old non-Unicode programs for my work, and these programs need to display Hebrew. To change this setting, i need to reboot the computer, which is very inconvenient, and since i use this computer for work most of the time, i am forced to see Hebrew letters instead of the special Portuguese characters ã, õ, ç etc. in the Houaiss program. Take a look the attached image to see how it looks on my machine.

Strangely enough, the central pane, where the dictionary article appears, works correctly. For example, the word “Derivação” appears with the right letters. But all the rest is broken: the word list on the left, the Parפnimos (Parônimos) tab at the bottom, the Acepחץes (Acepções) tab at the top appear with Hebrew characters. Hebrew characters also appear in the About box (Ajuda->Sobre) and in the installation program. In the menu itself question marks instead of special Portuguese characters: “Conjuga??o” instead of “Conjugação”. They appear as question marks even if i change the setting of “Language for non-Unicode programs” to “Portuguese (Brazil)”!

Note also this: Since the wordlist on the left doesn’t work correctly, i can’t easily search for words which include special characters. For example, if i want to search for the word “parônimo”, i try to type “p-a-r-o-n…”, and the program doesn’t get anywhere near “parônimo”, because you treat ‘o’ and ‘ô’ as different characters. So i need to scroll to it manually.

Besides this very annoying Unicode bug, i am very happy about the dictionary itself, so can you please fix this, so that my satisfação with it would be complete? In 2010 there are no more reasons to produce non-Unicode software. Besides, Windows 2000, which supports Unicode, is listed as a technical requirement to run the program.

Thanks in advance!

I sent this email to producao@objetiva.com.br and immediately received three identical replies from three different emails with human names, asking me to confirm that i am person and not a spam robot by replying. I replied to one of them and received a confirmation that i am not a spam robot. Good to know. Now please fix the Unicode support in your dictionary. It will take one day, including cafezinho breaks and a sword-fight.

How do you look up words in a Hebrew dictionary?

How do you look up words in a Hebrew dictionary?

For this post i would like to get as many comments as possible. If you are more comfortable reading or writing in Russian or in Hebrew, please see:

What is difficult for you?

Is it difficult to find the root of the word? (This is relevant mostly for verbs, but in some dictionaries also for nouns.) How do you prefer to search for verbs – by the root, by the infinitive, by the past (perfect) tense, by the present (participle) tense?

Is it hard for you to separate the prefixes (conjunctions, prepositions) and the suffixes (tense, possession)?

Do you have any trouble reading Hebrew with or without vowel points (niqqud)? Do you need transcription in easy-to-read Latin characters or in IPA?

Do you understand abbreviations such as vt, n.pr.m., adv., impv., זו”נ‎, פעו”י‎, מ”ג‎, נ”ר? Do you notice them at all? Do they bother you in any way?

Do you remember any words that were particularly hard to find? Words or expressions, in order to find which you had to open several dictionaries? Words that you couldn’t find at all, anywhere?

Do you have any particular problems with the usage of the letters א‎, ו‎, י for vowels? If you can’t find the word תוכנה, do you know that you should try searching for תכנה? Is there a dictionary that you prefer, because it has a system for the usage of these letters that you like?

Do you have a preferred dictionary in general or a dictionary that you don’t like? Why? I am talking about mono- and bi-lingual ones, and about printed and electronic: Even-Shoshan, Ben-Yehuda, Gur, Ariel, BDB, Rav-Millim, Alkalai, Sapir, Ha-hove, Morfix etc.

These questions may seem a bit generic, but i am curious mostly about the aspect of using the dictionary and not general language difficulties.

Please write whatever comes to your mind, even if you think that it is embarrassing or too simple. Feel free to answer anonymously or to email me at amir.aharoni@mail.huji.ac.il.

Many, many thanks in advance.


Archives

Advertisements