How Gboard Could Be Better for Hebrew

Oh (edit): Most of these suggestions are implemented as of February 7 2018. The only significant change that still does not seem to be implemented is the Oleh character. Thank you, Google, for your continued improvements of Gboard.

I mostly use the Gboard app for writing on my phone. The Samsung keyboard is generally not bad, but it doesn’t include Hebrew vowels, and I need them.

There are, however, several characters that are needed for Hebrew, and that aren’t included in Gboard, and some unnecessary characters could be removed.

These can be removed:

  • Long-pressing the minus (-) in the punctuation keyboard shows interpunct (·) and the em dash (—). They are unnecessary for Hebrew. The en dash (–), must not be removed, but see below.
  • The low line (_) appears twice in the punctuation keyboard: as its own key to the left of &, and as an option when long-pressing the minus (-). One of them can be removed. I’ll further argue that the en dash (–) is more useful for Hebrew than the low line (_), and the standalone low line can be replaced with the en dash. The low line is not used much anywhere except programming, while the en dash is useful for typing ranges correctly in Hebrew. I’ll readily admit that not a lot of Hebrew speakers know about the en dash’s correct semantics, but not many more people use the low line anyway.

And these should be added:

  • Maqaf (־, U+05be): It’s the Hebrew hyphen. It has different appearance and different direction semantics. It should be available when you long-press the minus in the main keyboard, and can also appear when you long-press the minus in the punctuation keyboard (for example, instead of the unnecessary em dash).
  • Geresh (׳, U+05f3) and Gershayim (״, U+05f4): These punctuation marks are similar in appearance to quotation marks, but they have different semantics. Apple went as far as replacing quotation marks on Hebrew keyboards on its devices with Geresh and Gershayim, which is an exaggeration. The usual quotation marks (‘, “) are used by most people, even though they are not perfect, and they must stay on Gboard where they are. The elegant Hebrew quotation marks (‚’„”) also appear on Gboard and must not be removed. Geresh and Gershayim can be added on the additional punctuation
  • Rafe (U+05bf): It’s a diacritic that looks like a line above a letter, and the opposite of dagesh, which is already available. It can appear when you long-press the letter resh (ר).
  • Oleh (U+05ab): It’s a diacritic that looks like a left-pointing arrow above a letter, and in modern Hebrew it signifies stress. It can appear when you long-press the letter ayin (ע).

The five character that I suggest to add are already part of the standard Hebrew keyboard (SII 1452), which is implemented in Windows 8. They must also be available in Android.

I hope that Google developers see this and make the necessary changes.

Advertisement

Five More Privileges of English Speakers, part 2: Language and Software

For the previous part in the series, see Five Privileges of English Speakers, part 1.

I’m continuing the series of posts in each of which I write about five privileges that English speakers have without giving it a lot of thought. The examples I give mostly come from my experience translating software, Wikipedia articles, blog posts, and some other texts between English, Hebrew, and Russian. Hebrew and Russian are the languages I know best. If you have interesting examples from other languages, I am very interested in hearing them and writing about them.

I’m writing them mostly as they come into my mind, without a particular order, but the five items in this part of the series will focus on usage of the English language in software, and try to show that the dominance of English is not only a consequence of economics and history, but that it’s further reinforced by features of the language itself.

1. Software usually begins its life in English

English is the main language of software development worldwide.

The world’s best-known place for software development is Silicon Valley, an English-speaking place. That’s the place of Facebook, Google, Apple, Oracle and many others. California is also the home of Adobe.

There are several other hubs of software development in United States: Seattle (Microsoft, Amazon), North Carolina (Red Hat), New York (IBM, CA), Massachusets (TripAdvisor, Lotus, RSA), and more. The U.S. is also the source for much of computer science research and education, coming from Berkeley, MIT, and plenty of other schools. The U.S. is also the birthplace of the Internet, originally supported by the U.S. Department of Defense and several American universities. The world wide web, which brought the Internet to the masses, was created in Switzerland by an English speaker.

Software is developed in other countries—India, Russia, Israel, France, Germany, Estonia, and many other countries. But the dominance of the U.S. and of the English language is clear. The reason for this is not only that the U.S. is the source for much of computer technologies, but also—and probably more importantly—that the U.S. is the biggest consumer market for software. So developers in all countries tend to optimize the product for the highest-paying consumers, and these only need English.

When engineers write the user interface of their software in English, they often do not give any thought to other languages at all, or make translation possible, but complicated by English-centric assumptions about number, gender, text direction, text size, personal names, and plenty of other things, which will be explored in further points.

2. Terminology

English is also the source for much of the computer world’s terminology. Other languages have to adapt terms like smartphone, network, token, download, authentication, and thousands of others.

Some language communities work hard to translate them all meticulously into native words; Icelandic, Lithuanian, French, Chinese, and Croatian are famous examples. This is nice, but requires effort on behalf of terminology committees, who need to keep up with the fast pace of technological development, and on behalf of the software translators, who have to keep with the committees.

Some just transliterate most of them: keep the term essentially in English, but rewritten in the native alphabet. Hindi and Japanese are examples of that. This seems easy, but it is based on a problematic assumption: that the target language speakers who will use the software know at least some English! This assumption is correct for the translators, who don’t just know the English terms, but are probably also quite accustomed to it, but it’s not necessarily correct for the end users. Thus, the privilege is perpetuated.

Some languages, such as Hebrew, German, and Russian, are mid-way, with language academics and purists pulling to purer native language, engineers pulling to more English-based words, and the general public settling somewhere in between—accepting the neologisms for some terms, and going for English-based words for others.

For the non-English languages it provides fertile ground for arguments between purists and realists, in which the needs of the actual users are frequently forgotten. All the while, English speakers are not even aware of all this.

3. Easy binary logic word formation

One particular area of computer terminology is binary logic. This sounds complicated, but it’s actually simple: in electronics and software opposite notions such as true / false, success / failure, OK / Cancel, and so forth, are very common.

This translates to a great need for words that express opposites: enable / disable, do / undo, log in / log out, delete / undelete, block / unblock, select / deselect, online / offline, connect / disconnect, read / unread, configured / misconfigured.

Notice something? All of the above words are formed with the same root, with the addition of a prefix (un-, dis-, de-, mis-, a-), or with the words “on” and “off”.

A distinct, but closely related need, is words for repetition. Computers are famously good at doing things again and again, and that’s where the prefix re- is handy: reconnect, retry, redo, retransmit.

These features happen to be conveniently built into the English language. While English has extremely simple morphology for declension and conjugation (see the section “Spell-checking” in part 1 of the series), it has a slightly more complex morphology for word formation, but it’s still fairly easy.

It is also productive. That is, a software developer can create new words using it. For example, the MediaWiki software has the concept of “oversight”—hiding a problematic page in such a way that only users with a particular permission can read it. What happens if a page was hidden by mistake? Correct: “unoversight”. This word doesn’t quite exist elsewhere, but it doesn’t sound incorrect, because familiar English word formation rules were used to coin it.

As it always happens, English-speaking software engineers either don’t think about it at all, or think that other languages also have similar word formation rules. If you haven’t guessed it already, it is not true. Sime other European languages have similar constructs, but not necessarily as consistent as in English. And for Semitic languages like Hebrew it’s a disaster, because in Semitic languages prefixes are used for entirely different things, and the grammar doesn’t have constructs for repetition and negation. So when translating software user interface strings into Hebrew, we have to use different words as opposites. For example the English pair connect / disconnect is translated as lehitḥabér / lehitnaték—completely different roots, which Hebrew is just lucky to have. Another option is to use negative words like lo and bilti, or bitul, but they are often unnatural or outright wrong. Having to deal with something like “Mark as unread” is every Hebrew software translator’s nightmare, even though it sounds pretty straightforward in English.

English itself also has pairs of negative words that are not formed using the above prefixes, for example next / previous and open / close, but in many other languages they are much more common.

4. Verbing

“Verbing weirds language”, as one of the famous Calvin and Hobbes panels says.

Despite being a funny joke in the comic, it’s a real feature of the English language: because of how English morphology and syntax work, nouns can easily jump into the roles of adjectives and verbs without changing the way they are written.

For English, this is a useful simplification, and it works in labeling, as well as in advertising. “Enjoy Coca-Cola” is something more than an imperative. The fact that it’s a short single word and that it’s the same in all genders and numbers, makes it more usable as a call to action than it would be in other languages. And, other than advertising, where are calls to action very common? Software, of course. When you’re trying to tell a user to do something, a word that happens to be both the abstract concept and the imperative is quite useful.

Perhaps the most famous example of this these days is Facebook’s “Like”. Grammatically, what is it in English? Imperative? A noun describing an abstract action? Maybe a plain old noun, as in “chasing likes” (this is a plural noun—English verb don’t have a plural form!)? Answer: it’s all of them and more.

When translated to Hebrew in Facebook’s interface, it’s Ahávti, which literally means “I loved it”. Actually, this translation is mostly good, because it’s understandable, idiomatic, and colloquial enough without compromising correctness. Still, it’s a verb, which is not imperative, and it’s definitely not a noun, so you cannot use it in a sentence as if it was a noun. Indeed, Hebrew speakers are comfortable using this button, but when they speak and write about this feature, they just use its English name: “like” (in plural láykim). It even became a slightly awkward, but commonly used verb: lelaykék. Something similar happens in Russian.

It would be impossible in Hebrew and Russian to use the exact same word for the noun and the verb, especially in different persons and genders. Sometimes the languages are lucky enough to be able to adapt an English verb in a way that is more or less natural, but sometimes it’s weird, and hurts the user experience.

5. Word length

This one is relatively simple and not unique to English, but should be mentioned anyway: English words are neither very long, nor very short. Examples of languages where words are, on average, longer than in English, are Finnish, Tamil, German, and occasionally Russian. Hebrew tends to be shorter, although sometimes a single English word has to be translated with several Hebrew words, so it can get also get longer. This is true for a pretty much any language, really.

In designing interfaces, especially for smaller screens, the length of the text is often important. If a button label is too long, it may overflow from the button, or be truncated, making the display ugly, or unusable, or both.

If you’re an English speaker, it probably won’t happen with you, because almost all software is usually designed with the word length of your language in mind. Other languages are almost always an afterthought.

The good practice for software engineers and designers is to make sure that translated strings can be longer. Their being shorter is rarely a problem, although sometimes a string is so short that the button may become to small to click or tap conveniently.


Generally, what can you do about these privileges?

Whoever you are, remember it. If you know English, you are privileged: Software is designed more for you than for people who speak other languages.

If you are a software engineer or a designer, at the very least, make your software translatable. Try to stick to good internationalization practices and to standards like Unicode and CLDR. Write explanations for every translatable string in as much detail as possible. Listen to users’ and translators’ complaints patiently—they are not whining, they are trying to improve your software! The more internationalizable it is, the more robust it is for you as a developer, and for your English-speaking users, too, because better design thinking will be going into each of its components, and less problematic assumptions will be made.

Five Privileges of English Speakers, part 1

It’s very common today on progressive blogs to urge people to check their privilege.

Being an English speaker, native or non-native, is a privilege.

It’s not as often as discussed as other forms of privilege, such as white, male, cis, hetero, or rich privilege. The reason for this is simple: The world’s media is dominated by the English language. English-language movies are more popular in many countries than movies in these countries’ own languages, English-language news networks are quoted by the rest of the world, the world’s most popular social networks are based in the U.S. and are optimized for U.S. audiences, etc.

So, when English speakers discuss privilege among each other, English is not much of an issue, and they dedicate more time to race, gender, wealth, religion, and other factors that differentiate between people in English-speaking countries.

Despite this, I am not the first one to describe English as a privilege. A simple Google search for english language privilege will yield many interesting results.

What I do want to try to do in this series of posts is to list the particular nuances that make English such a privilege in as much detail as possible. I wanted to write this for a long time, but there are many such nuances, so I’ll just do it in batches of five, in no particular order:

1. Keyboard

If you speak English, congratulations: A keyboard on which your language can be written is available on all electronic devices.

All of them.

All desktops, laptops, phones, tablets, watches. The only notable exception I can think of is typewriters, which only makes the point more tragic: technology moved forward and made writing easier in English, but harder in many other languages, where local-language typewriters were replaced with computers with English-only keyboard.

At the very worst case, writing English on a computer will be slightly inconvenient in countries like Germany, France, or Turkey, where the placement of the Latin letters on the keys is slightly different from the U.S. and U.K. QWERTY standard. Oh, poor American tourists.

On a more serious note, though, even though a lot of languages use the Latin alphabet, a lot of them also use a lot of extra diacritics and special characters, and English is one of the very few that doesn’t. Of the top 100 world’s languages by native speakers, only Malay, Kinyarwanda, Somali, and Uzbek have standardized orthographies that can be written in the basic 26-letter Latin alphabet without any extra characters. We can also add Swahili, which has a large number of non-native speakers, but that’s it. With other languages you can get stuck and not be able to write your language at all (Hindi, Chinese, Russian, etc.), or you may have to write in a substandard orthography because you can’t type letters like é or ł (French, Vietnamese, Polish, etc.).

The above is just the teeny-tiny tip of the iceberg; the keyboard problem will be explored in more points later.

2. Spell-checking

English word morphology is laughably simple.

There’s -s for plurals and for third person present tense verbs, there’s -‘s for possession, and there are -ed and -ing verb forms. There are also some contractions (‘d, ‘s, ‘ll, ‘ve), and a long, but finite list of irregular verb forms, and an even shorter list of irregular plural noun forms. And that’s it.

Most languages aren’t like that. In most languages words change with prefixes, suffixes, infixes, clitics, and so on, according to their role in the sentence.

Beyond the fact that English writing is (arguably) easier for children and foreigners to learn, this means that software tools for processing a language are easy to develop for English and hard to develop for other languages.

The first simple example is spell-checking.

English has had not just spelling, but also grammar and style checkers built into common word processors for decades, and many languages of today don’t even have spelling checkers, not to mention grammar, or style, or convenient searching. (See below.)

So in English, when you type “kinh”, most word processors will suggest correcting it to “king”, but then, some of them may also suggest replacing this word with “monarch” to be more inclusive for women, and this is just one of the hundreds of style improvement suggestions that these tools can make. For a lot of other languages, even simple spell-checking of single words hasn’t been developed yet, and grammar checking is a barely-imaginable dream.

3. Autocompletion

Simpler morphology has many other effects.

Even though Russian is my first native language and I speak it more fluently than I speak English, I am much slower when I’m typing in Russian on my phone. In English, the autocompleting keyboard makes it possible to write just two or three letters of a word and let the software complete the rest. In Russian, the ending of the word must be typed, and autocompletion rarely guesses it correctly. Typing an incorrect ending will make a sentence convey incorrect information, or just make it completely ungrammatical.

4. Searching

A yet-another issue of the previous point, English’s very simple morphology makes searching easier.

For example word processors have a search and replace function. For English, it will likely find all forms of the word, because there are so few of them anyway. But in Hebrew and Arabic, letters are often inserted or changed in the middle of the word according to its grammatical state, and you need to search for each form, which is quite agonizing. It’s comparable to “man” vs. “men” in English, except that in English such changes are very rare, while in many other languages it happens in almost every word.

With search engines that must find words across thousands of documents it gets even harder. Google can easily figure out that if you’re searching for “drive”, you may also be interested in “driving”, “drove”, and “driven”, but Russian has dozens of other forms for this word. A few languages are lucky: special support was developed for them in search engines, and tasks of this kind are automated, but most languages our just out in the cold. But English barely needs extra support like this in the first place.

5. Very little gender

A lot can be said about gendered language, but as far as basic grammar goes, English has very little in the area of gender. “He” and “She”, and that’s about it. There are also man/woman, actor/actress, boy/girl, etc., but these distinctions are rarely relevant in technology.

In many other languages gender is far more pervasive. In Semitic and Slavic languages, a lot of verb forms have gender. In English, the verb “retweeted” is the same in “Helen retweeted you” and “Michael retweeted you”, but in Hebrew the verb is different. Because Twitter doesn’t know that Helen needs a different verb, it uses the masculine verb there, which sounds silly to Hebrew speakers.

I asked Twitter developers about this many times, and they always replied that there’s no field for gender in the user profile. It becomes more and more amusing lately, now that it has become so common —and for good reasons!— to mention what one’s preferred pronouns are in the Twitter profile bio. So people see it, but computers don’t.

On a more practical note, in the relatively rare cases when third person pronouns must be used in software strings, English will often use the singular “they” instead of “he” or “she”. So English-speaking developers do notice it, but not as often as they should, and when they do, they just use the lazy singular-they solution, which is socially acceptable and doesn’t require any extra coding. If only they’d notice it more often, using their software in other languages would be much more convenient for people of all genders.

The only software packages that I know that have reasonably good support for grammatical gender are MediaWiki and Facebook’s software. I once read that Diaspora had a very progressive solution for that, but I don’t know anybody who actually uses it. There may be other software packages that do, but probably very few.


These are just the first five examples of English-language privilege I can think of. There will be many, many more. Stay tuned, and send me your ideas!

Weird GMail Habit: Removing Control Characters

GMail has a weirdish feature that probably very few people except me know about. When using it with a Hebrew user interface, invisible control characters—LRM, RLM, RLE, LRE and the like—are added to some strings to make them appear correctly in a mixed-direction interface.

Most notably, they are added to email addresses. I sometimes want to copy these email addresses as text, and my mouse pointer picks the control characters as well. Of course, these control characters are by themselves invisible to humans, but very much visible to computers, and an email address with these characters is not correct, even if it appears to be the same to human eyes.

It already became a habit for me to carefully delete and manually restore the first and the last characters of an email address to make sure that the control characters are removed.

It would be better if GMail just used the <bdi> element or CSS bidi isolation. They are fairly well supported in modern browsers and provide better experience.

Guess Which Software the Only Hebrew TLD Runs

There already are several TLDs in the Arabic script for several Arab countries. There are no TLDs in the Hebrew script yet, although one will probably soon be created for Israel.

There is however, a test TLD in Hebrew: “טעסט”. (That’s the word “test” in Hebrew characters and according to Yiddish spelling rules.)

And there’s even an actual working domain in it: http://דוגמה.טעסט. That can be translated as “example.test”. The TLD “טעסט” now appears to the left of “דוגמה”, which is the name, because Hebrew is written right-to-left.

And what happens if you use your browser to go to that domain? It redirects you to http://דוגמה.טעסט/עמוד_ראשי. That string in the end (or in the middle if you will) is the standard Hebrew title of a MediaWiki main page, which you can also see on the Hebrew Wikipedia. The hypothesis that MediaWiki is installed there is proven further by using Google site search on the same domain: http://www.google.com/search?q=site:דוגמה.טעסט. Something in the installation is probably broken, because the pages appear blank, but the page titles can only mean one thing: MediaWiki is, or was, being used to test a Hebrew domain name.

(This post is based on information from Tomer Cohen of Mozilla Israel.)

Facebook, give me my RLM back, please

Facebook doesn’t allow typing LRM and RLM characters in the status field. These are the Unicode characters “Left-to-right marker” and “Right-to-left marker”. People who type in right-to-left languages such as Arabic, Persian, Urdu or Hebrew need these characters to make their status updates appear properly aligned. If i try to type any of these characters, they are deleted when i save the message. There is no reason to do this. Facebook engineers, please allow your users to use these characters. Thank you.

Language teacher

If you search Google for “language teacher” (מורה ללשון) in Hebrew, the autocompletion suggests “language teacher killed herself” (מורה ללשון התאבדה). The word “teacher” is spelled the same for both genders, but the verb is feminine. I don’t know why does it happen, because actually searching for it doesn’t yield anything significant.

In Israeli schools where Hebrew is the medium of teaching, “Language” is the class where the grammar of Hebrew is taught… badly.

Roth

miriamruth11-hp
miriamruth11-hp; copyright: Google; based on the original illustration by Ora Ayal

Today the logo appearing at the top of Google.co.il honors Miriam Roth, the author of the famous Hebrew children’s book “A Tale of Five Balloons”. She was born on the 16th of February in 1910.

The Google employee who uploaded the image, made a mistake: the filename is “miriamruth”, but it should be “miriamroth”. That’s what happens when there’s no proper way to write the vowels: Her last name is written רות, which is how the Biblical name “Ruth”, still common in modern Israel, is written. But the German last name “Roth” is written the same way, because in Hebrew “u” and “o” are usually written using the same letter, Vav.

There is a way to differentiate the sounds: רוּת is “Ruth” and רוֹת is “Roth”. Notice the placement of the dot in relation to the letter in the middle. The sign for “u” is called shuruk, and the sign for “o” is called holam; i wrote the bulk of the articles about them in Wikipedia. Most people don’t type these signs; usually it’s fairly easy to guess the correct pronunciation, but people don’t use these signs even when it’s needed, as is the case with Ruth/Roth, because typing them on the standard Hebrew keyboard is very hard.

For years this made me very angry, so i asked the Standards Institute of Israel to develop a new standard keyboard in which it will be easy to type these signs. I was successful at convincing the SII to do it. The work is now underway, and i actively participate in the monthly meetings, together with representatives from Hamakor – the Israeli association for free and open source software, Israel Internet Association, IBM, Microsoft, Apple, Google and other companies. I hope that the standard will be published in 2011; the technical implementation of the keyboard layout will take about ten minutes on each operating system, and shortly after that, i hope, it will be distributed to computers using some kind of an auto-update mechanism.

And then, i hope, we’ll start to see at least slightly richer Hebrew typography everywhere. I want it to happen, not just because it’s a nice tradition, but because this will simply make Hebrew easier to read – and will prevent silly mistakes, like pronouncing and writing “Ruth” instead of “Roth”.


See also: Maqaf.

Unbearable Lightness

I was invited to the 10th anniversary celebration of the Catalan Wikipedia in Perpignan. Perpignan is a city in France, but from the Catalan point of view, it’s in Northern Catalonia – a rather large territory, also known as Roussillon, that was a part of Catalonia, but passed under French rule in 1659. Catalan is still spoken by many people there; how many exactly – i’ll have to see. I hope that it’s spoken by many people for a purely practical reason – my Catalan is much better than my French.

The Catalan Wikipedia is one of the first two Wikipedias created after the English one. The English Wikipedia was created on the 15th of January 2001; German and Catalan were created on the 16th of March 2001. Catalans love to tell that although their Wikipedia was created a few minutes after the German, it was the first one to have an actual article.

Since the Catalan Wikipedia is the oldest and the largest version of Wikipedia in a language which isn’t official in any big country (sorry, Andorra), the people behind it want to share their experiences promoting their language with other regional and minorized languages and this will be discussed in the event. More details on that later.


Direct El-Al flight from Tel-Aviv to Barcelona – 582 USD. Alitalia via Rome, 2 hours wait for connection – 460 USD. Czech Airlines (ČSA) via Prague, 11 hours wait for connection – 367 USD. Guess which one i picked. ČSA, of course – i pay less and i get to spend a day in Prague! Sorry, El-Al.

If you call Czech Airlines office in Tel-Aviv, you can choose one of the following languages, in that order: English, Russian, German, Czech, French, Spanish, Italian. No Hebrew or Arabic. Except that, however, the service is excellent. I spoke in Russian with the service people and they were very polite, helpful and efficient. They were Czech; They spoke Russian with a slight accent, but it was completely correct and easy to understand. I’ll have to wait for the flight itself to see how it is, but until now my impression is very good.


P.S. Typing the word “Czech” is surprisingly hard.

Componenta

Israeli programmers use many words of English origin when they speak Hebrew. (Many of them prefer to write only in English instead of Hebrew, which is a separate issue.)

When they use these English words, they tend to adapt them to Hebrew pronunciation. Some adaptations are simple, for example “router” is pronounced with an Israeli, rather than English [r] sound (some people – not necessarily purists! – use the Hebrew word נַתָּב [natav] for that). “SQL” is rarely pronounced as “sequel” – usually it’s “ess cue el”, and the same goes for MySQL.

But some are harder to explain. For example, “component” is often pronounced [kompoˈnenta]. I heard it in several companies and i don’t quite understand why. Note the [a] in the end and the stress, too: in English it’s supposed to be something in the area of [kʌmˈpoʊnənt] – on the second syllable, not the third. I have never heard an Israeli programmer pronounce it with correct stress when speaking in English – i always hear it as [ˈkomponənt] – with stress on the first syllable and with a [o]’s in the first two syllables.

The only languages available on Google Translate in which this word is anywhere near [komponénta] are Serbian (компонента), German (Komponente), Romanian (componentă) and Spanish and Italian (componente). It may have something to do with them, but the solution is probably more complicated. Does anyone have any idea?