The Curious Problem of Belarusian and Igbo in Twitter and Bing Translation

Twitter sometimes offers machine translation for tweets that are not written in the language that I chose in my preferences. Usually I have Hebrew chosen, but for writing this post I temporarily switched to English.

Here’s an example where it works pretty well. I see a tweet written in French, and a little “Translate from French” link:

Emmanuel Macron on Twitter.png

The translation is not perfect English, but it’s good enough; I never expect machine translation to have perfect grammar, vocabulary, and word order.

Now, out of curiosity I happen to follow a lot of people and organizations who tweet in the Belarusian language. It’s the official language of the country of Belarus, and it’s very closely related to Russian and Ukrainian. All three languages have similar grammar and share a lot of basic vocabulary, and all are written in the Cyrillic alphabet. However, the actual spelling rules are very different in each of them, and they use slightly different variants of Cyrillic: only Russian uses the letter ⟨ъ⟩; only Belarusian uses ⟨ў⟩; only Ukrainian uses ⟨є⟩.

Despite this, Bing gets totally confused when it sees tweets in the Belarusian language. Here’s an example form the Euroradio account:

Еўрарадыё   euroradio    Twitter double.pngBoth tweets are written in Belarusian. Both of them have the letter ⟨ў⟩, which is used only in Belarusian, and never in Ukrainian and Russian. The letter ⟨ў⟩ is also used in Uzbek, but Uzbek never uses the letter ⟨і⟩. If a text uses both ⟨ў⟩ and ⟨і⟩, you can be certain that it’s written in Belarusian.

And yet, Twitter’s machine translation suggests to translate the top tweet from Ukrainian, and the bottom one from Russian!

An even stranger thing happens when you actually try to translate it:

Еўрарадыё   euroradio    Twitter single Russian.pngNotice two weird things here:

  1. After clicking, “Ukrainian” turned into “Russian”!
  2. Since the text is actually written in Belarusian, trying to translate it as if it was Russian is futile. The actual output is mostly a transliteration of the Belarusian text, and it’s completely useless. You can notice how the letter ⟨ў⟩ cannot be transliterated.

Something similar happens with the Igbo language, spoken by more than 20 million people in Nigeria and other places in Western Africa:

 4  Tweets with replies by Ntụ Agbasa   blossomozurumba    Twitter.png

This is written in Igbo by Blossom Ozurumba, a Nigerian Wikipedia editor, whom I have the pleasure of knowing in real life. Twitter identifies this as Vietnamese—a language of South-East Asia.

The reason for this might be that both Vietnamese and Igbo happen to be written in the Latin alphabet with addition of diacritical marks, one of the most common of which is the dot below, such as in the words ibụọla in this Igbo tweet, and the word chọn lọc in Vietnamese. However, other than this incidental and superficial similarity, the languages are completely unrelated. Identifying that a text is written in a certain language only by this feature is really not great.

If I paste the text of the tweet, “Nwoke ọma, ibụọla chi?”, into translate.bing.com, it is auto-identified as Italian, probably because it includes the word chi, and a word that is written identically happens to be very common in Italian. Of course, Bing fails to translate everything else in the Tweet, but this does show a curious thing: Even though the same translation engine is used on both sites, the language of the same text is identified differently.

How could this be resolved?

Neither Belarusian nor Igbo languages are supported by Bing. If Bing is the only machine translation engine that Twitter can use, it would be better to just skip it completely and not to offer any translation, than to offer this strange and meaningless thing. Of course, Bing could start supporting Belarusian; it has a smaller online presence than Russian and Ukrainian, but their grammar is so similar, that it shouldn’t be that hard. But what to do until that happens?

In Wikipedia’s Content Translation, we don’t give exclusivity to any machine translation backend, and we provide whatever we can, legally and technically. At the moment we have Apertium, Yandex, and YouDao, in languages that support them, and we may connect to more machine translation services in the future. In theory, Twitter could do the same and use another machine translation service that does support the Belarusian language, such as Yandex, Google, or Apertium, which started supporting Belarusian recently. This may be more a matter of legal and business decisions than a matter of engineering.

Another thing for Twitter to try is to let users specify in which languages do they write. Currently, Twitter’s preferences only allow selecting one language, and that is the language in which Twitter’s own user interface will appear. It could also let the user say explicitly in which languages do they write. This would make language identification easier for machine translation engines. It would also make some business sense, because it would be useful for researchers and marketers. Of course, it must not be mandatory, because people may want to avoid providing too much identifying information.

If Twitter or Bing Translation were free software projects with a public bug tracking system, I’d post this as a bug report. Given that they aren’t, I can only hope that somebody from Twitter or Microsoft will read it and fix these issues some day. Machine translation can be useful, and in fact Bing often surprises me with the quality of its translation, but it has silly bugs, too.

Advertisement

A Relevant Tower of Babel

The Tower of Babel is frequently used as a symbol of foreign languages. For example, several language software packages are named after it, such as the Babylon electronic dictionary, MediaWiki’s Babel extension and the Babelfish translation service (itself named after the Babel fish from The Hitchhiker’s Guide).

In this post I shall use the Tower of Babel in a somewhat more relevant and specific way: It will speak about multilingualism and about Babel itself.

This is how most people saw the Wikipedia article about the Tower of Babel until today:

The Tower of Babel article. Notice the pointless squares in the Akkadian name. They are called "tofu" in the jargon on internationalization programmers.
The tower of Babel. Notice the pointless squares in the Akkadian name. They are called “tofu” in the jargon on internationalization programmers.

And this is how most people will see it from today:

And we have the name written in real Akkadian cuneiform!
And we have the name written in real Akkadian cuneiform!

Notice how the Akkadian name now appears as actual Akkadian cuneiform, and not as meaningless squares. Even if you, like most people, cannot actually read cuneiform, you probably understand that showing it this way is more correct, useful and educational.

This is possible thanks to the webfonts technology, which was enabled on the English Wikipedia today. It was already enabled in Wikipedias in some languages for many months, mostly in languages of India, which have severe problems with font support in the common operating systems, but now it’s available in the English Wikipedia, where it mostly serves to show parts of text that are written in exotic fonts.

The current iteration of the webfonts support in Wikipedia is part of a larger project: the Universal Language Selector (ULS). I am very proud to be one of its developers. My team in Wikimedia developed it over the last year or so, during which it underwent a rigorous process of design, testing with dozens of users from different countries, development, bug fixing and deployment. In addition to webfonts it provides an easy way to pick the user interface language, and to type in non-English languages (the latter feature is disabled by default in the English Wikipedia; to enable it, click the cog icon near “Languages” in the sidebar, then click “Input” and “Enable input tools”). In the future it will provide even more abilities, so stay tuned.

If you edit Wikipedia, or want to try editing it, one way in which you could help with the deployment of webfonts would be to make sure that all foreign strings in Wikipedia are marked with the appropriate HTML lang attribute; for example, that every Vietnamese string is marked as <span lang=”vi” dir=”ltr”>. This will help the software apply the webfonts correctly, and in the future it will also help spelling and hyphenation software, etc.

This wouldn’t be possible without the help of many, many people. The developers of Mozilla Firefox, Google Chrome, Safari, Microsoft Internet Explorer and Opera, who developed the support for webfonts in these browsers; The people in Wikimedia who designed and developed the ULS: Alolita Sharma, Arun Ganesh, Brandon Harris, Niklas Laxström, Pau Giner, Santhosh Thottingal and Siebrand Mazeland; The many volunteers who tested ULS and reported useful bugs; The people in Unicode, such as Michael Everson, who work hard to give a number to every letter in every imaginable alphabet and make massive online multilingualism possible; And last but not least, the talented and generous people who developed all those fonts for the different scripts and released them under Free licenses. I send you all my deep appreciation, as a developer and as a reader of Wikipedia.

Firefox Aurora – Mozilla’s biggest breakthrough since Firefox itself

This post encourages you to be a little more adventurous. Please try doing what it says, even if you don’t consider yourself a techie person.

The release of Firefox 4 in March 2011 brought many noticeable innovations in the browser itself, but there was another important innovation that was overlooked and misunderstood by many: A new procedure for testing and releasing new versions.

Before Firefox 4, the release schedule of the Firefox browser was inconsistent and versions were released “when they were ready”. Beta versions were released at rather random dates and quite frequently they were unstable. Nightly builds were appropriately called “Minefield” – they crashed so often that it was impossible to use them for daily web browsing activities.

The most significant breakthrough with regards to the testing of the Firefox browser came a year ago: Mozilla decided on a regular six-week release schedule and introduced the “release channels”: Nightly, Aurora, Beta and Release. The “Release” version is what most people download and use. “Beta” could be called a “Release candidate” – few, if any, changes are made to it before it becomes “Release”. Both “Aurora” and “Nightly” are updated daily and the differences between them are that “Nightly” has more experimental features that come right from the developers’ laptops and that “Aurora” is usually released with translations to all the languages that Firefox supports, while “Nightly” is mostly released in English.

Now here’s the most important part: I use Aurora and Nightly most of the time and my own experience is that both of them are actually very stable and can be used for daily browsing. It’s possible to install all the versions side-by-side on one machine and to have them use the same add-ons, preferences, history and bookmarks. This makes it possible for many testers to fully use them for whatever they need the browser for in their life without going back to the stable version. There certainly are surprises and bugs in functionality, but i have yet to encounter one that would make me give up. In comparison, in the old “Minefield” builds the browser would often crash before a tester would even notice these bugs, so it not so useful for testing.

This change is huge. Looking back at the year of this release schedule, this may be the biggest breakthrough in the world of web browsers since the release of Firefox 1.0 in 2004. In case you forgot, before Firefox was called “Firefox”, it was just “Mozilla”; it was innovative, but too experimental for the casual user: it had clunky user interface and it couldn’t open many websites, which were built with only Microsoft Internet Explorer in mind. Consequently, it was frequently laughed at. “Firefox” was an effort to take the great innovative thing that Mozilla was, clean it up and make it functional, shiny, inviting and easy to install and use. That effort was an earth-shaking success, that revived competition and innovation in Internet technologies.

Aurora does to software testing what Firefox did to web browsing. It makes beta testing easy and fun for many people – it turns testing from a bug hunting game that only nerds want to play into a fun and unobtrusive thing that anybody can do without even noticing. And it is a yet another thing that the Mozilla Foundation does to make the web better for everybody, with everybody’s participation.

A few words about Mozilla’s competitors: The Google Chrome team does something similar with what they call “Canary builds”. I use them to peek into the future of Chrome and i occasionally report bugs in them, but i find them much less stable than Firefox Nightly, so they aren’t as game-changing. Just as Minefield from Mozilla’s distant past, they crash too often to be useful as a daily web browser, so i keep going back to Firefox Aurora. Microsoft releases new versions of Microsoft Internet Explorer very rarely and installing future test versions is way too hard for most people, so it’s not even in the game. Opera is in the middle: It releases new versions of its browser quite frequently and offers beta builds for downloading, but it doesn’t have a public bug tracking system, so i cannot really participate in the development process.

To sum things up: Download Firefox Aurora and start using it as your daily browser and report bugs if you find any. You’ll see that it’s easier than you thought to make the Web better.

Why Google Chrome Will Make the Web Worse Than Television

I know very few people who still watch television.

Television is boring, pointless and hopelessly outdated. For some reason millions of people still watch it, but it’s a matter of time until the whole industry will crumble like the governments of the USSR and Libya did, and we shall wonder why did it take so long. It will be painful to some people who make their living from it, but it will happen.

The future of entertainment and broadcasting is shaping now, and the direction is not bad. With each version of the modern web browsers – Firefox, Chrome and Opera – embedding video into pages is getting easier and works better. Users are forced less and less to install proprietary and unstable plugins. Flash is becoming a thing of the past, with YouTube working without it just as well. Diverse people create excellent music and films in their homes and they are able to publish it instantly. Business models for getting people to pay for DRM-free video and music are improving, too, for everybody’s benefit.

For some reason, however, Google and Microsoft aren’t happy about these perfectly sensible developments. They are proposing to add DRM – Digital Restriction Management – to the HTML standard. This weird document says that “No ‘DRM’ is added to the HTML5 specification“, but a document that speaks about encrypting and “protecting” content is a document about DRM. This is not “protection”, but restriction, and it is defective by design.

Preventing the copying of music and video files is not actually important to Google or to the media production companies. They will find ways to charge money for music and video. They rather want to know who is listening to what, to know what to produce and to whom to sell it. Google is essentially an advertising company, and an advertising company’s biggest asset is demographic data about people’s tastes and customs.

This is a grave privacy concern, of course, but there are enough privacy geeks to write about that. I’m not much of a privacy geek; what i really care about for this matter is the future of culture. Culture has to be interesting, vibrant and constantly innovative. When advertisers and media providers know the tastes of the “consumers” too well, culture tends to repeat itself and become very bad. Much like television in the last few years.

It is highly unlikely that the W3C will accept this proposal and make it standard. W3C dislikes DRM to begin with, Mozilla representatives in the W3C will definitely oppose to it, and even Google’s own W3C representative isn’t enthusiastic about it. Nevertheless, it’s easy to imagine that Google will implement this proposal in Chrome, and Microsoft will implement it in Internet Explorer. Then they will set up several websites with “partners” who will provide “content” that cannot be played without this DRM scheme, and this will pull more people into using these browsers and lock them into a nightmare of pointless, recycled, creativity-stifling entertainment.

I am a Mozillian. You may think that this means that i want Firefox’s market share to be 100%. That is not what i want. I love the web and i want it to be great for all people, no matter which browser they use. Building Digital Restriction Management into browsers will make the web, and the whole culture around it, bad and boring.

Don’t let that happen to the web. If you care about culture and arts, use Firefox – a browser that is committed to openness and not to advertising revenue.

Web Fonts and Web Browsers – why Firefox is the best choice for most people who don’t read in the Latin alphabet

In December the Localization team of the Wikimedia Foundation, of which i am a proud member, deployed the support for web fonts in Wikipedias in several languages of India. Put simply, this technology allows anyone with reasonably modern web browser to read Wikipedia in an exotic language without manually installing exotic fonts on his computer. Tom Morris wrote a very nice blog post that explain why web fonts matter: Web fonts were invented for making web sites niftier, but they are useful for something much more important beyond aesthetics and design – to enable people to read and write in any language effortlessly. People need to be able to read and write effortlessly using a computer, but this notion is so basic that it is frequently overlooked.

Basically, web fonts turn this:

◌◌◌◌ ◌◌◌ ◌◌◌◌◌◌◌◌◌ ◌◌ ◌◌◌ ◌◌◌◌◌◌, ◌◌◌◌◌◌◌◌ ◌◌◌◌◌◌◌ ◌◌◌◌ ◌◌◌◌◌◌◌◌◌ ◌◌◌◌◌◌◌◌◌ ◌◌◌ ◌◌ ◌◌◌◌◌◌ ◌◌◌◌◌◌ ◌◌ ◌◌◌◌◌◌◌◌◌ ◌◌◌ ◌◌◌◌ ◌◌ ◌◌◌◌◌◌◌◌◌ ◌◌◌◌◌◌◌◌◌ ◌◌◌◌ ◌◌◌ ◌◌◌◌ ◌◌ ◌◌◌ ◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌ ◌◌◌◌◌.◌◌◌◌ ◌◌◌ ◌◌◌◌◌◌◌◌◌ ◌◌ ◌◌◌ ◌◌◌◌◌◌, ◌◌◌◌◌◌◌◌ ◌◌◌◌◌◌◌ ◌◌◌◌ ◌◌◌◌◌◌◌◌◌ ◌◌◌◌◌◌◌◌◌ ◌◌◌ ◌◌ ◌◌◌◌◌◌ ◌◌◌◌◌◌ ◌◌ ◌◌◌◌◌◌◌◌◌ ◌◌◌ ◌◌◌◌ ◌◌ ◌◌◌◌◌◌◌◌◌ ◌◌◌◌◌◌◌◌◌ ◌◌◌◌ ◌◌◌ ◌◌◌◌ ◌◌ ◌◌◌ ◌◌◌◌◌◌◌◌◌◌◌◌◌◌◌ ◌◌◌◌◌.

into this:

Near the beginning of his career, Einstein thought that Newtonian mechanics was no longer enough to reconcile the laws of classical mechanics with the laws of the electromagnetic field.

Without webfonts, a person who speaks a language that is not written in Latin letters has two choices when seeing “◌◌◌◌ ◌◌◌ ◌◌◌◌◌◌◌◌◌”: to install fonts manually or to try to find that information in English or some other language that is written in Latin. Two frequently ignored facts: 1. most people don’t know how to install fonts on their computers; 2. most people don’t know English.

Web fonts make text readable without any effort from the user. Wikipedia is probably the first major website that uses web fonts for the really important purpose of allowing people to read websites in their language. This post here will highlight some technical details about the deployment.

A spoiler: Firefox rulez.


Microsoft Internet Explorer, not surprisingly, has the most issues with web fonts support. For example, it sometimes shows complete gibberish instead of the actual letters. The situation is especially bad on Windows XP; Windows XP is an old system, but it matters, because lots of people in India and in many other countries still use it – about 17% of Wikipedia’s readers use Internet Explorer on Windows XP. Even though Microsoft Internet Explorer 9 seems to handle web fonts decently, it cannot be installed on Windows XP, so it’s irrelevant to hundreds of millions of people. My advice to them – get Firefox.

Opera sucks here and there, too. For example, on a Mac, Opera may fail to show English (!) words, because it tries to show them in an Indic font, and if an Indic font doesn’t have Latin characters, the display is broken. Google Chrome has similar problems, too.

In Firefox we found practically no issues with web fonts support. The only problem with Firefox that happened during the deployment of WebFonts is that Firefox didn’t load the fonts at all, but actually that happened because Firefox implements the web fonts standard correctly. On our testing site the font files were loaded from the same server as the web page itself, while on the actual Wikipedia the font files are loaded from a different domain to improve performance. The web fonts standard says that by default a browser is not supposed to load fonts from a different domain, unless that domain explicitly allows this. Chrome, Opera and Internet Explorer override this standard and load the fonts and Firefox doesn’t. When we noticed it, we asked Wikimedia’s web server administrators to change the configuration to explicitly allow the loading of fonts. Wikimedia’s web server configuration files are open, so you’re welcome to read them by clicking the link.

I didn’t make any precise measurements, but from my personal experience Firefox has much less issues with support for Unicode, complex fonts and right-to-left text than any other browser. It surely does have issues, but my impression is that Chrome, Internet Explorer and Opera have much more of them.

We reported the font issues that we found in Google Chrome to its developers and we hope that they will be fixed. We also tried to report issues in Opera and Internet Explorer; since there’s no public bug tracking systems for these browsers, we cannot track their development.

MozCamp Berlin 2011, part 1

On November 12–13 i participated in MozCamp Berlin. (I’m writing this late-ish, because a day after that i went to India to participate in a Wikimedia conference and not one, but two hackathons. That was a crazy month.)


In the past i participated in small events of the Israeli Mozilla community, but this was my first major Mozilla-centric event.

MozCamp Berlin 2011 group photo
MozCamp Berlin 2011 group photo. Notice the fox on the left and yours truly on the right.

The biggest thing that i take from this event is the understanding that i belong to this community of people who love the web. I never properly realized it earlier; i somehow thought that loving the web is a given. It is not.

Johnathan Nightingale, director of Firefox Engineering repeated the phrase “we <3 the web” several times in his keynote speech. And this is the thing that makes the Mozilla community special.

Firefox is not the only good web browser. Opera and Google Chrome are reasonably good, too. Frankly, they are even better than Firefox in some features, though i find them less essential.

Firefox is not the only web browser that strives to implement web standards. Opera, Google Chrome and even recent versions of Microsoft Internet Explorer try to do that, too.

Firefox is not even the only web browser that is Free Software. So is Chromium.

But Firefox and the Mozilla community around it love the web. I don’t really have a solid way to explain it – it’s mostly a feeling. And with other browsers i just don’t have it. They help people surf the web, but they aren’t in the business of loving it.

And this is important, because the Internet is not just a piece of technical infrastructure that helps people communicate, do business and find information and entertainment. The Internet is a culture in itself – worthy of appreciation in itself and worthy of love in itself – and the Mozilla community is there to make it happen.

Some people would understand from this that Firefox is for the nerds who care about the technology more than they care about going out every once in a while. It isn’t. It’s not, in fact, just about a browser. It’s about the web – more and more Mozilla is not just developing a great browser, but also technologies and trends that affect all users of all browsers, rather than target markets. By using Firefox you get as close as you can to the cutting edge, not just of cool new features, but of openness and equality. Some people may find this ideology boring and pointless; i find it important, because without it the Internet would not be where it is today. Imagine an Internet in which the main sites you visit every day are not Facebook, Wikipedia, Google and your favorite blogs, but msn.com… and nothing but msn.com. Without Mozilla that’s how the Internet would probably look today. Without Mozilla something like this may well happen in the future.


Thanks a lot to William Quiviger, Pierros Papadeas, Greg Jost and all the other hard-working people who produced this great event.

More about it in the next couple of posts very soon.

People Speaking – Save

— “How did you say that I can save a Word document as a MediaWiki file?”

— “You need to download an add-on for Microsoft Word. Google for ‘save microsoft word document as mediawiki’.”

— “I did. It brought me to a page about OpenOffice.”

— “Hmm… Great success!”


In LibreOffice, the freer version of OpenOffice, saving as MediaWiki is already available without installing any additional add-ons. It may be so in the latest version of OpenOffice, too. I used this feature to upload to Wikipedia dozens of articles that were written by people who can write well, but don’t want to learn the complicated MediaWiki syntax.

For Microsoft Word there is the “Microsoft Office Word Add-in For MediaWiki“. I tried installing it, but it didn’t actually work. Your mileage may vary.

Roth

miriamruth11-hp
miriamruth11-hp; copyright: Google; based on the original illustration by Ora Ayal

Today the logo appearing at the top of Google.co.il honors Miriam Roth, the author of the famous Hebrew children’s book “A Tale of Five Balloons”. She was born on the 16th of February in 1910.

The Google employee who uploaded the image, made a mistake: the filename is “miriamruth”, but it should be “miriamroth”. That’s what happens when there’s no proper way to write the vowels: Her last name is written רות, which is how the Biblical name “Ruth”, still common in modern Israel, is written. But the German last name “Roth” is written the same way, because in Hebrew “u” and “o” are usually written using the same letter, Vav.

There is a way to differentiate the sounds: רוּת is “Ruth” and רוֹת is “Roth”. Notice the placement of the dot in relation to the letter in the middle. The sign for “u” is called shuruk, and the sign for “o” is called holam; i wrote the bulk of the articles about them in Wikipedia. Most people don’t type these signs; usually it’s fairly easy to guess the correct pronunciation, but people don’t use these signs even when it’s needed, as is the case with Ruth/Roth, because typing them on the standard Hebrew keyboard is very hard.

For years this made me very angry, so i asked the Standards Institute of Israel to develop a new standard keyboard in which it will be easy to type these signs. I was successful at convincing the SII to do it. The work is now underway, and i actively participate in the monthly meetings, together with representatives from Hamakor – the Israeli association for free and open source software, Israel Internet Association, IBM, Microsoft, Apple, Google and other companies. I hope that the standard will be published in 2011; the technical implementation of the keyboard layout will take about ten minutes on each operating system, and shortly after that, i hope, it will be distributed to computers using some kind of an auto-update mechanism.

And then, i hope, we’ll start to see at least slightly richer Hebrew typography everywhere. I want it to happen, not just because it’s a nice tradition, but because this will simply make Hebrew easier to read – and will prevent silly mistakes, like pronouncing and writing “Ruth” instead of “Roth”.


See also: Maqaf.

Priest

Ray Ozzie, [Bill Gates’] successor as chief software architect, doesn’t have anything close to the confrontational approach that the Microsoft co-founder used to shape the company.

“It’s as though you’ve been running Italy using the Mafia for the last 20 years and you bring in a priest,” said Mark Anderson, publisher of the Strategic News Service technology newsletter.

One year later, Microsoft feels subtle effects of Gates’ transition, Todd Bishop