Archive for the 'search' Category

The Curious Problem of Belarusian and Igbo in Twitter and Bing Translation

Twitter sometimes offers machine translation for tweets that are not written in the language that I chose in my preferences. Usually I have Hebrew chosen, but for writing this post I temporarily switched to English.

Here’s an example where it works pretty well. I see a tweet written in French, and a little “Translate from French” link:

Emmanuel Macron on Twitter.png

The translation is not perfect English, but it’s good enough; I never expect machine translation to have perfect grammar, vocabulary, and word order.

Now, out of curiosity I happen to follow a lot of people and organizations who tweet in the Belarusian language. It’s the official language of the country of Belarus, and it’s very closely related to Russian and Ukrainian. All three languages have similar grammar and share a lot of basic vocabulary, and all are written in the Cyrillic alphabet. However, the actual spelling rules are very different in each of them, and they use slightly different variants of Cyrillic: only Russian uses the letter ⟨ъ⟩; only Belarusian uses ⟨ў⟩; only Ukrainian uses ⟨є⟩.

Despite this, Bing gets totally confused when it sees tweets in the Belarusian language. Here’s an example form the Euroradio account:

Еўрарадыё   euroradio    Twitter double.pngBoth tweets are written in Belarusian. Both of them have the letter ⟨ў⟩, which is used only in Belarusian, and never in Ukrainian and Russian. The letter ⟨ў⟩ is also used in Uzbek, but Uzbek never uses the letter ⟨і⟩. If a text uses both ⟨ў⟩ and ⟨і⟩, you can be certain that it’s written in Belarusian.

And yet, Twitter’s machine translation suggests to translate the top tweet from Ukrainian, and the bottom one from Russian!

An even stranger thing happens when you actually try to translate it:

Еўрарадыё   euroradio    Twitter single Russian.pngNotice two weird things here:

  1. After clicking, “Ukrainian” turned into “Russian”!
  2. Since the text is actually written in Belarusian, trying to translate it as if it was Russian is futile. The actual output is mostly a transliteration of the Belarusian text, and it’s completely useless. You can notice how the letter ⟨ў⟩ cannot be transliterated.

Something similar happens with the Igbo language, spoken by more than 20 million people in Nigeria and other places in Western Africa:

 4  Tweets with replies by Ntụ Agbasa   blossomozurumba    Twitter.png

This is written in Igbo by Blossom Ozurumba, a Nigerian Wikipedia editor, whom I have the pleasure of knowing in real life. Twitter identifies this as Vietnamese—a language of South-East Asia.

The reason for this might be that both Vietnamese and Igbo happen to be written in the Latin alphabet with addition of diacritical marks, one of the most common of which is the dot below, such as in the words ibụọla in this Igbo tweet, and the word chọn lọc in Vietnamese. However, other than this incidental and superficial similarity, the languages are completely unrelated. Identifying that a text is written in a certain language only by this feature is really not great.

If I paste the text of the tweet, “Nwoke ọma, ibụọla chi?”, into translate.bing.com, it is auto-identified as Italian, probably because it includes the word chi, and word that is written identically happens to be very common in Italian. Of course, Bing fails to translate everything else in the Tweet, but this does show a curious thing: Even though the same translation engine is used on both sites, the language of the same text is identified differently.

How could this be resolved?

Neither Belarusian nor Igbo languages are supported by Bing. If Bing is the only machine translation engine that Twitter can use, it would be better to just skip it completely and not to offer any translation, than to offer this strange and meaningless thing. Of course, Bing could start supporting Belarusian; it has a smaller online presence than Russian and Ukrainian, but their grammar is so similar, that it shouldn’t be that hard. But what to do until that happens?

In Wikipedia’s Content Translation, we don’t give exclusivity to any machine translation backend, and we provide whatever we can, legally and technically. At the moment we have Apertium, Yandex, and YouDao, in languages that support them, and we may connect to more machine translation services in the future. In theory, Twitter could do the same and use another machine translation service that does support the Belarusian language, such as Yandex, Google, or Apertium, which started supporting Belarusian recently. This may be more a matter of legal and business decisions than a matter of engineering.

Another thing for Twitter to try is to let users specify in which languages do they write. Currently, Twitter’s preferences only allow selecting one language, and that is the language in which Twitter’s own user interface will appear. It could also let the user say explicitly in which languages do they write. This would make language identification easier for machine translation engines. It would also make some business sense, because it would be useful for researchers and marketers. Of course, it must not be mandatory, because people may want to avoid providing too much identifying information.

If Twitter or Bing Translation were free software projects with a public bug tracking system, I’d post this as a bug report. Given that they aren’t, I can only hope that somebody from Twitter or Microsoft will read it and fix these issues some day. Machine translation can be useful, and in fact Bing often surprises me with the quality of its translation, but it has silly bugs, too.

Advertisements

Number in the Middle

I didn’t measure it, but i probably search Google in English more often than in Hebrew. Under the result link there’s a short summary of the page. Very frequently the first thing that is written in this summary is a date. Google forces right-to-left too strongly on all of the page, so the first number of the date goes to the other end of the summary:

Google search results - right to left

Google search results - right to left

The result is that very, very often i see things like “at most restaurants in 21 Lima and Cusco” and “What if 26 you buy a shite gun”, which doesn’t make sense.

These are the results in complete left-to-right display:

Google search results - left to right

Google search results - left to right

Dear Google, please fix this bug. It’s annoying me for a long time.

Hello, funny person

Hello, funny person. You know who you are. Yes, i see your search engine games. No, i am not really a spam expert. No, i am not going to work for Microsoft. No, i don’t dislike Russians. You can stop spamming me now.

For future reference, please note that WordPress cuts the search engine queries after 40 characters.

Vladimir

If you type владимир (vladimir) in Google and let it guess the popular queries, then Vladimir Putin is second and Vladimir Vysotsky is first.

Thank God.

Search and destroy – new page

WordPress has a nice feature – it is possible to see what did people look for in search engines when they find my blog.

I am creating a new page for the most interesting of them, which will be called “Search and destroy.”

An open letter to Richard M. Stallman

Hello,

I am shopping for a laptop computer and i would like to buy one that is truly free – one that is able to run GNU/Linux without any restricted drivers, binary blobs and proprietary firmware.

I’ve been looking for such a laptop for almost a week now, and unfortunately couldn’t find it. I’ve tried asking about it on Ubuntu and gNewSense forums and local (Israeli) forums of GNU/Linux and Free Software experts, but the best reply i could get was that finding a perfectly Free laptop is just too hard and that at this time i should just give up! That is what Mark Shuttleworth himself said, even though he claims that he is also concerned about the issue of “radical” hardware freedom (see discussion at the bottom of Bug #1).

Why is it so hard?

For example: The hardware database at the FSF website has a list of network cards that support Free Software; This is informative, but in practice i couldn’t find anywhere on the Internet a way to search for laptops that have these cards. A lot of laptop vendors don’t even bother to list the manufacturer and model of the network card in the details of their laptops’ components, because in Windows they all just work and Ubuntu makes it relatively easy to install restricted drivers.

The above is also correct for video cards, DVD burners, etc.

So, apparently, most people – even Linux users! – don’t care about free firmware. I do care, and i tried my best to do something about it, but my wife urgently needs a laptop to write her thesis, so unfortunately it seems that i’ll have to buy a (partially) restricted system after all.

I thought that you would like to know that there are people that care about this issue, but find it hard to do something about it in practice.

If you do know about a laptop that is fully usable with purely free drivers, please tell me.

Thanks!


N.B.: I have great respect towards Mark Shuttleworth and i believe that he is doing his best to help and fix this issue. I regret using the word “claim”, but i already sent the letter to RMS and wanted to post it here without changes.

Fun

Apparently someone arrived to my blog searching for “underwater-sex and fun”.


Archives