The Curious Problem of Belarusian and Igbo in Twitter and Bing Translation

Twitter sometimes offers machine translation for tweets that are not written in the language that I chose in my preferences. Usually I have Hebrew chosen, but for writing this post I temporarily switched to English.

Here’s an example where it works pretty well. I see a tweet written in French, and a little “Translate from French” link:

Emmanuel Macron on Twitter.png

The translation is not perfect English, but it’s good enough; I never expect machine translation to have perfect grammar, vocabulary, and word order.

Now, out of curiosity I happen to follow a lot of people and organizations who tweet in the Belarusian language. It’s the official language of the country of Belarus, and it’s very closely related to Russian and Ukrainian. All three languages have similar grammar and share a lot of basic vocabulary, and all are written in the Cyrillic alphabet. However, the actual spelling rules are very different in each of them, and they use slightly different variants of Cyrillic: only Russian uses the letter ⟨ъ⟩; only Belarusian uses ⟨ў⟩; only Ukrainian uses ⟨є⟩.

Despite this, Bing gets totally confused when it sees tweets in the Belarusian language. Here’s an example form the Euroradio account:

Еўрарадыё   euroradio    Twitter double.pngBoth tweets are written in Belarusian. Both of them have the letter ⟨ў⟩, which is used only in Belarusian, and never in Ukrainian and Russian. The letter ⟨ў⟩ is also used in Uzbek, but Uzbek never uses the letter ⟨і⟩. If a text uses both ⟨ў⟩ and ⟨і⟩, you can be certain that it’s written in Belarusian.

And yet, Twitter’s machine translation suggests to translate the top tweet from Ukrainian, and the bottom one from Russian!

An even stranger thing happens when you actually try to translate it:

Еўрарадыё   euroradio    Twitter single Russian.pngNotice two weird things here:

  1. After clicking, “Ukrainian” turned into “Russian”!
  2. Since the text is actually written in Belarusian, trying to translate it as if it was Russian is futile. The actual output is mostly a transliteration of the Belarusian text, and it’s completely useless. You can notice how the letter ⟨ў⟩ cannot be transliterated.

Something similar happens with the Igbo language, spoken by more than 20 million people in Nigeria and other places in Western Africa:

 4  Tweets with replies by Ntụ Agbasa   blossomozurumba    Twitter.png

This is written in Igbo by Blossom Ozurumba, a Nigerian Wikipedia editor, whom I have the pleasure of knowing in real life. Twitter identifies this as Vietnamese—a language of South-East Asia.

The reason for this might be that both Vietnamese and Igbo happen to be written in the Latin alphabet with addition of diacritical marks, one of the most common of which is the dot below, such as in the words ibụọla in this Igbo tweet, and the word chọn lọc in Vietnamese. However, other than this incidental and superficial similarity, the languages are completely unrelated. Identifying that a text is written in a certain language only by this feature is really not great.

If I paste the text of the tweet, “Nwoke ọma, ibụọla chi?”, into translate.bing.com, it is auto-identified as Italian, probably because it includes the word chi, and a word that is written identically happens to be very common in Italian. Of course, Bing fails to translate everything else in the Tweet, but this does show a curious thing: Even though the same translation engine is used on both sites, the language of the same text is identified differently.

How could this be resolved?

Neither Belarusian nor Igbo languages are supported by Bing. If Bing is the only machine translation engine that Twitter can use, it would be better to just skip it completely and not to offer any translation, than to offer this strange and meaningless thing. Of course, Bing could start supporting Belarusian; it has a smaller online presence than Russian and Ukrainian, but their grammar is so similar, that it shouldn’t be that hard. But what to do until that happens?

In Wikipedia’s Content Translation, we don’t give exclusivity to any machine translation backend, and we provide whatever we can, legally and technically. At the moment we have Apertium, Yandex, and YouDao, in languages that support them, and we may connect to more machine translation services in the future. In theory, Twitter could do the same and use another machine translation service that does support the Belarusian language, such as Yandex, Google, or Apertium, which started supporting Belarusian recently. This may be more a matter of legal and business decisions than a matter of engineering.

Another thing for Twitter to try is to let users specify in which languages do they write. Currently, Twitter’s preferences only allow selecting one language, and that is the language in which Twitter’s own user interface will appear. It could also let the user say explicitly in which languages do they write. This would make language identification easier for machine translation engines. It would also make some business sense, because it would be useful for researchers and marketers. Of course, it must not be mandatory, because people may want to avoid providing too much identifying information.

If Twitter or Bing Translation were free software projects with a public bug tracking system, I’d post this as a bug report. Given that they aren’t, I can only hope that somebody from Twitter or Microsoft will read it and fix these issues some day. Machine translation can be useful, and in fact Bing often surprises me with the quality of its translation, but it has silly bugs, too.

Advertisement

3 thoughts on “The Curious Problem of Belarusian and Igbo in Twitter and Bing Translation

  1. “Bing could start supporting Belarusian; it has a smaller online presence than Russian and Ukrainian, but their grammar is so similar, that it shouldn’t be that hard.”
    This is a rather nonsensical statement in today’s MT paradigm, where parallel corpora are king. I’m not sure about Bing, but Google Translate is transitioning to a system that’s utterly agnostic of all grammar constructed via human knowledge (i.e. manually).
    In other words, give them English-Belorussian data, and you’ll have your system.

    1. If it’s indeed all statistics and AI, then Bing could do the same thing that Yandex and Google did.

      But I suspect that it’s not all statistics, and that for languages that are as grammatically similar as Belarusian and Russian are, some rule-based MT could be in place. Google say that they don’t do rule-based stuff, but as long as the source is not open, we cannot know.

      I have the same suspicion about French and Haitian Creole, by the way. Is it a coincidence that Google, Yandex, and Bing support it? Or is there a huge parallel corpus of natural Haitian Creole texts?

  2. We have a similar problem with Welsh and it seems Twitter is unwilling to offer translation of Welsh messages or interface (although Welsh is available on facebook, google, SwiftKey, etc.). Like Facebook, a lose and broad network of Welsh enthusiasts have even begun translating twitter interface with the belief Twitter would use it. But it seems not.

    This is not only frustrating for Welsh speakers and also people, who may not be fluent or just the occasional follower who wishes to see a translation of a tweet, but it’s also very detrimental to the language as online (and voice recognition) precence is going to be absolutely pivotal to a language’s vibrancy and use in the furture.

    I fail to see Twitter’s problem with different languages. Especially as Welsh, is, in comparison with other lesser used languages (and probably all African languages, with the exception of Swahili) quite advances in online presence. It’s very bad news for African as well European languages like Welsh.

    As far as (mis)translating, I have a passing interest in Faroese and one has to use Icelandic to translate for an idea of an article. Or Faroese is recognised as Icelands. Welsh … well we get all kinds of exotic languages!

    Twitter and Welsh
    2012 ‘Twitter launch Welsh language Version’ – http://www.walesonline.co.uk/news/wales-news/twitter-launch-welsh-language-version-2024469

    2017 Concern over future of Welsh on Twitter (in Welsh) – https://golwg360.cymru/newyddion/cymru/505093-pryderon-ddyfodol-gymraeg-wefan-twitter

    Canolfan Bedwyr – centre at Bangor University for development of Welsh lang services and technology etc https://www.bangor.ac.uk/canolfanbedwyr/

    faroese
    Campaign for Faroese on Googletranslate https://www.prnewswire.com/news-releases/english-releases/tiny-archipelago-creates-faroe-islands-translate-to-petition-google-translate-to-share-their-language-300530806.html

    SwifKey
    Is fantastic, you can upload up to 5 languages for predictive texting. Ideal if you live a life in more than one language (hey, you know, like over half the world who aren’t monoglot anglophones!). Includes African all European languages and also African ones, not just the usual suspects. https://swiftkey.com/en It even includes Igbo and Faroese. The technology is there, it’s just the old imperial attitude.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.