Twitter sometimes offers machine translation for tweets that are not written in the language that I chose in my preferences. Usually I have Hebrew chosen, but for writing this post I temporarily switched to English.
Here’s an example where it works pretty well. I see a tweet written in French, and a little “Translate from French” link:
The translation is not perfect English, but it’s good enough; I never expect machine translation to have perfect grammar, vocabulary, and word order.
Now, out of curiosity I happen to follow a lot of people and organizations who tweet in the Belarusian language. It’s the official language of the country of Belarus, and it’s very closely related to Russian and Ukrainian. All three languages have similar grammar and share a lot of basic vocabulary, and all are written in the Cyrillic alphabet. However, the actual spelling rules are very different in each of them, and they use slightly different variants of Cyrillic: only Russian uses the letter ⟨ъ⟩; only Belarusian uses ⟨ў⟩; only Ukrainian uses ⟨є⟩.
Despite this, Bing gets totally confused when it sees tweets in the Belarusian language. Here’s an example form the Euroradio account:
Both tweets are written in Belarusian. Both of them have the letter ⟨ў⟩, which is used only in Belarusian, and never in Ukrainian and Russian. The letter ⟨ў⟩ is also used in Uzbek, but Uzbek never uses the letter ⟨і⟩. If a text uses both ⟨ў⟩ and ⟨і⟩, you can be certain that it’s written in Belarusian.
And yet, Twitter’s machine translation suggests to translate the top tweet from Ukrainian, and the bottom one from Russian!
An even stranger thing happens when you actually try to translate it:
Notice two weird things here:
- After clicking, “Ukrainian” turned into “Russian”!
- Since the text is actually written in Belarusian, trying to translate it as if it was Russian is futile. The actual output is mostly a transliteration of the Belarusian text, and it’s completely useless. You can notice how the letter ⟨ў⟩ cannot be transliterated.
Something similar happens with the Igbo language, spoken by more than 20 million people in Nigeria and other places in Western Africa:
This is written in Igbo by Blossom Ozurumba, a Nigerian Wikipedia editor, whom I have the pleasure of knowing in real life. Twitter identifies this as Vietnamese—a language of South-East Asia.
The reason for this might be that both Vietnamese and Igbo happen to be written in the Latin alphabet with addition of diacritical marks, one of the most common of which is the dot below, such as in the words ibụọla in this Igbo tweet, and the word chọn lọc in Vietnamese. However, other than this incidental and superficial similarity, the languages are completely unrelated. Identifying that a text is written in a certain language only by this feature is really not great.
If I paste the text of the tweet, “Nwoke ọma, ibụọla chi?”, into translate.bing.com, it is auto-identified as Italian, probably because it includes the word chi, and a word that is written identically happens to be very common in Italian. Of course, Bing fails to translate everything else in the Tweet, but this does show a curious thing: Even though the same translation engine is used on both sites, the language of the same text is identified differently.
How could this be resolved?
Neither Belarusian nor Igbo languages are supported by Bing. If Bing is the only machine translation engine that Twitter can use, it would be better to just skip it completely and not to offer any translation, than to offer this strange and meaningless thing. Of course, Bing could start supporting Belarusian; it has a smaller online presence than Russian and Ukrainian, but their grammar is so similar, that it shouldn’t be that hard. But what to do until that happens?
In Wikipedia’s Content Translation, we don’t give exclusivity to any machine translation backend, and we provide whatever we can, legally and technically. At the moment we have Apertium, Yandex, and YouDao, in languages that support them, and we may connect to more machine translation services in the future. In theory, Twitter could do the same and use another machine translation service that does support the Belarusian language, such as Yandex, Google, or Apertium, which started supporting Belarusian recently. This may be more a matter of legal and business decisions than a matter of engineering.
Another thing for Twitter to try is to let users specify in which languages do they write. Currently, Twitter’s preferences only allow selecting one language, and that is the language in which Twitter’s own user interface will appear. It could also let the user say explicitly in which languages do they write. This would make language identification easier for machine translation engines. It would also make some business sense, because it would be useful for researchers and marketers. Of course, it must not be mandatory, because people may want to avoid providing too much identifying information.
If Twitter or Bing Translation were free software projects with a public bug tracking system, I’d post this as a bug report. Given that they aren’t, I can only hope that somebody from Twitter or Microsoft will read it and fix these issues some day. Machine translation can be useful, and in fact Bing often surprises me with the quality of its translation, but it has silly bugs, too.