I was wondering: In which languages, user interface translations tend to be longer, and in which ones they are shorter?
The intuitive answers to these questions are that Chinese and Japanese are very short, English tends to be shorter than the average, Hebrew is shorter than English, and the longest ones are Turkish, Finnish, German, and Tamil. But what if I try to find a more precise answer?
So I made a super-simplistic calculation: I checked the average length of a core MediaWiki user interface message for English and the 150 languages with the highest number of translations.
I sorted them from the shortest average length to the longest. The table is at the end of the post.
Here’s a verbal summary of some interesting points that I found:
- The shortest messages are found, unsurprisingly, in Chinese, Japanese, and Korean.
- Another group of languages that surprised me by having very short translations are some Arabic-script languages of South Asia: Saraiki, Punjabi, Sindhi, Pashto, Balochi.
- Three more languages surprised me by being at the shoter end of the list: Hill Mari (mhr) and Northern Sami (se), which are Finno-Ugric, a family known for agglutinative grammar that tends to make words longer; and Armenian, about which I, for no particular reason, had the impression that its words are longish.
- English is at #22 out of 151, with an average length of 38.
- Hebrew is slightly above English at #21, with 37.9. This surprised me: I was always under the impression that Hebrew tends to be much shorter.
- The longest languages are not quite the ones I thought! The longest ones tend to be the Romance languages: Lombard, French, Portuguese, Spanish, Galician, Arpitan, Romanian, Catalan.
- Three Germanic languages, namely Colognian, German and Dutch, are on the longer end of the list, but not all of them. (Colognian is the longest in my list. The reason for this is not so natural, though: The most prolific translator into it, User:Purodha, liked writing out abbreviations in full, so it made many strings longer than they could be. He passed away in 2016. May he rest in peace.)
- Other language groups that tend to be longer are Slavic (Belarusian, Russian, Bulgarian, Polish, Ukrainian) and Austronesian (Sakizaya, Ilokano, Tagalog, Bikol, Indonesian).
- Other notable, but not easily grouped languages that tend to be longer are Irish, Greek, Shan, Quechua, Finnish, Hungarian, Basque, and Malayalam. All of them have an average length between 45 and 53 characters.
- Turkish is only slightly above average with 44.1, at #88.
- Tamil is a bit longer, with an average length of 44.6, at #94. Strings in its sister language Malayalam are considerably longer, 49.1.
- The median length is 43, and the average for everyone is 42. Notable languages at these lengths are Mongolian, Serbian, Welsh, Norwegian, Malaysian, Esperanto, Georgian, Balinese, Tatar, Estonian, and Bashkir. (Esperantistoj, ĉu vi ĝojas aŭdi, ke via lingvo aperas preskaŭ ĝuste en la mezo de ĉi tiu listo?)
One important factor that I didn’t take into account is that, for various reasons, translators to different languages may select to translate different messages, and one of those reasons may be that people choose to translate shorter messages first because they are usually easier. I addressed this in a very quick and dirty way, by ignoring strings longer than 300 characters. Some time in the (hopefully near) future, I’ll try to make a smarter way to calculate it.
And here are the full results. Please don’t take them too seriously, and feel free to write your own, better, calculation code!
# | Language code | Average translation length |
1 | zh-hans | 17.67324825 |
2 | zh-hant | 18.52284388 |
3 | skr-arab | 21.81899964 |
4 | ja | 24.67007612 |
5 | ko | 25.8110372 |
6 | sd | 27.71960396 |
7 | mhr | 28.95451413 |
8 | ps | 32.73647059 |
9 | pnb | 33.03592163 |
10 | bgn | 34.39934667 |
11 | se | 34.69274476 |
12 | hy | 35.02317597 |
13 | su | 35.37706967 |
14 | th | 35.52957892 |
15 | ce | 35.6969602 |
16 | mai | 36.02093909 |
17 | lv | 36.14100906 |
18 | gu | 36.59380971 |
19 | bcc | 36.64866033 |
20 | fy | 37.60139287 |
21 | nqo | 37.94138834 |
22 | he | 37.95259865 |
23 | en | 38.04300371 |
24 | ar | 38.18569036 |
25 | ckb | 38.66867672 |
26 | min | 38.71156958 |
27 | ses | 38.87941712 |
28 | jv | 38.94753377 |
29 | is | 39.0652467 |
30 | alt | 39.39977435 |
31 | az | 39.4337931 |
32 | kab | 39.50967506 |
33 | tk | 39.54990758 |
34 | mr | 39.72049689 |
35 | as | 39.72080166 |
36 | sw | 39.73986071 |
37 | km | 39.77591036 |
38 | azb | 39.92411642 |
39 | nn | 39.96771069 |
40 | yo | 40.00503291 |
41 | io | 40.0528125 |
42 | af | 40.1640678 |
43 | blk | 40.2813059 |
44 | sco | 40.33289474 |
45 | diq | 40.33887373 |
46 | yi | 40.34033476 |
47 | ur | 40.39857651 |
48 | ug-arab | 40.53965184 |
49 | da | 40.55894826 |
50 | my | 40.67551519 |
51 | kk-cyrl | 40.87443182 |
52 | guw | 41.07080182 |
53 | mg | 41.08369028 |
54 | sq | 41.23219241 |
55 | fa | 41.27007299 |
56 | or | 41.27020202 |
57 | ne | 41.33971151 |
58 | rue | 41.40219378 |
59 | lfn | 41.54527278 |
60 | lrc | 41.61281337 |
61 | sah | 41.63293173 |
62 | vi | 41.74578313 |
63 | awa | 41.84093291 |
64 | hi | 41.9257885 |
65 | si | 41.93065693 |
66 | te | 41.99780915 |
67 | mn | 42.18728223 |
68 | lki | 42.21091396 |
69 | bjn | 42.57961538 |
70 | sr-ec | 42.67730151 |
71 | cy | 42.75020408 |
72 | frr | 42.92761394 |
73 | vec | 43.00573682 |
74 | sr-el | 43.13764389 |
75 | nb | 43.34987835 |
76 | krc | 43.54919554 |
77 | ms | 43.5553814 |
78 | hr | 43.55564807 |
79 | eo | 43.57477789 |
80 | nds-nl | 43.59060895 |
81 | ka | 43.60108696 |
82 | ban | 43.64178033 |
83 | bs | 43.681094 |
84 | tt-cyrl | 43.78230132 |
85 | xmf | 43.80860161 |
86 | et | 43.96494239 |
87 | ba | 43.99432099 |
88 | tr | 44.17996604 |
89 | bn | 44.28768449 |
90 | bew | 44.44706174 |
91 | sv | 44.49027333 |
92 | sa | 44.58670931 |
93 | cs | 44.59026764 |
94 | ta | 44.62803055 |
95 | mt | 44.70207417 |
96 | lt | 44.7615 |
97 | roa-tara | 44.79812466 |
98 | fit | 44.79824561 |
99 | dsb | 44.9151957 |
100 | hsb | 44.96197228 |
101 | br | 44.98873461 |
102 | sh-latn | 45.00976709 |
103 | fi | 45.1222031 |
104 | hu | 45.17139303 |
105 | sk | 45.35804702 |
106 | lb | 45.39073034 |
107 | li | 45.5539548 |
108 | id | 45.56471159 |
109 | gsw | 45.63605209 |
110 | sl | 45.75350606 |
111 | be | 45.80325 |
112 | oc | 45.85709988 |
113 | mk | 45.90943939 |
114 | bcl | 45.97070064 |
115 | scn | 46.11905532 |
116 | an | 46.14892665 |
117 | uk | 46.22955524 |
118 | qu | 46.30301842 |
119 | eu | 46.33589404 |
120 | lij | 46.660536 |
121 | pl | 46.76863316 |
122 | hrx | 46.79802761 |
123 | ast | 46.87204161 |
124 | nap | 46.93783147 |
125 | ru | 47.02326139 |
126 | bg | 47.03590259 |
127 | be-tarask | 47.28525242 |
128 | hif-latn | 47.41652614 |
129 | tl | 47.51263001 |
130 | rm | 47.60741067 |
131 | pms | 47.69805527 |
132 | pt-br | 47.84063647 |
133 | ca | 47.92468307 |
134 | ro | 48.22437186 |
135 | nl | 48.4175636 |
136 | ia | 48.48612816 |
137 | it | 48.52347014 |
138 | frp | 48.54542755 |
139 | gl | 48.57820482 |
140 | ml | 49.12108224 |
141 | es | 49.21062944 |
142 | pt | 49.63085602 |
143 | de | 49.77225067 |
144 | szy | 49.84650877 |
145 | shn | 49.92356241 |
146 | fr | 50.15585031 |
147 | lmo | 50.85627837 |
148 | ilo | 50.9798995 |
149 | el | 51.14834894 |
150 | gd | 51.72994269 |
151 | ksh | 53.36332609 |
The Python 3 code I’ve used to create the table. You can run in the root directory of the core MediaWiki source tree. It’s horrible, please improve it!
import json
import os
import re
languages = {}
code_re = re.compile(r"(?P<code>[^/]+)\.json$")
def process_file(filename):
code_search = code_re.search(filename)
code = code_search.group("code")
if code in ('qqq', 'ti', 'lzh', 'yue-hant'):
return
with open(filename, "r", encoding="utf-8") as file:
data = json.load(file)
del(data['@metadata'])
average_unicode_length(code, data)
def average_unicode_length(language, translations):
total_translations = len(translations)
if total_translations < 2200:
print('Language ' + language + ' has fewer than 2200 translations')
return
total_length = 0
for translation in translations.values():
if len(translation) < 300:
total_length += len(translation)
# Calculate the average length
average_length = total_length / total_translations
languages[language] = average_length
root = "./languages/i18n/"
for file in os.listdir(root):
if file.endswith(".json"):
path = os.path.join(root, file)
process_file(path)
sorted_languages = sorted(
languages.items(),
key=lambda item: item[1]
)
# Print the sorted items
for code, length in sorted_languages:
print(code, '\t', length)