Average Lengths of MediaWiki Translations

I was wondering: In which languages, user interface translations tend to be longer, and in which ones they are shorter?

The intuitive answers to these questions are that Chinese and Japanese are very short, English tends to be shorter than the average, Hebrew is shorter than English, and the longest ones are Turkish, Finnish, German, and Tamil. But what if I try to find a more precise answer?

So I made a super-simplistic calculation: I checked the average length of a core MediaWiki user interface message for English and the 150 languages with the highest number of translations.

I sorted them from the shortest average length to the longest. The table is at the end of the post.

Here’s a verbal summary of some interesting points that I found:

  1. The shortest messages are found, unsurprisingly, in Chinese, Japanese, and Korean.
  2. Another group of languages that surprised me by having very short translations are some Arabic-script languages of South Asia: Saraiki, Punjabi, Sindhi, Pashto, Balochi.
  3. Three more languages surprised me by being at the shoter end of the list: Hill Mari (mhr) and Northern Sami (se), which are Finno-Ugric, a family known for agglutinative grammar that tends to make words longer; and Armenian, about which I, for no particular reason, had the impression that its words are longish.
  4. English is at #22 out of 151, with an average length of 38.
  5. Hebrew is slightly above English at #21, with 37.9. This surprised me: I was always under the impression that Hebrew tends to be much shorter.
  6. The longest languages are not quite the ones I thought! The longest ones tend to be the Romance languages: Lombard, French, Portuguese, Spanish, Galician, Arpitan, Romanian, Catalan.
  7. Three Germanic languages, namely Colognian, German and Dutch, are on the longer end of the list, but not all of them. (Colognian is the longest in my list. The reason for this is not so natural, though: The most prolific translator into it, User:Purodha, liked writing out abbreviations in full, so it made many strings longer than they could be. He passed away in 2016. May he rest in peace.)
  8. Other language groups that tend to be longer are Slavic (Belarusian, Russian, Bulgarian, Polish, Ukrainian) and Austronesian (Sakizaya, Ilokano, Tagalog, Bikol, Indonesian).
  9. Other notable, but not easily grouped languages that tend to be longer are Irish, Greek, Shan, Quechua, Finnish, Hungarian, Basque, and Malayalam. All of them have an average length between 45 and 53 characters.
  10. Turkish is only slightly above average with 44.1, at #88.
  11. Tamil is a bit longer, with an average length of 44.6, at #94. Strings in its sister language Malayalam are considerably longer, 49.1.
  12. The median length is 43, and the average for everyone is 42. Notable languages at these lengths are Mongolian, Serbian, Welsh, Norwegian, Malaysian, Esperanto, Georgian, Balinese, Tatar, Estonian, and Bashkir. (Esperantistoj, ĉu vi ĝojas aŭdi, ke via lingvo aperas preskaŭ ĝuste en la mezo de ĉi tiu listo?)

One important factor that I didn’t take into account is that, for various reasons, translators to different languages may select to translate different messages, and one of those reasons may be that people choose to translate shorter messages first because they are usually easier. I addressed this in a very quick and dirty way, by ignoring strings longer than 300 characters. Some time in the (hopefully near) future, I’ll try to make a smarter way to calculate it.

And here are the full results. Please don’t take them too seriously, and feel free to write your own, better, calculation code!

#Language codeAverage translation length
1zh-hans17.67324825
2zh-hant18.52284388
3skr-arab21.81899964
4ja24.67007612
5ko25.8110372
6sd27.71960396
7mhr28.95451413
8ps32.73647059
9pnb33.03592163
10bgn34.39934667
11se34.69274476
12hy35.02317597
13su35.37706967
14th35.52957892
15ce35.6969602
16mai36.02093909
17lv36.14100906
18gu36.59380971
19bcc36.64866033
20fy37.60139287
21nqo37.94138834
22he37.95259865
23en38.04300371
24ar38.18569036
25ckb38.66867672
26min38.71156958
27ses38.87941712
28jv38.94753377
29is39.0652467
30alt39.39977435
31az39.4337931
32kab39.50967506
33tk39.54990758
34mr39.72049689
35as39.72080166
36sw39.73986071
37km39.77591036
38azb39.92411642
39nn39.96771069
40yo40.00503291
41io40.0528125
42af40.1640678
43blk40.2813059
44sco40.33289474
45diq40.33887373
46yi40.34033476
47ur40.39857651
48ug-arab40.53965184
49da40.55894826
50my40.67551519
51kk-cyrl40.87443182
52guw41.07080182
53mg41.08369028
54sq41.23219241
55fa41.27007299
56or41.27020202
57ne41.33971151
58rue41.40219378
59lfn41.54527278
60lrc41.61281337
61sah41.63293173
62vi41.74578313
63awa41.84093291
64hi41.9257885
65si41.93065693
66te41.99780915
67mn42.18728223
68lki42.21091396
69bjn42.57961538
70sr-ec42.67730151
71cy42.75020408
72frr42.92761394
73vec43.00573682
74sr-el43.13764389
75nb43.34987835
76krc43.54919554
77ms43.5553814
78hr43.55564807
79eo43.57477789
80nds-nl43.59060895
81ka43.60108696
82ban43.64178033
83bs43.681094
84tt-cyrl43.78230132
85xmf43.80860161
86et43.96494239
87ba43.99432099
88tr44.17996604
89bn44.28768449
90bew44.44706174
91sv44.49027333
92sa44.58670931
93cs44.59026764
94ta44.62803055
95mt44.70207417
96lt44.7615
97roa-tara44.79812466
98fit44.79824561
99dsb44.9151957
100hsb44.96197228
101br44.98873461
102sh-latn45.00976709
103fi45.1222031
104hu45.17139303
105sk45.35804702
106lb45.39073034
107li45.5539548
108id45.56471159
109gsw45.63605209
110sl45.75350606
111be45.80325
112oc45.85709988
113mk45.90943939
114bcl45.97070064
115scn46.11905532
116an46.14892665
117uk46.22955524
118qu46.30301842
119eu46.33589404
120lij46.660536
121pl46.76863316
122hrx46.79802761
123ast46.87204161
124nap46.93783147
125ru47.02326139
126bg47.03590259
127be-tarask47.28525242
128hif-latn47.41652614
129tl47.51263001
130rm47.60741067
131pms47.69805527
132pt-br47.84063647
133ca47.92468307
134ro48.22437186
135nl48.4175636
136ia48.48612816
137it48.52347014
138frp48.54542755
139gl48.57820482
140ml49.12108224
141es49.21062944
142pt49.63085602
143de49.77225067
144szy49.84650877
145shn49.92356241
146fr50.15585031
147lmo50.85627837
148ilo50.9798995
149el51.14834894
150gd51.72994269
151ksh53.36332609

The Python 3 code I’ve used to create the table. You can run in the root directory of the core MediaWiki source tree. It’s horrible, please improve it!

import json
import os
import re

languages = {}
code_re = re.compile(r"(?P<code>[^/]+)\.json$")


def process_file(filename):
    code_search = code_re.search(filename)
    code = code_search.group("code")
    if code in ('qqq', 'ti', 'lzh', 'yue-hant'):
        return

    with open(filename, "r", encoding="utf-8") as file:
        data = json.load(file)
        del(data['@metadata'])
        average_unicode_length(code, data)


def average_unicode_length(language, translations):
    total_translations = len(translations)
    if total_translations < 2200:
        print('Language ' + language + ' has fewer than 2200 translations')
        return

    total_length = 0

    for translation in translations.values():
        if len(translation) < 300:
            total_length += len(translation)

    # Calculate the average length
    average_length = total_length / total_translations
    languages[language] = average_length

root = "./languages/i18n/"
for file in os.listdir(root):
    if file.endswith(".json"):
        path = os.path.join(root, file)
        process_file(path)

sorted_languages = sorted(
    languages.items(),
    key=lambda item: item[1]
)

# Print the sorted items
for code, length in sorted_languages:
    print(code, '\t', length)

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.