Average Lengths of MediaWiki Translations

I was wondering: In which languages, user interface translations tend to be longer, and in which ones they are shorter?

The intuitive answers to these questions are that Chinese and Japanese are very short, English tends to be shorter than the average, Hebrew is shorter than English, and the longest ones are Turkish, Finnish, German, and Tamil. But what if I try to find a more precise answer?

So I made a super-simplistic calculation: I checked the average length of a core MediaWiki user interface message for English and the 150 languages with the highest number of translations.

I sorted them from the shortest average length to the longest. The table is at the end of the post.

Here’s a verbal summary of some interesting points that I found:

The shortest messages are found, unsurprisingly, in Chinese, Japanese, and Korean.
Another group of languages that surprised me by having very short translations are some Arabic-script languages of South Asia: Saraiki, Punjabi, Sindhi, Pashto, Balochi.
Three more languages surprised me by being at the shoter end of the list: Hill Mari (mhr) and Northern Sami (se), which are Finno-Ugric, a family known for agglutinative grammar that tends to make words longer; and Armenian, about which I, for no particular reason, had the impression that its words are longish.
English is at #22 out of 151, with an average length of 38.
Hebrew is slightly above English at #21, with 37.9. This surprised me: I was always under the impression that Hebrew tends to be much shorter.
The longest languages are not quite the ones I thought! The longest ones tend to be the Romance languages: Lombard, French, Portuguese, Spanish, Galician, Arpitan, Romanian, Catalan.
Three Germanic languages, namely Colognian, German and Dutch, are on the longer end of the list, but not all of them. (Colognian is the longest in my list. The reason for this is not so natural, though: The most prolific translator into it, User:Purodha, liked writing out abbreviations in full, so it made many strings longer than they could be. He passed away in 2016. May he rest in peace.)
Other language groups that tend to be longer are Slavic (Belarusian, Russian, Bulgarian, Polish, Ukrainian) and Austronesian (Sakizaya, Ilokano, Tagalog, Bikol, Indonesian).
Other notable, but not easily grouped languages that tend to be longer are Irish, Greek, Shan, Quechua, Finnish, Hungarian, Basque, and Malayalam. All of them have an average length between 45 and 53 characters.
Turkish is only slightly above average with 44.1, at #88.
Tamil is a bit longer, with an average length of 44.6, at #94. Strings in its sister language Malayalam are considerably longer, 49.1.
The median length is 43, and the average for everyone is 42. Notable languages at these lengths are Mongolian, Serbian, Welsh, Norwegian, Malaysian, Esperanto, Georgian, Balinese, Tatar, Estonian, and Bashkir. (Esperantistoj, ĉu vi ĝojas aŭdi, ke via lingvo aperas preskaŭ ĝuste en la mezo de ĉi tiu listo?)

One important factor that I didn’t take into account is that, for various reasons, translators to different languages may select to translate different messages, and one of those reasons may be that people choose to translate shorter messages first because they are usually easier. I addressed this in a very quick and dirty way, by ignoring strings longer than 300 characters. Some time in the (hopefully near) future, I’ll try to make a smarter way to calculate it.

And here are the full results. Please don’t take them too seriously, and feel free to write your own, better, calculation code!

#	Language code	Average translation length
1	zh-hans	17.67324825
2	zh-hant	18.52284388
3	skr-arab	21.81899964
4	ja	24.67007612
5	ko	25.8110372
6	sd	27.71960396
7	mhr	28.95451413
8	ps	32.73647059
9	pnb	33.03592163
10	bgn	34.39934667
11	se	34.69274476
12	hy	35.02317597
13	su	35.37706967
14	th	35.52957892
15	ce	35.6969602
16	mai	36.02093909
17	lv	36.14100906
18	gu	36.59380971
19	bcc	36.64866033
20	fy	37.60139287
21	nqo	37.94138834
22	he	37.95259865
23	en	38.04300371
24	ar	38.18569036
25	ckb	38.66867672
26	min	38.71156958
27	ses	38.87941712
28	jv	38.94753377
29	is	39.0652467
30	alt	39.39977435
31	az	39.4337931
32	kab	39.50967506
33	tk	39.54990758
34	mr	39.72049689
35	as	39.72080166
36	sw	39.73986071
37	km	39.77591036
38	azb	39.92411642
39	nn	39.96771069
40	yo	40.00503291
41	io	40.0528125
42	af	40.1640678
43	blk	40.2813059
44	sco	40.33289474
45	diq	40.33887373
46	yi	40.34033476
47	ur	40.39857651
48	ug-arab	40.53965184
49	da	40.55894826
50	my	40.67551519
51	kk-cyrl	40.87443182
52	guw	41.07080182
53	mg	41.08369028
54	sq	41.23219241
55	fa	41.27007299
56	or	41.27020202
57	ne	41.33971151
58	rue	41.40219378
59	lfn	41.54527278
60	lrc	41.61281337
61	sah	41.63293173
62	vi	41.74578313
63	awa	41.84093291
64	hi	41.9257885
65	si	41.93065693
66	te	41.99780915
67	mn	42.18728223
68	lki	42.21091396
69	bjn	42.57961538
70	sr-ec	42.67730151
71	cy	42.75020408
72	frr	42.92761394
73	vec	43.00573682
74	sr-el	43.13764389
75	nb	43.34987835
76	krc	43.54919554
77	ms	43.5553814
78	hr	43.55564807
79	eo	43.57477789
80	nds-nl	43.59060895
81	ka	43.60108696
82	ban	43.64178033
83	bs	43.681094
84	tt-cyrl	43.78230132
85	xmf	43.80860161
86	et	43.96494239
87	ba	43.99432099
88	tr	44.17996604
89	bn	44.28768449
90	bew	44.44706174
91	sv	44.49027333
92	sa	44.58670931
93	cs	44.59026764
94	ta	44.62803055
95	mt	44.70207417
96	lt	44.7615
97	roa-tara	44.79812466
98	fit	44.79824561
99	dsb	44.9151957
100	hsb	44.96197228
101	br	44.98873461
102	sh-latn	45.00976709
103	fi	45.1222031
104	hu	45.17139303
105	sk	45.35804702
106	lb	45.39073034
107	li	45.5539548
108	id	45.56471159
109	gsw	45.63605209
110	sl	45.75350606
111	be	45.80325
112	oc	45.85709988
113	mk	45.90943939
114	bcl	45.97070064
115	scn	46.11905532
116	an	46.14892665
117	uk	46.22955524
118	qu	46.30301842
119	eu	46.33589404
120	lij	46.660536
121	pl	46.76863316
122	hrx	46.79802761
123	ast	46.87204161
124	nap	46.93783147
125	ru	47.02326139
126	bg	47.03590259
127	be-tarask	47.28525242
128	hif-latn	47.41652614
129	tl	47.51263001
130	rm	47.60741067
131	pms	47.69805527
132	pt-br	47.84063647
133	ca	47.92468307
134	ro	48.22437186
135	nl	48.4175636
136	ia	48.48612816
137	it	48.52347014
138	frp	48.54542755
139	gl	48.57820482
140	ml	49.12108224
141	es	49.21062944
142	pt	49.63085602
143	de	49.77225067
144	szy	49.84650877
145	shn	49.92356241
146	fr	50.15585031
147	lmo	50.85627837
148	ilo	50.9798995
149	el	51.14834894
150	gd	51.72994269
151	ksh	53.36332609

The Python 3 code I’ve used to create the table. You can run in the root directory of the core MediaWiki source tree. It’s horrible, please improve it!

import json
import os
import re

languages = {}
code_re = re.compile(r"(?P<code>[^/]+)\.json$")


def process_file(filename):
    code_search = code_re.search(filename)
    code = code_search.group("code")
    if code in ('qqq', 'ti', 'lzh', 'yue-hant'):
        return

    with open(filename, "r", encoding="utf-8") as file:
        data = json.load(file)
        del(data['@metadata'])
        average_unicode_length(code, data)


def average_unicode_length(language, translations):
    total_translations = len(translations)
    if total_translations < 2200:
        print('Language ' + language + ' has fewer than 2200 translations')
        return

    total_length = 0

    for translation in translations.values():
        if len(translation) < 300:
            total_length += len(translation)

    # Calculate the average length
    average_length = total_length / total_translations
    languages[language] = average_length

root = "./languages/i18n/"
for file in os.listdir(root):
    if file.endswith(".json"):
        path = os.path.join(root, file)
        process_file(path)

sorted_languages = sorted(
    languages.items(),
    key=lambda item: item[1]
)

# Print the sorted items
for code, length in sorted_languages:
    print(code, '\t', length)

Average Lengths of MediaWiki Translations

Published by aharoni

Leave a comment Cancel reply

Share this:

Related

Published by aharoni

Leave a comment Cancel reply