Archive for the 'Perl' Category

Always define the language and the direction of your HTML documents, part 01

I received this email from Safari Books Online:

Email in English from Safari Books, oriented like Hebrew

Email in English from Safari Books, oriented like Hebrew. Click to enlarge.

The email is written in English, but notice how the text is aligned unusually to the right. Notice also that the punctuation marks appear at the wrong end of the sentence. I used Firefox developer tools to apply the correct direction, and saw it correctly:

The same email, with corrected left-to-right formatting using Firefox developer tools

The same email, with corrected left-to-right formatting using Firefox developer tools

This happens because I use GMail with the Hebrew interface. GMail has to guess the direction of the emails that I receive, because in plain text there’s no easy way to specify the direction (I hope to discuss it in a separate post soon). Usually GMail guesses correctly. Ironically, for HTML-formatted emails like this one, GMail often guesses incorrectly, even though in HTML, unlike in plain text, it’s quite easy to specify the direction by simply adding dir=”ltr” to the root element of the email.

Unfortunately a lot of HTML authors don’t bother to specify explicit direction. Many are not even aware of this exotic dir attribute. Others think that because “ltr” is the default, they don’t have to specify it. They are wrong: As this email shows, the left-to-right HTML content is embedded in a right-to-left environment, and the “rtl” definition propagates to the embedded content.

You could blame GMail, of course, but it’s much more practical to always define the direction of your HTML content, even if it’s the default. You can never know where will your content end up.

P.S.: I read this post before publishing and suddenly realized that its style is quite similar to “Best Practices” books, such as Damian Conway’s classic “Perl Best Practices” – it tells you to do something that is not obviously needed, and explains why it is needed nevertheless. I like to acknowledge sources of inspiration. Thank you, Damian.

Advertisements

The Secret Spell – how to easily make spelling checkers better

Software localization and language tools are poorly understood by a lot of people in general. Probably the most misunderstood language tool, despite its ubiquity, is spell checking.

Here are some things that most people probably do understand about spelling checkers:

  • Using a spelling checker does not guarantee perfect grammar and correctness of the text. False positives and false negatives happen.
  • Spelling checkers don’t include all possible words – they don’t have names, rare technical terms, neologisms, etc.

And here are some facts about spelling checkers that people often don’t understand. Some of them are are so basic that they seem ridiculous, but nevertheless i heard them more than once:

  • Spelling checkers can exist for any language, not just for English.
  • At least in some programs it is possible to check the spelling of several languages at once, in one document.
  • Some spelling checkers intentionally omit some words, because they are too rare to be useful.
  • The same list of words can be used in several programs.
  • Contrariwise, the same language can have several lists of words available.

But probably the biggest misunderstanding about spelling checkers is that they are software just like any other: It was created by programmers, it has maintainers, and it has bugs. These bugs can be reported and fixed. This is relatively easy to do with Free Software like Firefox and LibreOffice, because proprietary software vendors usually don’t accept bug reports at all. But in fact, even with Free Software it is easy only in theory.

The problem with spelling checkers is that almost any person can easily find lots of missing words in them just by writing email and Facebook updates (and dare i mention, Wikipedia articles). It’s a problem, because there’s no easy way to report them. When the spell checker marks a legitimate word in red, the user can press “Add to dictionary”. This function adds the word to a local file, so it’s useful only for that user on that computer. It’s not even shared with that user’s other computers or mobile devices, and it’s certainly not shared with other people who speak that language and for whom that word can be useful.

The user can report a missing word as a bug in the bug tracking system of the program that he uses to write the texts, the most common examples being Firefox and LibreOffice. Both of these projects use Bugzilla to track bugs. However, filling a whole Bugzilla report form just to report a missing word is way too hard and time-consuming for most users, so they won’t do it. And even if they would do it, it would be hard for the maintainers of Firefox and LibreOffice to handle that bug report, because the spelling dictionaries are usually maintained by other people.

Now what if reporting a missing word to the spelling dictionary maintainers would be as easy as pressing “Add to dictionary”?

The answer is very simple – spelling dictionaries for many languages would quickly start to grow and improve. This is an area that just begs to be crowd-sourced. Sure, big, important and well-supported languages like English, French, Russian, Spanish and German may not really need it, because they have huge dictionaries already. But the benefit for languages without good software support would be enormous. I’m mostly talking about languages of Africa, India, the Pacific and Native American languages, too.

There’s not much to do on the client side: Just let “Add to dictionary” send the information to a server instead of saving it locally. Anonymous reporting should probably be the default, but there can be an option to attach an email address to the report and get the response of the maintainer. The more interesting question is what to do on the server side. Well, that’s not too complicated, either.

When the word arrives, the maintainer is notified and must do something about it. I can think of these possible resolutions:

  • The word is added to the dictionary and distributed to all users in the next released version.
  • The word is an inflected form of an existing word that the dictionary didn’t recognize because of a bug in the inflection logic. The bug is fixed and the fix is distributed to all users in the next released version.
  • The word is correct, but not added to the dictionary which is distributed to general users, because it’s deemed too rare to be useful for most people. It is, however, added to the dictionary for the benefit of linguists and other people who need complete dictionaries. Personal names that aren’t common enough to be included in the dictionary can receive similar treatment.
  • The word is not added to the dictionary, because it’s in the wrong language, but it can be forwarded to the maintainer of the spelling dictionary for that language. (The same can be done for a different spelling standard in the same language, like color/colour in English.)
  • The word is not added to the dictionary, because it’s a common misspelling (like “attendence” would be in English.)
  • The word is not added to the dictionary, because it’s complete gibberish.

Some of the points above can be identified semi-automatically, but the ultimate decision should be up to the dictionary maintainer. Mistakes that are reported too often – again, “attendence” may become one – can be filtered out automatically. The IP addresses of abusive users who send too much garbage can be blocked.

The same system for maintaining spelling dictionaries can be used for all languages and reside on the same website. This would be similar to translatewiki.net – one website in which all the translations for MediaWiki and related projects are handled. It makes sense on translatewiki.net, because the translation requirements for all languages are pretty much the same and the translators help each other. The requirements for spelling dictionaries are also mostly the same for all languages, even though they differ in the implementation of morphology and in other features, so developers of dictionaries for different languages can collaborate.

I already started implementing a web service for doing this. I called it Orthoman – “orthography manager”. I picked Perl and Catalyst for this – Perl is the language that i know best and i heard that Catalyst is a good framework for writing web services. I never wrote a web service from scratch before, so i’m slowish and this “implementation” doesn’t do anything useful yet. If you have a different suggestion for me – Ruby, Python, whatever -, you are welcome to propose it to me. If you are a web service implementation genius and can implement the thing i described here in two hours, feel free to do it in any language.

Arab Inventors in Wikipedia

The famous provocative Russian designer and blogger Artemy Lebedev wrote in his blog today (my translation from Russian):

European (Christian) consciousness is built differently than the Eastern (Muslim).

The main unique property of the European culture is the ability to invent and create new things, technologies, items and products. Arab peoples are absolutely unable to invent something. Do we know anything Arabic? A television? A telephone? A car? At least one thing? My main complaint towards Islam is this – as a culture it is so egotistic, that I feel suffocated there.

Though very provocative in his use of language and in his criticism against ugly design, Lebedev is usually very secularist and anti-nationalistic. Sometimes, though, he does make some shocking and scathing remarks about ethnic and religious groups, such as this one.

It did make me think, however. Everybody knows that in the Middle Ages Arabs made many important advances in literature, medicine, astronomy, mathematics and other fields, but i really couldn’t think of an Arab inventor from the recent centuries. So i went to Wikipedia, opened Category:Inventors and descended to Category:Inventors by nationality.

There was only one Arab country listed: United Arab Emirates. Other prominent Muslim countries were Pakistan, Afghanistan, Iran and Turkey. Hmm. So i went to the page List of inventors, hoping that it would be more inclusive and easy to search. It didn’t help much – i found very few Arabs there, and they were mostly medieval characters.

And then i recalled that it’s the English Wikipedia. So i went to Category:Inventors by nationality in the Arabic Wikipedia. There i found several sub-categories for Arab countries: Saudi Arabia, Tunisia, Algeria, Lebanon and Egypt. There was no category for UAE, even though one existed in the English Wikipedia, and none of the categories i found in Arabic had an English counterpart; the one that existed for Algerian inventors was deleted a few months ago, because it was empty.

I went over the articles in these categories in the Arabic Wikipedia. Most of them didn’t have an English counterpart. There was an article in English about Hassan Kamel Al-Sabbah, a Lebanese engineer, so i created Category:Lebanese inventors for him and now there are two Arab countries under Category:Inventors by nationality in English.

There was also an article in English about Ahmed Zewail, an Egyptian chemist, and a couple of other scientists. All of them are probably great people, but reading the articles about them in English it seemed to me that even though it’s correct to call them “scientists” and maybe “discoverers”, they probably aren’t inventors. Of course, it’s possible that i misunderstood something, but it may also mean that for the people who tagged these people as “inventors”, this word had a somewhat different meaning. This may or may not mean that the Arabic word used in the category name, مخترع, covers both inventions and discoveries. The Al-Mawrid Arabic-English dictionary, which i use most of the time, says that this word means “inventor, creator, originator, innovator, maker, author”.


So, there’s a little lesson in cultural divide to be learned here. No, i don’t agree with Artemy Lebedev – i am certain that Arabs can and do invent things and the existence of articles about alleged inventors from Arab countries in the Arabic Wikipedia probably means that this is true. But currently chauvinistic people can take a look in the English Wikipedia, see that it has almost no Arab inventors and keep being sure that Arabs are, indeed, stupid and incapable of invention. Since Wikipedia is so easily available, they probably won’t bother to search for information elsewhere.

Unfortunately, my understanding of the Arab culture and language is too small, but surely there must be an Arab who will take this challenge and improve the coverage of Arab inventors in the Wikipedia in English and other languages.

One way to do this would be to run the script that i wrote for finding and categorizing articles without interlanguage links; if you know Arabic and Perl, please contact me and i’ll gladly help you to set it up for the Arabic Wikipedia.

Expansionists

A group of right-wing Zionists wanted to try to get Wikipedia to represent their opinions better, so they tried to organize a course about it. They didn’t exactly have an instructor, so a couple of prominent Hebrew Wikipedia editors volunteered to help them: To give them a lecture about Wikipedia and the way it assures political neutrality. I know both of them personally and i believe that they did their job honestly.

Apparently, this piece of news was so important, that it reached The Guardian (Wikipedia editing courses launched by Zionist groups). Richard Stallman mentioned it in his “Political Notes” blog, saying: “Israeli expansionists are planning courses on how to slip their views into Wikipedia without triggering resistance.”

It’s actually not completely incorrect to call them “expansionists”, although it reminds me too strongly of the Soviet press, which used the exact same term to describe Israel. But it’s quite ridiculous to say that anyone in his right mind can plan to slip his views into Wikipedia without triggering resistance. It’s 2010 outside, and even people with strong political opinions already know that Wikipedia is supposed to be neutral. It doesn’t necessarily succeed at it, but to slip in views without triggering resistance? That’s patent nonsense.

One of the expansionists was quoted in the Guardian: “We don’t want to change Wikipedia or turn it into a propaganda arm, we just want to show the other side”. What do you know, an expansionist Zionist land-grabbing settler said a sentence that makes sense!

But what’s most disappointing about this whole thing is that it is a complete non-issue. Because if it is a significant newsworthy issue, then so are the lectures about Wikipedia that i gave to the left-wing youth movement Hanoar Haoved Vehalomed, to the Israeli Marine Mammal Research and Assistance Center, to the Rehovot Perl Mongers group and to several groups of the Hebrew University students and staff. These were the same lectures: Free culture, spreading knowledge, editing history, citing sources, neutral point of view. We the Wikipedians who volunteer to lecture on the website we are so passionate about say the same things about Wikipedia to left-wing people, to right-wing people, to programmers, to students and to scientists.

But hey, i’m happy to have read that silly Guardian article, because it made me realize that Wikipedia won: It is perceived as a website that is difficult to trick. Well, it really is.

Advocacy for the Uncool: SVN vs. git and Cygwin vs. the World

There are two Free Software packages that many Free Software people love to hate: Cygwin and Subversion.


Cygwin is a Unix-like environment on Windows. It gives the user a shell, and it’s possible to install there Perl, Python, Ruby, GNU make, gcc, vim and many other familiar tools from the GNU world. It’s even possible to run X windows using it.

I mostly use it for running Perl on Windows. There are two other major versions of Perl for Windows: ActiveState and Strawberry. Every now and then i try using them and i get immediately frustrated: from my experience, Cygwin is much more stable and predictable. Failure to install a CPAN module on Cygwin is much more rare than on ActiveState and Strawberry. Maybe i install the wrong modules, but for modules that i need Cygwin did the job better.

Cygwin is not without problems. But all too often it does the job more readily than ActiveState, Strawberry and GNU/Linux. Nevertheless, Free Software people tend to call me names, when i tell them that i use Cygwin. “You should expect problems when you run an emulator instead of running real Linux!”, they say. Well, what do you know – sometimes, i have to run Windows, that’s a fact of life, and there are stupid problems with Linux, too.


Another stupid holy war in the Free Software community is Git vs. Subversion (SVN in short). Both are source code management (SCM) systems. The “cool” Free Software people say that git is better, because it git lets you create your own repositories, because git is faster, because git is easier.

I can see the principal advantage in having a local repository, which is the way git works. I can work offline and make as many commits as i like. In SVN i need to go online for every commit. But that, in practice, is the only disadvantage that SVN has. People say that SVN sucks at branching and merging. They like to quote Linus Torvalds: “Did you ever try to merge using SVN? Did you enjoy the experience?” Well, i have news for them: I tried branching and merging using Perforce, Mercurial, ClearCase, SVN and git – and i didn’t enjoy the experience in any of them. So git also sucks at branching and merging, but the difference is that with git i lost data, too. Every single time i tried to branch and merge using git, i cursed the hell out of it, copied the files i wanted to change to a backup directory, deleted the repository, recreated it, and did the merge manually. Every single time.

Besides, every time i try to use git, i feel like a fucking scientologist, forced to look up every single word in the help files: how the hell am i supposed to remember the difference between “pull” and “fetch” or between “branch”, “clone” and “checkout”? To understand what “fetch” is, i need to understand what the fuck “head”, “tag”, “object” and “ref” are. Go on and tell me that i should sit down and learn git properly, but i didn’t have to sit down and learn SVN. It just worked without forcing me to understand things.

Call me stupid and old-fashioned, but SVN didn’t give me a headache. Ever.


So, cool kids, go on, keep being cool, keep telling people that Cygwin and SVN suck. But every now and then do a reality check, please. You find it fun to use git? Great. Just don’t force it on other people.

To the developers of Cygwin and SVN i want to say: Thank you. You deserve far more appreciation than you get.

People Speaking – Programming Perl

— “Hey, what are you doing?! Do you want to program in Perl?” (Hadar, to our cat, when she saw him jumping on her laptop keyboard.)

Regular expressions

I love regular expressions. I cannot live without regular expressions. I cannot understand how people can go on with their lives without using at least one regular expression every couple of hours. I don’t understand why schools teach algebra instead of regular expressions. I didn’t use any algebra in my life, ever. I used a lot of regular expressions and it is not less mathematical.


Archives

Advertisements