Posts Tagged ‘perlmonks’

Perl and Unicode

January 15, 2012 1 comment

Great article on Perl and Unicode.

The days of just flinging strings around are over. It’s well established that modern programs need to be capable of communicating funny accented letters, and things like euro symbols. This means that programmers need new habits. It’s easy to program Unicode capable software, but it does require discipline to do it right.

There’s a lot to know about character sets, and text encodings. It’s probably best to spend a full day learning all this, but the basics can be learned in minutes.

These are not the very basics, though. It is assumed that you already know the difference between bytes and characters, and realise (and accept!) that there are many different character sets and encodings, and that your program has to be explicit about them. Recommended reading is “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)” by Joel Spolsky, at

A word of caution,  you don’t really need to care that Perl stores strings internally as UTF-8:

Please, unless you’re hacking the internals, or debugging weirdness, don’t think about the UTF-8 flag at all. That means that you very probably shouldn’t use is_utf8_utf8_on or _utf8_off at all.

Perl’s internal format happens to be UTF-8. Unfortunately, Perl can’t keep a secret, so everyone knows about this. That is the source of much confusion. It’s better to pretend that the internal format is some unknown encoding, and that you always have to encode and decode explicitly.