Friday, 20 July 2007

Perl, unicode, utf-8, mysql

Handling unicode/utf8 in Perl is quite trivial when you understand "two string approaches".

Perl 5.8 by default can handle UTF-8 strings like a sequence of bytes (1-4 bytes per one char). You can "compress" them to unicode strings with
$wide_char_string = Encode::decode_utf8($octets)


Such encoded string has unicode flag, you can check it with:
Encode::is_utf8($checked_string)


If they have latin1/2/russian/etc. chars after unicode packing:
length($octets) > length($wide_char_string)


Remember: unicode flag does not mean you MUST have wide characters in string.
Wide characters (ord > 255) can be in such unicode string, but it can also has a set of unicode-octets, "unpacked" in unicode string.

Use Test::utf8 to check you have wide chars or not.

FAQ / typical problems:



- check all sources in your program are the same, coded as unicode (wide) or unicode-string. Typically wide-char-strings is a better approach then byte-string (see perldoc Encode). All inputs/outputs like files, DBI, network need to be converted to your choosen internal format.

- do not use use utf8 pragma unless you really need it. In this case all strings have unicode flag, but it does not mean they have wide chars!

(octet_string eq unicode_string) == false
You cannot compare such strings without decode/encode, they are natively different!

Perl, utf8, MySQL



Two approaches to get unicode working with MySQL:

1) after connecting to database, do("SET NAMES 'utf8');
All unicode strings will be octets, without unicode flag.

2) use DBI connection with flag: mysql_enable_utf8 (since DBD::mysql >= 4)
All unicode strings will have wide chars and unicode flag.

Only the second approach works correctly with AutoReconnect flag.

Good news: there is no difference for inserts/updates you use octets or wide_char_strings.

More to read



Check also this Martin Fowler - utf8 in perl

No comments: