Perl 5.8 by default can handle UTF-8 strings like a sequence of bytes (1-4 bytes per one char). You can "compress" them to unicode strings with
$wide_char_string = Encode::decode_utf8($octets)
Such encoded string has unicode flag, you can check it with:
Encode::is_utf8($checked_string)
If they have latin1/2/russian/etc. chars after unicode packing:
length($octets) > length($wide_char_string)
Remember: unicode flag does not mean you MUST have wide characters in string.
Wide characters (ord > 255) can be in such unicode string, but it can also has a set of unicode-octets, "unpacked" in unicode string.
Use Test::utf8 to check you have wide chars or not.
FAQ / typical problems:
- check all sources in your program are the same, coded as unicode (wide) or unicode-string. Typically wide-char-strings is a better approach then byte-string (see perldoc Encode). All inputs/outputs like files, DBI, network need to be converted to your choosen internal format.
- do not use use utf8 pragma unless you really need it. In this case all strings have unicode flag, but it does not mean they have wide chars!
(octet_string eq unicode_string) == falseYou cannot compare such strings without decode/encode, they are natively different!
Perl, utf8, MySQL
Two approaches to get unicode working with MySQL:
1) after connecting to database, do("SET NAMES 'utf8');
All unicode strings will be octets, without unicode flag.
2) use DBI connection with flag: mysql_enable_utf8 (since DBD::mysql >= 4)
All unicode strings will have wide chars and unicode flag.
Only the second approach works correctly with AutoReconnect flag.
Good news: there is no difference for inserts/updates you use octets or wide_char_strings.
More to read
Check also this Martin Fowler - utf8 in perl
No comments:
Post a Comment