Wednesday, 8 August 2007

Mysql, convert from latin1-unicode to unicode-unicode

If you have your unicode data stored in a default latin1 mysql data fields, there is a simple way to migrate it to a proper unicode/utf8 coding:

1) change text fields (char/varchar/text) format to BLOB to loose coding info
2) change BLOB to varchar/text setting proper coding.

ALTER TABLE `news` CHANGE `body` `body` BLOB NULL DEFAULT NULL;
ALTER TABLE `news` CHANGE `body` `body` TEXT CHARACTER SET utf8 NULL DEFAULT NULL;

Description: if you have 2/3-bytes sequences as a one non-ascii character in latin1 field, it means you have unicode (eg. utf8) text there - but for mysql it is still a latin1 string. It can work for some time if you do all your data operations outside database in a programming language and connect to MySQL as a latin1 client. But probably this is not what you would like to have as a long-term solution, e.g. you would like to use SQL for sort/comparisons, any reporting tools so you need to have data properly encoded.
If you do convert directly unicode-in-latin1 to unicode (by single alter table) it would become a 4-byte-mess, as MySQL will take existing utf8 multi-byte sequences as proper latin1 characters. The workaround is to hide coding info for MySQL by going through temporary BLOB step to keep all data untouched.

Be aware: After few several latin1-unicode conversions, I would check in the first step if you do not have a mix of latin1/utf8 coding and perhaps deal with it by conversion in external scripts.

2 comments:

Anonymous said...

Thanks a lot for this post, this is going to help us a lot

Aashish said...

This was such a great tip! Thank you!!