Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Character encoding
Breaking and unbreaking your data
Maciej Dobrzanski
maciek@psce.com | @mushupl
Brussels, 1 Feb 2015
01....
Character Encoding
• Binary representation of glyphs
• Each character can be represented by 1 or more bytes
• Popular sche...
Character Encoding
• Character set defines the visual interpretation of binary information
• One glyph can be associated w...
Please state the nature of the emergency
• Application configuration
• Database configuration
• Table/column definitions
0...
Problem #1: We are all born Swedish
• MySQL uses latin1 by default
• MySQL 5.7 too
• Is anyone actually aware of that?
• W...
Problem #1
• Let’s build an application
mysql> SELECT @@global.character_set_server, @@session.character_set_client;
+----...
Problem #1
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #1
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #1
• Everything is correct… NOT!
mysql> SET NAMES utf8;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from ...
Problem #1
• Let’s fix this
• Or can we ignore it?
• Ruby may not like it
# grep character-set-server /etc/mysql/my.cnf
ch...
Problem #1: The good news
• It’s usually fixable
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2: Settings, defaults, inheritance
• Where do you set character sets in MySQL?
• Sesssion settings
• character_se...
Problem #2
• Having fixed our problem #1, we continue to develop our application
mysql> SELECT @@session.character_set_ser...
Problem #2
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2
• Why is the table character set latin1?
mysql> SELECT @@session.character_set_server, @@session.character_set_...
Problem #2
• What’s all this, then?
mysql> SHOW SESSION VARIABLES LIKE 'character_set_%';
+--------------------------+----...
Problem #2
• Can we fix this?
mysql> SET NAMES utf8;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT last_name, HEX(las...
Problem #2: The bad news
• It may not be enough to configure the server correctly
• A mismatch between client and server c...
Problem #2: Settings, defaults, inheritance
• Where do you set character sets in MySQL?
• Sesssion settings
• character_se...
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2: Settings, defaults, inheritance
master [localhost] {ms...
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2: Settings, defaults, inheritance
master [localhost] {ms...
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2: Settings, defaults, inheritance
master [localhost] {ms...
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2: Settings, defaults, inheritance
master [localhost] {ms...
I f**ckd up. What do I do?
• Let’s start with what you shouldn’t do
• Keep calm and don’t start by changing something
• An...
I f**ckd up. What else I shouldn’t do, then?
• Do not rush things as you may easily go from bad to worse
• Do not start fi...
I f**ckd up. So how do I fix it?
• What needs to be fixed?
• Schema defaut character set
• ALTER SCHEMA fosdem DEFAULT CHA...
I f**ckd up. So how do I fix it?
• Option 1 – requires downtime
• Dump and restore
• Dump the data preserving the bad conf...
I f**ckd up. So how do I fix it?
• Option 2 – requires downtime
• Perform a two step conversion with ALTER TABLE
• Origina...
I f**ckd up. So how do I fix it?
• Option 3 – online character set fix; no downtime*
• Thanks to our plugin for pt-online-...
GOTCHAs!
• Data space requrements may change during conversion
• Latin1 uses 1 byte per character, utf8 will need to assum...
01.02.2015 Follow us on Twitter @dbasquare Need help? Visit www.psce.com
GOTCHAs!
master [localhost] {msandbox} (fosdem) >...
How to do it right?
• Set character-set-server during initial configuration
• When creating new schemas, always specify th...
Oh, and one more thing…
01.02.2015 Follow us on Twitter @dbasquare Need help? Visit www.psce.com
• We are sharing WebScaleSQL packages with the MySQL Community!
• Check out http://www.psce.com/blog for details
• Follow ...
Prochain SlideShare
Chargement dans…5
×

Character Encoding - MySQL DevRoom - FOSDEM 2015

2 335 vues

Publié le

Character encoding configuration in MySQL has always been a bit confusing. With too many options to set, unclear relationships between them, and the default settings that make MySQL incompatible with most languages, it is a headache to many users, many of whom end up with broken data. This lecture will provide an overview of the character set support in MySQL, guidelines on how to use it correctly, and will demonstrate several methods of detecting and repairing mangled data.

Publié dans : Logiciels
  • Soyez le premier à commenter

Character Encoding - MySQL DevRoom - FOSDEM 2015

  1. 1. Character encoding Breaking and unbreaking your data Maciej Dobrzanski maciek@psce.com | @mushupl Brussels, 1 Feb 2015 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  2. 2. Character Encoding • Binary representation of glyphs • Each character can be represented by 1 or more bytes • Popular schemes • ASCII • Unicode • UTF-8, UTF-16, UTF-32 • Language specific character sets • US (Latin US) • Europe (Latin 1, Latin 2) • Asia (EUC-KR, GB18030) 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  3. 3. Character Encoding • Character set defines the visual interpretation of binary information • One glyph can be associated with several numeric codes • One numeric code may be used to represent several different glyphs 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  4. 4. Please state the nature of the emergency • Application configuration • Database configuration • Table/column definitions 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  5. 5. Problem #1: We are all born Swedish • MySQL uses latin1 by default • MySQL 5.7 too • Is anyone actually aware of that? • Why Swedish? • latin1_swedish_ci is the default collation 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  6. 6. Problem #1 • Let’s build an application mysql> SELECT @@global.character_set_server, @@session.character_set_client; +-------------------------------+--------------------------------+ | @@global.character_set_server | @@session.character_set_client | +-------------------------------+--------------------------------+ | latin1 | latin1 | +-------------------------------+--------------------------------+ 1 row in set (0.00 sec) mysql> CREATE SCHEMA fosdem; Query OK, 1 row affected (0.00 sec) mysql> USE fosdem; mysql> CREATE TABLE locations (city VARCHAR(30) NOT NULL); Query OK, 0 rows affected (0.15 sec) mysql> SHOW CREATE TABLE locationsG *************************** 1. row *************************** Table: locations Create Table: CREATE TABLE `locations` ( `city` varchar(30) NOT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 1 row in set (0.00 sec) 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  7. 7. Problem #1 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  8. 8. Problem #1 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  9. 9. Problem #1 • Everything is correct… NOT! mysql> SET NAMES utf8; Query OK, 0 rows affected (0.00 sec) mysql> select * from locations; +--------------------+ | city | +--------------------+ | Berlin | | Kraków | | 東京都 | +--------------------+ 3 rows in set (0.00 sec) mysql> SET NAMES latin1; Query OK, 0 rows affected (0.00 sec) mysql> select * from locations; +-----------+ | city | +-----------+ | Berlin | | Kraków | | 東京都 | +-----------+ 3 rows in set (0.00 sec) 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  10. 10. Problem #1 • Let’s fix this • Or can we ignore it? • Ruby may not like it # grep character-set-server /etc/mysql/my.cnf character-set-server = utf8 mysql> SELECT @@global.character_set_server, @@session.character_set_client; +-------------------------------+--------------------------------+ | @@global.character_set_server | @@session.character_set_client | +-------------------------------+--------------------------------+ | utf8 | utf8 | +-------------------------------+--------------------------------+ 1 row in set (0.00 sec) ...we are fixing our tables here... mysql> SHOW CREATE TABLE locationsG *************************** 1. row *************************** Table: locations Create Table: CREATE TABLE `locations` ( `city` varchar(30) NOT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8 1 row in set (0.00 sec) 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  11. 11. Problem #1: The good news • It’s usually fixable 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  12. 12. Problem #2: Settings, defaults, inheritance • Where do you set character sets in MySQL? • Sesssion settings • character_set_server • character_set_client • character_set_connection • character_set_database • character_set_result • Schema level defaults • Table level defaults • Column charsets 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  13. 13. Problem #2 • Having fixed our problem #1, we continue to develop our application mysql> SELECT @@session.character_set_server, @@session.character_set_client; +--------------------------------+--------------------------------+ | @@session.character_set_server | @@session.character_set_client | +--------------------------------+--------------------------------+ | utf8 | utf8 | +--------------------------------+--------------------------------+ 1 row in set (0.00 sec) mysql> USE fosdem; mysql> CREATE TABLE people (first_name VARCHAR(30) NOT NULL, last_name VARCHAR(30) NOT NULL); Query OK, 0 rows affected (0.13 sec) 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  14. 14. Problem #2 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  15. 15. Problem #2 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  16. 16. Problem #2 • Why is the table character set latin1? mysql> SELECT @@session.character_set_server, @@session.character_set_client; +--------------------------------+--------------------------------+ | @@session.character_set_server | @@session.character_set_client | +--------------------------------+--------------------------------+ | utf8 | utf8 | +--------------------------------+--------------------------------+ 1 row in set (0.00 sec) mysql> USE fosdem; mysql> SHOW CREATE TABLE peopleG *************************** 1. row *************************** Table: people Create Table: CREATE TABLE `people` ( `first_name` varchar(30) NOT NULL, `last_name` varchar(30) NOT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 1 row in set (0.00 sec) 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  17. 17. Problem #2 • What’s all this, then? mysql> SHOW SESSION VARIABLES LIKE 'character_set_%'; +--------------------------+----------------------------+ | Variable_name | Value | +--------------------------+----------------------------+ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | utf8 | | character_set_system | utf8 | | character_sets_dir | /usr/share/mysql/charsets/ | +--------------------------+----------------------------+ 8 rows in set (0.00 sec) mysql> SHOW CREATE DATABASE fosdemG *************************** 1. row *************************** Database: fosdem Create Database: CREATE DATABASE `fosdem` /*!40100 DEFAULT CHARACTER SET latin1 */ 1 row in set (0.00 sec) 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  18. 18. Problem #2 • Can we fix this? mysql> SET NAMES utf8; Query OK, 0 rows affected (0.00 sec) mysql> SELECT last_name, HEX(last_name) FROM people; +------------+----------------------+ | last_name | HEX(last_name) | +------------+----------------------+ | Lemon | 4C656D6F6E | | Müller | 4DFC6C6C6572 | | Dobrza?ski | 446F62727A613F736B69 | +------------+----------------------+ 3 rows in set (0.00 sec) mysql> SET NAMES latin2; Query OK, 0 rows affected (0.00 sec) mysql> SELECT last_name, HEX(last_name) FROM people; +------------+----------------------+ | last_name | HEX(last_name) | +------------+----------------------+ | Lemon | 4C656D6F6E | | Müller | 4DFC6C6C6572 | | Dobrza?ski | 446F62727A613F736B69 | +------------+----------------------+ 3 rows in set (0.00 sec) • We can’t! :-( • 0x3F is '?', so my 'ń' was lost 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  19. 19. Problem #2: The bad news • It may not be enough to configure the server correctly • A mismatch between client and server can permantenly break data • Implicit conversion inside MySQL server 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  20. 20. Problem #2: Settings, defaults, inheritance • Where do you set character sets in MySQL? • Sesssion settings • character_set_server • character_set_client • character_set_connection • character_set_database • character_set_result • Schema level defaults – affect new tables • Table level defaults – affect new columns • Column charsets 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  21. 21. 01.02.2015 Follow us on Twitter @dbasquare www.psce.com Problem #2: Settings, defaults, inheritance master [localhost] {msandbox} ((none)) > SELECT @@global.character_set_server, @@session.character_set_client; +-------------------------------+--------------------------------+ | @@global.character_set_server | @@session.character_set_client | +-------------------------------+--------------------------------+ | latin1 | utf8 | +-------------------------------+--------------------------------+ 1 row in set (0.00 sec) master [localhost] {msandbox} ((none)) > CREATE SCHEMA fosdemG Query OK, 1 row affected (0.00 sec) master [localhost] {msandbox} ((none)) > SHOW CREATE SCHEMA fosdemG *************************** 1. row *************************** Database: fosdem Create Database: CREATE DATABASE `fosdem` /*!40100 DEFAULT CHARACTER SET latin1 */ 1 row in set (0.00 sec)
  22. 22. 01.02.2015 Follow us on Twitter @dbasquare www.psce.com Problem #2: Settings, defaults, inheritance master [localhost] {msandbox} ((none)) > USE fosdem; Database changed master [localhost] {msandbox} (fosdem) > CREATE TABLE test (a VARCHAR(300), INDEX (a)); Query OK, 0 rows affected (0.62 sec) master [localhost] {msandbox} (fosdem) > SHOW CREATE TABLE testG *************************** 1. row *************************** Table: test Create Table: CREATE TABLE `test` ( `a` varchar(300) DEFAULT NULL, KEY `a` (`a`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 1 row in set (0.00 sec)
  23. 23. 01.02.2015 Follow us on Twitter @dbasquare www.psce.com Problem #2: Settings, defaults, inheritance master [localhost] {msandbox} (fosdem) > ALTER TABLE test DEFAULT CHARSET = utf8; Query OK, 0 rows affected (0.08 sec) Records: 0 Duplicates: 0 Warnings: 0 master [localhost] {msandbox} (fosdem) > SHOW CREATE TABLE testG *************************** 1. row *************************** Table: test Create Table: CREATE TABLE `test` ( `a` varchar(300) CHARACTER SET latin1 DEFAULT NULL, KEY `a` (`a`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 1 row in set (0.00 sec)
  24. 24. 01.02.2015 Follow us on Twitter @dbasquare www.psce.com Problem #2: Settings, defaults, inheritance master [localhost] {msandbox} (fosdem) > ALTER TABLE test ADD b VARCHAR(10); Query OK, 0 rows affected (0.74 sec) Records: 0 Duplicates: 0 Warnings: 0 master [localhost] {msandbox} (fosdem) > SHOW CREATE TABLE testG *************************** 1. row *************************** Table: test Create Table: CREATE TABLE `test` ( `a` varchar(300) CHARACTER SET latin1 DEFAULT NULL, `b` varchar(10) DEFAULT NULL, KEY `a` (`a`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 1 row in set (0.00 sec)
  25. 25. I f**ckd up. What do I do? • Let’s start with what you shouldn’t do • Keep calm and don’t start by changing something • Analyze the situation • Why did the problem occur in the first place? • Reassess the damage • Is it consistent? • Are all rows broken in the same way? • Are some rows bad, but others are okay? • Are all bad in several different ways? • Is it actually repearable? • No character mapping occurred during writes (e.g. unicode over latin1/latin1) 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  26. 26. I f**ckd up. What else I shouldn’t do, then? • Do not rush things as you may easily go from bad to worse • Do not start fixing this on a replication slave • You can’t fix this by fixing tables one by one on a live database • Unless you really have everything in one table • Do not use: ALTER TABLE … DEFAULT CHARSET = … • It only changes the default character set for new columns • Do not use: ALTER TABLE … CONVERT TO CHARACTER SET … • It’s not for fixing broken encoding • Do not use: ALTER TABLE … MODIFY col_name … CHARACTER SET … 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  27. 27. I f**ckd up. So how do I fix it? • What needs to be fixed? • Schema defaut character set • ALTER SCHEMA fosdem DEFAULT CHARSET = utf8 • Tables with text columns: CHAR, VARCHAR, TEXT, TINYTEXT, LONGTEXT • What about ENUM? • Use INFORMATION_SCHEMA to grab a list • What about other tables? • They too (eventually), but it’s not critical SELECT CONCAT(c.table_schema, '.', c.table_name) AS candidate_table FROM information_schema.columns c WHERE c.table_schema = 'fosdem' AND c.column_type REGEXP '^(.*CHAR|.*TEXT|ENUM)((.+))?$' GROUP BY candidate_table; 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  28. 28. I f**ckd up. So how do I fix it? • Option 1 – requires downtime • Dump and restore • Dump the data preserving the bad configuration and drop the old database bash# mysqldump -u root -p --skip-set-charset --default-character-set=latin1 fosdem > fosdem.sql mysql> DROP SCHEMA fosdem; • Correct table definitions in the dump file • Edit DEFAULT CHARSET in all CREATE TABLE statements • Create the database again and import the data back mysql> CREATE SCHEMA fosdem DEFAULT CHARSET utf8; bash# mysql -u root -p --default-character-set=utf8 fosdem < fosdem.sql 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  29. 29. I f**ckd up. So how do I fix it? • Option 2 – requires downtime • Perform a two step conversion with ALTER TABLE • Original encoding -> VARBINARY/BLOB -> Target encoding • Conversion from/to BINARY/BLOB removes character set context • How? • Stop applications • On each tabe, for each text column perform: ALTER TABLE tbl MODIFY col_name VARBINARY(255); ALTER TABLE tbl MODIFY col_name VARCHAR(255) CHARACTER SET utf8; • You may specify multiple columns per ALTER TABLE • Fix the problems (application and/or db configs) • Restart applications 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  30. 30. I f**ckd up. So how do I fix it? • Option 3 – online character set fix; no downtime* • Thanks to our plugin for pt-online-schema-change • and a tiny patch for pt-online-schema-change that goes with the plugin  • How? • Start pt-online-schema-change on all tables – one by one • Do not rotate tables (--no-swap-tables) or drop pt-osc triggers • Wait until all tables have been converted • Stop applications • Fix the problems (application and/or db configs) • Rotate tables – takes just 1 minute • Restart applications • Et voilà 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  31. 31. GOTCHAs! • Data space requrements may change during conversion • Latin1 uses 1 byte per character, utf8 will need to assume 3 bytes • VARCHAR/TEXT fit up to 64KB – it won’t fit 65536 multi-byte characters • Key length limit is 767 bytes • Data type and/or index length changes may be required • Test and plan this ahead • There may be more prolems than you think • Detect irrecoverible problems with a simple stored procedure 01.02.2015 Follow us on Twitter @dbasquare www.psce.com CREATE FUNCTION `cnv_test_conversion` (`value_before` LONGTEXT, `value_after` LONGTEXT) RETURNS tinyint(1) BEGIN RETURN (IFNULL(CONVERT(CONVERT(`value_before` USING latin1) USING binary), "") = IFNULL(CONVERT(`value_after` USING binary), "")); END;;
  32. 32. 01.02.2015 Follow us on Twitter @dbasquare Need help? Visit www.psce.com GOTCHAs! master [localhost] {msandbox} (fosdem) > ALTER TABLE test MODIFY a VARCHAR(300) CHARACTER SET utf8; Query OK, 0 rows affected, 1 warning (1.23 sec) Records: 0 Duplicates: 0 Warnings: 1 master [localhost] {msandbox} (fosdem) > SHOW WARNINGSG *************************** 1. row *************************** Level: Warning Code: 1071 Message: Specified key was too long; max key length is 767 bytes 1 row in set (0.00 sec) master [localhost] {msandbox} (fosdem) > SHOW CREATE TABLE testG *************************** 1. row *************************** Table: test Create Table: CREATE TABLE `test` ( `a` varchar(300) DEFAULT NULL, `b` varchar(10) DEFAULT NULL, KEY `a` (`a`(255)) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 1 row in set (0.00 sec)
  33. 33. How to do it right? • Set character-set-server during initial configuration • When creating new schemas, always specify the desired charset • CREATE SCHEMA fosdem DEFAULT CHARSET = utf8 • ALTER SCHEMA fosdem DEFAULT CHARSET = utf8 • When creating new tables, also explicitly specify the charset • CREATE TABLE people (…) DEFAULT CHARSET = utf8 • And don’t forget to configure applications too • You can try to force charset on the clients • init-connect = "SET NAMES utf8" • It might also break applications that don’t want to talk to MySQL using utf8 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  34. 34. Oh, and one more thing… 01.02.2015 Follow us on Twitter @dbasquare Need help? Visit www.psce.com
  35. 35. • We are sharing WebScaleSQL packages with the MySQL Community! • Check out http://www.psce.com/blog for details • Follow @dbasquare to receive updates 01.02.2015 Follow us on Twitter @dbasquare 35 WebScaleSQL What is WebScaleSQL? WebScaleSQL is a collaboration among engineers from several companies such as Facebook, Twitter, Google or Linkedin, that face the same challenges in deploying MySQL at scale, and seek greater performance from a database technology tailored for their needs.

×