字符集和排序规则简介
MySQL提供了多种字符集和排序规则选择,其中字符集设置和数据存储以及客户端与MySQL实例的交互相关,排序规则和字符串的对比规则相关
字符集的设置可以在MySQL实例、数据库、表、列四个级别
MySQL设置字符集支持在InnoDB, MyISAM, Memory三个存储引擎
查看当前MySQL支持的字符集的方式有两种,一种是通过查看information_schema.character_set系统表,一种是通过命令show character set
查看
mysql> SHOW CHARACTER SET;
+----------+---------------------------------+---------------------+--------+
| Charset | Description | Default collation | Maxlen |
+----------+---------------------------------+---------------------+--------+
| armscii8 | ARMSCII-8 Armenian | armscii8_general_ci | 1 |
| ascii | US ASCII | ascii_general_ci | 1 |
| big5 | Big5 Traditional Chinese | big5_chinese_ci | 2 |
| binary | Binary pseudo charset | binary | 1 |
| cp1250 | Windows Central European | cp1250_general_ci | 1 |
| cp1251 | Windows Cyrillic | cp1251_general_ci | 1 |
| cp1256 | Windows Arabic | cp1256_general_ci | 1 |
| cp1257 | Windows Baltic | cp1257_general_ci | 1 |
| cp850 | DOS West European | cp850_general_ci | 1 |
| cp852 | DOS Central European | cp852_general_ci | 1 |
| cp866 | DOS Russian | cp866_general_ci | 1 |
| cp932 | SJIS for Windows Japanese | cp932_japanese_ci | 2 |
| dec8 | DEC West European | dec8_swedish_ci | 1 |
| eucjpms | UJIS for Windows Japanese | eucjpms_japanese_ci | 3 |
| euckr | EUC-KR Korean | euckr_korean_ci | 2 |
| gb18030 | China National Standard GB18030 | gb18030_chinese_ci | 4 |
| gb2312 | GB2312 Simplified Chinese | gb2312_chinese_ci | 2 |
| gbk | GBK Simplified Chinese | gbk_chinese_ci | 2 |
| geostd8 | GEOSTD8 Georgian | geostd8_general_ci | 1 |
| greek | ISO 8859-7 Greek | greek_general_ci | 1 |
| hebrew | ISO 8859-8 Hebrew | hebrew_general_ci | 1 |
| hp8 | HP West European | hp8_english_ci | 1 |
| keybcs2 | DOS Kamenicky Czech-Slovak | keybcs2_general_ci | 1 |
| koi8r | KOI8-R Relcom Russian | koi8r_general_ci | 1 |
| koi8u | KOI8-U Ukrainian | koi8u_general_ci | 1 |
| latin1 | cp1252 West European | latin1_swedish_ci | 1 |
| latin2 | ISO 8859-2 Central European | latin2_general_ci | 1 |
| latin5 | ISO 8859-9 Turkish | latin5_turkish_ci | 1 |
| latin7 | ISO 8859-13 Baltic | latin7_general_ci | 1 |
| macce | Mac Central European | macce_general_ci | 1 |
| macroman | Mac West European | macroman_general_ci | 1 |
| sjis | Shift-JIS Japanese | sjis_japanese_ci | 2 |
| swe7 | 7bit Swedish | swe7_swedish_ci | 1 |
| tis620 | TIS620 Thai | tis620_thai_ci | 1 |
| ucs2 | UCS-2 Unicode | ucs2_general_ci | 2 |
| ujis | EUC-JP Japanese | ujis_japanese_ci | 3 |
| utf16 | UTF-16 Unicode | utf16_general_ci | 4 |
| utf16le | UTF-16LE Unicode | utf16le_general_ci | 4 |
| utf32 | UTF-32 Unicode | utf32_general_ci | 4 |
| utf8 | UTF-8 Unicode | utf8_general_ci | 3 |
| utf8mb4 | UTF-8 Unicode | utf8mb4_0900_ai_ci | 4 |
+----------+---------------------------------+---------------------+--------+
41 rows in set (0.00 sec)
mysql> SHOW CHARACTER SET LIKE 'latin%';
+---------+-----------------------------+-------------------+--------+
| Charset | Description | Default collation | Maxlen |
+---------+-----------------------------+-------------------+--------+
| latin1 | cp1252 West European | latin1_swedish_ci | 1 |
| latin2 | ISO 8859-2 Central European | latin2_general_ci | 1 |
| latin5 | ISO 8859-9 Turkish | latin5_turkish_ci | 1 |
| latin7 | ISO 8859-13 Baltic | latin7_general_ci | 1 |
+---------+-----------------------------+-------------------+--------+
4 rows in set (0.00 sec)
每个指定的字符集都会有一个或多个支持的排序规则,可以通过两种方式查看,一种是查看information_schema.collations表,另一种是通过show collation
命令查看
不同的字符集不可能有相同的排序规则
每个字符集都会有一个默认的排序规则
mysql> SHOW COLLATION WHERE Charset = 'latin1';
+-------------------+---------+----+---------+----------+---------+---------------+
| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute |
+-------------------+---------+----+---------+----------+---------+---------------+
| latin1_bin | latin1 | 47 | | Yes | 1 | PAD SPACE |
| latin1_danish_ci | latin1 | 15 | | Yes | 1 | PAD SPACE |
| latin1_general_ci | latin1 | 48 | | Yes | 1 | PAD SPACE |
| latin1_general_cs | latin1 | 49 | | Yes | 1 | PAD SPACE |
| latin1_german1_ci | latin1 | 5 | | Yes | 1 | PAD SPACE |
| latin1_german2_ci | latin1 | 31 | | Yes | 2 | PAD SPACE |
| latin1_spanish_ci | latin1 | 94 | | Yes | 1 | PAD SPACE |
| latin1_swedish_ci | latin1 | 8 | Yes | Yes | 1 | PAD SPACE |
+-------------------+---------+----+---------+----------+---------+---------------+
8 rows in set (0.01 sec)
上述各行排序规则代表含义:
每个字符集可以对应多个排序规则,但每个排序规则只能对应一个字符集,例如下面的对应关系会报错:
mysql> SELECT _latin1 'x' COLLATE latin2_bin;
ERROR 1253 (42000): COLLATION 'latin2_bin' is not valid for CHARACTER SET 'latin1'
排序规则命名规则
排序规则的命令通常是以对应的字符集的名字为开头,并以自己的特定属性结尾,比如排序规则utf8_general_ci和latin1_swedish_ci就分别是对应utf8和latin1字符集的排序规则
当排序规则特指某种语言时,则中间的部分就为这种语言的名字,比如utf8_turkish_ci和utf8_hungarian_ci就代表UTF8字符集中的土耳其语和匈牙利语
排序规则名字的结尾字符代表是否大小写敏感,重音敏感以及是否是二进制的
当排序规则名字中没有指定_as或者_ai时,则是否口音敏感是由_ci或者_cs决定,当使用的是_ci是则暗指_ai,反之则暗指_as
比如latin1_general_ci显示指定大小写不敏感,也暗指口音不敏感,而latin1_general_cs显示指定大小写敏感,也暗指口音敏感
对Unicode的字符集来说,对应的排序规则也可能会包含unicode排序算法的版本号,如果没有这个版本号显示则表示是基于4.0.0这个版本,比如utf8_unicode_520_ci和utf8_unicode_ci
字符集和排序规则举例
mysql8默认情况下的字符集是utf8mb4
mysql> select version();
+-----------+
| version() |
+-----------+
| 8.0.13 |
+-----------+
1 row in set (0.00 sec)
mysql> show variables like '%character%';
+--------------------------+----------------------------------+
| Variable_name | Value |
+--------------------------+----------------------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| character_sets_dir | /usr/local/mysql/share/charsets/ |
+--------------------------+----------------------------------+
8 rows in set (0.00 sec)
在Latin1下创建的表默认字符集也是utf8mb4
mysql> create table temp(name varchar(100),name2 varchar(100));
Query OK, 0 rows affected (0.14 sec)
mysql> show create table temp;
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| temp | CREATE TABLE `temp` (
`name` varchar(100) COLLATE utf8mb4_general_ci DEFAULT NULL,
`name2` varchar(100) COLLATE utf8mb4_general_ci DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci |
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
ASCII码(American Standard Code for Information Interchange,美国信息交换标准代码)是基于拉丁字母的一套电脑编码系统。它主要用于显示现代英语
ASCII的局限在于只能显示26个基本拉丁字母、阿拉伯数目字和英式标点符号,因此只能用于显示现代美国英语(而且在处理英语当中的外来词如naïve、café、élite等等时,所有重音符号都不得不去掉,即使这样做会违反拼写规则)
Latin字符集:ISO 8859-1,正式编号为ISO/IEC 8859-1:1998,又称Latin-1或“西欧语言”,是国际标准化组织内ISO/IEC 8859的第一个8位字符集。它以ASCII为基础,在空置的0xA0-0xFF的范围内,加入96个字母及符号,藉以供使用附加符号的拉丁字母语言使用
在latin字符集的表中插入英文和中文字符串后
mysql> create database test;
Query OK, 1 row affected (0.01 sec)
mysql> use test
Database changed
mysql> show variables like '%character%';
+--------------------------+----------------------------------+
| Variable_name | Value |
+--------------------------+----------------------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| character_sets_dir | /usr/local/mysql/share/charsets/ |
+--------------------------+----------------------------------+
8 rows in set (0.00 sec)
mysql> set names latin1;
Query OK, 0 rows affected (0.01 sec)
mysql> show variables like '%character%';
+--------------------------+----------------------------------+
| Variable_name | Value |
+--------------------------+----------------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| character_sets_dir | /usr/local/mysql/share/charsets/ |
+--------------------------+----------------------------------+
8 rows in set (0.00 sec)
CREATE TABLE `temp` (
`name` varchar(100) DEFAULT NULL,
`name2` varchar(100) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Query OK, 0 rows affected (0.10 sec)
mysql> insert into temp values('china','中国');
Query OK, 1 row affected (0.04 sec)
mysql> select * from temp;
+-------+--------+
| name | name2 |
+-------+--------+
| china | 中国 |
+-------+--------+
1 row in set (0.00 sec)
Latin1是一种很常见的字符集,这种字符集是单字节编码,向下兼容ASCII,其编码范围是0x00-0xFF,0x00-0x7F之间完全和ASCII一致,0x80-0x9F之间是控制字符,0xA0-0xFF之间是文字符号。很明显,Latin1覆盖了所有的单字节,因此,可以将任意字符串保存在latin1字符集中,而不用担心有内容不符合latin1的编码规范而被抛弃。——gbk和utf8是多字节编码,没有这种特性。
mysql使用者经常利用Latin1的这种全覆盖特性,将其它类型的字符串,gbk,utf8,big5等,保存在latin1列中。保存的过程中没有数据丢失,只要原样取出来,便又是合法的gbk/utf8/big字符串。如果将gbk字符串保存在utf8列中,则gbk字符串中那些不符合utf8编码格式的内容,会被抛弃,保存的内容无法原样取出,数据实际上遭到了破坏。
综上,如果我们看到一个字段的字符集是latin1的,那么,他保存的可能是任何编码的字符串
所谓单字节编码就是挨着一个个来,比如圣诞节到了,你要送妹子一箱苹果,为制造浪漫,商铺提供两种包装方式,一是按个数来,即单个苹果包装进一个盒子,来一个包装一个,这样,妹子在拆完所有的盒子后完完整整的可以还原为一个个完整的和一箱完好无损的苹果,二是按重量来,每个盒子限重2两、3两、6两,这样在包装时,若刚好重3两的当然可以完整的放进一个盒子,但是若不够或者多了,勉不了要切开苹果,或者再往盒子中添加其他的部分苹果,这样的话,妹子再无论怎样拆开盒子,都会得到一箱残缺不堪的苹果了,因为你在按照这种包装方式进行时,已经破坏了单个苹果的完整性,现在还原不回来了~我们的字符集编码转换就是在做这种重新包装的工作,latin1恰好就像单个苹果包装,而utf8就像第二种方式。
mysql> set names utf8;
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> show variables like '%character%';
+--------------------------+----------------------------------+
| Variable_name | Value |
+--------------------------+----------------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| character_sets_dir | /usr/local/mysql/share/charsets/ |
+--------------------------+----------------------------------+
8 rows in set (0.00 sec)
mysql> insert into temp values('china','中国');
ERROR 1366 (HY000): Incorrect string value: '\xE4\xB8\xAD\xE5\x9B\xBD' for column 'name2' at row 1
mysql> select * from temp;
+-------+---------------+
| name | name2 |
+-------+---------------+
| china | ä¸å›½ |
+-------+---------------+
1 row in set (0.00 sec)
mysql> set names latin1;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from temp;
+-------+--------+
| name | name2 |
+-------+--------+
| china | 中国 |
+-------+--------+
1 row in set (0.00 sec)
区分到小写与不区分大小写
mysql> create table temp1(name varchar(10) charset latin1 collate
-> latin1_general_ci,name2 varchar(10) charset latin1 collate
-> latin1_general_cs);
Query OK, 0 rows affected (0.07 sec)
mysql> insert into temp1 values('a','a'),('A','A');
Query OK, 2 rows affected (0.07 sec)
Records: 2 Duplicates: 0 Warnings: 0
mysql> select count(distinct name),count(distinct name2) from temp1;
+----------------------+-----------------------+
| count(distinct name) | count(distinct name2) |
+----------------------+-----------------------+
| 1 | 2 |
+----------------------+-----------------------+
1 row in set (0.00 sec)
实例级别字符集和排序规则(以5.7做例子)
实例级别的字符集可以在实例启动的时候指定,默认的字符集是latin1,可以通过--character-set-server
参数指定实例的字符集,通过--collation-server
参数指定实例的排序规则
当仅指定了字符集没有指定排序规则时,则默认的排序规则是该字符集的默认排序规则
以下三个启动实例的方式的设置效果是一样的:
shell> mysqld
shell> mysqld --character-set-server=latin1
shell> mysqld --character-set-server=latin1 --collationserver=latin1_swedish_c
指定实例级别的字符集的作用是当create database语句中没有指定字符集,则创建的数据库默认是该字符集和排序规则
可以通过MySQL系统变量查看实例的字符集和排序规则,变量名分别是character_set_server和collation_server
mysql> show variables like 'character_set_server';
+----------------------+--------+
| Variable_name | Value |
+----------------------+--------+
| character_set_server | latin1 |
+----------------------+--------+
1 row in set (0.00 sec)
mysql> show variables like 'collation_server';
+------------------+-------------------+
| Variable_name | Value |
+------------------+-------------------+
| collation_server | latin1_swedish_ci |
+------------------+-------------------+
1 row in set (0.00 sec)
mysql> create database course2;
Query OK, 1 row affected (0.00 sec)
mysql> show create database course2;
+----------+--------------------------------------------------------------------+
| Database | Create Database |
+----------+--------------------------------------------------------------------+
| course2 | CREATE DATABASE `course2` /*!40100 DEFAULT CHARACTER SET latin1 */ |
+----------+--------------------------------------------------------------------+
root@ip-172-31-18-152:/usr/local/mysql/bin# cat /etc/my.cnf
[mysqld]
……
port=3307
character_set_server=utf8
root@ip-172-31-18-152:/usr/local/mysql/bin# /etc/init.d/mysql.server restart
[ ok ] Restarting mysql.server (via systemctl): mysql.server.service.
mysql> create database course3;
Query OK, 1 row affected (0.00 sec)
mysql> show create database course3;
+----------+------------------------------------------------------------------+
| Database | Create Database |
+----------+------------------------------------------------------------------+
| course3 | CREATE DATABASE `course3` /*!40100 DEFAULT CHARACTER SET utf8 */ |
+----------+------------------------------------------------------------------+
default-character-set是MySQL老版本中的“character_set_server”,在新版本中该参数改成了character_set_server,如果在5.7版本中继续使用老的参数会导致数据库无法启动
2017-05-26T10:46:49.151800Z 0 [ERROR] unknown variable 'defaultcharacter-set=utf8'
2017-05-26T10:46:49.151811Z 0 [ERROR] Aborting
数据库级别字符集和排序规则
实例上的每个数据库都有一个字符集和排序规则,我们可以通过create database
和alter database
语句来指定和修改
数据库的属性值会存放在数据库所在文件夹下的db.opt文件里
通过create/alter database语句里指定字符集和排序规则,可以创建不同于实例级别的字符集和排序规则,而且每个数据库都可以创建成不一样
CREATE DATABASE db_name CHARACTER SET latin1 COLLATE latin1_swedish_ci;
。同样,当仅指定了字符集而没有指定排序规则时,则会使用该字符集的默认排序规则;当仅指定了排序规则而没有字符集时,则在该排序规则名称上含有的字符集会被使用,当数据库创建时没有指定这两项,则使用实例级别的字符集和排序规则
alter database语句修改的字符集仅对数据库中后续创建的表产生作用,不会修改已经创建表的字符集
mysql> show create table temp1;
+-------+------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+-------+------------------------------------------------------------------------------------------------------------------------------+
| temp1 | CREATE TABLE `temp1` (
`id` int(11) DEFAULT NULL,
`name` varchar(10) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
mysql> alter database course2 character set utf8;
mysql> show create table temp1;
+-------+------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+-------+------------------------------------------------------------------------------------------------------------------------------+
| temp1 | CREATE TABLE `temp1` (
`id` int(11) DEFAULT NULL,
`name` varchar(10) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
mysql> create table temp2(id int,name varchar(10));
mysql> show create table temp2;
+-------+----------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+-------+----------------------------------------------------------------------------------------------------------------------------+
| temp2 | CREATE TABLE `temp2` (
`id` int(11) DEFAULT NULL,
`name` varchar(10) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
表级别字符集和排序规则
每个表都有自己的字符集和排序规则,可以通过create table和altertable语句指定或修改表的字符集和排序规则
CREATE TABLE t1 ( ... ) CHARACTER SET latin1 COLLATE latin1_danish_ci;
当创建和修改表时没有指定排序规则时会使用字符集默认的排序规则;当创建和修改表时没有指定字符集,则使用排序规则对应的字符集;当两个属性都没有指定时,会使用数据库级别的字符集和排序规则
mysql> create table temp1(id int,name varchar(100)) character set utf8 collate utf8_general_ci;
mysql> show create database course;
+----------+-------------------------------------------------------------------+
| Database | Create Database |
+----------+-------------------------------------------------------------------+
| course | CREATE DATABASE `course` /*!40100 DEFAULT CHARACTER SET latin1 */ |
+----------+-------------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> create table temp2(id int,name varchar(100));
Query OK, 0 rows affected (0.02 sec)
mysql> show create table temp2;
+-------+--------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+-------+--------------------------------------------------------------------------------------------------------------+
| temp2 | CREATE TABLE `temp2` (
`id` int(11) DEFAULT NULL,
`name` varchar(100) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
alter table表的字符集不会改变已经存在的字段的字符集和字段里的数据
mysql> show create table temp2;
+-------+----------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+-------+----------------------------------------------------------------------------------------------------------+
| temp2 | CREATE TABLE `temp2` (
`id` int(11) DEFAULT NULL,
`name` varchar(100) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
mysql> insert into temp2 values(1,'中国');
mysql> select * from temp2;
+------+--------+
| id | name |
+------+--------+
| 1 | 中国 |
mysql> alter table temp2 character set utf8;
mysql> select * from temp2;
+------+--------+
| id | name |
+------+--------+
| 1 | 中国 |
mysql> show create table temp2;
+-------+------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+-------+------------------------------------------------------------------------------------------------------------+
| temp2 | CREATE TABLE `temp2` (
`id` int(11) DEFAULT NULL,
`name` varchar(100) CHARACTER SET latin1 DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8
列级别字符集和排序规则
每个字符串字段(char, varchar, text)都有自己的字符集和排序规则,可以通过create table和alter table指定和修改字段的字符集和排序规则
CREATE TABLE t1 ( col1 VARCHAR(5) CHARACTER SET latin1 COLLATE latin1_german1_ci );
ALTER TABLE t1 MODIFY col1 VARCHAR(5) CHARACTER SET latin1 COLLATE latin1_swedish_ci;
同样,当字段上只指定了字符集时,排序规则会使用该字符集对应的默认排序规则;当字段上只指定了排序规则时,则字符集会使用对应的字符集;当字段上两者都没有定义时,则使用表级别的字符集和排序规则CREATE TABLE t1 ( col1 CHAR(10) CHARACTER SET utf8 ) CHARACTER SET latin1 COLLATE latin1_bin;
如果通过alter table语句修改某一列的字符集时,MySQL会试图转换其中已有的数据,这样的转换有可能会导致数据丢失
mysql> insert into temp2 values('中国');
Query OK, 1 row affected (0.02 sec)
mysql> select * from temp2;
+--------+
| name |
+--------+
| 中国 |
mysql> alter table temp2 modify name varchar(100) character set greek;
Query OK, 1 row affected, 1 warning (0.03 sec)
Records: 1 Duplicates: 0 Warnings: 1
mysql> select * from temp2;
+--------+
| name |
+--------+
| ???? |
字符串的字符集和排序规则
对于在SQL语句中出现的字符串,也会有自己的字符集和排序规则,通常是由character_set_connection和collation_connection两个系统参数决定的;也可以显示指定字符串的字符集和排序规则
[_charset_name]'string' [COLLATE collation_name]
SELECT 'abc';
SELECT _latin1'abc';
SELECT _binary'abc';
SELECT _utf8'abc' COLLATE utf8_danish_ci;
mysql> select case when 'a'='A' then 1 else 0 end;
+-------------------------------------+
| case when 'a'='A' then 1 else 0 end |
+-------------------------------------+
| 1 |
mysql> set collation_connection=latin1_general_cs;
Query OK, 0 rows affected (0.00 sec)
mysql> select case when 'a'='A' then 1 else 0 end;
+-------------------------------------+
| case when 'a'='A' then 1 else 0 end |
+-------------------------------------+
| 0 |
mysql> select case when 'a'=_latin1 'A' collate latin1_general_cs then 1 else 0 end;
+-----------------------------------------------------------------------+
| case when 'a'=_latin1 'A' collate latin1_general_cs then 1 else 0 end |
+-----------------------------------------------------------------------+
| 0 |
当charset和collation都指定时,使用这两者作为字符串的字符集和排序规则;当仅指定charset时,则排序规则使用该字符集对应的默认排序规则;当仅指定排序规则时,则使用character_set_connection参数对应的字符集,且必须保证排序规则是字符集允许的;当两者都没有指定时,则使用character_set_connection和collation_connection两个参数指定的字符集和排序规则;比如SELECT _utf8'Müller';会使用utf8字符集和默认的utf8_general_ci排序规则;比如SELECT 'Müller' COLLATE utf8_general_ci;会使用系统连接的字符集,但如果字符集不是utf8就会报错
mysql> set names latin1;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT 'Müller' COLLATE utf8_general_ci;
ERROR 1253 (42000): COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'latin1'
国家字符集
标准SQL中会有两种预定义的字符集数据类型,分别是nchar和nvarchar。 在MySQL中这种预定义的字符集就是UTF8,所以以下定义的最终结果是相同的:
CHAR(10) CHARACTER SET utf8
NATIONAL CHARACTER(10)
NCHAR(10)
以下定义的最终结果也是相同:
VARCHAR(10) CHARACTER SET utf8
NATIONAL VARCHAR(10)
NVARCHAR(10)
NCHAR VARCHAR(10)
NATIONAL CHARACTER VARYING(10)
NATIONAL CHAR VARYING(10)
mysql> create table temp(name nchar(10),name2 nvarchar(10),name3 varchar(10));
Query OK, 0 rows affected (0.02 sec)
mysql> show create table temp;
+-------+-----------------------------------------------------------------------------------------+
| Table | Create Table |
+-------+-----------------------------------------------------------------------------------------+
| temp | CREATE TABLE `temp` (
`name` char(10) CHARACTER SET utf8 DEFAULT NULL,
`name2` varchar(10) CHARACTER SET utf8 DEFAULT NULL,
`name3` varchar(10) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
对字符串来说,可以使用N'字符串'的方式表明使用国家字符集,比如以下三种表达方式的结果是相同的:
SELECT N'some text';
SELECT n'some text';
SELECT _utf8'some text';
字符集和排序规则举例
CREATE TABLE t1 ( c1 CHAR(10) CHARACTER SET latin1 COLLATE latin1_german1_ci ) DEFAULT CHARACTER SET latin2 COLLATE latin2_bin;
字段c1上的字符集和排序规则都已经显示定义,所以会直接使用字段显示指定的
CREATE TABLE t1 ( c1 CHAR(10) CHARACTER SET latin1 ) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;
字段c1上仅定义了字符集没有排序规则,则会使用latin1的默认排序规则,就是latin1_swedish_ci 而不是latin1_danish_ci
mysql> create table temp(
name varchar(10) character set latin1 collate latin1_general_ci,
name2 varchar(10) character set latin1 collate latin1_general_cs);
mysql> insert into temp values('a','a'),('A','A'),('b','b'),('B','B');
Query OK, 4 rows affected (0.00 sec)
Records: 4 Duplicates: 0 Warnings: 0
mysql> select * from temp order by name;#默认升序
+------+-------+
| name | name2 |
+------+-------+
| a | a |
| A | A |
| b | b |
| B | B |
+------+-------+
4 rows in set (0.00 sec)
mysql> select * from temp order by name2;
+------+-------+
| name | name2 |
+------+-------+
| A | A |
| a | a |
| B | B |
| b | b |
+------+-------+
CREATE TABLE t1 ( c1 CHAR(10) ) DEFAULT CHARACTER SETlatin1 COLLATE latin1_danish_ci;
字段c1没有定义字符集和排序规则,则会使用表级的字符集和排序规则
CREATE DATABASE d1 DEFAULT CHARACTER SET latin2COLLATE latin2_czech_ci;
USE d1;
CREATE TABLE t1 ( c1 CHAR(10) );
在字段和表级都没有明确指定使用哪种字符集和排序规则,则会沿用数据库的字符集和排序规则
连接级字符集和排序规则
每个数据库客户端连接都有自己的字符集和排序规则属性,客户端发送的语句的字符集是由character_set_client决定,而与服务端交互时会根据character_set_connection和collation_connection两个参数将接收到的语句转化。当涉及到显示字符串的比较时,由collation_connection参数决定,而当比较的是字段里的字符串时则根据字段本身的排序规则决定
character_set_result参数决定了语句的执行结果以什么字符集返回给客户端
客户端可以很方便的调整字符集和排序规则,比如使用SET NAMES 'charset_name' [COLLATE 'collation_name']表明后续的语句都以该字符集格式传送给服务端,而执行结果也以此字符集格式返回
比如一个set names语句相当于执行了以下三行语句:
SET character_set_client = charset_name;
SET character_set_results = charset_name;
SET character_set_connection = charset_name;
或者执行SET CHARACTER SET 'charset_name'
命令,此命令和set names
非常类似,唯一不同是将connection的字符集设置为当前数据库的字符集,所以相当于执行以下三行语句:
SET character_set_client = charset_name;
SET character_set_results = charset_name;
SET character_set_connection = @@character_set_database;
mysql> set names latin1 collate latin1_general_cs;
Query OK, 0 rows affected (0.00 sec)
mysql> show variables like ‘%collation_connection%';
+----------------------+-------------------+
| Variable_name | Value |
+----------------------+-------------------+
| collation_connection | latin1_general_cs |
3 rows in set (0.01 sec)
mysql> select case when 'a'='A' then 1 else 0 end;
+-------------------------------------+
| case when 'a'='A' then 1 else 0 end |
+-------------------------------------+
| 0 |
+-------------------------------------+
应用程序配置字符集和排序规则
当应用程序使用MySQL默认的字符集和排序规则时,则应用程序端不需要特别的设置,而当不是使用默认属性时则需要特别设置。比如我们可以灵活的的为每一个数据库设置不同的字符集和排序规则,对应用程序来说就可以通过set names命令与所操作的数据库保持一致;在服务器端我们可以通过设置--character_set_server和--collation_server两个参数来指定实例的默认字符集和排序规则,并将此作为每个数据库和表的默认字符集和排序规则,比如:
[mysqld]
character-set-server=utf8
collation-server=utf8_general_ci
而在应用程序端,也需要通过set names命令保证每个数据库链接和数据库保持一致
mysql> show variables like '%character%';
+--------------------------+----------------------------------+
| Variable_name | Value |
+--------------------------+----------------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/local/mysql/share/charsets/ |
+--------------------------+----------------------------------+
8 rows in set (0.00 sec)
mysql> set names latin1;
Query OK, 0 rows affected (0.00 sec)
mysql> show variables like '%character%';
+--------------------------+----------------------------------+
| Variable_name | Value |
+--------------------------+----------------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/local/mysql/share/charsets/ |
+--------------------------+----------------------------------+
我们也可以通过参数 --init_connect="SET NAMES 'utf8'" 来让每个客户端连接都自动设置字符集,但缺点是对拥有super权限的用户不生效;init_connect表示服务器为每个连接的客户端执行的字符串。字符串由一个或多个SQL语句组成。要想指定多个语句,用分号间隔开;
SET GLOBAL init_connect='SET AUTOCOMMIT=0;set names utf8’;
# 配置
[mysqld]
init_connect='SET AUTOCOMMIT=0;set names latin1’'
mysql> grant select on course to cdq@'localhost' identified by 'cdq';
root@ip-172-31-18-152:/usr/local/mysql/bin# ./mysql -u cdq -p
mysql> show variables like '%character%';
+--------------------------+----------------------------------+
| Variable_name | Value |
+--------------------------+----------------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/local/mysql/share/charsets/ |
+--------------------------+----------------------------------+
8 rows in set (0.01 sec)
mysql> show variables like '%autocommit%';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| autocommit | OFF |
排序规则在SQL中的使用
在SQL语句中指定排序规则可以覆盖之前定义的默认排序规则
- 和order by语句使用:
SELECT k FROM t1 ORDER BY k COLLATE latin1_german2_ci;
- 和as语句使用:
SELECT k COLLATE latin1_german2_ci AS k1 FROM t1 ORDER BY k1;
- 和group by语句使用:
SELECT k FROM t1 GROUP BY k COLLATE latin1_german2_ci;
- 和聚合函数使用:
SELECT MAX(k COLLATE latin1_german2_ci) FROM t1;
- 和distinct一起使用:
SELECT DISTINCT k COLLATE latin1_german2_ci FROM t1;
mysql> select * from temp;
+------+-------+
| name | name2 |
+------+-------+
| a | a |
| A | A |
| b | b |
| B | B |
+------+-------+
mysql> select count(distinct name2), count(distinct name2 collate latin1_general_ci) from temp;
+-----------------------+-------------------------------------------------+
| count(distinct name2) | count(distinct name2 collate latin1_general_ci) |
+-----------------------+-------------------------------------------------+
| 4 | 2 |
+-----------------------+-------------------------------------------------+
- 在where条件中使用:
SELECT * FROM t1 WHERE _latin1 'Müller' COLLATE latin1_german2_ci = k;
SELECT * FROM t1 WHERE k LIKE _latin1 'Müller' COLLATE latin1_german2_ci;
- 在having中使用:
SELECT k FROM t1 GROUP BY k HAVING k = _latin1 'Müller' COLLATE latin1_german2_ci;
在大部分的语句中,使用何种排序规则是明确的,比如如下几个语句就是使用X列上的排序规则:
SELECT x FROM T ORDER BY x;
SELECT x FROM T WHERE x = x;
SELECT DISTINCT x FROM T;
但也有复杂的情况,比如当列和字符串都有各自的排序规则时:
SELECT x FROM T WHERE x = 'Y';
Concat(X,'Y')
首先计算每种情况的排序规则权重:
- 当有显示的写明优先级的,则权重为0
- 当两个有不同排序规则的字符串连接在一起的,则权重为1
- 字段和本地参数的排序规则,权重为2
- 由部分字符串函数返回的系统常量所带的排序规则,权重为3
- 字符串自带的排序规则,权重为4
......
在最终选择使用哪种排序规则上,规则为:
- 优先使用权重最低的
- 如果两者拥有相同的优先级,则
1)如果两者都是Unicode,或者都不是Unicode,则返回错误
2)如果一边是Unicode而另一边不是,则将不是Unicode的一边转化成Unicode字符集,比如:SELECT CONCAT(utf8_column, latin1_column) FROM t1;
latin1字符集的列会自动转化为utf8的字符集,再执行字符串连接
3)当两边都是相同字符集,且一边是_bin而另一边是_ci/_cs的排序规则时,则使用_bin的排序规则
通过命令coercibility可以查看权重,比如:
mysql> SELECT COERCIBILITY('A' COLLATE latin1_swedish_ci);
-> 0
mysql> SELECT COERCIBILITY(VERSION());
-> 3
mysql> SELECT COERCIBILITY('A');
-> 4
mysql> SELECT COERCIBILITY(1000);
-> 5
mysql> show create table temp;
+-------+------------------------------------------------------------+
| Table | Create Table |
+--------------------------------------------------------------------+
| temp | CREATE TABLE `temp` (
`name` varchar(10) CHARACTER SET latin1 COLLATE latin1_general_ci DEFAULT NULL,
`name2` varchar(10) CHARACTER SET latin1 COLLATE latin1_general_cs DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
mysql> select * from temp where name='A';
+------+-------+
| name | name2 |
+------+-------+
| a | a |
| A | A |
mysql> select * from temp where name='A' collate latin1_general_cs;
+------+-------+
| name | name2 |
+------+-------+
| A | A |
mysql> select * from temp where name collate latin1_general_ci='A' collate latin1_general_cs;
ERROR 1267 (HY000): Illegal mix of collations (latin1_general_ci,EXPLICIT) and (latin1_general_cs,EXPLICIT) for operation '='
排序规则使用举例
德语中的元音情况
比如表T的X列有这样的值
Muffler
Müller
MX Systems
MySQL
当执行如下语句时:SELECT X FROM T ORDER BY X COLLATE collation_name;
我们不能比较两个属于不同校对的不同字符值。
CREATE TABLE `temp` (
`name` varchar(10) CHARACTER SET latin1 COLLATE latin1_general_ci DEFAULT NULL,
`name2` varchar(10) CHARACTER SET latin1 COLLATE latin1_general_cs DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
mysql> select * from temp;
+------+-------+
| name | name2 |
+------+-------+
| a | a |
| A | A |
| b | b |
| B | B |
+------+-------+
4 rows in set (0.00 sec)
mysql> select * from temp where name=name2;
ERROR 1267 (HY000): Illegal mix of collations (latin1_general_ci,IMPLICIT) and
(latin1_general_cs,IMPLICIT) for operation '='
mysql> select _utf8 '语言 Language 言語 язык';
+---------------------------------+
| 语言 Language 言語 язык |
+---------------------------------+
| 语言 Language 言語 язык |
+---------------------------------+
1 row in set (0.00 sec)
mysql> select _latin1 '语言 Language 言語 язык';
+--------------------------------------------------------+
| è¯è¨€ Language 言語 Ñзык |
+--------------------------------------------------------+
| è¯è¨€ Language 言語 Ñзык |
+--------------------------------------------------------+
1 row in set (0.00 sec)
Unicode字符集
Unicode(Universal Code)是一种在计算机上使用的字符编码。Unicode 是为了解决传统的字符编码方案的局限而产生的,它为每种语言中的每个字符设定了统一并且唯一的二进制编码,以满足跨语言、跨平台进行文本转换、处理的要求。Unicode存在不同的编码方案,包括Utf-8,Utf-16和Utf-32。Utf表示Unicode Transformation Format。
其中:
Charset:字符集名字。
Description:对每个字符集的简短描述。
Default collation:每个字符集的默认校对。
Maxlen:每个字符集所保留的最大字节数
MySQL支持的Unicode字符集包括以下几种:
BMP(BMP-Basic Multilingual Plane):基本多语言范围。如果将BMP之外的扩展字符转换成utf8等仅支持BMP的字符集时,不识别的字符会转化成?号;除UTF8外,客户端字符集不支持设置为其他的Unicode字符集, 所以set names和set character set命令在Unicode要求下仅支持UTF8
UTF8字符集
UTF8(Unicode Transformation Format with 8-bit units)是一种存放Unicode数据的编码规则。
- 基础拉丁字母,数字,标点符号会占用一个字节
- 扩展的拉丁字符,希腊语,斯拉夫语,阿拉伯语等会占用两个字节
- 韩语,中文,日语字符会占用3个或4个字节
在MySQL中,UTF8字符集不支持扩展字符,且仅占用最多3个字节
CREATE TABLE `temp1` (
`name` varchar(100) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
mysql> insert into temp1 values(12),('ab'),('中国'),('a中'),('中a');
mysql> select length(name),char_length(name) from temp1;
+--------------+-------------------+
| length(name) | char_length(name) |
+--------------+-------------------+
| 2 | 2 |
| 2 | 2 |
| 6 | 2 |
| 4 | 2 |
| 4 | 2 |
+--------------+-------------------+
大多数的Unicode字符集都会有一个一般意义上的排序规则,比如以_general结尾命名,以_bin命名的二进制排序规则,当然也有指定特定语言的排序规则
比如在“Classical Latin”的排序规则中, I和J的对比结果是相等的, 而U和V的对比结果也是相等的
不同字符集下字符空间消耗
其中M在非二进制字段中代表定义的字符长度数值,L代表真实的字符所占用的字节数。当定义varchar字段时,最大的长度为65553个字节,且这个最大长度是整行数据共享的限制,当存储比如Latin字符集时,每个字节存储一个字符,而当在多字节字符集比如UTF8时,由于每个字符最大要求3个字节的存储,所以字段的最大创建长度就是21844个字符
mysql> create table temp2(name varchar(21845)) charset=utf8;
ERROR 1118 (42000): Row size too large. The maximum row size forthe used table type, not counting BLOBs, is 65535. This
includesstorage overhead, check the manual. You have to change some columns to TEXT or BLOBs
中文字符集和排序规则
big5 (Big5 繁体中文) collations:
big5_bin
big5_chinese_ci (default)
gb2312 (GB2312 简体中文) collations:
gb2312_bin
gb2312_chinese_ci (default)
gbk (GBK简体总问) collations:
gbk_bin
gbk_chinese_ci (default)
gb18030 (中国国家标准GB18030) collations:
gb18030_bin:二进制排序规则
gb18030_chinese_ci (default):若是字母就按照字母排序,如果是汉字则按照拼音排序
gb18030_unicode_520_c:Unicode排序规则
gb18030是中国官方推出的中文字符集标准
Set names utf8;
CREATE TABLE `colum_charset` (
`c1` char(10) CHARACTER SET utf8 NOT NULL,
`c2` char(10) CHARACTER SET gbk DEFAULT NULL,
`c3` varchar(10) CHARACTER SET gb2312 DEFAULT NULL,
`c4` varchar(10) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
mysql> insert into colum_charset values("测试","测试","测试","测试");
mysql> select * from colum_charset;
+--------+--------+--------+------+
| c1 | c2 | c3 | c4 |
+--------+--------+--------+------+
| 测试 | 测试 | 测试 | ?? |
+--------+--------+--------+------+
MySQL字符集参数
查看当前数据库字符集
mysql> show variables like 'character%';
+--------------------------+----------------------------------+
| Variable_name | Value |
+--------------------------+----------------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| character_sets_dir | /usr/local/mysql/share/charsets/ |
+--------------------------+----------------------------------+
8 rows in set (0.01 sec)
查看当前数据库字符集校对
mysql> show variables like 'collation%';
+----------------------+--------------------+
| Variable_name | Value |
+----------------------+--------------------+
| collation_connection | utf8_general_ci |
| collation_database | utf8mb4_general_ci |
| collation_server | utf8mb4_general_ci |
+----------------------+--------------------+
3 rows in set (0.00 sec)
在my.cnf或my.ini中指定
[mysqld]
character_set_server=utf8
影响参数:character_set_server 和 character_set_database
注意:修改后要重启数据库才能生效。
[client]
default-character-set=latin1
影响参数:character_set_client,character_set_connection 和character_set_results。
注意:修改后无需重启数据库。
在启动参数前指定 ./mysqld --character-set-server=utf8 &
影响参数:character_set_server 和 character_set_database
临时指定
mysql> SET character_set_client = utf8;
mysql> SET character_set_connection = utf8;
mysql> SET character_set_database = utf8;
mysql> SET character_set_results = utf8;
mysql> SET character_set_server = utf8;
#其中character_set_client,character_set_connection 和character_set_results,可通过一句话指定
mysql> SET NAMES 'utf8';
MySQL字符集转化
a. MySQL Server收到请求时将请求数据从 character_set_client 转换为 character_set_connection;
b. 进行内部操作前将请求数据从character_set_connection转换为内部操作字符集,其确定方法如下:
使用每个数据字段的CHARACTER SET设定值;
若上述值不存在,则使用对应数据表的DEFAULT CHARACTER SET设定值(MySQL扩展,非SQL标准);
若上述值不存在,则使用对应数据库的DEFAULT CHARACTER SET设定值;
若上述值不存在,则使用character_set_server设定值。
c.将操作结果从内部操作字符集转换为character_set_results
MySQL字符集乱码
产生乱码的根本原因在于:
a. 客户机没有正确地设置client字符集,导致原先的SQL语句被转换成connection所指字符集,而这种转换,是会丢失信息的,如果client是utf8格式,那么如果转换成gb2312格式,这其中必定会丢失信息,反之则不会丢失。一定要保证connection的字符集大于client字符集才能保证转换不丢失信息。
b. 数据库字体没有设置正确,如果数据库字体设置不正确,那么connection字符集转换成database字符集照样丢失编码,原因跟上面一样。
比如向默认字符集为utf8的数据表插入utf8编码的数据前连接字符集设置为latin1,查询时设置连接字符集为utf8
插入时根据MySQL服务器的默认设置,character_set_client、character_set_connection和character_set_results均为latin1;
插入操作的数据将经过latin1=>latin1=>utf8的字符集转换过程,这一过程中每个插入的汉字都会从原始的3个字节变成6个字节保存;
查询时的结果将经过utf8=>utf8的字符集转换过程,将保存的6个字节原封不动返回,产生乱码
mysql> set names latin1;
mysql> create table temp(name varchar(10)) charset utf8;
mysql> insert into temp values('中国');
mysql> select * from temp;
+--------+
| name |
+--------+
| 中国 |
+--------+
mysql> set names utf8;
mysql> select * from temp;
+---------------+
| name |
+---------------+
| ?????? |
+---------------+
注意:存储字符集编码比插入时字符集大时,如果原封不动返回数据会出现乱码,不过可通过修改查询字符集,避免乱码,即不会丢失数据
又比如向默认字符集为latin1的数据表插入utf8编码的数据前设置了连接字符集为utf8
插入时根据连接字符集设置,character_set_client、character_set_connection和character_set_results均为utf8;
插入数据将经过utf8=>utf8=>latin1的字符集转换,若原始数据中含有\u0000~\u00ff范围以外的Unicode字 符,会因为无法在latin1字符集中表示而被转换为“?”(0×3F)符号,以后查询时不管连接字符集设置如何都无法恢复其内容了
mysql> set names utf8;
mysql> create table temp(name varchar(10)) charset latin1;
mysql> insert into temp values('中国');
mysql> select * from temp;
+------+
| name |
+------+
| ?? |
+------+
mysql> set names latin1;
mysql> select * from temp;
+------+
| name |
+------+
| ?? |
+------+
数据不完整了,且无法恢复。
乱码终极解决方案
- 首先要明确你的客户端是何种编码格式,这是最重要的(IE6一般用utf8,命令行一般是gbk,一般程序是gb2312)
- 确保你的数据库使用utf8格式,很简单,所有编码通吃。
- 一定要保证connection字符集大于等于client字符集,不然就会信息丢失,比如:latin1 < gb2312 < gbk < utf8,若设置set character_set_client = gb2312,那么至少connection的字符集要大于等于gb2312,否则就会丢失信息
- 以上三步做正确的话,那么所有中文都被正确地转换成utf8格式存储进了数据库,为了适应不同的浏览器,不同的客户端,你可以修改character_set_results来以不同的编码显示中文字体,由于utf8是大方向,因此web应用还是倾向于使用utf8格式显示中文的。