两个不同的字符集不能有相同的排序规则。且每个字符集都有一个默认的排序规则。
本篇大部分内容出自官网文档...
查看当前支持的字符集
INFORMATION_SCHEMA.CHARACTER_SETS
INFORMATION_SCHEMA.CHARACTER_SETS 表中可以查看可用的字符集。
MAXLEN 列表示:The maximum number of bytes required to store one character.
MariaDB [(none)]> select * from INFORMATION_SCHEMA.CHARACTER_SETS;
+--------------------+----------------------+-----------------------------+--------+
| CHARACTER_SET_NAME | DEFAULT_COLLATE_NAME | DESCRIPTION | MAXLEN |
+--------------------+----------------------+-----------------------------+--------+
| big5 | big5_chinese_ci | Big5 Traditional Chinese | 2 |
| dec8 | dec8_swedish_ci | DEC West European | 1 |
| cp850 | cp850_general_ci | DOS West European | 1 |
| hp8 | hp8_english_ci | HP West European | 1 |
| koi8r | koi8r_general_ci | KOI8-R Relcom Russian | 1 |
| latin1 | latin1_swedish_ci | cp1252 West European | 1 |
| latin2 | latin2_general_ci | ISO 8859-2 Central European | 1 |
...
...
或者查看包含 utf8 的字符集,可以使用LIKE子句或WHERE子句。
MariaDB [(none)]> SELECT * FROM INFORMATION_SCHEMA.CHARACTER_SETS
-> WHERE CHARACTER_SET_NAME LIKE '%utf8%';
+--------------------+----------------------+---------------+--------+
| CHARACTER_SET_NAME | DEFAULT_COLLATE_NAME | DESCRIPTION | MAXLEN |
+--------------------+----------------------+---------------+--------+
| utf8 | utf8_general_ci | UTF-8 Unicode | 3 |
| utf8mb4 | utf8mb4_general_ci | UTF-8 Unicode | 4 |
+--------------------+----------------------+---------------+--------+
2 rows in set (0.000 sec)
SHOW CHARACTER SET
相对于
INFORMATION_SCHEMA.CHARACTER_SETS 表,SHOW CHARACTER SET 语句更常用于查看支持的字符集。
MariaDB [(none)]> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+--------+
| Charset | Description | Default collation | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5 | Big5 Traditional Chinese | big5_chinese_ci | 2 |
| dec8 | DEC West European | dec8_swedish_ci | 1 |
| cp850 | DOS West European | cp850_general_ci | 1 |
| hp8 | HP West European | hp8_english_ci | 1 |
| koi8r | KOI8-R Relcom Russian | koi8r_general_ci | 1 |
| latin1 | cp1252 West European | latin1_swedish_ci | 1 |
| latin2 | ISO 8859-2 Central European | latin2_general_ci | 1 |
...
...
或者查看包含 utf8 的字符集,可以使用LIKE子句或WHERE子句。
MariaDB [(none)]> SHOW CHARACTER SET LIKE '%utf8%';
+---------+---------------+--------------------+--------+
| Charset | Description | Default collation | Maxlen |
+---------+---------------+--------------------+--------+
| utf8 | UTF-8 Unicode | utf8_general_ci | 3 |
| utf8mb4 | UTF-8 Unicode | utf8mb4_general_ci | 4 |
+---------+---------------+--------------------+--------+
2 rows in set (0.005 sec)
查看当前支持的排序规则Collation
一个给定的字符集至少有一个排序规则,大多数字符集有多个。
使用
INFORMATION_SCHEMA.COLLATIONS 表或 SHOW COLLATION 语句,可以显示一个字符集的排序规则。
SHOW COLLATION默认显示所有的排序规则,使用WHERE、LIKE等子句可以过滤显示指定字符集的排序规则。
查看默认字符集 utf8mb4 的排序规则。
MariaDB [(none)]> SHOW COLLATION WHERE Charset = 'utf8mb4';
+------------------------------+---------+------+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+------------------------------+---------+------+---------+----------+---------+
| utf8mb4_general_ci | utf8mb4 | 45 | Yes | Yes | 1 |
| utf8mb4_bin | utf8mb4 | 46 | | Yes | 1 |
| utf8mb4_unicode_ci | utf8mb4 | 224 | | Yes | 8 |
| utf8mb4_icelandic_ci | utf8mb4 | 225 | | Yes | 8 |
...
...
...
| utf8mb4_unicode_nopad_ci | utf8mb4 | 1248 | | Yes | 8 |
| utf8mb4_unicode_520_nopad_ci | utf8mb4 | 1270 | | Yes | 8 |
+------------------------------+---------+------+---------+----------+---------+
33 rows in set (0.008 sec)
[为什么推荐使用utf8mb4?]MySQL中的utf8和utf8mb4字符集
真正支持存储Unicode字符
MySQL/MariaDB 的 utf8 编码并不是真正的UTF-8编码。在MySQL/MariaDB中,utf8最多只支持3个字节,上面的查询中已经可以看到。
而简体中文在 utf8 编码中,一般占3个字节,使用扩展区的中文会占用4个字节。UTF-8编码是可变长编码,英文等字母字符占1个字节,其他复杂字符会占用2-4个字节。
但是目前常使用的emoji表情、繁体字、特殊字符、不常见汉字等,会占用4个字节。这就导致,
如果在 MySQL/MariaDB 中使用 utf8 编码,将无法存储4个字节的汉字、emoji等表情。MySQL/MariaDB 的 utf8 编码占用最多3个字节是由于历史原因造成的。
utf8mb4才是真正的UTF-8编码,可以最多支持4个字节。
所以实际开发中,推荐使用 utf8mb4 编码字符集,这是真正的Unicode编码。尤其是在用到中文、emoji等字符的环境中。否则就会遇到字节数不正确,报错不正确的字符值(Incorrect string value)
MySQL 8.0开始,默认的字符编码已经改为utf8mb4,MariaDB 10.0仍为latin1。
UTF-8编码是U+2528D,属于CJK Unified Ideographs Extension B(中日韩统一表意文字扩充B)字符集的字符,处于第二辅助平面(SIP,表意文字补充平面),最多支持4个字节。
而Mysql的utf8编码则属于常见的基本多文种平面(BMP,即Unicode编码范围在0000-FFFF之内)的字符,最多支持3个字节。
关于utf8mb4_bin排序规则
使用utf8mb4字符集时,通常推荐使用utf8mb4_unicode_ci(或utf8mb4_general_ci)排序规则。
但,如果想要区分区大小写且同时区分重音,推荐使用utf8mb4_bin。一般区分大小写的也都区分重音。utf8_unicode_cs并不是很推荐!
指定字符集和排序规则
修改MySQL服务器的排序规则
Linux系统下,修改 /etc/my.cnf 或 /etc/my.cnf.d/server.cnf 文件;Windows系统下,修改MySQL/Mariadb安装目录下data/my.ini文件,
在对应[mysql]、[mysqld]下设置:
[client]
default-character-set = utf8mb4
[mysql]
default-character-set = utf8mb4
[mysqld]
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
设置完后重启(restart)MySQL/MariaDB。
还可以使用命令行,设置字符集和排序规则:
mysqld --character-set-server=utf8 --collation-server=utf8_unicode_cs
创建数据库时指定
创建数据库时指定字符集和排序规则的语法如下:
CREATE DATABASE IF NOT EXISTS database_name
DEFAULT CHARSET utf8mb4 COLLATE utf8mb4_unicode_ci;
创建并查看一个数据库的排序规则和字符集:
CREATE DATABASE IF NOT EXISTS CHARSETTest
DEFAULT CHARSET utf8mb4 COLLATE utf8mb4_unicode_ci;
-- 查看数据库
show create database CHARSETTest;
-- 返回
-- CREATE DATABASE `CHARSETTest` /*!40100 DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci */
或:
CREATE DATABASE IF NOT EXISTS CHARSETTest1
DEFAULT CHARSET latin1 COLLATE latin1_swedish_ci;
-- 查看数据库
show create database CHARSETTest1;
-- 返回
-- CREATE DATABASE `CHARSETTest1` /*!40100 DEFAULT CHARACTER SET latin1 */
指定表和列的字符集和排序规则
如下,可以在创建一个表时,指定字符集和排序规则DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci。不指定时,默认使用所在数据库的字符集和排序规则。
同时,可以在列名后,指定当前列的排序规则,如 id char(4) COLLATE utf8mb4_unicode_ci。
CREATE TABLE `Product` (
`id` char(4) COLLATE utf8mb4_unicode_ci NOT NULL,
`name` varchar(100) COLLATE utf8mb4_unicode_ci NOT NULL,
`type` varchar(32) COLLATE utf8mb4_unicode_ci NOT NULL,
`sale_price` int(11) DEFAULT NULL,
`purchase_price` int(11) DEFAULT NULL,
`regist_date` date DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
通过SHOW CREATE TABLE tablename;语句,可以完整查看创建表和列的语句,及其中用到的字符集、排序规则。
对应的,可以通过 ALTER TABLE 语句实现对列使用的字符编码的修改。
修改数据库的字符集和排序规则
ALTER DATABASE <db_name> CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
示例:
MariaDB [test]> ALTER DATABASE CHARSETTest1 CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Query OK, 1 row affected (0.008 sec)
MariaDB [test]> show create database CHARSETTest1;
+--------------+-----------------------------------------------------------------------------------------------------+
| Database | Create Database |
+--------------+-----------------------------------------------------------------------------------------------------+
| CHARSETTest1 | CREATE DATABASE `CHARSETTest1` /*!40100 DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci */ |
+--------------+-----------------------------------------------------------------------------------------------------+
1 row in set (0.006 sec)
修改表的字符集
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
示例:
MariaDB [test]> ALTER TABLE OrderTest CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Query OK, 4 rows affected (0.035 sec)
Records: 4 Duplicates: 0 Warnings: 0
MariaDB [test]> show create table OrderTest;
+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| OrderTest | CREATE TABLE `OrderTest` (...) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci |
+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.002 sec)
修改某一个列的排序规则
ALTER TABLE t_name MODIFY c_name VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL;
查看不同对象的字符编码或排序规则
SHOW VARIABLES查看MySQL系统设置的排序规则
SHOW VARIABLES系统变量可以返回和排序规则相关的各个级别的默认设置。
如下,显示与字符集和排序规则相关的所有变量的设置:
MariaDB [(none)]> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+--------------------+
| Variable_name | Value |
+--------------------------+--------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| collation_connection | utf8mb4_general_ci |
| collation_database | utf8mb4_unicode_ci |
| collation_server | utf8mb4_unicode_ci |
+--------------------------+--------------------+
10 rows in set (0.001 sec)
最后三行分别是服务器、数据库和连接的默认排序规则。
SELECT @@collation_server;
查看当前数据库的字符集和排序规则
全局变量
use <db_name>;
SELECT @@character_set_database, @@collation_database;
如下查看:
MariaDB [test]> SELECT @@character_set_database, @@collation_database;
+--------------------------+----------------------+
| @@character_set_database | @@collation_database |
+--------------------------+----------------------+
| latin1 | latin1_swedish_ci |
+--------------------------+----------------------+
1 row in set (0.004 sec)
如果未指定数据库时,将返回MySQL系统级别的字符集和排序规则。
使用information_schema.schemata
查询
information_schema.schemata表,也可以获取数据库的排序规则。而且不需要切换数据库,可直接查看指定数据库的collate。
如下:
MariaDB [(none)]> SELECT
-> default_character_set_name,
-> default_collation_name
-> FROM information_schema.schemata
-> WHERE schema_name = 'test';
+----------------------------+------------------------+
| default_character_set_name | default_collation_name |
+----------------------------+------------------------+
| latin1 | latin1_swedish_ci |
+----------------------------+------------------------+
1 row in set (0.001 sec)
使用环境变量variables
USE db_name;
show variables like "character_set_database";
-- or:
-- show variables like "collation_database";
查看指定表的排序规则
创建一个存放ABab的数据表。后续操作基于此表演示。
CREATE TABLE OrderTest( letter char(1) NOT NULL ); INSERT INTO OrderTest values('B'),('b'),('A'),('a');
通过查询 INFORMATION_SCHEMA.TABLES 表,可以查看某个表的排序规则。
比如:
SELECT TABLE_SCHEMA -- 会有警告,结果为dbname
, TABLE_NAME
, TABLE_COLLATION
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME = 't_name';
MariaDB [test]> SELECT TABLE_SCHEMA
-> , TABLE_NAME
-> , TABLE_COLLATION
-> FROM INFORMATION_SCHEMA.TABLES
-> WHERE TABLE_NAME = 'OrderTest';
+--------------+------------+-------------------+
| TABLE_SCHEMA | TABLE_NAME | TABLE_COLLATION |
+--------------+------------+-------------------+
| test | ordertest | latin1_swedish_ci |
+--------------+------------+-------------------+
1 row in set, 2 warnings (0.003 sec)
除此之外,还可以使用 SHOW TABLE STATUS LIKE 't_name'; 或 SHOW TABLE STATUS where name like 't_name'; 语句查看表的 Collation 。(缺点是无法选择输出列...)
MariaDB [test]> SHOW TABLE STATUS LIKE 'OrderTest'\G
*************************** 1. row ***************************
Name: OrderTest
Engine: InnoDB
Version: 10
Row_format: Dynamic
Rows: 4
Avg_row_length: 4096
Data_length: 16384
Max_data_length: 0
Index_length: 0
Data_free: 0
Auto_increment: NULL
Create_time: 2021-08-10 10:38:06
Update_time: 2021-08-10 10:38:06
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
Max_index_length: 0
Temporary: N
1 row in set (0.001 sec)
查看列的字符集和排序规则
INFORMATION_SCHEMA.COLUMNS表
在
INFORMATION_SCHEMA.COLUMNS 表中,可以查看列的排序规则。
SELECT TABLE_NAME
, COLUMN_NAME
, character_set_name
, COLLATION_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 't_name';
如果想要查看所在数据库,可以在附加上TABLE_SCHEMA列。
如下:
MariaDB [(none)]> SELECT TABLE_NAME
-> , COLUMN_NAME
-> , character_set_name
-> , COLLATION_NAME
-> FROM INFORMATION_SCHEMA.COLUMNS
-> WHERE TABLE_NAME = 'OrderTest';
+------------+-------------+--------------------+-------------------+
| TABLE_NAME | COLUMN_NAME | character_set_name | COLLATION_NAME |
+------------+-------------+--------------------+-------------------+
| ordertest | letter | latin1 | latin1_swedish_ci |
+------------+-------------+--------------------+-------------------+
1 row in set (0.011 sec)
SHOW FULL COLUMNS FROM <table>
SHOW FULL COLUMNS FROM <table>可以返回一个表所有列的信息,包括排序规则。
MariaDB [test]> SHOW FULL COLUMNS FROM OrderTest;
+--------+---------+-------------------+------+-----+---------+-------+---------------------------------+---------+
| Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment |
+--------+---------+-------------------+------+-----+---------+-------+---------------------------------+---------+
| letter | char(1) | latin1_swedish_ci | NO | | NULL | | select,insert,update,references | |
+--------+---------+-------------------+------+-----+---------+-------+---------------------------------+---------+
1 row in set (0.005 sec)
SQL语句中的显式使用排序规则
使用COLLATE子句指定排序规则
默认查询使用默认排序规则,不区分大小写:
MariaDB [test]> select * from OrderTest where letter='a';
+--------+
| letter |
+--------+
| A |
| a |
+--------+
2 rows in set (0.000 sec)
指定字符集,查找严格区分大小写:
MariaDB [test]> select * from OrderTest where letter collate latin1_general_cs ='a';
+--------+
| letter |
+--------+
| a |
+--------+
1 row in set (0.007 sec)
MySQL/MariaDB排序中如何严格按照字符的码点排序?
从下面的查询,可以看到,MySQL/MariaDB的Order By子句排序查询中,指定排序列的Collation为区分大小写的latin1_general_cs,并不会使结果按照字符码点严格排序。
MariaDB [test]> select * from OrderTest order by letter;
+--------+
| letter |
+--------+
| A |
| a |
| B |
| b |
+--------+
4 rows in set (0.000 sec)
MariaDB [test]> select * from OrderTest order by letter collate latin1_general_cs;
+--------+
| letter |
+--------+
| A |
| a |
| B |
| b |
+--------+
4 rows in set (0.002 sec)
而要想实现按照字符的Unicode码值排序,可以使用二进制的latin1_bin、utf8mb4_bin等排序规则。如下:
MariaDB [test]> select * from OrderTest order by letter collate latin1_bin;
+--------+
| letter |
+--------+
| A |
| B |
| a |
| b |
+--------+
4 rows in set (0.002 sec)