1 utf-8字符集
<?php
mb_internal_encoding('UTF-8');
var_dump(mb_internal_encoding());
$string = 'cioèòà';
var_dump(
substr($string, 0, 6),
mb_substr($string, 0, 6),
mb_strcut($string, 0, 6)
);
$string = '这样就不会出现乱码了- ^ ^ - ';
var_dump(
substr($string, 0, 7),
mb_substr($string, 0, 7),
mb_strcut($string, 0, 7)
);
输出:
[dev@dev02 test]$ php testMb.php
string(5) "UTF-8"
string(6) "cioèb
string(9) "cioèòà"
string(5) "cioè"
string(7) "这样
string(21) "这样就不会出现"
string(6) "这样"
2 ISO-8859-1字符集
<?php
mb_internal_encoding('ISO-8859-1');
var_dump(mb_internal_encoding());
$string = 'cioèòà';
var_dump(
substr($string, 0, 6),
mb_substr($string, 0, 6),
mb_strcut($string, 0, 6)
);
$string = '这样就不会出现乱码了- ^ ^ - ';
var_dump(
substr($string, 0, 7),
mb_substr($string, 0, 7),
mb_strcut($string, 0, 7)
);
输出:
[dev@dev02 test]$ php testMb.php
string(10) "ISO-8859-1"
string(6) "cioèb
string(6) "cioèb
string(6) "cioèb
string(7) "这样
string(7) "这样
string(7) "这样
3 UTF-16BE 字符集 unicode 4个字节版本 32位
<?php
mb_internal_encoding('UTF-16BE');
var_dump(mb_internal_encoding());
$string = 'cioèòà';
var_dump(
substr($string, 0, 6),
mb_substr($string, 0, 6),
mb_strcut($string, 0, 6)
);
$string = '这样就不会出现乱码了- ^ ^ - ';
var_dump(
substr($string, 0, 7),
mb_substr($string, 0, 7),
mb_strcut($string, 0, 7)
);
输出:
[dev@dev02 test]$ php testMb.php
string(8) "UTF-16BE"
string(6) "cioèb
string(8) "cioèòb
string(6) "cioèb
string(7) "这样
string(14) "这样就不亢
string(6) "这样"
4 ascii码
<?php
mb_internal_encoding('ASCII');
var_dump(mb_internal_encoding());
$string = 'cioèòà';
var_dump(
substr($string, 0, 6),
mb_substr($string, 0, 6),
mb_strcut($string, 0, 6)
);
$string = '这样就不会出现乱码了- ^ ^ - ';
var_dump(
substr($string, 0, 7),
mb_substr($string, 0, 7),
mb_strcut($string, 0, 7)
);
输出:
[dev@dev02 test]$ php testMb.php
string(5) "ASCII"
string(6) "cioèb
string(6) "cioèb
string(6) "cioèb
string(7) "这样
string(7) "这样
string(7) "这样
总而言之,需要对中文进行取子串,需要设置内部编码字符集为UTF-8,然后使用mb_substr或者mb_strcut。mb_substr是按照字符为单位来截取子串的,而mb_strcut是按照字节来截取子串的,但是不会把不能组成一个字符的字节截取过来,而是舍去。
如第一个例子,使用 mb_strcut($string, 0, 7) 截取的7个字节,但是输出 【string(6) "这样"】是6个字节。