UnicodeStandard-12.0
⓿❶❷❸❹❹❺❻❼❽❾
Chapter 1 Introduction
The Unicode Standard is the universal character encoding standard for written characters and text.
Unicode标准是书写字符和文本的通用字符编码标准。
It defines a consistent way of encoding multilingual text that enables the exchange of text data internationally and creates the foundation for global software.
它定义了一种一致的编码多语言文本的方法,该方法允许国际文本数据交换并为全球软件奠定基础。
As the default encoding of HTML and XML, the Unicode Standard provides the underpinning for theWorld Wide Web and the global business environments of today.
作为HTML和XML的默认编码,Unicode标准为当今的万维网和全球业务环境提供了基础。
Required in new Internetprotocols and implemented in all modern operating systems and computer languages suchas Java and C#, Unicode is the basis of software that must function all around the world.
在新的网络协议中,需要在所有现代操作系统和计算机语言中实现,如Java和C语言,Unicode是必须在全世界范围内运行的软件的基础。
With Unicode, the information technology industry has replaced proliferating charactersets with data stability, global interoperability and data interchange, simplified software,and reduced development costs.
有了Unicode,信息技术行业已经用数据稳定性、全球互操作性和数据交换、简化软件和降低开发成本来取代激增的字符集。
While taking the ASCII character set as its starting point, the Unicode Standard goes farbeyond ASCII’s limited ability to encode only the upper- and lowercase letters A throughZ.
当以ASCII字符集为起点时,Unicode标准远远超出了ASCII仅对字母a到z进行大小写编码的有限能力。
It provides the capacity to encode all characters used for the written languages of theworld—more than 1 million characters can be encoded.
它提供了对用于世界书面语言的所有字符进行编码的能力,可以对100多万个字符进行编码。
No escape sequence or controlcode is required to specify any character in any language.
在任何语言中指定任何字符都不需要转义序列或控制代码。
The Unicode character encodingtreats alphabetic characters, ideographic characters, and symbols equivalently, whichmeans they can be used in any mixture and with equal facility
Unicode字符编码同等对待字母字符、表意字符和符号,这意味着它们可以在任何混合中使用,并且具有同等的功能。
The Unicode Standard specifies a numeric value (code point) and a name for each of itscharacters.
Unicode标准为每个字符指定一个数字值(代码点)和一个名称。
In this respect, it is similar to other character encoding standards from ASCIIonward.
在这方面,它类似于从ascii开始的其他字符编码标准。
In addition to character codes and names, other information is crucial to ensurelegible text: a character’s case, directionality, and alphabetic properties must be welldefined.
除了字符代码和名称之外,其他信息对于确保文本的可发布性至关重要:必须对字符的大小写、方向性和字母属性进行良好定义。
The Unicode Standard defines these and other semantic values, and it includesapplication data such as case mapping tables and character property tables as part of theUnicode Character Database.
Unicode标准定义了这些语义值和其他语义值,它包括作为Unicode字符数据库一部分的应用程序数据,例如大小写映射表和字符属性表。
Character properties define a character’s identity and behav-ior; they ensure consistency in the processing and interchange of Unicode data.Unicode Character Database
字符属性定义字符的标识和行为;它们确保处理和交换Unicode数据的一致性。
The Unicode Character Database (UCD) consists of a set of files that define the Unicodecharacter properties and internal mappings.
Unicode字符数据库(UCD)由一组文件组成,这些文件定义了Unicode字符属性和内部映射。
For each property, the files determine theassignment of property values to each code point.
对于每个属性,文件决定将属性值分配给每个代码点。
The UCD also supplies recommendedproperty aliases and property value aliases for textual parsing and display in environmentssuch as regular expressions.
UCD还提供推荐的属性别名和属性值别名,用于文本分析和在正则表达式等环境中显示。
▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃
The properties include the following:
• Name
• General Category (basic partition into letters, numbers, symbols, punctuation,and so on)
一般分类(基本划分为字母、数字、符号、标点等)
• Other important general characteristics (whitespace, dash, ideographic, alpha-betic, noncharacter, deprecated, and so on)
其他重要的一般特征(空格、破折号、表意字符、字母、非字符、不推荐使用等
• Display-related properties (bidirectional class, shaping, mirroring, width, andso on)
显示相关属性(双向类、形状、镜像、宽度等)
• Casing (upper, lower, title, folding—both simple and full)
• Numeric values and types
数值和类型
• Script and Block
• Normalization properties (decompositions, decomposition type, canonicalcombining class, composition exclusions, and so on)
规范化属性(分解、分解类型、规范化组合类、组合排除等)
• Age (version of the standard in which the code point was first designated)
年龄(第一次指定代码点的标准版本)
• Boundaries (grapheme cluster, word, line, and sentence)
边界(图形簇、单词、行和句子)
▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃
Unicode characters are represented in one of three encoding forms: a 32-bit form (UTF-32), a 16-bit form (UTF-16), and an 8-bit form (UTF-8). The 8-bit, byte-oriented form,UTF-8, has been designed for ease of use with existing ASCII-based systems.
The Unicode Standard is code-for-code identical with International Standard ISO/IEC10646. Any implementation that is conformant to Unicode is therefore conformant to ISO/IEC 10646.
Unicode标准是与国际标准ISO/IEC10646相同的代码。因此,任何符合Unicode的实现都符合ISO/IEC 10646
The Unicode Standard contains 1,114,112 code points, most of which are available for encoding of characters.
Unicode标准包含1114112个码位,其中大部分可用于字符编码。
The majority of the common characters used in the major lan-guages of the world are encoded in the first 65,536 code points, also known as the Basic Multilingual Plane (BMP).
世界主要语言中使用的大多数通用字符都编码在前65536个码位中,也称为基本多语言平面(BMP)。
The overall capacity for more than 1 million characters is morethan sufficient for all known character encoding requirements, including full coverage ofall minority and historic scripts of the world.
超过100万个字符的总容量足以满足所有已知的字符编码要求,包括对世界上所有少数民族和历史脚本的完全覆盖。
-
1-1 Coverage
范围?
The Unicode Standard, Version 12.0, contains 137,929 characters from the world’s scripts.
These characters are more than sufficient not only for modern communication for theworld’s languages, but also to represent the classical forms of many languages.
这些文字不仅足以进行世界语言的现代交流,而且能够代表许多语言的古典形式。
The standard includes the European alphabetic scripts, Middle Eastern right-to-left scripts, andscripts of Asia and Africa.
该标准包括欧洲字母脚本、中东从右到左的脚本以及亚洲和非洲的脚本。
Many archaic and historic scripts are encoded.
许多古老的和历史的脚本都被编码。
The Han scriptincludes 87,887 unified ideographic characters defined by national, international, andindustry standards of China, Japan, Korea, Taiwan, Vietnam, and Singapore.
汉字包括87887个统一的表意字符,这些字符由中国、日本、韩国、台湾、越南和新加坡的国家、国际和行业标准定义。
In addition,the Unicode Standard contains many important symbol sets, including currency symbols,punctuation marks, mathematical symbols, technical symbols, geometric shapes, dingbats,and emoji.
此外,Unicode标准还包含许多重要的符号集,包括货币符号、标点符号、数学符号、技术符号、几何图形、丁巴符和emoji。
For overall character and code range information, see Chapter 2, General Struc-ture.
用于整体字符和代码范围信息
Note, however, that the Unicode Standard does not encode idiosyncratic, personal, novel,or private-use characters, nor does it encode logos or graphics.
但是请注意,Unicode标准不编码特殊、个人、小说或私人使用字符,也不编码徽标或图形。
Graphologies unrelated totext, such as dance notations, are likewise outside the scope of the Unicode Standard.
与图形无关的图腾文字,如舞蹈符号,同样也不在Unicode标准的范围内。
Font variants are explicitly not encoded.
字体变体未显式编码。
The Unicode Standard reserves 6,400 code points inthe BMP for private use, which may be used to assign codes to characters not included inthe repertoire of the Unicode Standard.
Unicode标准在BMP中保留6400个代码点供私人使用,可用于将代码分配给未包含在Unicode标准列表中的字符。
Another 131,068 private-use code points are avail-able outside the BMP, should 6,400 prove insufficient for particular applications.
另外还有131068个私用代码点可以在BMP之外使用,如果6400证明不足以用于特定应用。
Standards Coverage
The Unicode Standard is a superset of all characters in wide spread use today.
Unicode标准是当今广泛使用的所有字符的超集。
It containsthe characters from major international and national standards as well as prominentindustry character sets.
它包含了国际和国家主要标准的字符以及突出的行业字符集。
For example, Unicode incorporates the ISO/IEC 6937 and ISO/IEC 8859 families of standards, the SGML standard ISO/IEC 8879, and bibliographic stan-dards such as ISO 5426.
例如,Unicode结合了ISO/IEC 6937和ISO/IEC 8859系列标准、SGML标准ISO/IEC 8879和书目标准(如ISO 5426)。
Important national standards contained within Unicode includeANSI Z39.64, KS X 1001, JIS X 0208, JIS X 0212, JIS X 0213, GB 2312, GB 18030, HKSCS,and CNS 11643.
Unicode中包含的重要国家标准包括:SI Z39.64、KS X 1001、JIS X 0208、JIS X 0212、JIS X 0213、GB 2312、GB 18030、HKSCS和CNS 11643。
Industry code pages and character sets from Adobe, Apple, Fujitsu, Hewl-ett-Packard, IBM, Lotus, Microsoft, NEC, and Xerox are fully represented as well.
来自Adobe、Apple、Fujitsu、Hewl-Ett-Packard、IBM、Lotus、Microsoft、NEC和Xerox的行业代码页和字符集也全部呈现。
The Unicode Standard is fully conformant with the International Standard ISO/IEC10646:2017, Information Technology—Universal Coded Character Set (UCS), known as theUniversal Character Set (UCS). For more information, see Appendix C, Relationship toISO/IEC 10646.
Unicode标准完全符合国际标准ISO/IEC10646:2017,信息技术通用编码字符集(UCS),即通用字符集(UCS)。
New Characters
The Unicode Standard continues to respond to new and changing industry demands by encoding important new characters.
Unicode标准继续通过编码重要的新字符来响应新的和不断变化的行业需求。
As the universal character encoding, the UnicodeStandard also responds to scholarly needs.
作为通用的字符编码,unicodestandard也响应学术需求。
To preserve world cultural heritage, importantarchaic scripts are encoded as consensus about the encoding is developed.
为了保护世界文化遗产,importantarachaic脚本被编码为关于编码的共识。
-
1-2 Design Goals
The Unicode Standard began with a simple goal: to unify the many hundreds of conflictingways to encode characters, replacing them with a single, universal standard.
Unicode标准从一个简单的目标开始:统一数百种冲突的字符编码方式,用一个单一的通用标准代替它们。
The pre-existing legacy character encodings were both inconsistent and incomplete—two encodingscould use the same codes for two different characters and use different codes for the samecharacters, while none of the encodings handled any more than a small fraction of theworld’s languages.
现有的旧字符编码既不一致又不完整,两种编码可以对两个不同的字符使用相同的代码,对相同的字符使用不同的代码,而所有的编码处理的都不超过世界语言的一小部分。
Whenever textual data was converted between different programs orplatforms, there was a substantial risk of corruption.
每当在不同的程序或平台之间转换文本数据时,就存在严重的损坏风险。
Programs often were written only tosupport particular encodings, making development of international versions expensive.
程序的编写通常只是为了支持特定的编码,这使得开发国际版本的成本很高。
As a result, developing countries were particularly hard-hit, as it was not economically feasibleto adapt specific versions of programs for smaller markets.
因此,发展中国家受到了特别严重的打击,因为在经济上不可能为小市场调整特定版本的计划。
Technical fields such as mathe-matics were also disadvantaged, because they were forced to use special fonts to representarbitrary characters, often leading to garbled content.
数学数学等技术领域也处于不利地位,因为他们被迫使用特殊字体来表示任意字符,往往导致内容混乱。
The designers of the Unicode Standard envisioned a uniform method of character identifi-cation that would be more efficient and flexible than previous encoding systems.
Unicode标准的设计者设想了一种统一的字符识别方法,这种方法比以前的编码系统更加高效和灵活。
The new system would satisfy the needs of technical and multilingual computing and would encodea broad range of characters for all purposes, including worldwide publication.
新系统将满足技术和多语言计算的需要,并将为所有目的编码广泛的字符,包括全球出版物。
The Unicode Standard was designed to be:
• Universal. The repertoire must be large enough to encompass all charactersthat are likely to be used in general text interchange, including those in majorinternational, national, and industry character sets.
节目表必须足够大,以涵盖所有可能用于一般文本交换的字符,包括国际、国内和行业主要字符集的字符。
• Efficient. Plain text is simple to parse: software does not have to maintain stateor look for special escape sequences, and character synchronization from anypoint in a character stream is quick and unambiguous. A fixed character codeallows for efficient sorting, searching, display, and editing of text.
纯文本很容易解析:软件不需要维护状态或查找特殊的转义序列,字符流中任何点的字符同步都是快速而明确的。固定字符代码允许对文本进行有效的排序、搜索、显示和编辑。
• Unambiguous. Any given Unicode code point always represents the same character.
任何给定的Unicode码位总是表示相同的字符。
-
1-3 Text Handling
The assignment of characters is only a small fraction of what the Unicode Standard and itsassociated specifications provide.
字符分配只是Unicode标准及其相关规范提供的一小部分。
The specifications give programmers extensive descrip-tions and a vast amount of data about the handling of text, including how to:
这些规范为程序员提供了广泛的描述和大量关于文本处理的数据,包括如何:
• divide words and break lines
• sort text in different languages
• format numbers, dates, times, and other elements appropriate to different locales
• display text for languages whose written form flows from right to left, such as Arabic or Hebrew
• display text in which the written form splits, combines, and reorders, such as for the languages of South Asia
• deal with security concerns regarding the many look-alike characters from writing systems around the world
Without the properties, algorithms, and other specifications in the Unicode Standard andits associated specifications, interoperability between different implementations would be impossible.
如果没有Unicode标准及其相关规范中的属性、算法和其他规范,不同实现之间就不可能实现互操作性。
With the Unicode Standard as the foundation of text representation, all of the text on the Web can be stored, searched, and matched with the same program code.
以Unicode标准作为文本表示的基础,Web上的所有文本都可以被存储、搜索和与相同的程序代码相匹配。
Characters and Glyphs字符和字形
The difference between identifying a character and rendering it on screen or paper is crucial to understanding the Unicode Standard’s role in text processing.
识别字符和在屏幕或纸张上呈现字符之间的差异对于理解Unicode标准在文本处理中的作用至关重要。
The character identified by a Unicode code point is an abstract entity, such as “latin capital letter a” or“bengali digit five”.
由Unicode码位标识的字符是一个抽象实体,例如“拉丁文大写字母A”或“孟加拉数字5”。
The mark made on screen or paper, called a glyph, is a visual representation of the character.
屏幕或纸上的标记称为字形,是字符的视觉表示。
The Unicode Standard does not define glyph images.
Unicode标准没有定义字形图像。
That is, the standard defines how characters are interpreted, not how glyphs are rendered.
也就是说,标准定义了字符的解释方式,而不是字形的呈现方式。
Ultimately, the software or hard-ware rendering engine of a computer is responsible for the appearance of the characters on the screen.
最终,计算机的软件或硬件呈现引擎负责屏幕上字符的外观。
The Unicode Standard does not specify the precise shape, size, or orientation of onscreen characters.
Unicode标准没有指定屏幕字符的精确形状、大小或方向。
Text Elements
The successful encoding, processing, and interpretation of text requires appropriate defini-tion of useful elements of text and the basic rules for interpreting text.
文本的成功编码、处理和解释需要适当定义文本的有用元素和解释文本的基本规则。
The definition of textelements often changes depending on the process that handles the text.
文本元素的定义通常根据处理文本的过程而更改。
For example, whensearching for a particular word or character written with the Latin script, one often wishesto ignore differences of case.
例如,当用拉丁文字书写某个特定的单词或字符时,人们常常希望忽略大小写的差异。
However, correct spelling within a document requires casesensitivity.
但是,文档中正确的拼写要求区分大小写。
The Unicode Standard does not define what is and is not a text element in different processes; instead, it defines elements called encoded characters.
在不同的进程中,Unicode标准不定义什么是文本元素,什么不是文本元素;相反,它定义了称为编码字符的元素。
An encoded character is rep-resented by a number from 0 to 10FFFF16, called a code point.
编码字符由一个0到10ffff16的数字表示,称为码位。
A text element, in turn, isrepresented by a sequence of one or more encoded characters.
文本元素又由一个或多个编码字符序列呈现。