Characters and Grapheme Clusters
It's common to think of a string as a sequence of characters, but when working with NSString
objects, or with Unicode strings in general, in most cases it is better to deal with substrings rather than with individual characters. The reason for this is that what the user perceives as a character in text may in many cases be represented by multiple characters in the string. NSString
has a large inventory of methods for properly handling Unicode strings, which in general make Unicode compliance easy, but there are a few precautions you should observe.
- 将字符串视为一系列字符是很常见的,但是当使用
NSString
对象或一般使用Unicode字符串时,在大多数情况下,最好处理子字符串而不是单个字符。 其原因在于,在许多情况下,用户认为文本中的字符可以由字符串中的多个字符表示。NSString
有大量的方法可以正确处理Unicode字符串,这通常会使Unicode合规性变得容易,但是您应该遵循一些预防措施。
NSString
objects are conceptually UTF-16 with platform endianness. That doesn't necessarily imply anything about their internal storage mechanism; what it means is that NSString
lengths, character indexes, and ranges are expressed in terms of UTF-16 units, and that the term “character” in NSString
method names refers to 16-bit platform-endian UTF-16 units. This is a common convention for string objects. In most cases, clients don't need to be overly concerned with this; as long as you are dealing with substrings, the precise interpretation of the range indexes is not necessarily significant.
-
NSString
对象在概念上是UTF-16,具有平台字节序。 这并不一定意味着他们的内部存储机制; 这意味着NSString
长度,字符索引和范围用UTF-16单位表示,并且NSString
方法名称中的术语“字符”指的是16位平台端字符UTF-16单位。 这是字符串对象的常见约定。 在大多数情况下,客户不需要过分关注这一点; 只要您处理子串,范围索引的精确解释就不一定重要。
The vast majority of Unicode code points used for writing living languages are represented by single UTF-16 units. However, some less common Unicode code points are represented in UTF-16 by surrogate pairs. A surrogate pair is a sequence of two UTF-16 units, taken from specific reserved ranges, that together represent a single Unicode code point. CFString has functions for converting between surrogate pairs and the UTF-32 representation of the corresponding Unicode code point. When dealing with NSString
objects, one constraint is that substring boundaries usually should not separate the two halves of a surrogate pair. This is generally automatic for ranges returned from most Cocoa methods, but if you are constructing substring ranges yourself you should keep this in mind. However, this is not the only constraint you should consider.
- 用于编写活语言的绝大多数Unicode代码点由单个UTF-16单元表示。 但是,一些不太常见的Unicode代码点由代理对以UTF-16表示。 代理对是两个UTF-16单元的序列,取自特定的保留范围,它们一起代表单个Unicode代码点。 CFString具有用于在代理对和相应Unicode代码点的UTF-32表示之间进行转换的功能。 处理
NSString
对象时,一个约束是子串边界通常不应该分隔代理对的两半。 对于大多数Cocoa方法返回的范围,这通常是自动的,但如果您自己构建子字符串范围,则应记住这一点。 但是,这不是您应该考虑的唯一约束。
In many writing systems, a single character may be composed of a base letter plus an accent or other decoration. The number of possible letters and accents precludes Unicode from representing each combination as a single code point, so in general such combinations are represented by a base character followed by one or more combining marks. For compatibility reasons, Unicode does have single code points for a number of the most common combinations; these are referred to as precomposed forms, and Unicode normalization transformations can be used to convert between precomposed and decomposed representations. However, even if a string is fully precomposed, there are still many combinations that must be represented using a base character and combining marks. For most text processing, substring ranges should be arranged so that their boundaries do not separate a base character from its associated combining marks.
- 在许多书写系统中,单个字符可以由基本字母加上重音或其他装饰组成。 可能的字母和重音的数量使得Unicode不能将每个组合表示为单个代码点,因此通常这样的组合由基本字符后跟一个或多个组合标记表示。 出于兼容性原因,Unicode确实为许多最常见的组合提供单个代码点; 这些被称为预合成形式,Unicode规范化转换可用于在预合成和分解表示之间进行转换。 但是,即使字符串是完全预先组合的,仍然有许多组合必须使用基本字符和组合标记来表示。 对于大多数文本处理,应排列子字符串范围,使其边界不会将基本字符与其关联的组合标记分开。
In addition, there are writing systems in which characters represent a combination of parts that are more complicated than accent marks. In Korean, for example, a single Hangul syllable can be composed of two or three subparts known as jamo. In the Indic and Indic-influenced writing systems common throughout South and Southeast Asia, single written characters often represent combinations of consonants, vowels, and marks such as viramas, and the Unicode representations of these writing systems often use code points for these individual parts, so that a single character may be composed of multiple code points. For most text processing, substring ranges should also be arranged so that their boundaries do not separate the jamo in a single Hangul syllable, or the components of an Indic consonant cluster.
- 另外,存在书写系统,其中字符表示比重音符号更复杂的部分的组合。 例如,在韩语中,单个韩语音节可以由称为jamo的两个或三个子部分组成。 在南亚和东南亚常见的印度语和印度语写作系统中,单个书写字符通常表示辅音,元音和标记(如变形记)的组合,这些书写系统的Unicode表示通常使用这些单独部分的代码点, 这样单个字符可以由多个代码点组成。 对于大多数文本处理,还应该排列子字符串范围,使得它们的边界不会将单个韩文音节中的干扰或印度语辅音聚类的组件分开。
In general, these combinations—surrogate pairs, base characters plus combining marks, Hangul jamo, and Indic consonant clusters—are referred to as grapheme clusters. In order to take them into account, you can use NSString
’s rangeOfComposedCharacterSequencesForRange: or rangeOfComposedCharacterSequenceAtIndex: methods, or CFStringGetRangeOfComposedCharactersAtIndex. These can be used to adjust string indexes or substring ranges so that they fall on grapheme cluster boundaries, taking into account all of the constraints mentioned above. These methods should be the default choice for programmatically determining the boundaries of user-perceived characters.:
- 通常,这些组合 - 代理对,基本字符加组合标记,Hangul jamo和印度语辅音簇 - 被称为字形簇。 为了将它们考虑在内,您可以使用NSString的rangeOfComposedCharacterSequencesForRange:或rangeOfComposedCharacterSequenceAtIndex:方法或CFStringGetRangeOfComposedCharactersAtIndex。 这些可以用于调整字符串索引或子字符串范围,以便它们落在字形簇边界上,同时考虑到上面提到的所有约束。 这些方法应该是以编程方式确定用户感知字符边界的默认选择:
In some cases, Unicode algorithms deal with multiple characters in ways that go beyond even grapheme cluster boundaries. Unicode casing algorithms may convert a single character into multiple characters when going from lowercase to uppercase; for example, the standard uppercase equivalent of the German character “ß” is the two-letter sequence “SS”. Localized collation algorithms in many languages consider multiple-character sequences as single units; for example, the sequence “ch” is treated as a single letter for sorting purposes in some European languages. In order to deal properly with cases like these, it is important to use standard NSString
methods for such operations as casing, sorting, and searching, and to use them on the entire string to which they are to apply. Use NSString
methods such as lowercaseString, uppercaseString, capitalizedString, compare: and its variants, rangeOfString: and its variants, and rangeOfCharacterFromSet: and its variants, or their CFString equivalents. These all take into account the complexities of Unicode string processing, and the searching and sorting methods in particular have many options to control the types of equivalences they are to recognize.
- 在某些情况下,Unicode算法以超出字形集群边界的方式处理多个字符。 Unicode套管算法可以在从小写变为大写时将单个字符转换为多个字符;例如,德语字符“ß”的标准大写等价物是双字母序列“SS”。许多语言中的本地化校对算法将多字符序列视为单个单元;例如,序列“ch”被视为单个字母,用于在某些欧洲语言中进行排序。为了正确处理这些情况,重要的是使用标准的NSString方法进行封装,排序和搜索等操作,并在它们要应用的整个字符串上使用它们。使用NSString方法,例如lowercaseString,uppercaseString,capitalizedString,compare:及其变体,rangeOfString:及其变体,rangeOfCharacterFromSet:及其变体,或其CFString等价物。这些都考虑了Unicode字符串处理的复杂性,特别是搜索和排序方法有许多选项来控制它们要识别的等价类型。
In some less common cases, it may be necessary to tailor the definition of grapheme clusters to a particular need. The issues involved in determining and tailoring grapheme cluster boundaries are covered in detail in Unicode Standard Annex #29, which gives a number of examples and some algorithms. The Unicode standard in general is the best source for information about Unicode algorithms and the considerations involved in processing Unicode strings.
- 在一些不太常见的情况下,可能有必要根据特定需要定制字素集群的定义。 Unicode标准附件#29详细介绍了确定和定制字形集群边界所涉及的问题,其中给出了许多示例和一些算法。 通常,Unicode标准是有关Unicode算法的信息以及处理Unicode字符串所涉及的注意事项的最佳来源。
If you are interested in grapheme cluster boundaries from the point of view of cursor movement and insertion point positioning, and you are using the Cocoa text system, you should know that on OS X v10.5 and later, NSLayoutManager has API support for determining insertion point positions within a line of text as it is laid out. Note that insertion point boundaries are not identical to glyph boundaries; a ligature glyph in some cases, such as an “fi” ligature in Latin script, may require an internal insertion point on a user-perceived character boundary. See Cocoa Text Architecture Guide for more information.
- 如果您从光标移动和插入点定位的角度对字形簇边界感兴趣,并且您正在使用Cocoa文本系统,您应该知道在OS X v10.5及更高版本中,NSLayoutManager具有用于确定插入的API支持 在布置的一行文本中指出位置。 请注意,插入点边界与字形边界不同; 在某些情况下,例如拉丁文字中的“fi”连字,连字字形可能需要在用户感知的字符边界上使用内部插入点。 有关更多信息,请参阅Cocoa Text Architecture Guide。