09. Characters and Grapheme Clusters

相关链接:
https://developer.apple.com/library/archive/documentation/Cocoa/Conceptual/Strings/Articles/stringsClusters.html#//apple_ref/doc/uid/TP40008025-SW1

Characters and Grapheme Clusters

It's common to think of a string as a sequence of characters, but when working with NSString objects, or with Unicode strings in general, in most cases it is better to deal with substrings rather than with individual characters. The reason for this is that what the user perceives as a character in text may in many cases be represented by multiple characters in the string. NSString has a large inventory of methods for properly handling Unicode strings, which in general make Unicode compliance easy, but there are a few precautions you should observe.

  • 将字符串视为一系列字符是很常见的,但是当使用NSString对象或一般使用Unicode字符串时,在大多数情况下,最好处理子字符串而不是单个字符。 其原因在于,在许多情况下,用户认为文本中的字符可以由字符串中的多个字符表示。 NSString有大量的方法可以正确处理Unicode字符串,这通常会使Unicode合规性变得容易,但是您应该遵循一些预防措施。

NSString objects are conceptually UTF-16 with platform endianness. That doesn't necessarily imply anything about their internal storage mechanism; what it means is that NSString lengths, character indexes, and ranges are expressed in terms of UTF-16 units, and that the term “character” in NSString method names refers to 16-bit platform-endian UTF-16 units. This is a common convention for string objects. In most cases, clients don't need to be overly concerned with this; as long as you are dealing with substrings, the precise interpretation of the range indexes is not necessarily significant.

  • NSString对象在概念上是UTF-16,具有平台字节序。 这并不一定意味着他们的内部存储机制; 这意味着NSString长度,字符索引和范围用UTF-16单位表示,并且NSString方法名称中的术语“字符”指的是16位平台端字符UTF-16单位。 这是字符串对象的常见约定。 在大多数情况下,客户不需要过分关注这一点; 只要您处理子串,范围索引的精确解释就不一定重要。

The vast majority of Unicode code points used for writing living languages are represented by single UTF-16 units. However, some less common Unicode code points are represented in UTF-16 by surrogate pairs. A surrogate pair is a sequence of two UTF-16 units, taken from specific reserved ranges, that together represent a single Unicode code point. CFString has functions for converting between surrogate pairs and the UTF-32 representation of the corresponding Unicode code point. When dealing with NSString objects, one constraint is that substring boundaries usually should not separate the two halves of a surrogate pair. This is generally automatic for ranges returned from most Cocoa methods, but if you are constructing substring ranges yourself you should keep this in mind. However, this is not the only constraint you should consider.

  • 用于编写活语言的绝大多数Unicode代码点由单个UTF-16单元表示。 但是,一些不太常见的Unicode代码点由代理对以UTF-16表示。 代理对是两个UTF-16单元的序列,取自特定的保留范围,它们一起代表单个Unicode代码点。 CFString具有用于在代理对和相应Unicode代码点的UTF-32表示之间进行转换的功能。 处理NSString对象时,一个约束是子串边界通常不应该分隔代理对的两半。 对于大多数Cocoa方法返回的范围,这通常是自动的,但如果您自己构建子字符串范围,则应记住这一点。 但是,这不是您应该考虑的唯一约束。

In many writing systems, a single character may be composed of a base letter plus an accent or other decoration. The number of possible letters and accents precludes Unicode from representing each combination as a single code point, so in general such combinations are represented by a base character followed by one or more combining marks. For compatibility reasons, Unicode does have single code points for a number of the most common combinations; these are referred to as precomposed forms, and Unicode normalization transformations can be used to convert between precomposed and decomposed representations. However, even if a string is fully precomposed, there are still many combinations that must be represented using a base character and combining marks. For most text processing, substring ranges should be arranged so that their boundaries do not separate a base character from its associated combining marks.

  • 在许多书写系统中,单个字符可以由基本字母加上重音或其他装饰组成。 可能的字母和重音的数量使得Unicode不能将每个组合表示为单个代码点,因此通常这样的组合由基本字符后跟一个或多个组合标记表示。 出于兼容性原因,Unicode确实为许多最常见的组合提供单个代码点; 这些被称为预合成形式,Unicode规范化转换可用于在预合成和分解表示之间进行转换。 但是,即使字符串是完全预先组合的,仍然有许多组合必须使用基本字符和组合标记来表示。 对于大多数文本处理,应排列子字符串范围,使其边界不会将基本字符与其关联的组合标记分开。

In addition, there are writing systems in which characters represent a combination of parts that are more complicated than accent marks. In Korean, for example, a single Hangul syllable can be composed of two or three subparts known as jamo. In the Indic and Indic-influenced writing systems common throughout South and Southeast Asia, single written characters often represent combinations of consonants, vowels, and marks such as viramas, and the Unicode representations of these writing systems often use code points for these individual parts, so that a single character may be composed of multiple code points. For most text processing, substring ranges should also be arranged so that their boundaries do not separate the jamo in a single Hangul syllable, or the components of an Indic consonant cluster.

  • 另外,存在书写系统,其中字符表示比重音符号更复杂的部分的组合。 例如,在韩语中,单个韩语音节可以由称为jamo的两个或三个子部分组成。 在南亚和东南亚常见的印度语和印度语写作系统中,单个书写字符通常表示辅音,元音和标记(如变形记)的组合,这些书写系统的Unicode表示通常使用这些单独部分的代码点, 这样单个字符可以由多个代码点组成。 对于大多数文本处理,还应该排列子字符串范围,使得它们的边界不会将单个韩文音节中的干扰或印度语辅音聚类的组件分开。

In general, these combinations—surrogate pairs, base characters plus combining marks, Hangul jamo, and Indic consonant clusters—are referred to as grapheme clusters. In order to take them into account, you can use NSString’s rangeOfComposedCharacterSequencesForRange: or rangeOfComposedCharacterSequenceAtIndex: methods, or CFStringGetRangeOfComposedCharactersAtIndex. These can be used to adjust string indexes or substring ranges so that they fall on grapheme cluster boundaries, taking into account all of the constraints mentioned above. These methods should be the default choice for programmatically determining the boundaries of user-perceived characters.:

  • 通常,这些组合 - 代理对,基本字符加组合标记,Hangul jamo和印度语辅音簇 - 被称为字形簇。 为了将它们考虑在内,您可以使用NSString的rangeOfComposedCharacterSequencesForRange:或rangeOfComposedCharacterSequenceAtIndex:方法或CFStringGetRangeOfComposedCharactersAtIndex。 这些可以用于调整字符串索引或子字符串范围,以便它们落在字形簇边界上,同时考虑到上面提到的所有约束。 这些方法应该是以编程方式确定用户感知字符边界的默认选择:

In some cases, Unicode algorithms deal with multiple characters in ways that go beyond even grapheme cluster boundaries. Unicode casing algorithms may convert a single character into multiple characters when going from lowercase to uppercase; for example, the standard uppercase equivalent of the German character “ß” is the two-letter sequence “SS”. Localized collation algorithms in many languages consider multiple-character sequences as single units; for example, the sequence “ch” is treated as a single letter for sorting purposes in some European languages. In order to deal properly with cases like these, it is important to use standard NSString methods for such operations as casing, sorting, and searching, and to use them on the entire string to which they are to apply. Use NSString methods such as lowercaseString, uppercaseString, capitalizedString, compare: and its variants, rangeOfString: and its variants, and rangeOfCharacterFromSet: and its variants, or their CFString equivalents. These all take into account the complexities of Unicode string processing, and the searching and sorting methods in particular have many options to control the types of equivalences they are to recognize.

  • 在某些情况下,Unicode算法以超出字形集群边界的方式处理多个字符。 Unicode套管算法可以在从小写变为大写时将单个字符转换为多个字符;例如,德语字符“ß”的标准大写等价物是双字母序列“SS”。许多语言中的本地化校对算法将多字符序列视为单个单元;例如,序列“ch”被视为单个字母,用于在某些欧洲语言中进行排序。为了正确处理这些情况,重要的是使用标准的NSString方法进行封装,排序和搜索等操作,并在它们要应用的整个字符串上使用它们。使用NSString方法,例如lowercaseString,uppercaseString,capitalizedString,compare:及其变体,rangeOfString:及其变体,rangeOfCharacterFromSet:及其变体,或其CFString等价物。这些都考虑了Unicode字符串处理的复杂性,特别是搜索和排序方法有许多选项来控制它们要识别的等价类型。

In some less common cases, it may be necessary to tailor the definition of grapheme clusters to a particular need. The issues involved in determining and tailoring grapheme cluster boundaries are covered in detail in Unicode Standard Annex #29, which gives a number of examples and some algorithms. The Unicode standard in general is the best source for information about Unicode algorithms and the considerations involved in processing Unicode strings.

  • 在一些不太常见的情况下,可能有必要根据特定需要定制字素集群的定义。 Unicode标准附件#29详细介绍了确定和定制字形集群边界所涉及的问题,其中给出了许多示例和一些算法。 通常,Unicode标准是有关Unicode算法的信息以及处理Unicode字符串所涉及的注意事项的最佳来源。

If you are interested in grapheme cluster boundaries from the point of view of cursor movement and insertion point positioning, and you are using the Cocoa text system, you should know that on OS X v10.5 and later, NSLayoutManager has API support for determining insertion point positions within a line of text as it is laid out. Note that insertion point boundaries are not identical to glyph boundaries; a ligature glyph in some cases, such as an “fi” ligature in Latin script, may require an internal insertion point on a user-perceived character boundary. See Cocoa Text Architecture Guide for more information.

  • 如果您从光标移动和插入点定位的角度对字形簇边界感兴趣,并且您正在使用Cocoa文本系统,您应该知道在OS X v10.5及更高版本中,NSLayoutManager具有用于确定插入的API支持 在布置的一行文本中指出位置。 请注意,插入点边界与字形边界不同; 在某些情况下,例如拉丁文字中的“fi”连字,连字字形可能需要在用户感知的字符边界上使用内部插入点。 有关更多信息,请参阅Cocoa Text Architecture Guide。
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 215,245评论 6 497
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,749评论 3 391
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 160,960评论 0 350
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,575评论 1 288
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,668评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,670评论 1 294
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,664评论 3 415
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,422评论 0 270
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,864评论 1 307
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,178评论 2 331
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,340评论 1 344
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,015评论 5 340
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,646评论 3 323
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,265评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,494评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,261评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,206评论 2 352

推荐阅读更多精彩内容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi阅读 7,320评论 0 10
  • 2018-10-21 加入一年制时间管理司马腾自控力学院,是我给自己一个生日礼物。不知不觉已经接近半年了,这半年以...
    Super嘉祺暖阳阅读 144评论 0 0
  • 午梦深垂,谁人出闺? 初不在意,炮声如雷。 尚不经心,人语如沸。 须臾转寂,忧心始微。 拨帘瞻望,轿马欲归。 忽觉...
    wikii的果异奇阅读 195评论 0 0
  • 1.Telechips 最早从事平板这个产业的人,基本上都是从MP3,MP4转过来的,在大家看来,所谓的MID,只...
    三石而立_阅读 1,637评论 0 0
  • 身为一名吃货,无时无刻的在吃是一种特征。 凌晨的1点,我们在街头,广州番禺南村,寻找一种番禺特有的味道。 靠近屠宰...
    龙傲天Terry阅读 389评论 0 0