UnicodeStandard-12.0
⓿❶❷❸❹❹❺❻❼❽❾
Chapter 3 Conformance
-
3-6 Combination
Combining Character Sequences
D50 Graphic character: A character with the General Category of Letter (L), CombiningMark (M), Number (N), Punctuation (P), Symbol (S), or Space Separator (Zs).
• Graphic characters specifically exclude the line and paragraph separators (Zl,Zp), as well as the characters with the General Category of Other (Cn, Cs, Cc,Cf ).
• The interpretation of private-use characters (Co) as graphic characters or not isdetermined by the implementation.
• For more information, see Chapter 2, General Structure, especially Section 2.4,Code Points and Characters, and Ta b l e 2 - 3.
D51 Base character: Any graphic character except for those with the General Category ofCombining Mark (M).
• Most Unicode characters are base characters. In terms of General Categor y val-ues, a base character is any code point that has one of the following categories:Letter (L), Number (N), Punctuation (P), Symbol (S), or Space Separator (Zs).
• Base characters do not include control characters or format controls.
• Base characters are independent graphic characters, but this does not precludethe presentation of base characters from adopting different contextual forms orparticipating in ligatures.
• The interpretation of private-use characters (Co) as base characters or not isdetermined by the implementation. However, the default interpretation of pri-vate-use characters should be as base characters, in the absence of other infor-mation.
D51a Extended base: Any base character, or any standard Korean syllable block.
• This term is defined to take into account the fact that sequences of Korean con-joining jamo characters behave as if they were a single Hangul syllable charac-ter, so that the entire sequence of jamos constitutes a base.
• For the definition of standard Korean syllable block, see D134 in Section 3.12,Conjoining Jamo Behavior.
D52 Combining character: A character with the General Category of Combining Mark(M).
• Combining characters consist of all characters with the General Category val-ues of Spacing Combining Mark (Mc), Nonspacing Mark (Mn), and EnclosingMark (Me).
• All characters with non-zero canonical combining class are combining charac-ters, but the reverse is not the case: there are combining characters with a zerocanonical combining class.
• The interpretation of private-use characters (Co) as combining characters ornot is determined by the implementation.
• These characters are not normally used in isolation unless they are beingdescribed. They include such characters as accents, diacritics, Hebrew points,Arabic vowel signs, and Indic matras.
• The graphic positioning of a combining character depends on the last preced-ing base character, unless they are separated by a character that is neither acombining character nor either zero width joiner or zero width non-joiner. The combining character is said to apply to that base character.
• There may be no such base character, such as when a combining character is atthe start of text or follows a control or format character—for example, a car-riage return, tab, or right-left mark. In such cases, the combining charactersare called isolated combining characters.
• With isolated combining characters or when a process is unable to performgraphical combination, a process may present a combining character withoutgraphical combination; that is, it may present it as if it were a base character.
• The representative images of combining characters are depicted with a dottedcircle in the code charts. When presented in graphical combination with a pre-ceding base character, that base character is intended to appear in the positionoccupied by the dotted circle.
D53 Nonspacing mark: A combining character with the General Category of NonspacingMark (Mn) or Enclosing Mark (Me).
• The position of a nonspacing mark in presentation depends on its base charac-ter. It generally does not consume space along the visual baseline in and ofitself.
• Such characters may be large enough to affect the placement of their base char-acter relative to preceding and succeeding base characters. For example, a cir-cumflex applied to an “i” may affect spacing (“î”), as might the characterU+20DD combining enclosing circle.
D54 Enclosing mark: A nonspacing mark with the General Category of Enclosing Mark(Me).
• Enclosing marks are a subclass of nonspacing marks that surround a base char-acter, rather than merely being placed over, under, or through it.
D55 Spacing mark: A combining character that is not a nonspacing mark.
• Examples include U+093F devanagari vowel sign i. In general, the behaviorof spacing marks does not differ greatly from that of base characters.
•Spacing marks such as U+0BCA tamil vowel sign o may be rendered on bothsides of a base character, but are not enclosing marks.
D56 Combining character sequence: A maximal character sequence consisting of either abase character followed by a sequence of one or more characters where each is acombining character, zero width joiner, or zero width non-joiner; or asequence of one or more characters where each is a combining character, zerowidth joiner, or zero width non-joiner.
• When identifying a combining character sequence in Unicode text, the defini-tion of the combining character sequence is applied maximally. For example, inthe sequence <c, dot-below, caron, acute, a>, the entire sequence <c, dot-below, caron, acute> is identified as the combining character sequence, ratherthan the alternative of identifying <c, dot-below> as a combining charactersequence followed by a separate (defective) combining character sequence<caron, acute>.
D56a Extended combining character sequence: A maximal character sequence consistingof either an extended base followed by a sequence of one or more characters whereeach is a combining character, zero width joiner, or zero width non-joiner ; ora sequence of one or more characters where each is a combining character, zerowidth joiner, or zero width non-joiner.
• Combining character sequence is commonly abbreviated as CCS, andextended combining character sequence is commonly abbreviated as ECCS.
D57 Defective combining character sequence: A combining character sequence that doesnot start with a base character.
• Defective combining character sequences occur when a sequence of combiningcharacters appears at the start of a string or follows a control or format charac-ter. Such sequences are defective from the point of view of handling of combin-ing marks, but are not ill-formed. (See D84.)
Grapheme Clusters
D58 Grapheme base: A character with the property Grapheme_Base, or any standardKorean syllable block.
• Characters with the property Grapheme_Base include all base characters (withthe exception of U+FF9E..U+FF9F) plus most spacing marks.
• The concept of a grapheme base is introduced to simplify discussion of thegraphical application of nonspacing marks to other elements of text. A graph-eme base may consist of a spacing (combining) mark, which distinguishes it from a base character per se. A grapheme base may also itself consist of asequence of characters, in the case of the standard Korean syllable block.
• For the definition of standard Korean syllable block, see D134 in Section 3.12,Conjoining Jamo Behavior.
D59 Grapheme extender: A character with the property Grapheme_Extend.
• Grapheme extender characters consist of all nonspacing marks, zero widthjoiner, zero width non-joiner, U+FF9E halfwidth katakana voicedsound mark, U+FF9F halfwidth katakana semi-voiced sound mark, anda small number of spacing marks.
• A grapheme extender can be conceived of primarily as the kind of nonspacinggraphical mark that is applied above or below another spacing character.
•zero width joiner and zero width non-joiner are formally defined to begrapheme extenders so that their presence does not break up a sequence ofother grapheme extenders.
• The small number of spacing marks that have the property Grapheme_Extendare all the second parts of a two-part combining mark.
• The set of characters with the Grapheme_Extend property and the set of char-acters with the Grapheme_Base property are disjoint, by definition.
• The Grapheme_Extend property is used in the derivation of the set of charac-ters with the value Grapheme_Cluster_Break = Extend, but is not identical toit. See Section 3, “Grapheme Cluster Boundaries” in UAX #29 for details.
D60 Grapheme cluster: The text between grapheme cluster boundaries as specified byUnicode Standard Annex #29, “Unicode Text Segmentation.”
• This definition of “grapheme cluster” is generic. The specification of graphemecluster boundary segmentation in UAX #29 includes two alternatives, for“extended grapheme clusters” and for “legacy grapheme clusters.” Further-more, the segmentation algorithm in UAX #29 is tailorable.
• The grapheme cluster represents a horizontally segmentable unit of text, con-sisting of some grapheme base (which may consist of a Korean syllable)together with any number of nonspacing marks applied to it.
• A grapheme cluster is similar, but not identical to a combining charactersequence. A combining character sequence starts with a base character andextends across any subsequent sequence of combining marks, nonspacing orspacing. A combining character sequence is most directly relevant to processingissues related to normalization, comparison, and searching.
• A grapheme cluster typically starts with a grapheme base and then extendsacross any subsequent sequence of nonspacing marks. A grapheme cluster ismost directly relevant to text rendering and processes such as cursor placement and text selection in editing, but may also be relevant to comparison andsearching.
• For many processes, a grapheme cluster behaves as if it were a single characterwith the same properties as its grapheme base. Effectively, nonspacing marksapply graphically to the base, but do not change its properties. For example, <x,macron> behaves in line breaking or bidirectional layout as if it were the char-acter x.
D61 Extended grapheme cluster: The text between extended grapheme cluster boundariesas specified by Unicode Standard Annex #29, “Unicode Text Segmentation.”
• Extended grapheme clusters are defined in a parallel manner to legacy graph-eme clusters, but also include sequences of spacing marks.
• Grapheme clusters and extended grapheme clusters may not have any particu-lar linguistic significance, but are used to break up a string of text into units forprocessing.
• Grapheme clusters and extended grapheme clusters may be adjusted for partic-ular processing requirements, by tailoring the rules for grapheme cluster seg-mentation specified in Unicode Standard Annex #29, “Unicode TextSegmentation.”
Application of Combining Marks
A number of principles in the Unicode Standard relate to the application of combiningmarks. These principles are listed in this section, with an indication of which are consid-ered to be normative and which are considered to be guidelines.In particular, guidelines for rendering of combining marks in conjunction with other char-acters should be considered as appropriate for defining default rendering behavior, in theabsence of more specific information about rendering. It is often the case that combiningmarks in complex scripts or even particular, general-use nonspacing marks will have ren-dering requirements that depart significantly from the general guidelines. Rendering pro-cesses should, as appropriate, make use of available information about specific typographicpractices and conventions so as to produce best rendering of text.
To help in the clarification of the principles regarding the application of combining marks,a distinction is made between dependence and graphical application.
D61a Dependence: A combining mark is said to depend on its associated base character.
• The associated base character is the base character in the combining charactersequence that a combining mark is part of.
• A combining mark in a defective combining character sequence has no associ-ated base character and thus cannot be said to depend on any particular basecharacter. This is one of the reasons why fallback processing is required fordefective combining character sequences.
• Dependence concerns all combining marks, including spacing marks and com-bining marks that have no visible display.
D61b Graphical application: A nonspacing mark is said to apply to its associated graph-eme base.
• The associated grapheme base is the grapheme base in the grapheme clusterthat a nonspacing mark is part of.
• A nonspacing mark in a defective combining character sequence is not part of agrapheme cluster and is subject to the same kinds of fallback processing as forany defective combining character sequence.
• Graphic application concerns visual rendering issues and thus is an issue fornonspacing marks that have visible glyphs. Those glyphs interact, in rendering,with their grapheme base.
Throughout the text of the standard, whenever the situation is clear, discussion of combin-ing marks often simply talks about combining marks “applying” to their base. In the proto-typical case of a nonspacing accent mark applying to a single base character letter, thissimplification is not problematical, because the nonspacing mark both depends (notion-ally) on its base character and simultaneously applies (graphically) to its grapheme base,affecting its display. The finer distinctions are needed when dealing with the edge cases,such as combining marks that have no display glyph, graphical application of nonspacingmarks to Korean syllables, and the behavior of spacing combining marks.
The distinction made here between notional dependence and graphical application doesnot preclude spacing marks or even sequences of base characters from having effects onneighboring characters in rendering. Thus spacing forms of dependent vowels (matras) inIndic scripts may trigger particular kinds of conjunct formation or may be repositioned inways that influence the rendering of other characters. (See Chapter 12, South and CentralAsia-I, for many examples.) Similarly, sequences of base characters may form ligatures inrendering. (See “Cursive Connection and Ligatures” in Section 23.2, Layout Controls.)
The following listing specifies the principles regarding application of combining marks.Many of these principles are illustrated in Section 2.11, Combining Characters, andSection 7.9, Combining Marks.
P1[Normative] Combining character order: Combining characters follow the basecharacter on which they depend.
• This principle follows from the definition of a combining character sequence.
• Thus the character sequence <U+0061 “a” latin small letter a, U+0308 “!”combining diaeresis, U+0075 “u” latin small letter u> is unambiguouslyinterpreted (and displayed) as “äu”, not “aü”. See Figure 2-18.
P2[Guideline] Inside-out application. Nonspacing marks with the same combiningclass are generally positioned graphically outward from the grapheme base towhich they apply.
• The most numerous and important instances of this principle involve nonspac-ing marks applied either directly above or below a grapheme base. SeeFigure 2-21.
• In a sequence of two nonspacing marks above a grapheme base, the first nons-pacing mark is placed directly above the grapheme base, and the second is thenplaced above the first nonspacing mark.
• In a sequence of two nonspacing marks below a grapheme base, the first nons-pacing mark is placed directly below the grapheme base, and the second is thenplaced below the first nonspacing mark.
• This rendering behavior for nonspacing marks can be generalized to sequencesof any length, although practical considerations usually limit such sequences tono more than two or three marks above and/or below a grapheme base.
• The principle of inside-out application is also referred to as default stackingbehavior for nonspacing marks.
P3[Guideline] Side-by-side application. Notwithstanding the principle of inside-outapplication, some specific nonspacing marks may override the default stackingbehavior and are positioned side-by-side over (or under) a grapheme base, ratherthan stacking vertically.
• Such side-by-side positioning may reflect language-specific orthographic rules,such as for Vietnamese diacritics and tone marks or for polytonic Greekbreathing and accent marks. See Ta b l e 2 - 6.
• Side-by-side positioning may also reflect certain writing conventions, such asfor titlo letters in the Old Church Slavonic manuscript tradition.
• When positioned side-by-side, the visual rendering order of a sequence of non-spacing marks reflects the dominant order of the script with which they areused. Thus, in Greek , the first nonspacing mark in such a sequence will be posi-tioned to the left side above a grapheme base, and the second to the right sideabove the grapheme base. In Hebrew, the opposite positioning is used for side-by-side placement.
• The combining parentheses diacritical marks U+1ABB..U+1ABD are also posi-tioned in a side-by-side manner, surrounding other diacritics, as described inthe subsection “Combining Diacritical Marks Extended: U+1AB0–U+1AFF” inSection 7.9, Combining Marks.
P4[Guideline] Traditional typographical behavior will sometimes override thedefault placement or rendering of nonspacing marks.
• Because of typographical conflict with the descender of a base character, acombining comma below placed on a lowercase “g” is traditionally rendered asif it were an inverted comma above. See Figure 7-1.
• Because of typographical conflict with the ascender of a base character, a com-bining há`ek (caron) is traditionally rendered as an apostrophe when placed,for example, on a lowercase “d”. See Figure 7-1.
• The relative placement of vowel marks in Arabic cannot be predicted by defaultstacking behavior alone, but depends on traditional rules of Arabic typography.See Figure 9-5.
P5[Normative] Nondistinct order. Nonspacing marks with different, non-zero com-bining classes may occur in different orders without affecting either the visual dis-play of a combining character sequence or the interpretation of that sequence.
• For example, if one nonspacing mark occurs above a grapheme base andanother nonspacing mark occurs below it, they will have distinct combiningclasses. The order in which they occur in the combining character sequencedoes not matter for the display or interpretation of the resulting graphemecluster.
• The introduction of the combining class for characters and its use in canonicalordering in the standard is to precisely define canonical equivalence andthereby clarify exactly which such alternate sequences must be considered asidentical for display and interpretation. See Figure 2-24.
• In cases of nondistinct order, the order of combining marks has no linguisticsignificance. The order does not reflect how “closely bound” they are to thebase. After canonical reordering, the order may no longer reflect the typed-insequence. Rendering systems should be prepared to deal with common typed-insequences and with canonically reordered sequences. See Ta b l e 5 - 3.
•Inserting a combining grapheme joiner between two combining marks withnondistinct order prevents their canonical reordering. For more information,see “Combining Grapheme Joiner” in Section 23.2, Layout Controls.
P6[Guideline] Enclosing marks surround their grapheme base and any interveningnonspacing marks.
• This implies that enclosing marks successively surround previous enclosingmarks. See Figure 3-1.
• Dynamic application of enclosing marks—particularly sequences of enclosingmarks—is beyond the capability of most fonts and simple rendering processes.It is not unexpected to find fallback rendering in cases such as that illustratedin Figure 3-1.
P7[Guideline] Double diacritic nonspacing marks, such as U+0360 combining dou-ble tilde, apply to their grapheme base, but are intended to be rendered withglyphs that encompass a following grapheme base as well.
• Because such double diacritic display spans combinations of elements thatwould otherwise be considered grapheme clusters, the support of double dia-critics in rendering may involve special handling for cursor placement and textselection. See Figure 7-9 for an example.
P8[Guideline] When double diacritic nonspacing marks interact with normal nons-pacing marks in a grapheme cluster, they “float” to the outermost layer of thestack of rendered marks (either above or below).
• This behavior can be conceived of as a kind of looser binding of such doublediacritics to their bases. In effect, all other nonspacing marks are applied first,and then the double diacritic will span the resulting stacks. See Figure 7-10 foran example.
• Double diacritic nonspacing marks are also given a very high combining class,so that in canonical order they appear at or near the end of any combiningcharacter sequence. Figure 7-11 shows an example of the use of CGJ to blockthis reordering.
• The interaction of enclosing marks and double diacritics is not well definedgraphically. Many fonts and rendering processes may not be able to handlecombinations of these marks. It is not recommended to use combinations ofthese together in the same grapheme cluster.
This treatment of the application of combining marks with respect to Korean syllables fol-lows from the implications of canonical equivalence. It should be noted, however, thatolder implementations may have supported the application of an enclosing combiningmark to an entire Indic consonant conjunct or to a sequence of grapheme clusters linkedtogether by combining grapheme joiners. Such an approach has a number of technicalproblems and leads to interoperability defects, so it is strongly recommended that imple-mentations do not follow it.For more information on the recommended use of the combining grapheme joiner, see thesubsection “Combining Grapheme Joiner” in Section 23.2, Layout Controls. For more dis-cussion regarding the application of combining marks in general, see Section 7.9, Combin-ing Marks.
-
3-7 Decomposition
D62 Decomposition mapping: A mapping from a character to a sequence of one or morecharacters that is a canonical or compatibility equivalent, and that is listed in thecharacter names list or described in Section 3.12, Conjoining Jamo Behavior.
• Each character has at most one decomposition mapping. The mappings inSection 3.12, Conjoining Jamo Behavior, are canonical mappings. The mappingsin the character names list are identified as either canonical or compatibilitymappings (see Section 24.1, Character Names List).
D63 Decomposable character: A character that is equivalent to a sequence of one or moreother characters, according to the decomposition mappings found in the UnicodeCharacter Database, and those described in Section 3.12, Conjoining Jamo Behavior.
• A decomposable character is also referred to as a precomposed character orcomposite character.
• The decomposition mappings from the Unicode Character Database are alsogiven in Section 24.1, Character Names List.
D64 Decomposition: A sequence of one or more characters that is equivalent to a decom-posable character. A full decomposition of a character sequence results from decom-posing each of the characters in the sequence until no characters can be furtherdecomposed.
Compatibility Decomposition
D65 Compatibility decomposition: The decomposition of a character or charactersequence that results from recursively applying both the compatibility mappingsand the canonical mappings found in the Unicode Character Database, and thosedescribed in Section 3.12, Conjoining Jamo Behavior, until no characters can be fur-ther decomposed, and then reordering nonspacing marks according to Section 3.11,Normalization Forms.
• The decomposition mappings from the Unicode Character Database are alsogiven in Section 24.1, Character Names List.
• Some compatibility decompositions remove formatting information.
D66 Compatibility decomposable character: A character whose compatibility decomposi-tion is not identical to its canonical decomposition. It may also be known as a com-patibility precomposed character or a compatibility composite character.
•For example, U+00B5 micro sign has no canonical decomposition mapping,so its canonical decomposition is the same as the character itself. It has a com-patibility decomposition to U+03BC greek small letter mu. Because microsign has a compatibility decomposition that is not equal to its canonicaldecomposition, it is a compatibility decomposable character.
• For example, U+03D3 greek upsilon with acute and hook symbol canon-ically decomposes to the sequence <U+03D2 greek upsilon with hook sym-bol, U+0301 combining acute accent>. That sequence has a compatibilitydecomposition of <U+03A5 greek capital letter upsilon, U+0301 com-bining acute accent>. Because greek upsilon with acute and hook sym-bol has a compatibility decomposition that is not equal to its canonicaldecomposition, it is a compatibility decomposable character.
• This term should not be confused with the term “compatibility character,”which is discussed in Section 2.3, Compatibility Characters.
• Many compatibility decomposable characters are included in the UnicodeStandard solely to represent distinctions in other base standards. They supporttransmission and processing of legacy data. Their use is discouraged other thanfor legacy data or other special circumstances.
• Some widely used and indispensable characters, such as NBSP, are compatibil-ity decomposable characters for historical reasons. Their use is not discour-aged.
• A large number of compatibility decomposable characters are used in phoneticand mathematical notation, where their use is not discouraged.
• For historical reasons, some characters that might have been given a compati-bility decomposition were not, in fact, decomposed. The Normalization Stabil-ity Policy prohibits adding decompositions for such cases in the future, so thatnormalization forms will stay stable. See the subsection “Policies” inSection B.3, Other Unicode Online Resources.
• Replacing a compatibility decomposable character by its compatibility decom-position may lose round-trip convertibility with a base standard.
D67 Compatibility equivalent: Two character sequences are said to be compatibilityequivalents if their full compatibility decompositions are identical.
Canonical Decomposition
D68 Canonical decomposition: The decomposition of a character or character sequencethat results from recursively applying the canonical mappings found in the UnicodeCharacter Database and those described in Section 3.12, Conjoining Jamo Behavior,until no characters can be further decomposed, and then reordering nonspacingmarks according to Section 3.11, Normalization Forms.
• The decomposition mappings from the Unicode Character Database are alsoprinted in Section 24.1, Character Names List.
• A canonical decomposition does not remove formatting information.
D69 Canonical decomposable character: A character that is not identical to its canonicaldecomposition. It may also be known as a canonical precomposed character or acanonical composite character.
• For example, U+00E0 latin small letter a with grave is a canonicaldecomposable character because its canonical decomposition is to thesequence <U+0061 latin small letter a, U+0300 combining graveaccent>. U+212A kelvin sign is a canonical decomposable character becauseits canonical decomposition is to U+004B latin capital letter k.
D70 Canonical equivalent: Two character sequences are said to be canonical equivalentsif their full canonical decompositions are identical.
•For example, the sequences <o, combining-diaeresis> and <ö> are canonicalequivalents. Canonical equivalence is a Unicode property. It should not be con-fused with language-specific collation or matching, which may add otherequivalencies. For example, in Swedish, ö is treated as a completely differentletter from o and is collated after z. In German, ö is weakly equivalent to oe andis collated with oe. In English, ö is just an o with a diacritic that indicates that itis pronounced separately from the previous letter (as in coöperate) and is col-lated with o.
• By definition, all canonical-equivalent sequences are also compatibility-equiva-lent sequences.
For information on the use of decomposition in normalization, see Section 3.11, Normal-ization Forms.
-
3-8 Surrogates
D71 High-surrogate code point: A Unicode code point in the range U+D800 to U+DBFF.D72 High-surrogate code unit: A 16-bit code unit in the range D80016 to DBFF16, used inUTF-16 as the leading code unit of a surrogate pair.D73 Low-surrogate code point: A Unicode code point in the range U+DC00 to U+DFFF.D74Low-surrogate code unit: A 16-bit code unit in the range DC0016 to DFFF16, used inUTF-16 as the trailing code unit of a surrogate pair.
• High-surrogate and low-surrogate code points are designated only for that use.
• High-surrogate and low-surrogate code units are used only in the context of theUTF-16 character encoding form.
D75 Surrogate pair: A representation for a single abstract character that consists of asequence of two 16-bit code units, where the first value of the pair is a high-surro-gate code unit and the second value is a low-surrogate code unit.
• Surrogate pairs are used only in UTF-16. (See Section 3.9, Unicode EncodingForms.)
• Isolated surrogate code units have no interpretation on their own. Certainother isolated code units in other encoding forms also have no interpretationon their own. For example, the isolated byte 8016 has no interpretation in UTF-8; it can be used only as part of a multibyte sequence. (See Ta b l e 3 - 7.)
• Sometimes high-surrogate code units are referred to as leading surrogates. Low-surrogate code units are then referred to as trailing surrogates. This is analo-gous to usage in UTF-8, which has leading bytes and trailing bytes.
• For more information, see Section 23.6, Surrogates Area, and Section 5.4, Han-dling Surrogate Pairs in UTF-16.