Mapping of Unicode characters

From Wikinfo

(Redirected from Basic Multilingual Plane)
Jump to: navigation, search
Unicode
Encodings
UCS
Mapping
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and E-mail
Unicode typefaces

Unicode’s Universal Character Set potentially supports over 1 million (1,114,112 = 220 + 216 or 17 × 216, hexadecimal 110000) code points.

As of Unicode 5.0.0, 102,012 (9.2%) of these code points are assigned, with another 137,468 (12.3%) reserved for private use, 2,048 for surrogates, and 66 designated noncharacters, leaving 872,582 (78.3%) unassigned. The number of assigned code points is made up as follows:

(See the summary table for a more detailed breakdown).

Unicode characters can be categorized in many ways. Every character is assigned a script (though many are assigned the common or inherited scripts where they inherit the script from the adjacent character). In Unicode a script is a coherent writing system that includes letters but also may include script specific punctuation, diacritic and other marks and numerals and symbols. A single script supports one or more languages.

Characters are assigned in blocks of characters. These blocks are usually groups of code points in some multiple of eight: many, for example, are grouped in blocks of 128 or 256 code points. Every character is also assigned a general category and subcategory. The general categories are: letter, mark, number, punctuation, symbol, or control (in other words a formatting or non-graphical character).

The blocks of characters are assigned according to various planes. Most characters are currently assigned to the first plane: the Basic Multilingual Plane. This is to help ease the transition for legacy software since the Basic Multilingual Plane is addressable with just two octet bytes. The characters outside the first plane usually have very specialized or rare use.

The first 256 code points correspond with those of ISO 8859-1, the most popular 8-bit character encoding in the Western world. As a result, the first 128 characters are also identical to ASCII. Though Unicode refers to these as a Latin script block, these two blocks contain many characters that are commonly useful outside of the Latin script.

Contents

Planes

The Unicode characters can be categorized in many different ways, Unicode code points can be logically divided into 17 planes, each with 65,536 (= 216) code points, although currently only a few planes are used:

  • Plane 0 (0000–FFFF): Basic Multilingual Plane (BMP). This is the plane containing most of the character assignments so far. A primary objective for the BMP is to support the unification of prior character sets as well as characters for writing systems in current use.
  • Plane 1 (10000–1FFFF): Supplementary Multilingual Plane (SMP).
  • Plane 2 (20000–2FFFF): Supplementary Ideographic Plane (SIP)
  • Planes 3 to 13 (30000–DFFFF) are unassigned
  • Plane 14 (E0000–EFFFF): Supplementary Special-purpose Plane (SSP)
  • Plane 15 (F0000–FFFFF) reserved for the Private Use Area (PUA)
  • Plane 16 (100000–10FFFF), reserved for the Private Use Area (PUA)

Currently, about ten percent of the potential space is used. Furthermore, ranges of characters have been tentatively blocked out for every current and ancient writing system (script) the Unicode consortium has been able to identify: (see [1]). While Unicode may eventually need to use another of the spare 11 planes for ideographic characters, other planes remain, if previously unknown scripts with tens of thousands of characters are discovered. This 20 bit limit is therefore unlikely to be reached in the near future.

Basic Multilingual Plane

The first plane (plane 0), the Basic Multilingual Plane (BMP), is where most characters have been assigned so far. The BMP contains characters for almost all modern languages, and a large number of special characters. Most of the allocated code points in the BMP are used to encode Chinese, Japanese, and Korean (CJK) characters.

File:Roadmap to Unicode BMP.png
Roadmap of Unicode Basic Multilingual Plane. Each numbered box represents 256 codepoints.

The graphic on the right is a visual roadmap to the Basic Multilingual Plane. The colours in use are:

  •  Black  = Latin scripts and symbols
  •  Light Blue  = Linguistic scripts
  •  Blue  = Other European scripts
  •  Orange  = Middle Eastern and SW Asian scripts
  •  Light Orange  = African scripts
  •  Green  = South Asian scripts
  •  Purple  = Southeast Asian scripts
  •  Red  = East Asian scripts
  •  Light Red  = Unified CJK Han
  •  Yellow  = Canadian Aboriginal scripts
  •  Magenta  = Symbols
  •  Dark Grey  = Diacritics
  •  Light Grey  = UTF-16 surrogates and private use
  •  Cyan  = Miscellaneous characters
  •  White  = Unused


As of Unicode 5.0, The BMP includes the following scripts:

  • Supplemental Punctuation (2E00–2E7F)
  • CJK Radicals Supplement (2E80–2EFF)
  • Kangxi Radicals (2F00–2FDF)
  • Ideographic Description Characters (2FF0–2FFF)
  • CJK Symbols and Punctuation (3000–303F)
  • Hiragana (3040–309F)
  • Katakana (30A0–30FF)
  • Bopomofo (3100–312F)
  • Hangul Compatibility Jamo (3130–318F)
  • Kanbun (3190–319F)
  • Bopomofo Extended (31A0–31BF)
  • CJK Strokes (31C0–31EF)
  • Katakana Phonetic Extensions (31F0–31FF)
  • Enclosed CJK Letters and Months (3200–32FF)
  • CJK Compatibility (3300–33FF)
  • CJK Unified Ideographs Extension A (3400–4DBF)
  • Yijing Hexagram Symbols (4DC0–4DFF)
  • CJK Unified Ideographs (4E00–9FFF)
  • Yi Syllables (A000–A48F)
  • Yi Radicals (A490–A4CF)
  • Modifier Tone Letters (A700–A71F)
  • Latin Extended-D (A720–A7FF)
  • Syloti Nagri (A800–A82F)
  • Phags-pa (A840–A87F)
  • Hangul Syllables (AC00–D7AF)
  • High Surrogates (D800–DB7F)
  • High Private Use Surrogates (DB80–DBFF)
  • Low Surrogates (DC00–DFFF)
  • Private Use Area (E000–F8FF)
  • CJK Compatibility Ideographs (F900–FAFF)
  • Alphabetic Presentation Forms (FB00–FB4F)
  • Arabic Presentation Forms-A (FB50–FDFF)
  • Variation Selectors (FE00–FE0F)
  • Vertical Forms (FE10–FE1F)
  • Combining Half Marks (FE20–FE2F)
  • CJK Compatibility Forms (FE30–FE4F)
  • Small Form Variants (FE50–FE6F)
  • Arabic Presentation Forms-B (FE70–FEFF)
  • Halfwidth and Fullwidth Forms (FF00–FFEF)
  • Specials (FFF0–FFFF)

Future additions Several scripts are expected to be included in the BMP in the next revision of Unicode. These scripts, and their proposed code point ranges, are the following:

Several other scripts are proposed for inclusion in the BMP, including:

Supplementary Multilingual Plane

Plane 1, the Supplementary Multilingual Plane (SMP), is mostly used for historic scripts such as Linear B, but is also used for musical and mathematical symbols.

As of Unicode 5.0, Plane One includes the following scripts:

Many other scripts are proposed for inclusion in Plane One, including:

Supplementary Ideographic Plane

Plane 2, the Supplementary Ideographic Plane (SIP), is used for about 40,000 Unified Han Ideographs that have previously been seldom used in daily written communications.

Unused planes

Unicode has not yet assigned any characters to Planes 3 through 13. It is not anticipated that these planes will be needed, given the total sizes of the known writing systems left to be encoded. However, the number of possible symbol characters that could arise outside of the context of writing systems is potentially limitless. The UCS and Unicode take requests for symbols on a case by case basis.

Supplementary Special-purpose Plane

Plane 14 (E in hexadecimal), the Supplementary Special-purpose Plane (SSP), currently contains non-graphical characters in two blocks of 128 and 240 characters. The first block is for language tag characters for use when language cannot be indicated through other protocols (such as the xml:lang attribute in XML). The other block contains glyph variation selectors to indicate an alternate glyph for a character that cannot be determined by context.

Private use planes

Two planes (planes 15 and 16) have been set aside for character assignment by parties outside the ISO and the Unicode Consortium. Use of such characters will have limited interoperability. Software and fonts that support Unicode will not necessarily support characters assignments by other parties. Especially if the characters have unusual properties such as right-to-left characters, other implementations may treat those characters inappropriately.

Plane mapping tables

Unicode mapping tables
BMP SMP SIP SSP
0000–0FFF8000–8FFF10000–10FFF20000–20FFF28000–28FFFE0000–E0FFF
1000–1FFF9000–9FFF 21000–21FFF29000–29FFF
2000–2FFFA000–AFFF12000–12FFF22000–22FFF2A000–2AFFF
3000–3FFFB000–BFFF 23000–23FFF 
4000–4FFFC000–CFFF1D000–1DFFF24000–24FFF2F000–2FFFF
5000–5FFFD000–DFFF 25000–25FFF 
6000–6FFFE000–EFFF 26000–26FFF 
7000–7FFFF000–FFFF 27000–27FFF

Graphical characters

By far the most common Unicode characters are graphical characters. Graphical characters all have some visual representation or glyphs associated with them. While Unicode does not specify the concrete glyphs for these characters, it does specify recommended or prototypical glyphs. The actual glyph used by textual display software will depend on the font files used and whether those fonts provide support for contextual and non-contextual glyph variations

Script-specific characters

Template:UCS characters

Main article: Writing system

In Unicode, a script is an abstract coherent and unified writing system supporting one or more concrete writing systems which in turn support the written forms of one or more languages. Some scripts support one and only one language, for example: Armenian. Other scripts, like Latin, support many different writing systems: English, French, German, Italian, and Latin to name just a few. Some languages also make use of multiple alternate writing systems. Turkish, for example used Arabic before the 20th century and transitioned to Latin in the early part of the 20th century. For a list of languages supported by each script see the list of languages by writing system.

When multiple languages make use of the same script, there are frequently some differences: particularly in diacritics and other marks. For example, Swedish and English both use the Latin script. However, Swedish includes the character ‘å’ (sometimes called a “Swedish O”) while English has no such character. Nor does English make use of the diacritic combining circle above for any character. In general the languages sharing the same scripts share many of the same characters. Despite these peripheral differences in the Swedish and English writing systems they are said to use the same Latin script. So the Unicode abstraction of writing systems is a basic organizing technique. The differences between different alphabets or writing systems remain and are supported through Unicode’s flexible scripts, combining marks and collation algorithms.

While all characters have the property of belonging to a script, many characters, such as symbols, indicate “common” or “inherited” for their script property. The unified diacritical characters and unified punctuation characters frequently have the “common” or “inherited” script property. However, the individual scripts often have their own punctuation and diacritics. So many scripts include not only letters, but also diacritic and other marks, punctuation, numerals and even their own idiosyncratic symbols and space characters.

Unicode already includes over 60 scripts supporting hundreds or even thousands of languages throughout the World. Unicode is actively working on many more as indicated by its roadmap.

Unihan characters

Main article: Unihan

Han unification is the process used by the authors of Unicode and the Universal Character Set to map multiple character sets of the CJK languages into a single set of unified characters. The Chinese characters are common to Chinese (where they are called hanzi), Japanese (where they are called kanji), and Korean (where they are called hanja). Modern Korean, Chinese and Japanese typefaces may represent a given Han character as somewhat different glyphs. However, in the formulation of Unicode, these different glyphs were treated as the same character. This unification is referred to as "Han unification", with the resulting character repertoire sometimes referred to as Unihan.

Besides the Unihan ideographs, Han unification also provides Han unified punctuation, symbols, numerals, ideograph stroke characters and ideographic description characters.

Phonetic characters

It was thought by at least one person on Wikipedia that input from an expert in this subject would be helpful.

Unicode includes letters and marks from the International Phonetic Alphabet (IPA) and those supporting other phonetic writing systems too. Essentially these characters are used as graphemes for phonemes. In terms of script or writing system, these phonetic alphabets are basically one writing system. What distinguishes the various phonetic alphabets are their glyphs. However, as with numerals, the UCS often focus more on the presentational forms or glyphs given to these phonemes by the various phonetic alphabets. This is in contrast to the alternate names of these characters provided by Unicode NamesList property which typically reflects the common phoneme semantics shared by those various writing systems regardless of the glyphs used. So these differences manifest in the alternate names given to these characters: the canonical UCS name and the NamesList property names. Similarly, Unicode assignees the value of “Latin” to the script property of many of these characters. However, the primary purpose for these characters inclusion in the character set is to support the various phonetic writing systems. These phonetic writing system, in many ways, constitute a single unified writing system on its own: despite borrowing glyphs from other Latin, Greek and Cyrillic scripts.

Numerals

Main article: Unicode numerals

Numerals (often called numbers in Unicode) are characters that denote a number. The same Arabic-Indic numerals are used widely in various writing systems throughout the world and all share the same semantics for denoting numbers, However, the glyphs representing these numerals differ widely from one writing system to another. To support these glyph differences, Unicode includes duplicate encodings of these numerals within many of the script blocks. These digits are repeated in 23 separate blocks: twice in Arabic. Six additional blocks contain the digits again as rich text or legacy software compatibility characters.

Unicode also includes several less common numerals: Roman numerals, counting rod numerals, Cuneiform numerals and ancient Greek numerals.

Numerals invariably involve composition of glyphs as a limited number of characters are composed to make other numerals. For example the sequence 9 - 9 - 0 in Arabic-Indic numerals composes the numeral for nine hundred and ninety (990). In Roman numerals, the same number is expressed by the composed numeral Ⅹↀ or ⅩⅯ. Each of these is a distinct numeral for representing the same abstract number. The semantics of the numerals differ in particular in their composition. The Arabic-Indic decimal digits are positional-value compositions, while the Roman numerals are sign-value and they are additive and subtractive depending on their composition.

Punctuation and diacritics

Unicode includes several blocks for unified diacritics and other combining marks and also blocks for unified punctuation. However, when a mark or punctuation character is intended primarily for use within a particular script, the character is assigned to that particular script’s blocks. Therefore authors will find these types of characters throughout the Unicode character database. Unicode categorizes them as:

  • Punctuation
  • connector (Pc)
  • dash (Pd)
  • open (Po)
  • close (Pe)
  • initial (Pi)
  • final (Pf)
  • Mark
  • non-spacing (Mn)
  • spacing-combining (Mc
  • enclosing (Me)

Symbols

Unicode has dozens of blocks dedicated to symbols that are useful regardless of one’s writing system. Other script-specific symbols are often included within a particular script’s blocks. Symbols are categorized as:

Symbols:

  • math (Sm)
  • currency (Sc)
  • modifier (Sk)
  • other (So)

Music notation

Unicode devotes a block of 256 characters for musical symbols. Since Unicode focuses on characters laid out in two dimensions, these characters do not encode pitch or other parts of Western music expressed in the vertical dimension. Therefore the music symbols are more suited for discussions of music symbols themselves or to discuss rhythm within the prose of a document. To encode more complex musical information some other data format is necessary, such as MusicXML or Midi.

Compatibility characters

In discussing Unicode and the UCS, many often refer to compatibility characters. Compatibility characters are graphical characters that are discouraged by the Unicode Consortium. As the Unicode consortium says:

A character that would not have been encoded except for compatibility and round-trip convertibility with other standards

However, the definition is more complicated that the glossary reveals. One of the properties given to characters by the Unicode consortium is the characters decomposition or compatibility decomposition. Most characters have no value for this property, but over 5 thousand characters do have a compatibility decomposition mapping that compatibility character to one or more other characters. By setting a characters decomposition property, Unicode establishes that character as a compatibility character. The reasons for these compatibility designations are varied and are discussed in further detail below. The term decomposition can sometimes confuse because a characters decomposition can, in some cases, be a singleton. In these cases the decomposition of one character is simply another equivalent or approximately equivalent character.

Canonical and Non-canonical

The compatibility decomposition property for the 5,402 Unicode compatibility characters includes a keyword that divides the compatibility characters into 17 logical groups. Those without a keyword are termed canonical equivalent or canonical decomposable characters. These characters have the closest relationship. Other keywords include: <initial>, <medial>, <final>, <isolated>, <wide>, <narrow, <small>, <square>, <vertical>, <circle>, <noBreak>, <fraction>, <subscript>, <superscript>, and <compat>. These keywords provide some indication of the relation between the compatibility character and its compatibility decomposition character sequence. However, the compatibility characters — whether canonical or not — fall in three basic categories: 1) characters corresponding to multiple alternate glyph forms and precomposed diacritics to support software and font implementations that do not include complete Unicode text layout capabilities; 2) characters included from other character sets or otherwise added to the UCS that constitute rich text rather than the plain text goals of Unicode; 3) some other characters that are semantically distinct, but visually similar. Because these semantically distinct characters may be displayed with glyphs similar to the glyphs of other characters, text processing software should try to address possible confusion for the sake of end users. When comparing and collating (sorting) text strings, different forms and rich text variants of characters should not alter the text processing results. For example, software users may be confused when performing a find on a page for a capital Latin letter ‘I’ and their software application fails to find the visually similar Roman numeral ‘Ⅰ’.

Compatibility Blocks

Several blocks of Unicode characters include either entirely or almost entirely all compatibility characters. These compatibility blocks contain none of the semantically distinct compatibility characters and so they fall unambiguously into the set of discouraged characters. Unicode recommends authors use the plain text compatibility decomposition equivalents instead and complement those characters with rich text markup. This approach is much more flexible and open-ended than using the finite set of circled or enclosed alphanumerics to give just one example.

Unfortunately, there are a small number of characters even within the compatibility blocks that themselves are not compatibility characters and therefore may confuse authors. The “Enclosed CJK Letters and Months” block contains a single non-compatibility character: the ‘Korean Standard Symbol’ (㉿ U+327F). This symbol and 12 other characters have been included in these blocks for no known reasons. The “CJK Compatibility Ideographs” block contains these non-compatibility unified Han ideographs:

  1. (U+FA0E): 﨎
  2. (U+FA0F): 﨏
  3. (U+FA11): 﨑
  4. (U+FA13): 﨓
  5. (U+FA14): 﨔
  6. (U+FA1F): 﨟
  7. (U+FA21): 﨡
  8. (U+FA23): 﨣
  9. (U+FA24): 﨤
  10. (U+FA27): 﨧
  11. (U+FA28): 﨨
  12. (U+FA29): 﨩

These thirteen characters are neither compatibility characters nor are their use discouraged in any way.

Several other characters in these blocks have no compatibility mapping but are clearly intended for legacy support:

Alphabetic Presentation Forms (1)

  1. Hebrew Point Judeo-Spanish Varika (U+FB1E): ﬞ. This is a glyph variant of Hebrew Point Rafe (U+05BF): ֿ , though Unicode provides no compatibility mapping.

Arabic Presentation Forms (4)

  1. “Ornate Left Parenthesis” (U+FD3E): ﴾. A glyph variant for U+0029 ‘)’
  2. “Ornate Right Parenthesis” (U+FD3F): ﴿. A glyph variant for U+0028 ‘ (’
  3. “Ligature Bismillah Ar-Rahman Ar-Raheem” (U+FDFD): ﷽. Bismillah Ar-Rahman Ar-Raheem is a ligature for Teh Marbuta (U+0629), Lam (U+0644), Meem (U+0645), Seen (U+0633), Beh (U+0628), (بسملة)
  4. “Arabic Tail Fragment” (U+FE73): ﹳ for supporting text systems without contextual glyph handling

CJK Compatibility Forms (2 that are both related to CJK Unified Ideograph: U+4E36 丶)

  1. Sesame Dot (U+FE45): ﹅
  2. White Sesame Dot (U+FE46): ﹆

Enclosed Alphanumerics (21 rich text variants)

  1. 10 Negative Circled Numbers (0 and 11 through 20) (U+24FF and U+24EB through U+24F4): ⓫ – ⓴
  2. 11 Double Circled Numbers (0 through 10) (U+24F5 through U+24FE): ⓵ – ⓾

Compatibility characters and normalization

Main article: Unicode normalization

Normalization is the process by which Unicode conforming software first performs compatibility decomposition before making comparisons or collating text strings. This is similar to other operations needed when, for example, a user performs a case or diacritic insensitive search within some text. In such cases software must equate or ignore characters it would not otherwise equate or ignore. Typically normalization is performed without altering the underlying stored text data (lossless). However, some software may potentially make permanent changes to text that eliminates the canonical or even non-canonical compatibility characters differences from text storage (lossy).

Non-graphical characters

Many characters are used to control the interpretation or display of text, but these characters themselves have no visual or spatial representation. For example, the null character (U+0000) is used in C-programming application environments to indicate the end of a string of characters. In this way, these programs only require a single starting memory address for a string. The string ends once the program reads the null character.

Legacy control characters

The legacy control characters come from ASCII and ISO 8859-1 character sets and are sometimes referred to as C0 and C1 respectively. Many of these characters play no explicit role in Unicode text handling, though they are still used in mainframe computing environments. Others, like the null character and many whitespace characters are still used commonly in text processing. Other common control characters are tabulation or tab (U+0009), linefeed (U+000A), carriage return (U+000D) and newline (U+0085). These are included among whitespace characters because, though they have no visual glyph, they do insert vertical or horizontal spacing between the display of characters.

Unicode introduced separators

In an attempt to simplify the several new line characters used in legacy text, UCS introduces its own new line characters to separate either lines or paragraphs: the line separator (U+2028) and paragraph separator (U+2029) characters.

Language tags

Unicode includes 128 characters as language tags. The characters essentially mirror the 128 ASCII characters except, when used they identify the subsequent text as belonging to a particular language according to BCP 47. For example, for indicating subsequent text as the variant of English as written in the United States, the initiating ‘Language Tag character’ (U+E0001) followed by the sequence ‘Tag Small Letter e’ (U+U+E0065), ‘Tag Small Letter n’ (U+E006E), “Tag Hyphen-minus’ (U+E002D), ‘Tag Small Letter u’ (U+E0075) and ‘Tag Small Letter s’ (U+E0073).

These language tag characters would not be displayed themselves. However, they would provide information for text processing or even for the display of other characters. For example the display of Unihan ideographs might substitute different glyphs if the language tags indicated Korean than if the tags indicated Japanese. Another example, might influence the display of decimal digits 0 through 9 differently depending on the language they appeared in.

Interlinear annotation

Three formatting characters provide support for interlinear annotation (U+FFF9, U+FFFA, U+FFFB). This may be used for providing notes that would typically be displayed between the lines of other text. Unicode considers such annotation to be rich text and recommends using other protocols for such annotation. The W3C ruby markup recommendation is an example of an alternate protocol supporting more advanced interlinear annotation.

Bidirectional text control

Unicode supports standard bidirectional text without any special characters. In other words Unicode conforming software should display right-to-left characters such as Hebrew letters as right-to-left simply from the properties of those characters. Similarly, the Unicode handles the mixture of left-to-right-text alongside right-to-left text without any special characters. For example, one can quote Arabic (“بسملة”) right alongside English and the Arabic letters will flow from right-to-left and the Latin letters left-to-right.. However, support for bidirectional text becomes more complicated when text flowing in opposite directions is embedded hierarchically. So that for example if one quotes an Arabic phrase that in turn quotes an English phrase. Other situations may complicate this when for example, an author wants the left-to-right characters overridden so that they flow from right-to-left. While these situations are fairly rare, Unicode provides seven characters ((U+200E, U+200F, U+202A, U+202B, U+202C, U+202D, U+202E) to help control these embedded bidirectional text levels up to 61 levels deep.

Variation Selectors

Many characters map to alternate glyphs depending on the context. For example Arabic and Latin cursive characters substitute different glyphs to connect glyphs together depending on whether the character is the initial character in a word, the final character, a medial character or an isolated character. These types of glyph substitution are easily handled by the context of the character with no other authoring input involved. Authors may also use special-purpose characters such as joiners and non-joiners to force an alternate form of glyph where it would not otherwise appear. Ligatures are similar instances where glyphs may be substituted simply by turning ligatures on or off as a rich text attribute.

However, for other glyph substitution, the authors intent may need to be encoded with the text and cannot be determined contextually. This is the case with character/glyphs referred to as gaiji where different glyphs are used for the same character either historically or for ideographs for family names. This is one of the gray areas in distinguishing between a glyph and a character. If a family name differs slightly from the ideograph character it derives from, then is that a simple glyph variant or a character variant? As of Unicode 3.2 and 4.0, the character set now includes 256 variation selectors so that these combining mark characters can select from 256 possible character/glyph variations for the preceding character. Unicode does not as yet provide any registry for these variations, so the issue of interoperable variation registration is left to other parties.

Other Special-purpose characters

Several characters fall between the non-graphical control and formatting characters and full-fledged graphical characters.

Joiners and Non-joiners

Word Joiner (U+2060), Zero-width joiner (U+200D), Zero-width non-joiner (U+200C), Zero-width space (U+200B), Combining Grapheme Joiner (U+034F).

Invisible Separator

Primarily for mathematics, the Invisible Separator (U+2063) provides a separator between characters where punctuation or space may be omitted such as in a two-dimensional index like i⁣j.

Invisible Times and Function Application

Invisible Times (U+2062) and Function Application (U+2061) are useful in mathematics text where the multiplication of terms or the application of a function is implied without any glyph indicating the operation.

Spaces

The space character (U+0020) typically input by the space bar on a keyboard serves semantically as a word separator in many languages. For legacy reasons, the UCS also includes spaces of varying sizes that are compatibility equivalents for the space character. These spaces include:

  1. Space (U+0020)
  2. En Quad (U+2000)
  3. Em Quad (U+2001)
  4. En Space (U+2002)
  5. Em Space (U+2003)
  6. Three-Per-Em Space (U+2004)
  7. Four-Per-Em Space (U+2005)
  8. Six-Per-Em Space (U+2006)
  9. Figure Space (U+2007)
  10. Punctuation Space (U+2008)
  11. Thin Space (U+2009)
  12. Hair Space (U+200A)
  13. Mathematical Space (U+205F)

Aside from the original ASCII space, the other spaces are all compatibility characters. In this context this means that they effectively add no semantic content to the text, but instead provide styling control. Within Unicode, this non-semantic styling control is often referred to as rich text and is outside the thrust of Unicode’s goals. Rather than using different spaces in different contexts, this styling could instead be handled through intelligent text layout software.

Line-break control characters

Several characters are designed to help control line-breaks either by discouraging them (no-break characters) or suggesting line breaks such as the soft or shy hyphen (U+00AD). Such characters, though designed for styling, are probably indispensable for the intricate types of line-breaking they make possible.

  1. Shy Hyphen (U+00AD)
  2. Non-breaking Hyphen (U+2011)
  3. No-break Space (U+00A0)
  4. Narrow No-break Space (U+202F)
  5. Zero-width space (U+200B)

Whitespace characters

Whitespace characters are not a separate group of characters, but instead Unicode provides a list of characters it deems whitespace characters for interoperability support. Software Implementations and other standards may use the term to denote a slightly different set of characters. Whitespace characters are characters typically designated for programming environments. Often they have no syntactic meaning in such programming environments and are ignored by the machine interpreters. Unicode designates the legacy control characters U+0009 through U+000D and U+0085 as white space characters as well as the Unicode introduced line separator and paragraph separator. Also the core space character (U+0020) is designated as a whitespace character, but none of the other styling spaces.

Private use characters

The UCS includes over 100,000 code points for private use. This means these code points can be assigned characters with specific properties by individuals, organizations and software vendors outside the ISO and Unicode Consortium. A Private Use Area (PUA) is one of several ranges which are reserved for private use. For this range, the Unicode standard does not specify any characters.

The Basic Multilingual Plane includes a PUA in the range from U+E000 to U+F8FF (57344–63743). Plane Fifteen (U+F0000 to U+FFFFD), and Plane Sixteen (U+100000 to U+10FFFD) are completely reserved for private use as well.

The use of the PUA was a concept inherited from certain Asian encoding systems. These systems had private use areas to encode Japanese Gaiji (rare personal name characters) in application-specific ways. Similarly the ConScript Unicode Registry aims to coordinate the mapping of scripts not yet encoded in or rejected by Unicode in the PUAs. The Medieval Unicode Font Initiative uses the PUA to encode various ligatures, precomposed characters, and symbols found in medieval texts.

One example of usage of the Private Use Area is Apple's usage of U+F8FF for the Apple logo.

In Microsoft Windows, these character can be created using Private Character Editor, a limited font editor that comes with Windows.

Special code points

At the simplest level, each character in the UCS represents a code point and a particular semantic function: For graphical characters, the semantic function is often implied by its name, and the script or block it is included within. A graphical character may also have a recommended glyph that helps define the meaning of the character. Ideographs for languages in China, Japan, Korea and Vietnam include many other rich properties that participate in defining the semantic role for a character.

However, the UCS and Unicode designate other code points for other purposes. Those code points may have no or few character properties associated with them.

Surrogates

The 2,048 surrogates are not characters, but are reserved for use in UTF-16 to specify code points outside the Basic Multilingual Plane. They are divided into "high surrogates" (D800–DBFF) and "low surrogates" (DC00–DFFF). In UTF-16, they must always appear in pairs, as a high surrogate followed by a low surrogate, thus using 32 bits to denote one code point.

A surrogate pair denotes the code point

1000016 + (H - D80016 ) × 40016 + (L - DC0016)

where H and L are the numeric values of the high and low surrogates respectively.

Since high surrogate values in the range DB80 to DBFF always produce values in the Private Use planes, the high surrogate range can be further divided into (normal) high surrogates (D800–DB7F) and "high private use surrogates" (DB80–DBFF).

Noncharacters

Unicode reserves several code points as noncharacters. These code points are guaranteed to never have a character assigned to them. Software implementations are therefore free to use these code points for internal use. However, these noncharacters should never be included in text interchange between implementations. One inherently useful example of a noncharacter is the code point U+FFFE. This code point has the reverse binary sequence of the byte order mark (U+FEFF). If a stream of text contains this noncharacter, this is a good indication the text has been interpreted with the incorrect endianness.

Summary table of UCS characters assignments

Description of Table Columns and Rows

The following table lists all of the blocks currently assigned characters as of April 2007 (Unicode 5.0). Blocks are grouped according to their function.

  • The first column lists the name of the group.

Working backwards:

  • The last four columns indicate the boundaries of the block — both its starting code point and ending code point — in both hexdecimal and decimal notation.
  • The prior column (labeed seq for sequence) indicates the order of the block in code point order. This sequence is jusst an ordering based on the current block assignments. As new blocks are assigned or broken down from the existing unassigned blocks, those sequences numbers would change (though the order would remain the same).
  • Unalloc'd indicates the number of unallocated code points represented by a potential block.
  • Alloc'd indicates the number of code points allocated to the block whether actually provisioned or reserved for potential future use of the block.
  • Exxcl and Incl: Some bblocks contain unrelated characters best treated within other categories. In this casse the characters are all tallied in one place in terms of the allocated and reserved characters. The other unrelated characters are subtracted ('Excl ) from the present row and added (Incl) to another.
  • Resrvd indicates the number of characters assigned to the block for related characters, but not yet assigned.
  • Provd indicates the characters provisioned in the block: those actually assigned characters (Allocd − Excl + Incl − Resrvd; = Provd).
  • Compat indicates the number of characters in the block considered compatibility characters. The issue of compatibility characters is complicated, however they generally represent characters included for compatibility with legacy text processing systems or legacy character sets. Unicode’s separation of glyph from character implies that far fewer characters are required for text processing. The various variant glyphs are instead stored as font data, rather than stored as text data (see Unicode compatibility character section). Compatibility characters are typically ligatures such as ffi or precomposed diacritic letters such as å.
  • Core indicates the number of provisioned characters in the block less the discouraged compatibility characters and deprecated (strongly discouraged) characters.

Though the table name unallocated blocks, those blocks could potentially be allocated for any purpose. For example unused code point blocks within the general area of the BMP dedicated to Unihan ideographs could instead be allocated to modern scripts. The names merely indicate the general region of the plane in which they are situated.

Totals

Allocd Unallocd Nonchars Potential Code Points
Grand Totals 241,498 + 872,548 + 66  = 1,114,112
Code Points in Unallocated Planes 720,874
Non-Characters (Unallocated Planes) 22 Unicode designates 32 other non-characters in the Arabic Presentation Forms-A block from U+FDD0 to U+FDEF for a total of 66 noncharacters designated so far.
Non-Characters (Planes in Current Use) 12
Non-Characters in BMP Arabic Presentation Forms-A block -32 32
Private Use Allocation 137,468
Surrogates 2,048
93,978 5,349 99,327 2,684 155 155 102,014 151,674
Core Compat Provd. Resrvd Incl. Excl. Allocd Unallocd

Modern Scripts

Script-Block Name Core Compat Provd. Resrvd Incl. Excl. Allocd Unallocd Seq Hex
Start
Hex
End
Dec
Start
Dec
End
A Modern Scripts 17,722 999 18,721 1,454 14 79 20,240 944 (sequences 1-4, 8-55, 99-100, 102-103, 105, 111-112)
1 Arabic Blocks 253 12 265 39 304 0 (sequences 13 & 15)
1.1 Arabic 223 12 235 21 256 13 0600 06FF 1536 1791
1.2 Arabic Supplement 30 30 18 48 15 0750 077F 1872 1919
2 Armenian 85 1 86 10 96 11 0530 058F 1328 1423
3 Balinese 110 11 121 7 128 54 1B00 1B7F 6912 7039
4 Bengali 87 5 92 36 128 20 0980 09FF 2432 2559
5 Bopomofo Blocks 64 0 64 16 80 (sequences 100, 103)
5.1 Bopomofo 40 40 8 48 100 3100 312F 12544 12591
5.2 Bopomofo Extended 24 24 8 32 103 31A0 31BF 12704 12735
6 Buginese 30 30 2 32 52 1A00 1A1F 6656 6687
7 Buhid 20 20 12 32 43 1740 175F 5952 5983
8 Cherokee 85 85 11 96 37 13A0 13FF 5024 5119
9 Coptic Blocks 128 0 128 14 14 128 (sequences 8, 87)
9.1 Coptic 114 114 14 128 87 2C80 2CFF 11392 11519
9.2 Coptic, Greek and 14 14 0 14 0 8 0370 03FF 880 1023
10 Cyrillic Blocks 223 52 275 29 304 (sequences 9-10)
10.1 Cyrillic 203 52 255 1 256 9 0400 04FF 1024 1279
10.2 Cyrillic Supplement 20 20 28 48 10 0500 052F 1280 1327
11 Devanagari 99 11 110 18 128 19 0900 097F 2304 2431
12 Ethiopic Blocks 461 0 461 51 512 (sequences 33, 88)
13.1 Ethiopic 356 356 28 384 35 1200 137F 4608 4991
13.2 Ethiopic Extended 79 79 17 96 90 2D80 2DDF 11648 11743
13.3 Ethiopic Supplement 26 26 6 32 36 1380 139F 4992 5023
14 Georgian Blocks 121 0 121 23 144
14.1 Georgian 83 83 13 96 33 10A0 10FF 4256 4351
14.2 Georgian Supplement 38 38 10 48 88 2D00 2D2F 11520 11567
15 Glagolitic 94 94 2 96 85 2C00 2C5F 11264 11359
16 Greek Blocks 79 267 346 40 14 400 (sequences 33, 88)
16.1 Greek and Coptic 79 34 113 17 14 144 8 0370 03FF 880 1023
16.2 Greek Extended 0 233 233 23 256 60 1F00 1FFF 7936 8191
17 Gujarati 84 84 44 128 22 0A80 0AFF 2688 2815
18 Gurmukhi 72 6 78 50 128 21 0A00 0A7F 2560 2687
19 Hangul Blocks 11,412 0 11,412 28 11,440 (sequences 34, 120)
19.1 Hangul Jamo 240 240 16 256 34 1100 11FF 4352 4607
19.2 Hangul Syllables 11,172 11,172 12 11,184 120 AC00 D7AF 44032 55215
20 Hanunoo 23 23 9 32 42 1720 173F 5920 5951
21 Hebrew 87 87 25 112 12 0590 05FF 1424 1535
22 Japanese Blocks 159 62 221 3 224 (sequences 98-99, 102, 105)
22.1 Katakana 64 32 96 0 96 99 30A0 30FF 12448 12543
22.2 Katakana Phonetic Extensions 16 16 0 16 105 31F0 31FF 12784 12799
22.3 Hiragana 63 30 93 3 96 98 3040 309F 12352 12447
22.4 Kanbun 16 16 0 16 102 3190 319F 12688 12703
23 Kannada 81 5 86 42 128 26 0C80 0CFF 3200 3327
24 Khmer Blocks 144 2 146 14 160 (sequences 45, 51)
24.1 Khmer 112 2 114 14 128 45 1780 17FF 6016 6143
24.2 Khmer Symbols 32 32 0 32 51 19E0 19FF 6624 6655
25 Lao 62 3 65 63 128 30 0E80 0EFF 3712 3839
26 Latin Blocks 268 524 792 247 65 1,104 (sequences 1-4, 59, 86, 115)
26.1 Latin, Basic 95 95 0 33 128 1 0000 007F 0 127
26.2 Latin-1 Supplement 35 61 96 0 32 128 2 0080 00FF 128 255
26.3 Latin Extended Additional 0 246 246 10 256 59 1E00 1EFF 7680 7935
26.4 Latin Extended-A 14 114 128 0 128 3 0100 017F 256 383
26.5 Latin Extended-B 105 103 208 0 208 4 0180 024F 384 591
26.6 Latin Extended-C 17 17 15 32 86 2C60 2C7F 11360 11391
26.7 Latin Extended-D 2 2 222 224 115 A720 A7FF 42784 43007
27 Limbu 66 66 14 80 48 1900 194F 6400 6479
28 Malayalam 75 3 78 50 128 27 0D00 0D7F 3328 3455
29 Mongolian 155 155 21 176 46 1800 18AF 6144 6319
30 Myanmar 77 1 78 82 160 32 1000 109F 4096 4255
31 New Tai Lue 80 80 16 96 50 1980 19DF 6528 6623
32 NKo 59 59 5 64 17 07C0 07FF 1984 2047
33 Ogham 29 29 3 32 39 1680 169F 5760 5791
34 Oriya 76 5 81 47 128 23 0B00 0B7F 2816 2943
35 Phags-pa 56 56 8 64 118 A840 A87F 43072 43135
36 Runic 81 81 15 96 40 16A0 16FF 5792 5887
37 Sinhala 77 4 81 47 128 28 0D80 0DFF 3456 3583
38 Syloti Nagri 44 44 4 48 116 A800 A82F 43008 43055
39 Syriac 77 77 3 80 14 0700 074F 1792 1871
40 Tagalog 20 20 12 32 41 1700 171F 5888 5919
41 Tagbanwa 18 18 14 32 44 1760 177F 5984 6015
42 Tai Le 35 35 13 48 49 1950 197F 6480 6527
43 Tamil 68 4 72 56 128 24 0B80 0BFF 2944 3071
44 Telugu 81 1 82 46 128 25 0C00 0C7F 3072 3199
45 Thaana 50 50 14 64 16 0780 07BF 1920 1983
46 Thai 86 1 87 41 128 29 0E00 0E7F 3584 3711
47 Tibetan 176 19 195 61 256 31 0F00 0FFF 3840 4095
48 Tifinagh 55 55 25 80 89 2D30 2D7F 11568 11647
49 Unified Canadian Aboriginal Syllabics 630 630 10 640 38 1400 167F 5120 5759
50 Yi Blocks 1,220 0 1,220 12 1,232 0 (sequences 111 & 112)
50.1 Yi Radicals 55 55 9 64 112 A490 A4CF 42128 42191
50.2 Yi Syllables 1,165 1,165 3 1,168 111 A000 A48F 40960 42127
51 Unallocated Script Blocks 944 (sequences 18, 47, 53, 55)
51.1 256 18 0800 08FF 2048 2303
51.2 80 47 18B0 18FF 6320 6399
51.3 224 53 1A20 1AFF 6688 6911
51.4 384 55 1B80 1CFF 7040 7423

Ancient Scripts

Script-Block Name Core Compat Provd. Resrvd Incl. Excl. Allocd Unallocd Seq Hex
Start
Hex
End
Dec
Start
Dec
End
B Ancient Scripts 1,783 0 1,783 313 2,096 51,152 (sequences 138-162)
1 Linear B Syllabary 88 88 40 128 138 10000 1007F 65536 65663
2 Linear B Ideograms 123 123 5 128 139 10080 100FF 65664 65791
3 Aegean Numbers 57 57 7 64 140 10100 1013F 65792 65855
4 Ancient Greek Numbers 75 75 5 80 141 10140 1018F 65856 65935
5 Old Italic 35 35 13 48 143 10300 1032F 66304 66351
6 Gothic 27 27 5 32 144 10330 1034F 66352 66383
7 Ugaritic 31 31 1 32 146 10380 1039F 66432 66463
8 Old Persian 50 50 14 64 147 103A0 103DF 66464 66527
9 Deseret 80 80 0 80 149 10400 1044F 66560 66639
10 Shavian 48 48 0 48 150 10450 1047F 66640 66687
11 Osmanya 40 40 8 48 151 10480 104AF 66688 66735
12 Cypriot Syllabary 55 55 9 64 153 10800 1083F 67584 67647
13 Phoenician 27 27 5 32 155 10900 1091F 67840 67871
14 Kharoshthi 65 65 31 96 157 10A00 10A5F 68096 68191
15 Cuneiform 879 879 145 1,024 159 12000 123FF 73728 74751
16 Cuneiform Numbers and Punctuation 103 103 25 128 160 12400 1247F 74752 74879
17 Ancient Script Unallocated 51,152 (sequences 7, 58, 64, 131)
17.1 368 142 10190 102FF 65936 66303
17.2 48 145 10350 1037F 66384 66431
17.3 32 148 103E0 103FF 66528 66559
17.4 848 152 104B0 107FF 66736 67583
17.5 192 154 10840 108FF 67648 67839
17.6 224 156 10920 109FF 67872 68095
17.7 5,536 158 10A60 11FFF 68192 73727
17.8 128 161 12480 124FF 74880 75007
17.9 43,776 162 12500 1CFFF 75008 118783

Phonetics

Script-Block Name Core Compat Provd. Resrvd Incl. Excl. Allocd Unallocd Seq Hex
Start
Hex
End
Dec
Start
Dec
End
C Phonetic 277 118 395 5 400 0 (sequences 5-6, 56-57, 114)
1 IPA Extensions 96 96 0 96 5 0250 02AF 592 687
2 Phonetic Extensions 67 61 128 0 128 56 1D00 1D7F 7424 7551
3 Phonetic Extensions Supplement 27 37 64 0 64 57 1D80 1DBF 7552 7615
4 Spacing Modifier Letters 60 20 80 0 80 6 02B0 02FF 688 767
5 Modifier Tone Letters 27 27 5 32 114 A700 A71F 42752 42783

Unified Diacritics

Script-Block Name Core Compat Provd. Resrvd Incl. Excl. Allocd Unallocd Seq Hex
Start
Hex
End
Dec
Start
Dec
End
D Unified Diacritics 156 4 160 79 1 240 0 (sequences 7, 58, 64, 131)
1 Combining Diacritical Marks 107 4 111 0 1 112 7 0300 036F 768 879
2 Combining Diacritical Marks Supplement 13 13 51 64 58 1DC0 1DFF 7616 7679
3 Combining Diacritical Marks for Symbols 32 32 16 48 64 20D0 20FF 8400 8447
4 Combining Half Marks 4 4 12 16 131 FE20 FE2F 65056 65071

Unified Punctuation

Script-Block Name Core Compat Provd. Resrvd Incl. Excl. Allocd Unallocd Seq Hex
Start
Hex
End
Dec
Start
Dec
End
E Unified Punctuation 88 25 113 108 19 240 0 (sequences 61 & 92)
1 General Punctuation 62 25 87 6 19 112 61 2000 206F 8192 8303
2 Supplemental Punctuation 26 26 102 128 92 2E00 2E7F 11776 11903

Unified Symbols

Script-Block Name Core Compat Provd. Resrvd Incl. Excl. Allocd Unallocd Seq Hex
Start
Hex
End
Dec
Start
Dec
End
F Unified Symbols 2,528 90 2,618 241 18 55 2,896 10,414 (sequences 63, 65-71, 73-84, 91, 95, 97, 109, 167-169, 171)
1 Arrows Blocks 268 6 274 0 18 0 256 0 (sequences 67, 79, 81, 84)
1.1 Arrows 106 6 112 0 112 67 2190 21FF 8592 8703
1.2 Supplemental Arrows-A 16 16 0 16 79 27F0 27FF 10224 10239
1.3 Supplemental Arrows-B 128 128 0 128 81 2900 297F 10496 10623
1.4 Miscellaneous Symbols and Arrows 18 18 0 18 0 84 2B00-2B11 11008-11025
2 Braille Patterns 256 256 0 256 80 2800 28FF 10240 10495
3 Control Pictures 39 39 25 64 70 2400 243F 9216 9279
4 Counting Rod Numerals 18 18 14 32 168 1D360 1D37F 119648 119679
5 Currency Symbols 22 22 26 48 63 20A0 20CF 8352 8399
6 Geometrical Symbols 256 0 256 0 256 0 (sequences 73-75)
6.1 Geometric Shapes 96 96 0 96 75 25A0 25FF 9632 9727
6.2 Box Drawing 128 128 0 128 73 2500 257F 9472 9599
6.3 Block Elements 32 32 0 32 74 2580 259F 9600 9631
7 Letterlike Symbols 38 4 42 1 37 80 65 2100 214F 8448 8527
8 Math 632 47 679 9 688 0 (sequences 68, 78, 82-83)
8.1 Mathematical Operators 214 42 256 0 256 68 2200 22FF 8704 8959
8.2 Supplemental Mathematical Operators 251 5 256 0 256 83 2A00 2AFF 10752 11007
8.3 Miscellaneous Mathematical Symbols-A 39 39 9 48 78 27C0 27EF 10176 10223
8.4 Miscellaneous Mathematical Symbols-B 128 128 0 128 82 2980 29FF 10624 10751
9 Miscellaneous Symbols 818 2 820 122 18 960 0 (sequences 69, 76-77, 84)
9.1 Miscellaneous Symbols and Arrows 238 238 0 18 256 84 2B00 2BFF 11008 11263
9.2 Miscellaneous Symbols 176 176 80 256 76 2600 26FF 9728 9983
9.3 Miscellaneous Technical 230 2 232 24 256 69 2300 23FF 8960 9215
9.4 Dingbats 174 174 18 192 77 2700 27BF 9984 10175
10 Number Forms 19 31 50 14 64 66 2150 218F 8528 8591
11 Optical Character Recognition 11 11 21 32 71 2440 245F 9280 9311
12 Tai Xuan Jing Symbols 87 87 9 96 167 1D300 1D35F 119552 119647
13 Yijing Hexagram Symbols 64 64 0 64 109 4DC0 4DFF 19904 19967
14 Unallocated Symbol Blocks 48 (sequences 91 & 95)
14.1 32 91 2DE0 2DFF 11744 11775
14.2 16 95 2FE0 2FEF 12256 12271
15 Unallocated Symbol Blocks 10,366 (sequences 169 & 171)
15.1 128 169 1D380 1D3FF 119680 119807
15.2 10,238 171 1D800 1FFFD 120832 131069

Music Notation

Script-Block Name Core Compat Provd. Resrvd Incl. Excl. Allocd Unallocd Seq Hex
Start
Hex
End
Dec
Start
Dec
End
G Music Notation and Symbols 522 13 535 57 592 176 (sequences 163-166)
1 Byzantine Musical Symbols 246 246 10 256 163 1D000 1D0FF 118784 119039
2 Musical Symbols 206 13 219 37 256 164 1D100 1D1FF 119040 119295
3 Ancient Greek Musical Notation 70 70 10 80 165 1D200 1D24F 119296 119375
4 Unallocated Musical Blocks 176 (sequence 166)
4.1 176 166 1D250 1D2FF 119376 119551

Unihan CJKV Blocks

Script-Block Name Core Compat Provd. Resrvd Incl. Excl. Allocd Unallocd Seq Hex
Start
Hex
End
Dec
Start
Dec
End
H Unified CJK Blocks 70,426 7 70,433 127 70,560 22,320 (sequences 93, 96, 104, 108, 110, 113-114, 117, 119, 120-121, 172-173)
1 Unified CJK Support Blocks 200 7 207 49 256 0 (sequences 93, 96-97, 104)
1.1 CJK Radicals Supplement 113 2 115 13 128 93 2E80 2EFF 11904 12031
1.2 Ideographic Description Characters 12 12 4 16 96 2FF0 2FFF 12272 12287
1.3 CJK Strokes 16 16 32 48 104 31C0 31EF 12736 12783
1.4 CJK Symbols and Punctuation 59 5 64 0 64 97 3000 303F 12288 12351
2 Unified Han Ideographs 70,226 70,226 78 70,304 0 (sequences 108, 110, 172)
2.1 CJK Unified Ideographs Extension A 6,582 6,582 10 6,592 108 3400 4DBF 13312 19903
2.2 CJK Unified Ideographs 20,924 20,924 68 20,992 110 4E00 9FFF 19968 40959
2.3 CJK Unified Ideographs Extension B 42,720 42,720 42,720 172 20000 2A6DF 131072 173791
3 Unallocated Unihan 22,320 (sequences 113, 117, 119, 121, 173)
3.1 20,768 173 2A6E0 2F7FF 173792 194559
3.2 560 113 A4D0 A6FF 42192 42751
3.3 16 117 A830 A83F 43056 43071
3.4 896 119 A880 ABFF 43136 44031
3.5 80 121 D7B0 D7FF 55216 55295

Legacy Compatibility Blocks

Script-Block Name Core Compat Provd. Resrvd Incl. Excl. Allocd Unallocd Seq Hex
Start
Hex
End
Dec
Start
Dec
End
I Legacy Compatibility Blocks 41 3,054 3,095 232 0 1 3,328 1,502 (sequences 62, 72, 94, 101, 106-107, 126-128, 130, 132-135, 170, 174)
1 Enclosed Alphanumerics 21 139 160 0 160 72 2460 24FF 9312 9471
2 Superscripts and Subscripts 0 34 34 14 48 62 2070 209F 8304 8351
3 Alphabetic Presentation Forms 1 57 58 22 80 127 FB00 FB4F 64256 64335
4 Arabic Compatibility 4 731 735 96 1 832
4.1 Arabic Presentation Forms-A 3 592 595 93 688 128 FB50 FDFF 64336 65023
4.2 Arabic Presentation Forms-B 1 139 140 3 1 144 134 FE70 FEFF 65136 65279
5 CJK and Ideograph Compatibility 15 2,093 2,108 100 2,208
5.1 KangXi Radicals 0 214 214 10 224 94 2F00 2FDF 12032 12255
5.2 Hangul Compatibility Jamo 0 94 94 2 96 101 3130 318F 12592 12687
5.3 CJK Compatibility 0 256 256 0 256 107 3300 33FF 13056 13311
5.4 CJK Compatibility Ideographs 12 455 467 45 512 126 F900 FAFF 63744 64255
5.5 Vertical Forms 0 10 10 6 16 130 FE10 FE1F 65040 65055
5.6 CJK Compatibility Forms 2 30 32 0 32 132 FE30 FE4F 65072 65103
5.7 Small Form Variants 0 26 26 6 32 133 FE50 FE6F 65104 65135
5.8 Halfwidth and Fullwidth Forms 0 225 225 15 240 135 FF00 FFEF 65280 65519
5.9 CJK Compatibility Ideographs Supplement 0 542 542 2 544 174 2F800 2FA1F 194560 195103
5.10 Enclosed CJK Letters and Months 1 241 242 14 256 106 3200 32FF 12800 13055
6 Unallocated Compatibility Blocks 1,502
6.1 1,502 175 2FA20 2FFFD 195104 196605

Other Compatibility Blocks

Script-Block Name Core Compat Provd. Resrvd Incl. Excl. Allocd Unallocd Seq Hex
Start
Hex
End
Dec
Start
Dec
End
J Other Compatibility Blocks 0 1,033 1,033 28 37 0 1,024 0 (sequences 65 & 170)
1 Letterlike Symbols 0 37 37 0 37 0 65 2100 214F 8448 8527
2 Mathematical Alphanumeric Symbols 0 996 996 28 1,024 170 1D400 1D7FF 119808 120831

Special-purpose characters

Script-Block Name Core Compat Provd. Resrvd Incl. Excl. Allocd Unallocd Seq Hex
Start
Hex
End
Dec
Start
Dec
End
K Control Characters 435 6 441 86 0 396 65,166 (sequences 1-2, 7, 61, 134, 136, 129, 177, 178)
1 ASCII/8099-1 Controls 65 0 65 0 65 0 0 0 (sequences 1 & 2)
1.1 C0 (Latin, Basic) 33 33 33 0 1 0001 001F & 007F 1 31 & 79
1.2 C1 (Latin-1 Supplement) 32 32 32 0 2 0080 009F 128 159
2 Byte Order Mark 1 1 1 0 134 FEFF 65279
3 Combining Grapheme Joiner 1 1 1 0 7 034F 847
4 General Punctuation 13 6 19 0 19 0 0 0 (sequence 61)
4.1 Bidi Characters 7 0 7 7 0 61 200E-200F, 202A-202E
4.2 Other Formatting 6 0 6 0 6 0 61 2000, 200D, 2028-2029, 2060, 2063
4.3 Deprecated 0 6 6 6 0 61 206A-206F 8298-8303
5 Specials 5 5 7 12 136 FFF0 FFFD 65520 65533
6 Tags 95 95 33 128 177 E0000 E007F 917504 917631
7 Variation Selectors 256 256 256
7.1 Variation Selectors 16 16 16 129 FE00 FE0F 65024 65039
7.2 Variation Selectors Supplement 240 240 240 178 E0100 E01EF 917760 917999
8 Unallocated Special-Purpose 0 0 0 65,166 (sequences 179-181)
8.1 128 179 E0080 E00FF 917632 917759
8.2 16 180 E01F0 E01FF 918000 918015
8.3 65,022 181 E0200 EFFFD 918016 983037

Surrogates

Script-Block Name Core Compat Provd. Resrvd Incl. Excl. Allocd Unallocd Seq Hex
Start
Hex
End
Dec
Start
Dec
End
L Surrogates 0 0 0 0 0 0 2,048 0 (sequences 122-124)
1 High Private Use Surrogates 128 123 DB80 DBFF 56192 56319
2 High Surrogates 896 122 D800 DB7F 55296 56191
3 Low Surrogates 1,024 124 DC00 DFFF 56320 57343

Private use characters

Script-Block Name Core Compat Provd. Resrvd Incl. Excl. Allocd Unallocd Seq Hex
Start
Hex
End
Dec
Start
Dec
End
M Private Use Areas 0 0 0 0 0 0 137,468 0 (sequences 125, 182-183)
1 Private Use Area 6,400 125 E000 F8FF 57344 63743
3 Supplementary Private Use Area-A 65,534 182 F0000 FFFFD 983040 1048573
3 Supplementary Private Use Area-B 65,534 183 100000 10FFFD 1048576 1114109

Unused Planes

Script-Block Name Core Compat Provd. Resrvd Incl. Excl. Allocd Unallocd Seq Hex
Start
Hex
End
Dec
Start
Dec
End
N Planes 3 through 13 0 0 0 0 0 0 720,874 176

See also

Tables

Unicode mapping tables
BMP SMP SIP SSP
0000—0FFF8000—8FFF10000—10FFF20000—20FFF28000—28FFFE0000—E0FFF
1000—1FFF9000—9FFF 21000—21FFF29000—29FFF
2000—2FFFA000—AFFF12000—12FFF22000—22FFF2A000—2AFFF
3000—3FFFB000—BFFF 23000—23FFF 
4000—4FFFC000—CFFF1D000—1DFFF24000—24FFF2F000—2FFFF
5000—5FFFD000—DFFF 25000—25FFF 
6000—6FFFE000—EFFF 26000—26FFF 
7000—7FFFF000—FFFF 27000—27FFF

External links

Notes

    References

    This page uses content from Wikipedia. The original article was at Mapping of Unicode characters.
    The list of authors can be seen in the page history. The text of this Wikinfo article is available under the GNU Free Documentation License and the Creative Commons Attribution-Share Alike 3.0 license.

    English | Română | edit

    In other languages