Skip to content

Instantly share code, notes, and snippets.

@nitely
Last active April 26, 2018 03:40
Show Gist options
  • Save nitely/cf429a096d6a52b29ea2e83d457275f6 to your computer and use it in GitHub Desktop.
Save nitely/cf429a096d6a52b29ea2e83d457275f6 to your computer and use it in GitHub Desktop.
unicode stuff

Combining marks are insufficient to break a string into graphemes

Q: So is a combining character sequence the same as a “character”?
A: That depends. For a programmer, a Unicode code point represents a single character (for exceptions, see below). For an end user, it may not. The better word for what end-users think of as characters is grapheme: a minimally distinctive unit of writing in the context of a particular writing system.
For example, å (A + COMBINING RING or A-RING) is a grapheme in the Danish writing system, while KA + VIRAMA + TA + VOWEL SIGN U is one in the Devanagari writing system. Graphemes are not necessarily combining character sequences, and combining character sequences are not necessarily graphemes. Moreover, there are a number of other cases where a user would not count “characters” the same way as a programmer would: where there are invisible characters such as the RIGHT-TO-LEFT MARK (RLM) used in BIDI, compatibility composites such as “Dz”, “ij”, or Roman numerals, and so on.

see http://unicode.org/faq/char_combmark.html

This is likely the most known grapheme: "\r\n" (CR LF) . It's a grapheme according to unicode tr29

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment