Skip to content

Instantly share code, notes, and snippets.

@pavan538
Last active September 17, 2018 12:05
Show Gist options
  • Save pavan538/2e689ae289c724ed162f3b97da367ff2 to your computer and use it in GitHub Desktop.
Save pavan538/2e689ae289c724ed162f3b97da367ff2 to your computer and use it in GitHub Desktop.
Unicode
  • codepoint The Unicode standard describes how characters are represented by code points. A code point is an integer value, usually denoted in base 16. In the standard, a code point is written using the notation U+12CA to mean the character with value 0x12ca (4,810 decimal). The Unicode standard contains a lot of tables listing characters and their corresponding code points:

    U+12CA is a codepoint, which represents some particular character;

  • A Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal). This sequence needs to be represented as a set of bytes (meaning, values from 0 through 255) in memory. The rules for translating a Unicode string into a sequence of bytes are called an encoding.

  • A character is represented on a screen or on paper by a set of graphical elements that’s called a glyph. The glyph for an uppercase A, for example, is two diagonal strokes and a horizontal stroke, though the exact details will depend on the font being used.

  • The rules for converting a Unicode string into the ASCII encoding, for example, are simple; for each code point:

    • If the code point is < 128, each byte is the same as the value of the code point.
    • If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)
  • Let say we have a unicode string as u'hello', if we try to encode to ascii, it will encode to ascii.

  • Let say if we have a unicode string as u'⺀', if we try to encode to ascii, it will throw an error as follows

    UnicodeEncodeError: 'ascii' codec can't encode character '\u2e80' in position 0: ordinal not in range(128)
    
  • Let say we have a unicode string which is a combination of ascii and other language character, as u'hello⺀' and try to encode to ascii, it will throw an error as follows

    UnicodeEncodeError: 'ascii' codec can't encode character '\u2e80' in position 5: ordinal not in range(128)
    
  • Let say we have a unicode string as u'hello⺀' and try to encode to utf-8, the output will be b'hello\xe2\xba\x80'

  • Let say we have byte unicode string as b'\xe2\xba\x80' if we try to decode to ascii, it will throw an error as follows:

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
    
  • Let say we have byte encoded value as b'hello\xe2\xba\x80', if we try to decode to utf-8 , the output will be 'hello⺀'

  • Let say we have string as 'hello⺀', if we try to encode with ascii, it will throw an error as follows:

    UnicodeEncodeError: 'ascii' codec can't encode character '\u2e80' in position 5: ordinal not in range(128)
    
  • Let say we have string as 'hello⺀', if we try to encode with utf8, the output will be b'hello\xe2\xba\x80'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment