-
codepoint The Unicode standard describes how characters are represented by code points. A code point is an integer value, usually denoted in base 16. In the standard, a code point is written using the notation U+12CA to mean the character with value 0x12ca (4,810 decimal). The Unicode standard contains a lot of tables listing characters and their corresponding code points:
U+12CA
is a codepoint, which represents some particular character; -
A Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal). This sequence needs to be represented as a set of bytes (meaning, values from 0 through 255) in memory. The rules for translating a Unicode string into a sequence of bytes are called an encoding.
-
A character is represented on a screen or on paper by a set of graphical elements that’s called a glyph. The glyph for an uppercase A, for example, is two diagonal strokes and a horizontal stroke, though the exact details will depend on the font being used.
-
The rules for converting a Unicode string into the ASCII encoding, for example, are simple; for each code point:
- If the code point is < 128, each byte is the same as the value of the code point.
- If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)
-
Let say we have a unicode string as
u'hello'
, if we try to encode to ascii, it will encode to ascii. -
Let say if we have a unicode string as
u'⺀'
, if we try to encode to ascii, it will throw an error as followsUnicodeEncodeError: 'ascii' codec can't encode character '\u2e80' in position 0: ordinal not in range(128)
-
Let say we have a unicode string which is a combination of ascii and other language character, as
u'hello⺀'
and try to encode to ascii, it will throw an error as followsUnicodeEncodeError: 'ascii' codec can't encode character '\u2e80' in position 5: ordinal not in range(128)
-
Let say we have a unicode string as
u'hello⺀'
and try to encode to utf-8, the output will beb'hello\xe2\xba\x80'
-
Let say we have byte unicode string as
b'\xe2\xba\x80'
if we try to decode to ascii, it will throw an error as follows:UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
-
Let say we have byte encoded value as
b'hello\xe2\xba\x80'
, if we try to decode to utf-8 , the output will be'hello⺀'
-
Let say we have string as
'hello⺀'
, if we try to encode with ascii, it will throw an error as follows:UnicodeEncodeError: 'ascii' codec can't encode character '\u2e80' in position 5: ordinal not in range(128)
-
Let say we have string as
'hello⺀'
, if we try to encode with utf8, the output will beb'hello\xe2\xba\x80'
Last active
September 17, 2018 12:05
-
-
Save pavan538/2e689ae289c724ed162f3b97da367ff2 to your computer and use it in GitHub Desktop.
Unicode
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment