Unicode

codepoint The Unicode standard describes how characters are represented by code points. A code point is an integer value, usually denoted in base 16. In the standard, a code point is written using the notation U+12CA to mean the character with value 0x12ca (4,810 decimal). The Unicode standard contains a lot of tables listing characters and their corresponding code points:

U+12CA is a codepoint, which represents some particular character;
A Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal). This sequence needs to be represented as a set of bytes (meaning, values from 0 through 255) in memory. The rules for translating a Unicode string into a sequence of bytes are called an encoding.
A character is represented on a screen or on paper by a set of graphical elements that’s called a glyph. The glyph for an uppercase A, for example, is two diagonal strokes and a horizontal stroke, though the exact details will depend on the font being used.
The rules for converting a Unicode string into the ASCII encoding, for example, are simple; for each code point:
- If the code point is < 128, each byte is the same as the value of the code point.
- If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)
Let say we have a unicode string as u'hello', if we try to encode to ascii, it will encode to ascii.
Let say if we have a unicode string as u'⺀', if we try to encode to ascii, it will throw an error as follows
```
UnicodeEncodeError: 'ascii' codec can't encode character '\u2e80' in position 0: ordinal not in range(128)
```
Let say we have a unicode string which is a combination of ascii and other language character, as u'hello⺀' and try to encode to ascii, it will throw an error as follows
```
UnicodeEncodeError: 'ascii' codec can't encode character '\u2e80' in position 5: ordinal not in range(128)
```
Let say we have a unicode string as u'hello⺀' and try to encode to utf-8, the output will be b'hello\xe2\xba\x80'
Let say we have byte unicode string as b'\xe2\xba\x80' if we try to decode to ascii, it will throw an error as follows:
```
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
```
Let say we have byte encoded value as b'hello\xe2\xba\x80', if we try to decode to utf-8 , the output will be 'hello⺀'

Let say we have string as 'hello⺀', if we try to encode with ascii, it will throw an error as follows:

UnicodeEncodeError: 'ascii' codec can't encode character '\u2e80' in position 5: ordinal not in range(128)

Let say we have string as 'hello⺀', if we try to encode with utf8, the output will be b'hello\xe2\xba\x80'

pavan538/unicode_in_python.md