Created
June 17, 2019 06:03
-
-
Save junaidtk/742d6f1a8e355114e5d1057b83f7919f to your computer and use it in GitHub Desktop.
Unicode and Character Sets (No Excuses!)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
What is character Set and Encoding | |
=================================== | |
Character encoding are the important concept in process of converting byte streams into characters. | |
There are two things which are important in converting byte to character "character set and an encoding" | |
================= ####### ================= ####### ================== | |
In the earlier stage of computer encoding | |
There are diffrent methods of representing the character. | |
ASCII code used for representing the english letter. They have used the 8bit for representing a character. | |
The number form 32 to 127 is used for representing the letters. | |
space was 32 and letter A was 65. | |
So in order to process other languages we need to use any other encoding schemes like | |
OEM character set the number above 128 are used for encoding their own purposes. | |
eventualy it is added in ANSI standared and agreed below 128 same as ASCII. | |
So different countries uses their own letters above the 128 number for encoding. | |
But in asian countries their language is not fit in the 8 bit so they require 16 bit for encoding the language. | |
This arises DBCS system. Double byte character set. | |
================= ####### ================= ####### ================== | |
Then Unicode character set is evolved. | |
In unicode letter map to some code point, which is theoreticla deal. IN unicode letter is a platonic deal. | |
In unicode the letter is represent like a magic number like U+0639 . U+0041 for A etc | |
Hello | |
U+0048 U+0065 U+006C U+006C U+006F. | |
Then invented brillient UTF-8 algorithm | |
UTF-8 was another system for storing your string of Unicode code points, | |
those magic U+ numbers, in memory using 8 bit bytes. | |
In UTF-8, every code point from 0-127 is stored in a single byte. | |
Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes. | |
Hello, which was U+0048 U+0065 U+006C U+006C U+006F, | |
will be stored as 48 65 6C 6C 6F, | |
which, behold! is the same as it was stored in ASCII, and ANSI, | |
and every OEM character set on the planet | |
================= ####### ================= ####### ================== | |
Now UTF 7, 8, 16, and 32 all have the nice property of being able to store any code point correctly. | |
UTF-8, UTF-16 and UTF-32 are differnt way to store unicode points. | |
Here the UTF-32 is fixed width encoding and UTF-8, UTF-16 are variable length | |
UTF-32: Each code point takes 4 bytes for encoding the data. | |
UTF-16: Take 2 or 4 bytes for encoding the data. | |
UTF-8 : Take 1 - 4 bytes for encoding the data. | |
ASCII is compatible with UTF-8. So english text look like same as in ASCII encoding. | |
ie, 0-127 code point store in single byte and code point 128 and above store in 2,3,4 bytes. | |
So we need to specify the what type of encoding used in the application. | |
We can see the charset definition at the start of every HTML page as seen below | |
<head> | |
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"> | |
Reference : | |
=============== | |
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ | |
https://javarevisited.blogspot.com/2015/02/difference-between-utf-8-utf-16-and-utf.html |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment