junaidtk · June 17, 2019 06:03
diff --git a/Character set and Encoding b/Character set and Encoding
 What is character Set and Encoding
 ===================================

 Character encoding are the important concept in process of converting byte streams into characters. 
 There are two things which are important in converting byte to character "character set and an encoding"

 ================= ####### ================= ####### ==================

 In the earlier stage of computer encoding 
 There are diffrent methods of representing the character. 
 ASCII code used for representing the english letter. They have used the 8bit for representing a character.
 The number form 32 to 127 is used for representing the letters.
 space was 32 and letter A was 65.

 So in order to process other languages we need to use any other encoding schemes like 
 OEM character set the number above 128 are used for encoding their own purposes.
 eventualy it is added in ANSI standared and agreed below 128 same as ASCII.
 So different countries uses their own letters above the 128 number for encoding.

 But in asian countries their language is not fit in the 8 bit so they require 16 bit for encoding the language.
 This arises DBCS system. Double byte character set.

 ================= ####### ================= ####### ==================

 Then Unicode character set is evolved.
 In unicode letter map to some code point, which is theoreticla deal. IN unicode letter is a platonic deal.
 In unicode the letter is represent like a magic number like U+0639 . U+0041 for A etc

 Hello
 U+0048 U+0065 U+006C U+006C U+006F.


 Then invented brillient UTF-8 algorithm
 UTF-8 was another system for storing your string of Unicode code points, 
 those magic U+ numbers, in memory using 8 bit bytes. 
 In UTF-8, every code point from 0-127 is stored in a single byte. 
 Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.


 Hello, which was U+0048 U+0065 U+006C U+006C U+006F, 

 will be stored as 48 65 6C 6C 6F, 
 which, behold! is the same as it was stored in ASCII, and ANSI, 
 and every OEM character set on the planet

 ================= ####### ================= ####### ==================

 Now UTF 7, 8, 16, and 32 all have the nice property of being able to store any code point correctly.
 UTF-8, UTF-16 and UTF-32 are differnt way to store unicode points. 
 Here the UTF-32 is fixed width encoding and UTF-8, UTF-16 are variable length

 UTF-32: Each code point takes 4 bytes for encoding the data.
 UTF-16: Take 2 or 4 bytes for encoding the data.
 UTF-8 : Take 1 - 4 bytes for encoding the data. 
        ASCII is compatible with UTF-8. So english text look like same as in ASCII encoding.
        ie, 0-127 code point store in single byte and code point 128 and above store in 2,3,4 bytes.
        
        
 So we need to specify the what type of encoding used in the application.
 We can see the charset definition at the start of every HTML page as seen below
 <head>
 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">



 Reference :
 ===============
 https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
 https://javarevisited.blogspot.com/2015/02/difference-between-utf-8-utf-16-and-utf.html
	What is character Set and Encoding
	===================================

	Character encoding are the important concept in process of converting byte streams into characters.
	There are two things which are important in converting byte to character "character set and an encoding"

	================= ####### ================= ####### ==================

	In the earlier stage of computer encoding
	There are diffrent methods of representing the character.
	ASCII code used for representing the english letter. They have used the 8bit for representing a character.
	The number form 32 to 127 is used for representing the letters.
	space was 32 and letter A was 65.

	So in order to process other languages we need to use any other encoding schemes like
	OEM character set the number above 128 are used for encoding their own purposes.
	eventualy it is added in ANSI standared and agreed below 128 same as ASCII.
	So different countries uses their own letters above the 128 number for encoding.

	But in asian countries their language is not fit in the 8 bit so they require 16 bit for encoding the language.
	This arises DBCS system. Double byte character set.

	================= ####### ================= ####### ==================

	Then Unicode character set is evolved.
	In unicode letter map to some code point, which is theoreticla deal. IN unicode letter is a platonic deal.
	In unicode the letter is represent like a magic number like U+0639 . U+0041 for A etc

	Hello
	U+0048 U+0065 U+006C U+006C U+006F.


	Then invented brillient UTF-8 algorithm
	UTF-8 was another system for storing your string of Unicode code points,
	those magic U+ numbers, in memory using 8 bit bytes.
	In UTF-8, every code point from 0-127 is stored in a single byte.
	Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.


	Hello, which was U+0048 U+0065 U+006C U+006C U+006F,

	will be stored as 48 65 6C 6C 6F,
	which, behold! is the same as it was stored in ASCII, and ANSI,
	and every OEM character set on the planet

	================= ####### ================= ####### ==================

	Now UTF 7, 8, 16, and 32 all have the nice property of being able to store any code point correctly.
	UTF-8, UTF-16 and UTF-32 are differnt way to store unicode points.
	Here the UTF-32 is fixed width encoding and UTF-8, UTF-16 are variable length

	UTF-32: Each code point takes 4 bytes for encoding the data.
	UTF-16: Take 2 or 4 bytes for encoding the data.
	UTF-8 : Take 1 - 4 bytes for encoding the data.
	ASCII is compatible with UTF-8. So english text look like same as in ASCII encoding.
	ie, 0-127 code point store in single byte and code point 128 and above store in 2,3,4 bytes.


	So we need to specify the what type of encoding used in the application.
	We can see the charset definition at the start of every HTML page as seen below
	<head>
	<meta http-equiv="Content-Type" content="text/html; charset=utf-8">



	Reference :
	===============
	https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
	https://javarevisited.blogspot.com/2015/02/difference-between-utf-8-utf-16-and-utf.html