trevnorris · April 28, 2016 21:16
diff --git a/on-iso-8859-1.txt b/on-iso-8859-1.txt
 ASCII, the American Standard Code for Information Exchange, (or the IANA
 preferred US-ASCII) is a 7-bit encoding first standardized in 1960 as part of
 the ASA, American Standards Association, X3.2. Though the most recent update
 was in 1986 in the ANSI X3.4-1986 standard [1]. This was also standardized in
 ISO/IEC 646:1991 [2] and later ratified by ECMA as ECMA-6 [3].

 Since ASCII was first standardized there have been proprietary extensions that
 use the full 8-bit space. First to do so was IBM and their introduction of
 "code pages" (most commonly called character encodings). Those who've used
 MS-DOS will recognize CP437 [4]. Along with IBM, other companies such as Apple
 also developed their own character encodings.

 (discrepancy in timeline. ref https://github.com/nodejs/node-eps/pull/15#issuecomment-215564862
 and fix later).

 Amidst all of this, the ISO, International Organization for Standardization,
 released the ISO 8859 standard describing its own 8-bit ASCII extensions. The
 most commonly used being ISO 8859-1, which was first released in 1987 as ISO
 8859-1:1987 [5] but later revised and released as ISO/IEC 8859-1:1998 [6][7].

 Notice that character ranges 0x00-0x1F and 0x7F-0x9F are empty. This is because
 the ISO/IEC 8859-1 only specified graphical characters. Then in 1992 the IANA
 registered the character map ISO_8859-1:1987 [8] also referred to as
 ISO-8859-1 and aliased as latin1. This map assigns the C0 and C1 control
 characters [9] to the unassigned code points in the original specification.

 ECMA-35 later standardized the layout of 8-bit character sets [10].
 Specifically that byte ranges 0x00-0x1F be identified as CL and be used for the
 primary set of control characters, and 0x80-0x9F as be "either a supplementary
 set of control functions, or unused" (sec 8.1). The reason that each 7-bit page
 layout should be the same is outlined in section 9. Basically it's a
 description of how to invoke graphical characters, e.g. character glyphs, from
 a different 7-bit space.

 The ISO also keeps the ISO/IEC 10646 standard along side the Unicode
 Consortium. To verify the standard we'll first take a look at the official
 8895-1 to Unicode mapping [11]. Which shows basically a 1-to-1 map of character
 to Unicode code point. We can then reference to the official Unicode standard
 "C1 Controls and Latin-1 Supplement" in version 8 [12]. Here we can see the C1
 control characters are stated to be used in ranges from 0x0080-0x009F.

 Now the WHATWG has an encoding spec [13] stating that ISO-8859-1 is an alias
 for windows-1252. While not technically correct, as defined above, for all
 intents and purposes of the browser, windows-1252 could be seen as a superset
 of ISO-8859-1. Since the CR control area doesn't contain any graphical glyphs.
 In fact, if you look at the Unicode map [14] you can see that unused codes in
 the CR range are simply undefined. Meaning windows-1252 is technically a
 superset of ISO/IEC 8859-1:1998, while using the C0 control set.

 So for anyone planning on scraping the web, make sure you pay attention to the
 charset of the document and compare it against what browsers are doing. If the
 document is specified as latin1, it most likely isn't. Instead the document
 would need to be parsed with windows-1252 instead.


    [1] http://sliderule.mraiow.com/w/images/7/73/ASCII.pdf
    [2] http://www.iso.org/iso/catalogue_detail.htm?csnumber=4777
    [3] http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-006.pdf
    [4] http://www-01.ibm.com/software/globalization/cp/cp00437.html
    [5] http://www.iso.org/iso/catalogue_detail.htm?csnumber=16338
    [6] http://www.iso.org/iso/home/store/catalogue_ics/catalogue_detail_ics.htm?csnumber=28245
    [7] http://www.open-std.org/JTC1/SC2/WG3/docs/n411.pdf
    [8] http://tools.ietf.org/html/rfc1345
    [9] http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-048.pdf
    [10] http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-035.pdf
    [11] http://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT
    [12] http://unicode.org/charts/PDF/U0080.pdf
    [13] https://encoding.spec.whatwg.org/
    [14] http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
	ASCII, the American Standard Code for Information Exchange, (or the IANA
	preferred US-ASCII) is a 7-bit encoding first standardized in 1960 as part of
	the ASA, American Standards Association, X3.2. Though the most recent update
	was in 1986 in the ANSI X3.4-1986 standard [1]. This was also standardized in
	ISO/IEC 646:1991 [2] and later ratified by ECMA as ECMA-6 [3].

	Since ASCII was first standardized there have been proprietary extensions that
	use the full 8-bit space. First to do so was IBM and their introduction of
	"code pages" (most commonly called character encodings). Those who've used
	MS-DOS will recognize CP437 [4]. Along with IBM, other companies such as Apple
	also developed their own character encodings.

	(discrepancy in timeline. ref https://github.com/nodejs/node-eps/pull/15#issuecomment-215564862
	and fix later).

	Amidst all of this, the ISO, International Organization for Standardization,
	released the ISO 8859 standard describing its own 8-bit ASCII extensions. The
	most commonly used being ISO 8859-1, which was first released in 1987 as ISO
	8859-1:1987 [5] but later revised and released as ISO/IEC 8859-1:1998 [6][7].

	Notice that character ranges 0x00-0x1F and 0x7F-0x9F are empty. This is because
	the ISO/IEC 8859-1 only specified graphical characters. Then in 1992 the IANA
	registered the character map ISO_8859-1:1987 [8] also referred to as
	ISO-8859-1 and aliased as latin1. This map assigns the C0 and C1 control
	characters [9] to the unassigned code points in the original specification.

	ECMA-35 later standardized the layout of 8-bit character sets [10].
	Specifically that byte ranges 0x00-0x1F be identified as CL and be used for the
	primary set of control characters, and 0x80-0x9F as be "either a supplementary
	set of control functions, or unused" (sec 8.1). The reason that each 7-bit page
	layout should be the same is outlined in section 9. Basically it's a
	description of how to invoke graphical characters, e.g. character glyphs, from
	a different 7-bit space.

	The ISO also keeps the ISO/IEC 10646 standard along side the Unicode
	Consortium. To verify the standard we'll first take a look at the official
	8895-1 to Unicode mapping [11]. Which shows basically a 1-to-1 map of character
	to Unicode code point. We can then reference to the official Unicode standard
	"C1 Controls and Latin-1 Supplement" in version 8 [12]. Here we can see the C1
	control characters are stated to be used in ranges from 0x0080-0x009F.

	Now the WHATWG has an encoding spec [13] stating that ISO-8859-1 is an alias
	for windows-1252. While not technically correct, as defined above, for all
	intents and purposes of the browser, windows-1252 could be seen as a superset
	of ISO-8859-1. Since the CR control area doesn't contain any graphical glyphs.
	In fact, if you look at the Unicode map [14] you can see that unused codes in
	the CR range are simply undefined. Meaning windows-1252 is technically a
	superset of ISO/IEC 8859-1:1998, while using the C0 control set.

	So for anyone planning on scraping the web, make sure you pay attention to the
	charset of the document and compare it against what browsers are doing. If the
	document is specified as latin1, it most likely isn't. Instead the document
	would need to be parsed with windows-1252 instead.


	[1] http://sliderule.mraiow.com/w/images/7/73/ASCII.pdf
	[2] http://www.iso.org/iso/catalogue_detail.htm?csnumber=4777
	[3] http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-006.pdf
	[4] http://www-01.ibm.com/software/globalization/cp/cp00437.html
	[5] http://www.iso.org/iso/catalogue_detail.htm?csnumber=16338
	[6] http://www.iso.org/iso/home/store/catalogue_ics/catalogue_detail_ics.htm?csnumber=28245
	[7] http://www.open-std.org/JTC1/SC2/WG3/docs/n411.pdf
	[8] http://tools.ietf.org/html/rfc1345
	[9] http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-048.pdf
	[10] http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-035.pdf
	[11] http://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT
	[12] http://unicode.org/charts/PDF/U0080.pdf
	[13] https://encoding.spec.whatwg.org/
	[14] http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT