Last active
April 28, 2016 21:16
-
-
Save trevnorris/ed747e50290c2b1434cb to your computer and use it in GitHub Desktop.
Little history and references on the evolution and existence of ISO-8859-1
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ASCII, the American Standard Code for Information Exchange, (or the IANA | |
preferred US-ASCII) is a 7-bit encoding first standardized in 1960 as part of | |
the ASA, American Standards Association, X3.2. Though the most recent update | |
was in 1986 in the ANSI X3.4-1986 standard [1]. This was also standardized in | |
ISO/IEC 646:1991 [2] and later ratified by ECMA as ECMA-6 [3]. | |
Since ASCII was first standardized there have been proprietary extensions that | |
use the full 8-bit space. First to do so was IBM and their introduction of | |
"code pages" (most commonly called character encodings). Those who've used | |
MS-DOS will recognize CP437 [4]. Along with IBM, other companies such as Apple | |
also developed their own character encodings. | |
(discrepancy in timeline. ref https://github.com/nodejs/node-eps/pull/15#issuecomment-215564862 | |
and fix later). | |
Amidst all of this, the ISO, International Organization for Standardization, | |
released the ISO 8859 standard describing its own 8-bit ASCII extensions. The | |
most commonly used being ISO 8859-1, which was first released in 1987 as ISO | |
8859-1:1987 [5] but later revised and released as ISO/IEC 8859-1:1998 [6][7]. | |
Notice that character ranges 0x00-0x1F and 0x7F-0x9F are empty. This is because | |
the ISO/IEC 8859-1 only specified graphical characters. Then in 1992 the IANA | |
registered the character map ISO_8859-1:1987 [8] also referred to as | |
ISO-8859-1 and aliased as latin1. This map assigns the C0 and C1 control | |
characters [9] to the unassigned code points in the original specification. | |
ECMA-35 later standardized the layout of 8-bit character sets [10]. | |
Specifically that byte ranges 0x00-0x1F be identified as CL and be used for the | |
primary set of control characters, and 0x80-0x9F as be "either a supplementary | |
set of control functions, or unused" (sec 8.1). The reason that each 7-bit page | |
layout should be the same is outlined in section 9. Basically it's a | |
description of how to invoke graphical characters, e.g. character glyphs, from | |
a different 7-bit space. | |
The ISO also keeps the ISO/IEC 10646 standard along side the Unicode | |
Consortium. To verify the standard we'll first take a look at the official | |
8895-1 to Unicode mapping [11]. Which shows basically a 1-to-1 map of character | |
to Unicode code point. We can then reference to the official Unicode standard | |
"C1 Controls and Latin-1 Supplement" in version 8 [12]. Here we can see the C1 | |
control characters are stated to be used in ranges from 0x0080-0x009F. | |
Now the WHATWG has an encoding spec [13] stating that ISO-8859-1 is an alias | |
for windows-1252. While not technically correct, as defined above, for all | |
intents and purposes of the browser, windows-1252 could be seen as a superset | |
of ISO-8859-1. Since the CR control area doesn't contain any graphical glyphs. | |
In fact, if you look at the Unicode map [14] you can see that unused codes in | |
the CR range are simply undefined. Meaning windows-1252 is technically a | |
superset of ISO/IEC 8859-1:1998, while using the C0 control set. | |
So for anyone planning on scraping the web, make sure you pay attention to the | |
charset of the document and compare it against what browsers are doing. If the | |
document is specified as latin1, it most likely isn't. Instead the document | |
would need to be parsed with windows-1252 instead. | |
[1] http://sliderule.mraiow.com/w/images/7/73/ASCII.pdf | |
[2] http://www.iso.org/iso/catalogue_detail.htm?csnumber=4777 | |
[3] http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-006.pdf | |
[4] http://www-01.ibm.com/software/globalization/cp/cp00437.html | |
[5] http://www.iso.org/iso/catalogue_detail.htm?csnumber=16338 | |
[6] http://www.iso.org/iso/home/store/catalogue_ics/catalogue_detail_ics.htm?csnumber=28245 | |
[7] http://www.open-std.org/JTC1/SC2/WG3/docs/n411.pdf | |
[8] http://tools.ietf.org/html/rfc1345 | |
[9] http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-048.pdf | |
[10] http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-035.pdf | |
[11] http://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT | |
[12] http://unicode.org/charts/PDF/U0080.pdf | |
[13] https://encoding.spec.whatwg.org/ | |
[14] http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment