Skip to content

Instantly share code, notes, and snippets.

@trevnorris
Last active April 28, 2016 21:16
Show Gist options
  • Save trevnorris/ed747e50290c2b1434cb to your computer and use it in GitHub Desktop.
Save trevnorris/ed747e50290c2b1434cb to your computer and use it in GitHub Desktop.
Little history and references on the evolution and existence of ISO-8859-1
ASCII, the American Standard Code for Information Exchange, (or the IANA
preferred US-ASCII) is a 7-bit encoding first standardized in 1960 as part of
the ASA, American Standards Association, X3.2. Though the most recent update
was in 1986 in the ANSI X3.4-1986 standard [1]. This was also standardized in
ISO/IEC 646:1991 [2] and later ratified by ECMA as ECMA-6 [3].
Since ASCII was first standardized there have been proprietary extensions that
use the full 8-bit space. First to do so was IBM and their introduction of
"code pages" (most commonly called character encodings). Those who've used
MS-DOS will recognize CP437 [4]. Along with IBM, other companies such as Apple
also developed their own character encodings.
(discrepancy in timeline. ref https://github.com/nodejs/node-eps/pull/15#issuecomment-215564862
and fix later).
Amidst all of this, the ISO, International Organization for Standardization,
released the ISO 8859 standard describing its own 8-bit ASCII extensions. The
most commonly used being ISO 8859-1, which was first released in 1987 as ISO
8859-1:1987 [5] but later revised and released as ISO/IEC 8859-1:1998 [6][7].
Notice that character ranges 0x00-0x1F and 0x7F-0x9F are empty. This is because
the ISO/IEC 8859-1 only specified graphical characters. Then in 1992 the IANA
registered the character map ISO_8859-1:1987 [8] also referred to as
ISO-8859-1 and aliased as latin1. This map assigns the C0 and C1 control
characters [9] to the unassigned code points in the original specification.
ECMA-35 later standardized the layout of 8-bit character sets [10].
Specifically that byte ranges 0x00-0x1F be identified as CL and be used for the
primary set of control characters, and 0x80-0x9F as be "either a supplementary
set of control functions, or unused" (sec 8.1). The reason that each 7-bit page
layout should be the same is outlined in section 9. Basically it's a
description of how to invoke graphical characters, e.g. character glyphs, from
a different 7-bit space.
The ISO also keeps the ISO/IEC 10646 standard along side the Unicode
Consortium. To verify the standard we'll first take a look at the official
8895-1 to Unicode mapping [11]. Which shows basically a 1-to-1 map of character
to Unicode code point. We can then reference to the official Unicode standard
"C1 Controls and Latin-1 Supplement" in version 8 [12]. Here we can see the C1
control characters are stated to be used in ranges from 0x0080-0x009F.
Now the WHATWG has an encoding spec [13] stating that ISO-8859-1 is an alias
for windows-1252. While not technically correct, as defined above, for all
intents and purposes of the browser, windows-1252 could be seen as a superset
of ISO-8859-1. Since the CR control area doesn't contain any graphical glyphs.
In fact, if you look at the Unicode map [14] you can see that unused codes in
the CR range are simply undefined. Meaning windows-1252 is technically a
superset of ISO/IEC 8859-1:1998, while using the C0 control set.
So for anyone planning on scraping the web, make sure you pay attention to the
charset of the document and compare it against what browsers are doing. If the
document is specified as latin1, it most likely isn't. Instead the document
would need to be parsed with windows-1252 instead.
[1] http://sliderule.mraiow.com/w/images/7/73/ASCII.pdf
[2] http://www.iso.org/iso/catalogue_detail.htm?csnumber=4777
[3] http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-006.pdf
[4] http://www-01.ibm.com/software/globalization/cp/cp00437.html
[5] http://www.iso.org/iso/catalogue_detail.htm?csnumber=16338
[6] http://www.iso.org/iso/home/store/catalogue_ics/catalogue_detail_ics.htm?csnumber=28245
[7] http://www.open-std.org/JTC1/SC2/WG3/docs/n411.pdf
[8] http://tools.ietf.org/html/rfc1345
[9] http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-048.pdf
[10] http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-035.pdf
[11] http://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT
[12] http://unicode.org/charts/PDF/U0080.pdf
[13] https://encoding.spec.whatwg.org/
[14] http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment