Of Big Endians and Little Endians: File encoding and digital accessibility

I figured it out. And learned a valuable lesson. In exploratory data analysis work rule number one is "Read the damn docs. Carefully".

North Carolina's State Board of Elections makes a lot of useful data public. This is probably one of the best ways to ensure electoral integrity, and gives all citizens the ability to see and use data that they've already paid for. The alternative is a relatively closed system like in many north eastern states where only big money interests and their clients, like the major parties, have access to anything close to raw data.

Historical voter registration snapshots are taken at every election, so that analysts and historians can see what the state of the electorate was for each vote.

The state also provides weekly running updates of current voter registrations.

Here's where things get interesting. Currently, those weekly updates are encoded in Latin 1 (ISO-8859-1). But the historical files are encoded UTF-16LE. It's not a secret, the handy data layout file that accompanies each clearly states what encoding is used.

This is the layout of the current weekly updates:

/***********************************************************************************
* name:     layout_ncvoter.txt
* purpose:  Contains all legally available voter specific information. Personal 
*           identifying information (PII) such as birth date and drivers license 
*           number are not included. Voter registrations with a voter_status_desc 
*           of 'Removed' are omitted whenever the most recent last voted date is 
*           greater than 10 years.
*           This is a weekly point-in-time snapshot current per file date and time.
* updated:  02/11/2022
* format:   tab delimited
* instructions: 
*            1) extract using a file archiving and compression program (eg. WinZip)
*            2) can be linked to ncvhis file by ncid
***********************************************************************************/

This is from the layout file for the historical snapshot files:

/* *******************************************************************************
* name:    layout_VR_Snapshot.txt
* purpose: Layout for the VR_SNAPSHOT_YYYYMMDD file. This file contains a denormalized
*          point-in-time snapshot of information for active and inactive voters 
*          as-well-as removed voters going back for a period of ten years.
* updated: 02/15/2021
* format:  tab delimited column names in first row
* encoding: UTF-16 LE
******************************************************************************* */

Oh wait, the current voter layout file doesn't actually say how it's encoded. Huh.

Anyway, back to the snapshot files, which I needed to access for a research project. As I wrote, these are encoded UTF-16LE. The "LE" stands for "Little Endian". It's refers to where the BOM (Byte Order Mark), the indicator of how the data is encoded, is placed. This is down at the "word" level, deep inside the file.

So, the BOE's current working format is Latin 1 (or ISO-8859-1) for its weekly voter registration updates, which doesn't do most "special characters". You can't store a Chinese or Russian name in Latin 1, or even do an umlaut. Or emojis. For some reason emojis have been someone's priority since the beginning. F-ing emojis. That required a multiplicity of competing special character sets among vendors. IBM had theirs, Microsoft had theirs, Adobe, Sun and the rest all had theirs. Starting in the 90s, Sun made a play to get everyone on UTF-8. It was simple and handled whatever you threw at it. But it wasn't invented by either IBM or Microsoft (or Sun's Java team), so there was friction (to this day Microsoft still uses IBM Code Page 437 as its default character set, which it calls Windows Code Page 1252). It also didn't do emojis.

UTF-16 was "better", could express the full Unicode character set, but two competing versions emerged: Big Endian (BE) and Little Endian (LE). Sun's Java team and Microsoft pushed UTF-16 (Sun's operating systems and server teams pushed UTF-8). At some point the architects threw up their hands and declared that it didn't matter, everyone just had to rewrite their software to detect the endianness (yes, that's a real word) of the encoding and all would be well. Well, surprise, surprise, that didn't happen. Even most new software can't tell the difference, which is why my Linux's file command kept telling me the state's UTF-16 encoded file was "data" (binary) but stopped there and left me to guess what that meant.

As for what difference it all makes in the practical task of reading in the data, with python pandas, for example, here it is:

dfc = pd.read_csv('data/VR_Snapshot_20201103.txt', encoding='ISO-8859-1', sep='\t',

dfc = pd.read_csv('data/VR_Snapshot_20201103.txt', encoding='utf-16le', sep='\t',

That's it.

References:

"Voter Registration Data". NCSBOE, https://www.ncsbe.gov/results-data/voter-registration-data/.

"List of Standard Encodings". The Python Standard Library, https://docs.python.org/3/library/codecs.html#standard-encodings.

"pandas.read_csv". Pandas API Reference, https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html.

"file". _Ubuntu Manuals, https://manpages.ubuntu.com/manpages/jammy/man1/file.1.html.

Jimmy Zhang. "A Beginner-Friendly Guide to Unicode in Python". freeCodeCamp, 18 July 2018, https://www.freecodecamp.org/news/a-beginner-friendly-guide-to-unicode-d6d45a903515/.

Kealan Parr. "What is Endianness? Big-Endian vs Little-Endian Explained with Examples". freeCodeCamp, 1 February 2021, https://www.freecodecamp.org/news/what-is-endianness-big-endian-vs-little-endian/.

The above all last accessed on 17 August 2022.

plembo/biglittleendians.md

Of Big Endians and Little Endians: File encoding and digital accessibility