Skip to content

Instantly share code, notes, and snippets.

@NicholasBallard
Created October 24, 2019 18:21
Show Gist options
  • Save NicholasBallard/8256171078874d67ba45b8f4b984605b to your computer and use it in GitHub Desktop.
Save NicholasBallard/8256171078874d67ba45b8f4b984605b to your computer and use it in GitHub Desktop.

The team's machine is assuming a default encoding when opening the CSV in an editor, then on save is replacing non-ASCII characters with ? placeholders.

Explanation: Emojis are 2 4-bit characters... 2*4 = 8 bits, that's why it's UTF-8, meaning Unicode Text Format 8-bit. It's a newer format, machines will often assume ASCII characters (a-zA-Z0-9!@#$%^&*()_+, etc.)

Here is the Excel program on my machine defaulting to an ASCII format when opening a CSV:

image

Since I'm not opening the CSV file in a text editor defaulting to UTF-8 support, the program will have an algorithm to replace sequences of bits it doesn't understand. In the case of emojis, like the smiley face 😀, what the machine sees is actually the tear-jerking sequence of bits as follows: \U0001f604, which is actually two blocks of sequences (that make up the UTF-8 character). If I were to open the text file in a program that expects UTF-8 characters by default, like Visual Studio Code, it would display the emojis.

image (1)

A program like Excel or LibreOffice will not:

image (2)

But can be set to interpret UTF-8 characters:

image (3)

My computer interpreting a Unicode string as an emoji:

image (4)

To prevent having to check that the different programs across teams are interpreting and saving files correctly (meaning there is no replacement algorithm at work, like the one LibreOffice uses to change bit sequences it doesn't understand to ?), using the Excel XLSX file format will preserve encoding, because it is a binary file with XML-like structure when decoded.

To conclude, having an XLSX download feature will save a lot of ambiguity moving data between teams. In the meantime, importing text files and saving them carefully with the spreadsheet application will prevent changing the text contents of the files.

Source file is downloaded from the Trinity dashboard page:

image (5)

Unzipped then opened from the CSV file:

image (6)

When imported correctly into a spreadsheet program, preserves emoji encoding:

image (7)

Specifically how the team can do this, is by using Excel's PowerQuery editor to change the default encoding to UTF-8:

image (8)

Excel>Data>Get Data>From File>From Text/CSV>File Origin>65001: Unicode (UTF-8)>Load

💾

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment