Skip to content

Instantly share code, notes, and snippets.

@NicholasBallard
Last active October 25, 2019 19:39
Show Gist options
  • Save NicholasBallard/10c966aee5c5c7a5ce78797d7a463b87 to your computer and use it in GitHub Desktop.
Save NicholasBallard/10c966aee5c5c7a5ce78797d7a463b87 to your computer and use it in GitHub Desktop.
How to import text into spreadsheet programs without changing the underlying bytes of the file.

Directions for Saving Files from the Oracle Bronto Portal

In the Bronto portal: clipboard

File saved to computer as a text file (csv): clipboard

We can see the encoding is correctly UTF-8 and the non-ASCII characters (the emojis) are preserved:

clipboard

This is where the problems come in.

If we choose to open the file in a spreadsheet editing program, like Excel, LibreOffice, Apache Open Office, WPS Office, etc., then the two-step process of

  1. opening the file with an assumed encoding, and
  2. saving the file to disk

will lock us in to the character selection the software chooses. Not our choice.

When opening any text file, there is no protection for the text character encoding like there is with binary. (This is because the bit interpretation is ambiguous. Bring this up at parties if you want to be popular.)

On opening a text file in a spreadsheet editor, the program makes a guess as to how to interpret the bits for our human eyes to read.

On LibreOffice.: clipboard

See how it messes up the emojis? It replaces at UTF-8 symbol like 😥 with ASCII symbols like 📞. That is because in the default encoding for this text file, there is no such thing as emojis. So it puts characters it does have in its language that closest match the bytes we're giving it.

Now when we save the file in non-text format (ie. xlsx), and open it back up, we can see the text embedded in the file is changed. This is permanent and cannot be converted or undone with any degree of accuracy.

clipboard

And it doesn't matter the program, the actual bytes stored in the file are altered:

clipboard

Even stripping out the XML-like data structure of the binary file and converting back to CSV doesn't help:

clipboard

clipboard

This is UTF-8 encoding, but the original bit sequence of the text file is changed. Meaning the computer is interpreting "right" the wrong bits.

clipboard


What the !@$#%&^%$ was the point of all that?!

To summarize all this with two lines of text, of course.

  1. Import CSV or any text data into a spreadsheet program using UTF-8 encoding.
  2. When saving back to text, like .csv, ensure the encoding is UTF-8 text encoding.

Here is how to do that on Excel: https://gist.github.com/NicholasBallard/8256171078874d67ba45b8f4b984605b

It's even easier in LibreOffice, where the initial prompt screen asks what encoding you want to use. Select UTF-8. Usually it defaults to ASCII (because of the byte-order marker, or BOM, in the text file you cannot see), so change it:

clipboard

Thanks for encoding your text correctly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment