File saved to computer as a text file (csv):
We can see the encoding is correctly UTF-8 and the non-ASCII characters (the emojis) are preserved:
This is where the problems come in.
If we choose to open the file in a spreadsheet editing program, like Excel, LibreOffice, Apache Open Office, WPS Office, etc., then the two-step process of
- opening the file with an assumed encoding, and
- saving the file to disk
will lock us in to the character selection the software chooses. Not our choice.
When opening any text file, there is no protection for the text character encoding like there is with binary. (This is because the bit interpretation is ambiguous. Bring this up at parties if you want to be popular.)
On opening a text file in a spreadsheet editor, the program makes a guess as to how to interpret the bits for our human eyes to read.
See how it messes up the emojis? It replaces at UTF-8 symbol like 😥 with ASCII symbols like 📞
. That is because in the default encoding for this text file, there is no such thing as emojis. So it puts characters it does have in its language that closest match the bytes we're giving it.
Now when we save the file in non-text format (ie. xlsx
), and open it back up, we can see the text embedded in the file is changed. This is permanent and cannot be converted or undone with any degree of accuracy.
And it doesn't matter the program, the actual bytes stored in the file are altered:
Even stripping out the XML-like data structure of the binary file and converting back to CSV doesn't help:
This is UTF-8 encoding, but the original bit sequence of the text file is changed. Meaning the computer is interpreting "right" the wrong bits.
To summarize all this with two lines of text, of course.
- Import CSV or any text data into a spreadsheet program using UTF-8 encoding.
- When saving back to text, like
.csv
, ensure the encoding is UTF-8 text encoding.
Here is how to do that on Excel: https://gist.github.com/NicholasBallard/8256171078874d67ba45b8f4b984605b
It's even easier in LibreOffice, where the initial prompt screen asks what encoding you want to use. Select UTF-8. Usually it defaults to ASCII (because of the byte-order marker, or BOM, in the text file you cannot see), so change it:
Thanks for encoding your text correctly!