Skip to content

Instantly share code, notes, and snippets.

@cjolly
Last active April 30, 2017 09:48
Show Gist options
  • Save cjolly/a6e42c817936d5b4fb2d to your computer and use it in GitHub Desktop.
Save cjolly/a6e42c817936d5b4fb2d to your computer and use it in GitHub Desktop.
How to identify malformed characters or illegal byte sequence in files

Legacy Data

When dealing with legacy data it's been pretty common to run into malformed / illegal byte sequences in files. Figuring out what's causing the issue is often really difficult, especially when the file has thousands of rows.

Here's a trick I pretty much stumpbled upon:

nl file.txt | sort

sort: string comparison failed: Illegal byte sequence
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were `  9009\tThis is a line without errors\r' and `  9010\tLine\222s got strange chars\r'.

nl [man page] adds line numbers to the file, sort [man page] blows up while comparing the line with the malformed characters in question. The output from the explosion contains the line number in question, in this case 9010.

At the very least this should give you a good starting point to troubleshooting upstream.

@cjolly
Copy link
Author

cjolly commented Aug 5, 2014

You can also use iconv file.txt and it will give the relevant line and character sequence that's causing issues.

iconv: file.txt:8904:209: cannot convert

@yamoinza
Copy link

yamoinza commented Feb 1, 2017

the nl tip didn't work for me... that seems to remove the error from sort appearing
iconv does report ":3:157: cannot convert" though on "pequeño" which seems strange

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment