Created
March 2, 2020 19:42
-
-
Save mookieblaylock/cb29e58252ebaa8f6fee939b86fd6441 to your computer and use it in GitHub Desktop.
garbled_domains
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2019-02-13,domains,https://uk.godaddy.com/domains,0.0,1.0,0.0,154.0,https://uk.godaddy.com/,worldwide,desktop, |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
these are valid UTF-8 characters
\u2229\u2557\u2510
, not a BOM. it looks like a bug upstream. the easiest way to discard them is usingiconv -c
to target us-ascii:if other columns do have Unicode you'd probably want to strip the non-ASCII chars from that column (assuming Unicode domains are punycoded), either in the source query or in an ETL step. if there is a mix of punicoded and utf-8 domain names then I'd write a Python script to normalize them