Created
January 25, 2016 22:40
-
-
Save codeforkjeff/b929dd8ddeaa60299205 to your computer and use it in GitHub Desktop.
possible encoding problem with ipfs-api
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# run this with LC_ALL and LANG env vars set to "en_US.utf8" | |
import ipfsApi | |
utf8_filename = u"clich\xe9.txt".encode('utf8') | |
with open(utf8_filename, "w") as f: | |
f.write("this is just a test") | |
c = ipfsApi.Client('127.0.0.1', 5001) | |
response = c.add(utf8_filename) | |
# note unicode strings in the response | |
print response | |
print utf8_filename | |
print response['Name'] | |
# why is this False? | |
# it also displays a warning: UnicodeWarning: Unicode unequal comparison | |
# failed to convert both arguments to Unicode - interpreting them as being unequal | |
print response['Name'] == utf8_filename | |
# prints True: why is this necessary? | |
print response['Name'].encode('latin-1') == utf8_filename | |
print "done." |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
utf8_filename
is just a byte string: Python doesn't remember where it came from or what the bytes meant. So when it's asked to compare it with a Unicode string, it doesn't know which encoding to use to convert the bytes into Unicode. Hence it displays a warning and returns False from the comparison.In the second comparison you avoid the problem by explicitly converting the Unicode string into bytes (using Latin-1), and there's no problem comparing byte strings.
They should not, however, compare equal: UTF-8 and Latin-1 will encode any non-ASCII character to different sequences of bytes (as long as it's in Latin-1's repertoire). It would seem that
response['Name']
has been corrupted somewhere, by the bytes being decoded using Latin-1 instead of UTF-8, making itu'clich\xc3\xe9'
instead.