Skip to content

Instantly share code, notes, and snippets.

@hideaki-t
Created July 4, 2013 19:51
Show Gist options
  • Save hideaki-t/5929887 to your computer and use it in GitHub Desktop.
Save hideaki-t/5929887 to your computer and use it in GitHub Desktop.
find 4-byte UTF-8 characters
import itertools
for i, f in zip(itertools.count(1), open('lex.csv', 'rb')):
for j, g in zip(itertools.count(0), f):
if g >= 0b11110000:
print(i, f[j:j+4], f[j:j+4].decode('utf-8'))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment