Skip to content

Instantly share code, notes, and snippets.

@sotsugov
Last active August 29, 2015 14:06
Show Gist options
  • Select an option

  • Save sotsugov/a1aeca8766bffd22f24f to your computer and use it in GitHub Desktop.

Select an option

Save sotsugov/a1aeca8766bffd22f24f to your computer and use it in GitHub Desktop.
UTF-8 Unicode (with BOM) text to UTF-8
import os, sys, codecs
BUFSIZE = 4096
BOMLEN = len(codecs.BOM_UTF8)
path = sys.argv[1]
with open(path, "r+b") as fp:
chunk = fp.read(BUFSIZE)
if chunk.startswith(codecs.BOM_UTF8):
i = 0
chunk = chunk[BOMLEN:]
while chunk:
fp.seek(i)
fp.write(chunk)
i += len(chunk)
fp.seek(BOMLEN, os.SEEK_CUR)
chunk = fp.read(BUFSIZE)
fp.seek(-BOMLEN, os.SEEK_CUR)
fp.truncate()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment