Created
October 19, 2021 17:35
-
-
Save FilipDominec/f321f2523e9fc8948fea72fabd18c5aa to your computer and use it in GitHub Desktop.
Helps to fix diacritics mess in legacy websites. Uses the chardet module to detect character encoding; accepts multiple files to print a table
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/python3 | |
#-*- coding: utf-8 -*- | |
import chardet, pathlib, sys | |
known_enc = {'Win':'Windows-1250', 'ISO':'ISO-8859-2', '1250':'Windows-1250', 'utf':'utf8' } | |
for fn in sys.argv[1:]: | |
found_enc = chardet.detect(pathlib.Path(fn).read_bytes())['encoding'] | |
if found_enc[:3] in known_enc.keys(): | |
found_enc = known_enc[found_enc[:3]] | |
print(f'{fn:20s} auto-detected encoding {found_enc:14s}', end='') | |
fileheader = pathlib.Path(fn).read_bytes()[:500] | |
if 'charset='.encode() in fileheader: | |
print(' --> file defines encoding ', end='') | |
for k in known_enc.keys(): | |
if k.encode() in fileheader: | |
print(k, end='') | |
print() | |
else: | |
print(' --> file DOES NOT define encoding') |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment