Skip to content

Instantly share code, notes, and snippets.

@krisk0
Created September 15, 2025 12:51
Show Gist options
  • Select an option

  • Save krisk0/cbf1a750f8f5c01c615f13cca6e5f5bc to your computer and use it in GitHub Desktop.

Select an option

Save krisk0/cbf1a750f8f5c01c615f13cca6e5f5bc to your computer and use it in GitHub Desktop.
Find .md files that contain non-ascii characters except long dash '—' and three dots '…'. If file 'some.md' needs sanitizing, save sanizitized version as 'some.ii'.
#!/usr/bin/python3
'''
Find all .md files that contain non-ascii characters excluding '—…'
If such file found, sanitize it and save as NAME.ii
'''
import glob, re
g_allowed = re.compile(r'[^\x00-\x7F—…]')
def do_with_file(i):
with open(i, 'rb') as ii:
if test_file(ii):
return
print("Caught file " + i)
o = i[:-3] + '.ii'
with open(i, 'rb') as ii:
with open(o, 'wb') as oo:
conv(ii, oo)
def test_file(i):
for j in i:
k = j.decode('utf-8')
if re.findall(g_allowed, k):
return False
return True
def conv(i, o):
for j in i:
k = j.decode('utf-8')
c = re.sub(g_allowed, '', k)
o.write(c.encode('utf-8'))
for f in glob.glob('./**/*.md', recursive=True):
do_with_file(f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment