Skip to content

Instantly share code, notes, and snippets.

@xthezealot
Last active August 3, 2023 14:07
Show Gist options
  • Save xthezealot/9a65fac2c7b916c4d84e66188bf06bec to your computer and use it in GitHub Desktop.
Save xthezealot/9a65fac2c7b916c4d84e66188bf06bec to your computer and use it in GitHub Desktop.
Normalize unicode file names (converts UTF-8 NFD to NFC). Required by macOS clients through AFP/NFS/SMB. Tested on Synology DSM 6.2 with built-in Python 2.7.12.

NFCFN.py

Normalize unicode file names (converts UTF-8 NFD to NFC).
Required by macOS clients through AFP/NFS/SMB.

Tested on Synology DSM 6.2 with built-in Python 2.7.12.

Usage

# 1. Activate SSH on your NAS

# 2. On your computer, open a new console/terminal and connect to your server:
ssh [email protected]

# 3. Go to the directory where you want saving the `nfcfn.py` script:
cd /volume1/YourSharedFolder/PathToScript

# 4. Download the latest version:
wget https://gist.githubusercontent.com/xthezealot/9a65fac2c7b916c4d84e66188bf06bec/raw/nfcfn.py

# 5. Run it with Python to check the result:
python nfcfn.py -cr /volume1/YourSharedFolder

# 6. When you are sure, add the `-p` flag to effectively rename the files:
python nfcfn.py -crp /volume1/YourSharedFolder
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""Normalize unicode file names."""
from __future__ import unicode_literals
from argparse import ArgumentParser
from os import rename, walk
from os.path import exists, isfile, join, split
from sys import version_info
from unicodedata import normalize
def bytes_saved(old, new):
"""Print difference of bytes between old an new string."""
diff = len(new) - len(old)
s = "[\033["
if diff < 0:
s += "32m" + str(diff)
elif diff > 0:
s += "31m+" + str(diff)
else:
s += "34m="
s += " byte"
if abs(diff) > 1:
s += "s"
return s + "\033[0m]"
def norm(root, file, form, proceed):
"""Do the normalization."""
normed = (
normalize(form, file).replace("/", "/").replace("\\", "\").replace(":", ":")
)
if file != normed:
old = join(root, file)
new = join(root, normed)
if exists(new):
print("%s \033[31mcannot be renamed as\033[0m %s \033[31malready exists\033[0m" % (old, normed))
else:
print("%s ▶︎ %s %s" % (old, normed, bytes_saved(file, normed)))
if proceed:
rename(old, new)
def main():
"""Normalize unicode file names."""
parser = ArgumentParser(description="Normalize unicode file names.")
parser.add_argument("source", help="the source file or directory")
parser.add_argument(
"-c",
"--compatibility",
action="store_true",
help='normalize with compatibility (ex: "fi"' ' becomes "fi")',
)
parser.add_argument("-p", "--proceed", action="store_true", help="rename files")
parser.add_argument(
"-r",
"--recursive",
action="store_true",
help="go through directories recursively",
)
args = parser.parse_args()
if version_info < (3,):
args.source = unicode(args.source, "utf8")
norm_form = "NFKC" if args.compatibility else "NFC"
# Source is a file
if isfile(args.source):
head, tail = split(args.source)
norm(head, tail, norm_form, args.proceed)
# Source is a directory
else:
for root, dirs, files in walk(args.source):
for d in dirs:
norm(root, d, norm_form, args.proceed)
for f in files:
norm(root, f, norm_form, args.proceed)
if not args.recursive:
break
if __name__ == "__main__":
main()
@rennewpeter
Copy link

rennewpeter commented Jul 16, 2021

@ArthurWhite thx. i tested the script with python3 and dsm7 in task scheduler and its worked. The script run daily. So you saved my life, i have 25TB of albums, with this synology drive search working again. Respect

@xthezealot
Copy link
Author

Szívesen! 👍

@janusn
Copy link

janusn commented Jul 16, 2021

@ArthurWhite

I have recently run the script and noticed some more side effect of the -c option which I think most people do not want.
For example:

  • /volume2/Media/shared/Movies/Alien³ (1992) ▶︎ Alien3 (1992) [= byte]
  • /volume2/Media/shared/Movies/DESTINY (2017) ▶︎ DESTINY (2017) [= byte]

I am not challenging your recommendation. I am genuinely curious and would like to learn more. Could you enlighten me about why the -c option is a good idea?

@xthezealot
Copy link
Author

xthezealot commented Jul 16, 2021

Simply because with some fonts, different strings may look exactly the same although they are in fact not "equal".

So for example, if you have a file that contains (1 char) in its name, and you search for fi (2 chars), you may not find this file.
And different versions of a same OS can have a different behaviors.
If you're not aware of this kind of subtlety, it can be tricky. It's not always as obvious as ³ vs. 3.

That's why I recommend sticking as close as possible to ASCII characters for file names.
Even spaces in a file path can cause problems with some apps, I experienced it, but that's too much of a rule…

So yes, it's an old-school recommendation but still a "good practice" when heterogenous OS and file systems share the same data, which is the purpose of a NAS.

But if only you have access to your NAS and always through the same device, OK, the -c option is not a big win.

@janusn
Copy link

janusn commented Jul 16, 2021

Thank you very much for your explanation. I understand now. 👍🏼

I think I better omit the -c option in my use case though. It causes confusion to a few tools I use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment