Skip to content

Instantly share code, notes, and snippets.

@xthezealot
Last active August 3, 2023 14:07
Show Gist options
  • Save xthezealot/9a65fac2c7b916c4d84e66188bf06bec to your computer and use it in GitHub Desktop.
Save xthezealot/9a65fac2c7b916c4d84e66188bf06bec to your computer and use it in GitHub Desktop.
Normalize unicode file names (converts UTF-8 NFD to NFC). Required by macOS clients through AFP/NFS/SMB. Tested on Synology DSM 6.2 with built-in Python 2.7.12.

NFCFN.py

Normalize unicode file names (converts UTF-8 NFD to NFC).
Required by macOS clients through AFP/NFS/SMB.

Tested on Synology DSM 6.2 with built-in Python 2.7.12.

Usage

# 1. Activate SSH on your NAS

# 2. On your computer, open a new console/terminal and connect to your server:
ssh [email protected]

# 3. Go to the directory where you want saving the `nfcfn.py` script:
cd /volume1/YourSharedFolder/PathToScript

# 4. Download the latest version:
wget https://gist.githubusercontent.com/xthezealot/9a65fac2c7b916c4d84e66188bf06bec/raw/nfcfn.py

# 5. Run it with Python to check the result:
python nfcfn.py -cr /volume1/YourSharedFolder

# 6. When you are sure, add the `-p` flag to effectively rename the files:
python nfcfn.py -crp /volume1/YourSharedFolder
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""Normalize unicode file names."""
from __future__ import unicode_literals
from argparse import ArgumentParser
from os import rename, walk
from os.path import exists, isfile, join, split
from sys import version_info
from unicodedata import normalize
def bytes_saved(old, new):
"""Print difference of bytes between old an new string."""
diff = len(new) - len(old)
s = "[\033["
if diff < 0:
s += "32m" + str(diff)
elif diff > 0:
s += "31m+" + str(diff)
else:
s += "34m="
s += " byte"
if abs(diff) > 1:
s += "s"
return s + "\033[0m]"
def norm(root, file, form, proceed):
"""Do the normalization."""
normed = (
normalize(form, file).replace("/", "/").replace("\\", "\").replace(":", ":")
)
if file != normed:
old = join(root, file)
new = join(root, normed)
if exists(new):
print("%s \033[31mcannot be renamed as\033[0m %s \033[31malready exists\033[0m" % (old, normed))
else:
print("%s ▶︎ %s %s" % (old, normed, bytes_saved(file, normed)))
if proceed:
rename(old, new)
def main():
"""Normalize unicode file names."""
parser = ArgumentParser(description="Normalize unicode file names.")
parser.add_argument("source", help="the source file or directory")
parser.add_argument(
"-c",
"--compatibility",
action="store_true",
help='normalize with compatibility (ex: "fi"' ' becomes "fi")',
)
parser.add_argument("-p", "--proceed", action="store_true", help="rename files")
parser.add_argument(
"-r",
"--recursive",
action="store_true",
help="go through directories recursively",
)
args = parser.parse_args()
if version_info < (3,):
args.source = unicode(args.source, "utf8")
norm_form = "NFKC" if args.compatibility else "NFC"
# Source is a file
if isfile(args.source):
head, tail = split(args.source)
norm(head, tail, norm_form, args.proceed)
# Source is a directory
else:
for root, dirs, files in walk(args.source):
for d in dirs:
norm(root, d, norm_form, args.proceed)
for f in files:
norm(root, f, norm_form, args.proceed)
if not args.recursive:
break
if __name__ == "__main__":
main()
@janusn
Copy link

janusn commented Mar 23, 2020

I have found out why. There was fullwidth slash (/) in the original filename. It was converted to normal slash (/) by nfcfn.py and the new slash was interpreted as a path separator. I have global replaced those characters with a division slash (∕) and the script has completed successfully. I think they are better be handled by the script itself.

There is another problematic character fullwidth backslash(\) which should be replaced as small reverse slash(﹨) as well.

These fullwidth characters are very common in the Asian languages and file names.

@janusn
Copy link

janusn commented Mar 24, 2020

I am not familiar with python. Sorry that I did not read your code.
After reading your code, I now understand that I can avoid the problem by not adding -c option to the argument now.

The compatible version, NFKA normalization, may introduce invalid characters. I think NFC normalization is a better option for file name conversion.
There should be at least 3 characters that are commonly used in Asian languages will become problematic after converted to NFKC form. They are:

I would recommend omitting option -c from the arguments to avoid possible problems.

@xthezealot
Copy link
Author

xthezealot commented May 9, 2020

I added exceptions for these 3 characters.

But --compatibility is still useful for some of us who just don't like having ambiguous special chars in their filenames.

@bilhackmac
Copy link

Thx a lot, you saved my day (even my week)

@xthezealot
Copy link
Author

Thx a lot, you saved my day (even my week)

Mais avec plaisir ! 👍

@mauthner
Copy link

Arthur, you are my hero! After more than one year search to solve these sticky problem I found your script (and the description how to treat) – and it works brilliant. Just one point took me time to figure out:

  • If there is a folder with a name needing to be normalized, the script do, but stopp to look up its content. For every level of subfolders the script needs to be startet again.

This is not a problem, because it's quickly done (and I have mostly just three levels of subfolders). Just you have to be aware of this point.
So, thank you very much…

@mauthner
Copy link

Just an other observation: If a folder after normalization will get the same name as an allready existing one, the script will stop and I get the message:

Traceback (most recent call last):
File "nfcfn.py", line 82, in
main()
File "nfcfn.py", line 74, in main
norm(root, d, norm_form, args.proceed)
File "nfcfn.py", line 39, in norm
rename(old, new)
OSError: [Errno 39] Directory not empty

@miketou
Copy link

miketou commented Mar 4, 2021

thanks man. you literally saved me one week of my life. why on earth is it that difficult to have a universal standard for filename parameters in the year 2021... pfff
thanks

@datiti
Copy link

datiti commented Jul 12, 2021

Thanks !!!!! This is the end of "error 43" :D

@rennewpeter
Copy link

rennewpeter commented Jul 15, 2021

Please help. I get this error:
Traceback (most recent call last):
File "/volume1/homes/Wenner Peter/Drive/Script/nfcfn.py", line 82, in
main()
File "/volume1/homes/Wenner Peter/Drive/Script/nfcfn.py", line 67, in main
if isfile(args.source):
File "/usr/lib/python2.7/genericpath.py", line 37, in isfile
st = os.stat(path)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 37: ordinal not in range(128)

i want to run this in synology task scheduler daily

which python should i install?
in the package center i see:

  • Python Module - TRIED
  • Python3 - NOW INSTALLED

Community:

  • Python - TRIED
  • Phyton 3.8
  • Python3

Any suggestion?

@xthezealot
Copy link
Author

Thanks for the feedback.

@mauthner, going recursively after renaming a directory would require another strategy which I don't actually have time to code. Sorry.
For the "already existing directory", I added a warning. The script won't fail anymore over this one.

@rennewpeter, the built-in Python version (2.7.18 if your DSM is up to date) should be sufficient. Python 3 is not required.
The error simply shows that you provided a directory path that cannot be parsed by Python.
Maybe your terminal or your Synology uses an uncommon encoding. It's difficult to investigate without details.
But I already edited the script to use the unicode function instead of decode. You can redownload it and retry.

@rennewpeter
Copy link

rennewpeter commented Jul 16, 2021

@ArthurWhite thx. i tested the script with python3 and dsm7 in task scheduler and its worked. The script run daily. So you saved my life, i have 25TB of albums, with this synology drive search working again. Respect

@xthezealot
Copy link
Author

Szívesen! 👍

@janusn
Copy link

janusn commented Jul 16, 2021

@ArthurWhite

I have recently run the script and noticed some more side effect of the -c option which I think most people do not want.
For example:

  • /volume2/Media/shared/Movies/Alien³ (1992) ▶︎ Alien3 (1992) [= byte]
  • /volume2/Media/shared/Movies/DESTINY (2017) ▶︎ DESTINY (2017) [= byte]

I am not challenging your recommendation. I am genuinely curious and would like to learn more. Could you enlighten me about why the -c option is a good idea?

@xthezealot
Copy link
Author

xthezealot commented Jul 16, 2021

Simply because with some fonts, different strings may look exactly the same although they are in fact not "equal".

So for example, if you have a file that contains (1 char) in its name, and you search for fi (2 chars), you may not find this file.
And different versions of a same OS can have a different behaviors.
If you're not aware of this kind of subtlety, it can be tricky. It's not always as obvious as ³ vs. 3.

That's why I recommend sticking as close as possible to ASCII characters for file names.
Even spaces in a file path can cause problems with some apps, I experienced it, but that's too much of a rule…

So yes, it's an old-school recommendation but still a "good practice" when heterogenous OS and file systems share the same data, which is the purpose of a NAS.

But if only you have access to your NAS and always through the same device, OK, the -c option is not a big win.

@janusn
Copy link

janusn commented Jul 16, 2021

Thank you very much for your explanation. I understand now. 👍🏼

I think I better omit the -c option in my use case though. It causes confusion to a few tools I use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment