I apologize for the wall of text, there are a lot of concerns raised in the WASI case-senstivity ticket and I wanted to explain how everything works without just pointing to dense specs and code.
Case-insensitivity is important because it is required by end users: Windows, OS X, and Android all enforce some level of case-insensitivity. As a practical matter, this means Linux developers must manually enforce case-insensitivity. And despite much wailing and gnashing of teeth, even Linux has recently added support for case-insensitivity on a per-directory basis. Distros that care about usability will eventually adopt case-insensitivity, even if it is just for the home directories.
WASI’s current proposal trades Unix’s opaque bytes model for UTF-8 filenames, but rejects case-insensitivity. From a usability engineering perspective, this carries on the tradition of programmers alienating users because it makes their lives easier. Curiously, it also risks alienating developers on every other platform and makes the WASI filesystem abstraction much thicker.
I would instead like to propose a specifcation modeled on Rusts overflow handling: precisely defined implementation dependent behavior that can be made deterministic, but mostly “just works” when backed by a sloppy/fast implementation that relies on the native file system. That specification is Unicode’s caseless matching, which is not the fractal of complexity most i18n issues devolve into:
- The actual
toCasefold
function is a context-free mapping of single codepoints to case-folded codepoints. - Only ~1,500 codepoints case-fold.
- Only ~225 case-folded codepoints are outside the BMP.
- Legacy systems based on UCS-2 (Windows) should be largely compatible.
- ~1,400 codepoints map to a single case-folded codepoint.
- Optional
full
variant oftoCasefold
maps ~100 codepoints to 2-3 casefolded codepoints.
- Optional
- Locale independent:
- Only 2 codepoints are language dependent (Turkic i) and are skipped in both
simple
andfull
. - I do not believe any FS implements locale dependent case-insensitivity, including including NTFS.
- Unicode specifically discourages this: In most environments, such as in file systems, text is not and cannot be tagged with language information. ... For such environments, a constant, language-independent, default case folding is required.
- Only 2 codepoints are language dependent (Turkic i) and are skipped in both
- Case-folds are immutable across versions of Unicode for assigned codepoints:
- Any codepoint folded in Unicode 5+ will casefold to the same codepoint(s) in all future versions.
- Runtimes can prevent any non-determinism by rejecting unasssigned codepoints at runtime.
- They could also assume unassigned codepoints do not fold and emit a warning.
- Most languages are unicase, only 0.13% of existing codepoints case-fold.
- Could also fallback to BoB model.
- Older runtimes can be easily polyfilled to support new codepoints.
- Relatively few new case-folds, it should be exceedingly rare to encouter these in the wild.
- ~30/year since 2006, trending downward.
- Mostly ancient/archane texts and native alphabets.
- Still important to eventually support those communities.
I believe the biggest sticking point would be the addition of new case-folds. I think parameterizing a runtime based on a Unicode version is a reasonable requirement, as using new codepoints which also casefold would rarely happen in practice. And it’s not as if WASM runtimes will never add new functionality, which also changes behavior.
An ideal WASI implementation would provide a deterministic filesystem, using something like sandboxfs. But most implementations would get by using their native file system’s case insensitivity. It’s true that these differ in edge cases, but as Ted Tso put it, “the world is converging enough that the latest versions of Mac OS X’s APFS and Windows NTFS behave pretty much the same way.”
I’m also unsure on whether to use NFC or NFD normalization prior to case-folding. Most text on the web is already in NFC, but NFC has an NFD processing step so 🤷.
Notes:
- Ignore rants and specs covering generic case mapping, special casing, case conversion uppercase, lowercase, or titlecase.
- We care about case folding/caseless matching/toCasefold.
- Unicode Casemap FAQ
- Unicode Spec Chapter 5 Section 18: Caseless Matching.
- Unicode Data CaseFold.txt
- Unicode caseless matching stability policy.
- LWN Case Insensitive Ext4.
- Summarizes and links to relevant discussions and changesets.
- Super helpful in understanding design mistakes and implementation details of HFS+, APFS, Samba, and EXT4.
- Explainer on Ext4 and F2FS, provides some additional implementation details.
- ZFS unicode normalization behavior.
- MSDN Unicode handling in NTFS, NTFS never uses locale.
- Linux FS tree, best source of filename handling for any OS.
There isn't s specific proposal, but the ticket starts with "I think it would be a good idea to standardize on if the file-system APIs should be case-sensitive or not" and there are suggestions codifying case-sensitive behavior through lints of the filesystem and failing when case differs.
Furthermore, if you want to remove non-deterministic behavior from the filesystem, then you have to sort directories and files based on filenames, not their position in the FS b-tree. That requires a detailed specification.
The big concern I have is that baking case-sensitivity into the runtime now would cause much larger engineering headaches later.
If you are just providing access to the underlying filesystem (like a user's download folder) then you should just do whatever the underlying fileystem does.
Linux introduced per-directory support for case-insensitivity last year in EXT4 and F2FS. This will be ported to other filesystems. So a mostly-correct implementation will be fast.
As I noted in the other ticket, emulating case-insensitive semantics without the help of the filesystem is not as bad as your assumption (
O(2**chars )
). Case insensitive lookups are only required when a case-sensitive lookup fails, in which case you have to perform a directory walk of the filesystemO(directories)
.Samba and WINE get slow for deeply nested paths (
foo/bar/...
,Foo/bar/...
). However, we aren't building a compatibility layer for unmodified widows binaries. We can just fail if there are two or directories or files that match when casefolded.Even with all the extra work that Samba and WINE do, all of my development work happens in virtual machines which mount a machine-local SMB share.
Deterministic directory iteration would incur a constant factor overhead, dominated by filesystem access. This can be turned into
O(log files)
by maintaining an index viainotify
.My focus was on how Rust punted on the question of overflow handling, as it was too difficult in time for 1.0. But they did so in a way that they could back out of the decision they made without breaking existing code.
So in this case, we would specify that files are entries in a in a NFC/NFD casefolded namespace, sorted according to Unicode codepoints or possibly some default Unicode sort order. Runtimes could detect usage of unassigned codepoints and case-insensitive naming conflicts and throw an error or display a warning.
There are some CVEs that exploit differences between application and filesystem Unicode handling. But forcing filenames to be valid UTF-8 will also introduce that hazard.
But given that Linux just does whatever Unicode says and only archaic and newly codified native alphabets add new Unicode casing rules....