indolering/WASI-Case-Folding.md

Created October 21, 2020 04:51

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/indolering/e386f53ce3c89103d0a9aa6b7250caea.js"></script>
Save indolering/e386f53ce3c89103d0a9aa6b7250caea to your computer and use it in GitHub Desktop.

Proposal for Unicode case-folding of WASI filenames.

Raw

I apologize for the wall of text, there are a lot of concerns raised in the WASI case-senstivity ticket and I wanted to explain how everything works without just pointing to dense specs and code.

Case-insensitivity is important because it is required by end users: Windows, OS X, and Android all enforce some level of case-insensitivity. As a practical matter, this means Linux developers must manually enforce case-insensitivity. And despite much wailing and gnashing of teeth, even Linux has recently added support for case-insensitivity on a per-directory basis. Distros that care about usability will eventually adopt case-insensitivity, even if it is just for the home directories.

WASI’s current proposal trades Unix’s opaque bytes model for UTF-8 filenames, but rejects case-insensitivity. From a usability engineering perspective, this carries on the tradition of programmers alienating users because it makes their lives easier. Curiously, it also risks alienating developers on every other platform and makes the WASI filesystem abstraction much thicker.

I would instead like to propose a specifcation modeled on Rusts overflow handling: precisely defined implementation dependent behavior that can be made deterministic, but mostly “just works” when backed by a sloppy/fast implementation that relies on the native file system. That specification is Unicode’s caseless matching, which is not the fractal of complexity most i18n issues devolve into:

The actual toCasefold function is a context-free mapping of single codepoints to case-folded codepoints.
Only ~1,500 codepoints case-fold.
Only ~225 case-folded codepoints are outside the BMP.
- Legacy systems based on UCS-2 (Windows) should be largely compatible.
~1,400 codepoints map to a single case-folded codepoint.
- Optional full variant of toCasefold maps ~100 codepoints to 2-3 casefolded codepoints.
Locale independent:
- Only 2 codepoints are language dependent (Turkic i) and are skipped in both simple and full.
- I do not believe any FS implements locale dependent case-insensitivity, including including NTFS.
- Unicode specifically discourages this: In most environments, such as in file systems, text is not and cannot be tagged with language information. ... For such environments, a constant, language-independent, default case folding is required.
Case-folds are immutable across versions of Unicode for assigned codepoints:
- Any codepoint folded in Unicode 5+ will casefold to the same codepoint(s) in all future versions.
- Runtimes can prevent any non-determinism by rejecting unasssigned codepoints at runtime.
- They could also assume unassigned codepoints do not fold and emit a warning.
  - Most languages are unicase, only 0.13% of existing codepoints case-fold.
  - Could also fallback to BoB model.
- Older runtimes can be easily polyfilled to support new codepoints.
- Relatively few new case-folds, it should be exceedingly rare to encouter these in the wild.
  - ~30/year since 2006, trending downward.
  - Mostly ancient/archane texts and native alphabets.
  - Still important to eventually support those communities.

I believe the biggest sticking point would be the addition of new case-folds. I think parameterizing a runtime based on a Unicode version is a reasonable requirement, as using new codepoints which also casefold would rarely happen in practice. And it’s not as if WASM runtimes will never add new functionality, which also changes behavior.

An ideal WASI implementation would provide a deterministic filesystem, using something like sandboxfs. But most implementations would get by using their native file system’s case insensitivity. It’s true that these differ in edge cases, but as Ted Tso put it, “the world is converging enough that the latest versions of Mac OS X’s APFS and Windows NTFS behave pretty much the same way.”

I’m also unsure on whether to use NFC or NFD normalization prior to case-folding. Most text on the web is already in NFC, but NFC has an NFD processing step so 🤷.

Notes:

Ignore rants and specs covering generic case mapping, special casing, case conversion uppercase, lowercase, or titlecase.
We care about case folding/caseless matching/toCasefold.
Unicode Casemap FAQ
Unicode Spec Chapter 5 Section 18: Caseless Matching.
Unicode Data CaseFold.txt
Unicode caseless matching stability policy.
LWN Case Insensitive Ext4.
- Summarizes and links to relevant discussions and changesets.
- Super helpful in understanding design mistakes and implementation details of HFS+, APFS, Samba, and EXT4.
Explainer on Ext4 and F2FS, provides some additional implementation details.
ZFS unicode normalization behavior.
MSDN Unicode handling in NTFS, NTFS never uses locale.
Linux FS tree, best source of filename handling for any OS.

Author

indolering commented Oct 24, 2020

WASI’s current proposal [...] rejects case-insensitivity

I don't think WASI's current proposal takes a stand either way.

There isn't s specific proposal, but the ticket starts with "I think it would be a good idea to standardize on if the file-system APIs should be case-sensitive or not" and there are suggestions codifying case-sensitive behavior through lints of the filesystem and failing when case differs.

Furthermore, if you want to remove non-deterministic behavior from the filesystem, then you have to sort directories and files based on filenames, not their position in the FS b-tree. That requires a detailed specification.

You may be right about convergence, particularly in user-facing systems, and that Linux distros will move to case-insensitive for at least some things

The big concern I have is that baking case-sensitivity into the runtime now would cause much larger engineering headaches later.

...but it still seems like it could be a long time before we could rely on all filesystems we want to run WASI on being case-insensitive.
...
Is there an assumption here that the host filesystem will only be modified through WASI APIs?

If you are just providing access to the underlying filesystem (like a user's download folder) then you should just do whatever the underlying fileystem does.

We're not trying to foist case-sensitivity on anyone; we're mainly just not sure how to efficiently implement case-insensitivity if the underlying OS doesn't do it for us.

Linux introduced per-directory support for case-insensitivity last year in EXT4 and F2FS. This will be ported to other filesystems. So a mostly-correct implementation will be fast.

As I noted in the other ticket, emulating case-insensitive semantics without the help of the filesystem is not as bad as your assumption (O(2**chars )). Case insensitive lookups are only required when a case-sensitive lookup fails, in which case you have to perform a directory walk of the filesystem O(directories).

Samba and WINE get slow for deeply nested paths (foo/bar/..., Foo/bar/...). However, we aren't building a compatibility layer for unmodified widows binaries. We can just fail if there are two or directories or files that match when casefolded.

Even with all the extra work that Samba and WINE do, all of my development work happens in virtual machines which mount a machine-local SMB share.

Deterministic directory iteration would incur a constant factor overhead, dominated by filesystem access. This can be turned into O(log files) by maintaining an index via inotify.

I'd like to better understand the analogy to Rust's overflow handling here. Rust has a concept of release mode and debug mode, but Wasm execution doesn't currently. So, who picks the mode, and when do they pick it?

My focus was on how Rust punted on the question of overflow handling, as it was too difficult in time for 1.0. But they did so in a way that they could back out of the decision they made without breaking existing code.

So in this case, we would specify that files are entries in a in a NFC/NFD casefolded namespace, sorted according to Unicode codepoints or possibly some default Unicode sort order. Runtimes could detect usage of unassigned codepoints and case-insensitive naming conflicts and throw an error or display a warning.

Also, are there any cases where sloppy/fast implementations would lead to data loss?

There are some CVEs that exploit differences between application and filesystem Unicode handling. But forcing filenames to be valid UTF-8 will also introduce that hazard.

But given that Linux just does whatever Unicode says and only archaic and newly codified native alphabets add new Unicode casing rules....

Author

indolering commented Oct 31, 2020

Personal notes on handling normalization in the filesystem:

indolering/WASI-Case-Folding.md

indolering commented Oct 24, 2020

Uh oh!

indolering commented Oct 31, 2020

Uh oh!