Skip to content

Instantly share code, notes, and snippets.

@indolering
Created October 21, 2020 04:51
Show Gist options
  • Save indolering/e386f53ce3c89103d0a9aa6b7250caea to your computer and use it in GitHub Desktop.
Save indolering/e386f53ce3c89103d0a9aa6b7250caea to your computer and use it in GitHub Desktop.
Proposal for Unicode case-folding of WASI filenames.

I apologize for the wall of text, there are a lot of concerns raised in the WASI case-senstivity ticket and I wanted to explain how everything works without just pointing to dense specs and code.

Case-insensitivity is important because it is required by end users: Windows, OS X, and Android all enforce some level of case-insensitivity. As a practical matter, this means Linux developers must manually enforce case-insensitivity. And despite much wailing and gnashing of teeth, even Linux has recently added support for case-insensitivity on a per-directory basis. Distros that care about usability will eventually adopt case-insensitivity, even if it is just for the home directories.

WASI’s current proposal trades Unix’s opaque bytes model for UTF-8 filenames, but rejects case-insensitivity. From a usability engineering perspective, this carries on the tradition of programmers alienating users because it makes their lives easier. Curiously, it also risks alienating developers on every other platform and makes the WASI filesystem abstraction much thicker.

I would instead like to propose a specifcation modeled on Rusts overflow handling: precisely defined implementation dependent behavior that can be made deterministic, but mostly “just works” when backed by a sloppy/fast implementation that relies on the native file system. That specification is Unicode’s caseless matching, which is not the fractal of complexity most i18n issues devolve into:

  • The actual toCasefold function is a context-free mapping of single codepoints to case-folded codepoints.
  • Only ~1,500 codepoints case-fold.
  • Only ~225 case-folded codepoints are outside the BMP.
    • Legacy systems based on UCS-2 (Windows) should be largely compatible.
  • ~1,400 codepoints map to a single case-folded codepoint.
    • Optional full variant of toCasefold maps ~100 codepoints to 2-3 casefolded codepoints.
  • Locale independent:
    • Only 2 codepoints are language dependent (Turkic i) and are skipped in both simple and full.
    • I do not believe any FS implements locale dependent case-insensitivity, including including NTFS.
    • Unicode specifically discourages this: In most environments, such as in file systems, text is not and cannot be tagged with language information. ... For such environments, a constant, language-independent, default case folding is required.
  • Case-folds are immutable across versions of Unicode for assigned codepoints:
    • Any codepoint folded in Unicode 5+ will casefold to the same codepoint(s) in all future versions.
    • Runtimes can prevent any non-determinism by rejecting unasssigned codepoints at runtime.
    • They could also assume unassigned codepoints do not fold and emit a warning.
      • Most languages are unicase, only 0.13% of existing codepoints case-fold.
      • Could also fallback to BoB model.
    • Older runtimes can be easily polyfilled to support new codepoints.
    • Relatively few new case-folds, it should be exceedingly rare to encouter these in the wild.
      • ~30/year since 2006, trending downward.
      • Mostly ancient/archane texts and native alphabets.
      • Still important to eventually support those communities.

I believe the biggest sticking point would be the addition of new case-folds. I think parameterizing a runtime based on a Unicode version is a reasonable requirement, as using new codepoints which also casefold would rarely happen in practice. And it’s not as if WASM runtimes will never add new functionality, which also changes behavior.

An ideal WASI implementation would provide a deterministic filesystem, using something like sandboxfs. But most implementations would get by using their native file system’s case insensitivity. It’s true that these differ in edge cases, but as Ted Tso put it, “the world is converging enough that the latest versions of Mac OS X’s APFS and Windows NTFS behave pretty much the same way.”

I’m also unsure on whether to use NFC or NFD normalization prior to case-folding. Most text on the web is already in NFC, but NFC has an NFD processing step so 🤷.

Notes: