path-operation-notes.md

Some definitions:

A "safe" transformation is a transformation of a path A into a path B such that any file system operation on path A will produce the same behaviour if given path B.
"normalize" means remove redundant directory separators (including trailing separators) and replace non-canonical separators with canonical separators (on Windows, backslash is the canonical separator, but forward slashes are also recognised as separators. On POSIX, forward slash is the only separator character and is canonical). Normalization is trivially "safe"; it doesn't add or remove any path components, so it can't really result in any change in behaviour.
"simplify" means normalize, and also remove leading or internal '.' components from the path, retaining a trailing '.' component if there is one. Some examples: './a' becomes 'a', 'a/./b' becomes 'a/b', and 'a/./b/./.' becomes 'a/b/.', '.' is left unchanged. I think simplification according to these rules is safe, but I'm not absolutely certain. Retention of one trailing '.' component is necessary, because test -e '/root/.' behaves differently to test -e '/root' (Testing the existence of '/root' only requires the ability to list the contents of the '/' directory; testing the existence of '/root/.' requires the ability to list the contents of the '/root' directory, which is typically only readable by the root user).

Operations I want:

path.normalize(p)
path.simplify(p)
path.join(...)
path.split(p)
path.dirname(p)
path.basename(p)

All operations should at least return normalized paths. I'm undecided whether they should all return simplified paths (if so, then we drop path.simplify from the list and just keep path.normalize). All operations should accept un-normalized paths.

There are various rules that I think are desirable for these operations:

Given a non-empty input path, the output should always be non-empty.
path.normalize and path.simplify should be idempotent. (difficult to imagine how else they could work)
If you give path.join a single path, it should be equivalent to path.normalize.
path.join(path.split(p)) == path.normalize(p)
path.join(path.dirname(p), path.basename(p)) == path.normalize(p)

Unfortunately, rules 1, 4 and 5 are incompatible with each other, unless operations are expected to return simplified (not just normalized) paths. Consider the normalized path 'a'. Given rule 1, both dirname('a') and basename('a') must return non-empty paths. The only sensible choice for dirname('a') is '.'. However, if you then join dirname to basename, you get './a', which breaks rule 5. Alternatively, you can implement join such that join('.', 'a') gives you 'a', but then you get join(split('./a')) == 'a', which breaks rule 4. (Or you can just break rule 1 from the start, with dirname('a') == '').

Note that if all operations return simplified paths anyway, then (I think) you can meet all the above rules. In that case, split('./a') would be expected to return a single component ('a'), so you have join(split('./a')) == 'a' without breaking rule 4, and you can keep rule 1 with dirname('a') == '.', because simplification will drop the extra '.' component when you join it to basename('a').

The only problem with simplifying at every operation is that it arguably makes the system less easy to understand (because it's removing path components behind your back). I think for most path operations that's fine, but it gets scarier when you add a glob matching operation.

So... how should these operations work? Should they simplify all paths on the basis that you usually want simplified paths anyway, and they should still be safe, and it allows all 5 rules to be kept? Or should I break one of the rules, and if so, which one? (Currently I believe rule 5 is least important and the least surprising one to break).

johnbartholomew/path-operation-notes.md