IPLD is a system for creating decentralized systems based on content-addressable data primitives. One of the things people frequently want to do when building applications is handle files, and describe filesystems. As a result, we've compiled some thoughts and recommendations about how to describe filesystems in IPLD, and also produced some specifications that we suggest systems use and build upon.
There will not be "one true way" to describe filesystems in IPLD. IPLD is an open-ended ecosystem and there can be many different ways to accomplish goals using IPLD. What we discuss in this document will be just one way that we've thought through particularly thoroughly. If it fits your needs as an application designer, we hope you will use it; if not, we hope it is at least useful inspiration.
This document will also introduce some of the features of IPLD that we suggest using to describe filesystems, and demonstrate how we apply them. Even if you build different filesystem descriptions than ours here, these tools and the rough idea of how we compose them should probably be highly reusable.
Unixfsv2 is the name of a set of conventions we propose for handling filesystems in IPLD.
(There is also a unixfsv1! However, it's very different, and comes from IPFS, before IPLD was recognizably extracted. We won't talk much more about it here.)
Unixfsv2 is designed to fit naturally within the IPLD Data Model, and is described in IPLD Schemas for clarity, and leverages features like IPLD ADLs to solve tricky problems like large data sharding in a nicely layered way.
The type for files in unixfs is quite simple: they're just a big blob:
type File bytes
using { ADL="FBL" }
The Flexible Byte Layout ADL is used to allow support for arbitrarily large data. (The FBL ADL is good at random access, and it's not strict about internal tree structure, and it can also simply be a bunch of bytes with no linking at all for small data.)
Some applications want to describe single named files. (Think: attachments in an issue tracker, etc.) This use case involves no attributes, no directory structures, etc -- just names.
We propose the following simple structure be used for this:
type NamedFile struct {
filename String
body File
}
Recall that for small quantities of bytes, the File
type can still just be an inline set of bytes --
meaning the entire NamedFile
structure could be in one block if it wants to;
or, it could be the start of a document spanning multiple blocks, depending on the behavior of the ADL working on File
.
Directories are essentially a map from filename to a file... or other directory! And, it may involve some sort of attributes.
Let's focus on the map structure for first (this is just for didactic purposes, what we actually recommend will be below):
type Filename string
type File bytes
using { ADL="FBL" }
type Directory {Filename:AnyFile}
using { ADL="HAMT" }
type AnyFile union {
| "f" File
| "d" &Directory
} representation keyed
Notice how when we use a Directory
type, we don't use the NamedFile
type at all.
Filenames are already part of the structure of a Directory
: it would be redundant to use the NamedFile
type.
However, both still eventually lead to a File
type -- this is the most important part to share, since it's likely the largest piece of data.
We use another ADL here: the Directory
type uses a "HAMT" (Hash Array Mapped Trie).
A HAMT is a system for sharding a map across multiple blocks of data (it's somewhat similar to a B+ tree, but also has some rules which result in a canonicalized form, which has nice emergent behaviors when used in decentralized systems).
Using this ADL means we can support directories of nearly any size.
Note how in the AnyFile
union, the Directory
member is prefixed with an &
symbol.
This means a link should be there -- the Directory
data will be in a new block, and we'll point to it here with a CID.
Okay, let's expand that a little bit. (This'll be more the real thing.)
We also need to account for attributes. Right now, let's keep that to an Attribs
type, and we'll decide what it actually is in the next section.
Also, let's throw in another file type -- symlinks.
Here's what we get now:
type Filename string
type File bytes
using { ADL="FBL" }
type Directory {Filename:DirEnt}
using { ADL="HAMT" }
type Symlink struct {
target String
}
type DirEnt struct {
attribs Attribs
content AnyFile
}
type AnyFile union {
| "f" File
| "d" &Directory
| "l" Symlink
} representation keyed
type Attribs struct {
# we'll discuss this in the next section;
# for now, it's enough to reserve the position where it's used.
}
Here, the Attribs
info is embedded into directories.
The number of blocks expected and ways they will sharded is the same in this schema as the previous one, despite the added types!
Notice how easy it was to add the Symlink
type to the AnyFile
union, also.
Ahh, "attributes". Here be dragons.
There are many different concepts of "attributes" out there. Windows filesystem attributes. Mac filesystem attributes. POSIX filesystem attributes (which ones?). Tar format attributes. Zip format attributes "Xtended" attributes.
Many of these concepts of "attributes" are close -- but none of them are exactly equal to any one of the others. What, then, do we do about this?
IPLD Schemas to the rescue: We're not going to pick a single approach here, but rather outline several: applications can choose which of these concepts of "attributes" they want to plug into their overall schema.
For example:
type Attribs struct {
executable Bool
}
This is one of the simpler attributes models one could use: is the file "executable"? (This is a unix'y concept moreso than a windows concept -- but it's also a brazen simplification of the unix concept.)
Or, we could make a much larger set of attributes described:
type Attribs struct {
mtime Int # In time since Epoch, in 1-second granularity.
posix Int # The familiar unixy 0777 mask packing.
sticky Bool
setuid Bool
setgid Bool
uid Int
gid Int
}
Or, we could make a third set, which includes the posix
and mtime
fields above, but ditches uid
/gid
/setuid
/setgid
/sticky
.
One can make two different schemas, one with each of these definitions of Attribs
, and use either of them -- or both. Or more than two!
Remember: IPLD Schemas are structual typing, not nominative -- which means they can be applied as pattern recognition for data.
Unixfsv2 will probably not be a do-all, end-all system. Unixfsv2 is aiming to provide a simple standard for filesystems and directories that can be large in size. There's many more things Unixfsv2 is not trying to solve. For example:
- Signed (but freely readable) contents
- Encryption
- "Capability" systems (encrypted or otherwise)
- Efficient tracking of partial mutations, or conflict resolution
... and so on!
We hope that we can make as many parts of this spec reusable (piecemeal if necessary) as possible.
In general, we hope that we can have shared conventions for data structures on the leaves --
e.g., it's especially useful if we can have everyone agree on the File
type,
because if those leaves on a large DAG of filesystem content are shared by two different systems,
even if the directory structure over them is distinctive, a great deal of the overall content will be deduplicatable.
Hopefully sharing vocabulary and patterns of design is still useful :)
// todo discuss further
As mentioned earlier -- probably there won't be "one" "true" understanding of filesystem "attributes".
But maybe we can compile a small set of them which a wide range of different projects agree to recognize. This can make interop between different projects easier.
- Unixfsv2 gets some early drafts and discussion here: https://github.com/ipld/specs/blob/caa5af41702b026683f2c35c2dc701fc88c31f98/design/history/exploration-reports/2019.06-unixfsv2-spike-01.md
- This concept of various schemas is also discussed in this gist: https://gist.github.com/warpfork/3948bd951e93c0f0b4e355d78b736f83
- Filesystem attributes (with a particular lean towards linux perspectives) and tree layouts were workshopped once upon a time in this shared document: https://hackmd.io/4lqtycvdQN2WTspBLpy3qw
- The issues and pull-requests in this repo contain various discussions of filesystems and attributes: https://github.com/ipld/legacy-unixfs-v2/