Skip to content

Instantly share code, notes, and snippets.

@csabahenk
Last active December 13, 2016 12:02
Show Gist options
  • Save csabahenk/5668160 to your computer and use it in GitHub Desktop.
Save csabahenk/5668160 to your computer and use it in GitHub Desktop.
RFC: Filetree Schema Language

Filetree Schema Language is a format that's intended to make it easy to specify certain file tree layouts.

The idea is that the performance dataset tool (yet to be written) will be capable of producing actual file trees upon feeding it with a file tree schema.

I tried to addressed a good trade-off between versatility and simplicity.

Syntactically, this manifests in being based on JSON -- that's easy to write by humans while providing the processing utility with properly typed structural data. A lot of abbreviations are added to make writing easy and expressive (eg. you can specify the empty directory as "DIR" or {}, while the full form would be [ "DIR", {"entries": {} } ]).

Semantically, it manifests as having good support for recursive constructs, while enforcing some naming conventions (ie. giving up some freedom in terms of naming the files); also, we stick with the most basic filesystem semantics, at the level of VFAT -- there is no metadata and the types known are restricted to regular files and directories. If due to some specific demand we need more, please speak up and I try to adjust it accordingly.

To get some feel of it, I attach a patch against Vijaykumar's crefi tool (revision ecc49c04) which adds the dumpschema fop; so that with --fop=dumpschema the tool spits out a description of the file tree that would be created, instead of doing any changes on the filesystem. Due to the above mentioned naming conventions, the schema just approximates the names used by Vijaykumar, but the layout and the content of the files will be faithfully represented.

Enjoy!

Csaba

  • Filetree Schema Language
    • semantic sketch

    • syntax elements

      • size spec
    • file types

      • STRING
      • BINARY
      • NULL
      • LOOP
      • RANDOM
      • CALLOUT
      • abbreviations
    • the directory type

      • DIR
      • the entries object
      • abbreviations
      • expandig directory schemas
    • file tree schema

      • the ROOT attribute
      • the VERSION attribute
      • user label attributes
      • consistency requirements
      • expandig file tree schemas
      • abbreviations

Filetree Schema Language

Version 1.0.

Filetree Schema Language is a JSON-based domain specific language to describe file trees.

Text documents adhering the specficiation of the Filetree Schema Langauge are simply referred to as "file tree schemas".

semantic sketch

The semantical atom we choose is the entity, which, in this discourse, is either a file or a directory; and a file tree is a tree of entites. (So hereby we neglect other possible filesystem objects, like symlinks or device files, and also the identity of an entity is given by its position in the tree, so there are no absolute (position-indepent) identifiers like inodes.)

A file tree schema describes (potentially ambiguously) a file tree. A file tree schema is a tree of entity schemas which describe (potentially ambiguously) an entity. Let's see an example.

{
	"foo" : {
			"bar": ["STRING", "aa"],
			"baz": "NULL"
	},
	"quux": {}
}

is a file tree schema that describes the file tree we can get by executing the following commands in an empty directory:

mkdir foo quux
touch foo/baz
echo aa > foo/bar

Within these, ["STRING", "aa"], "NULL", {}, { "bar": ["STRING", "aa"], "baz": "NULL" } and the whole thing are entity schemas (the first two are file schemas, the latter three are directory schemas).

This is very straightforward, but hard to see what is it good for. Let's see then an ambiguous file schema:

[ "RANDOM", {"size": ["1k", "1m"] } ]

describes a file with random content of random size in between 1 kilobyte and 1 megabyte.

This is a little less trivial but the real purpose is to describe large trees with constant size schemas. Eg. the directory chain a/a/a/a/a will be possible to specify with the following schema:

{
	"ROOT":  ["entry", 5],
 
	"entry" : {
		"a": "SELF"
	}
}

and the length of the chain can be anything by replacing the 5 with the chosen value. (This is also an example of a file tree schema which is not an entity schema: the labels ROOT and entry do not correspond to directory entries.)

Or would we like an increase in width? Eg. how could we describe the binary file tree that's produced by

mkdir -p a{0,1}/a{0,1}/a{0,1}/a{0,1}/a{0,1}
touch a{0,1}/a{0,1}/a{0,1}/a{0,1}/a{0,1}/b

? This schema will do that:

{
	"ROOT": ["entry", 4],
 
	"entry" : {
		"a2": ["SELF", {"b": "NULL"}]
	}
}

That is, an index after the entry name describes an enumerated list of entries up to that index.

syntax elements

Entity schemas are specified, identified and referenced by means of labels.

Labels are strings, and they are grouped in two ways:

  • reserved vs. user: reserved ones have a meaning defined in this spec; user ones are specific to a particular file tree schema. Reserved are written ALL_CAPS, user ones should start with a lowercase letter.
  • topological vs. type vs. special: topological ones have a role to define the tree layout; type ones give type information for an entity. Special ones are, well, special.

All user labels are topological; that is, all type and special labels are reserved.

There is a fix set of reserved labels, as follows:

  • special: VERSION
  • topological: ROOT, SELF, NONE
  • type: DIR, NULL, STRING, BINARY, LOOP, RANDOM, CALLOUT

DIR is the label representing directories; all the other type ones represent files, in various ways. An entity schema is given in the form

[ <TYPE>, {<attribute>: <value>,...} ]

where TYPE is a type label; the optional and mandatory attrbitues are specific to TYPE as discussed below.

Any string can serve as user label adhering to the syntactical restriction above (start with lowercase).

size spec

One particular value is size. Mathematically, a size is given by a closed non-empty interval of integers, which means a random value within that interval for any entity instantiating the schema. Note that in JSON all numbers are floats; thus in terms of JSON, we consider a dotless float to be an integer.

Thus:

  • a size spec is either exact (represents a single value) or fuzzy (represents a proper interval).
  • An exact size spec is:
    • either a dotless float;
    • or a string consisting of digits followed by "K", "M", or "G" (case-insensitively), whereby "K", "M", and "G" stand for, respectively, multipliers of kilo (1<<10), mega (1<<20) and giga (1<<30) for the number represented by the digits.
  • A fuzzy size spec is an array [a, b] with a, b being exact size specs, so that ab.

Example:

["1k", 1536]

file types

For each file type we give the attributes with type and default value. Type of the value is either a native JSON type or a size spec. Attributes are optional unless declared mandatory.

STRING

Attributes: data (string, defaults to ""), size (size spec, defaults to length of data).

[In the sequel we shorten such a description as data (string: ""), size (size: len(data)).]

Represents a file with data as its content, iterated circularly up to size. Eg.

[ "STRING", {"data": "abc", "size": 5 } ]

represents a file with content "abcab"

BINARY

Attributes: data (string: ""), encoding (one of "hex", "base64", "quoted", mandatory), size (size: len(decoded data)).

Represents a file with content of data decoded from encoding, iterated circularly up to size. Eg.

[ "BINARY", {"data": "\"zero\\0separated\\0list\"", "encoding": "quoted" } ]

represents a file with content "zero\0separated\0list".

NULL

Attributes: size (size: 0).

Represents a file of size size with 0 bytes as content. In fact it's a special case of BINARY, ie.

[ "NULL", {"size": "1M" } ]

is the same as

[ "BINARY", {"data": "00", "encoding": "hex", "size": "1M" } ]

LOOP

Attributes: file (string, mandatory), size (size: size of the file named file).

Represents a file with content taken from file, iterated circularly up to size. Eg.

[ "LOOP", {"file": "/etc/services" } ]

is a replica of /etc/services.

RANDOM

Attributes: size (size: 0).

Represents a file with random binary content of size size. Eg.

[ "RANDOM", {"size": ["1k", "1m"] } ]

represents a file with random content of random size in between 1 kilobyte and 1 megabyte.

CALLOUT

Attributes: command (string or array of strings, mandatory), size (size: size of output).

Represents a file with content taken from the output of command, iterated circularly up to size.

If command is a string, it will be passed on to shell; if command is an array of strings, it will be directly executed by fork/exec (in particular, the execution will happen in execlp(3) style, ie. initial member of the command array is searched along the PATH environment variable and that will be executed with command as the argument vector). The path of the file to be created will be passed on to the program in the ENTRYPATH environment variable (it is guaranteed to be a relative path without ".." components). Eg.

[ "CALLOUT", {"command": "echo -n $ENTRYPATH | md5sum | cut -d ' ' -f 1"} ]

represents a file with its content being the MD5 sum of its path from the top of the file tree.

abbreviations

The following abbreviations can be applied:

  • [ <TYPE>, {} ] can be shortened either as [ <TYPE> ] or just <TYPE>. Thus, eg., "NULL" specifies an empty file, abbreviating [ "NULL", {} ].
  • if the only given attribute is the first one (wrt. the order it occurs in the type definitions above), the enclosing JSON object can be omitted. Thus, eg., for the STRING type the first attribute is data, which lets us write [ "STRING", "aa" ] instead of [ "STRING", { "data": "aa" } ].

the directory type

DIR

Attributes: entries (object: {})

Represents a directory with entries serving as a template of its content. Eg.

[ "DIR", {"entries": {"bar": ["STRING", "aa"], "baz": "NULL"} } ]

represents a directory with two files in it, bar, with content aa, and the empty file baz.

the entries object

In the above example entries is a literal specification of the content; but in general it's a template that's expanded according to certain rules.

We will refer to the attributes (or names) of entries as entry name schemas and to the values as entry specs.

entry name schemas

  • an entry name schema is a string which

    • must contain a non-numerical character
    • cannot be identical with ".."
  • if the last character of an entry name schema is not a digit, then it's a literal entry name schema, ie. an instance of the containing directory schema will have an entry of the same name.

    [ "DIR", {"entries": {"baz": "NULL"} } ]

    represents an directory with a single empty file in in named baz.

  • if an entry name schema ends with a digit, then it is splitted as

    <entry name schema> = <name base><multiplicity specifier>
    

    where <multiplicity specifier> is a sequence of digits, and the last character of <name base> is not a digit. The multiplicity of the name is the integer denoted by the <multiplicity specifier>. The actual entry names are indexed copies of the <name base>, ie. with printf syntax, "<name base>%0ld" % k, for 0 ≤ k < multiplicity, with l being the number of digits in multiplicity - 1 in decimal representation.

    So for example

    [ "DIR", {"entries": {"baz3": "NULL"} } ]

    represents a directory with 3 empty files in it, named baz0, baz1 and baz2.

  • An exception is ".", called inline entry name schema. It has a special value (which is not an entry spec), see below.

This relation between entry names and entry name schemas will be referred to as derived from. So in the above example the actual entry names baz0, baz1 and baz2 are derived from the entry name schema baz3.

entry specs

In their full form, entry specs are arrays of the following formats:

  • either [ <schema reference>, <stacking level> ]
  • or [ "SELF", <schema reference>, <stacking level> ]

where:

  • <schema reference> is an entry schema, an user label, or NONE;
  • <stacking level> is a non-negative integer (dotless float).

These are called, respectively, non-self-referent and self-referent entry specs.

A directory schema is self-contained if no user label occurs in it -- ie. the schema references of the entry specs of its entries object are entry schemas, and these entry schemas (would they be directory schemas), also fulfill this condition, and so on.

You can notice that in the above examples the entry specs are not arrays. That's because there we used abbreviated forms. Let's therefore discuss available abbreviations.

the inline entry name schema

The value of the inline entry name schema is an array of schema references.

abbreviations

The abbreviations discussed above, for the file schemas, also apply for directories. Thus, for example, "DIR" is a valid abbreviation for the schema that specifies the empty directory, ie. [ "DIR", {"entries": {}} ].

However, in case of directories we also allow to omit the type label and use solely the entries object; in that way, {} is also an abbreviation for the empty schema.

Furthermore, entry specs can be abbreviated:

  • the stacking level of 0 can be omitted
  • [ <schema reference> ] can be unboxed, ie. abbreviated as <schema reference>
  • [ "SELF", "NONE" ] can be abbreviated as "SELF"

[ <schema reference> ] can be unboxed also when it occurs as the value of the inline entry name schema.

expandig directory schemas

A single directory schema represents a tree of its instances. Producing this tree from the given schema is referred to as "expanding the directory schema".

User labels are references; they resolve to entry schemas (we'll discuss later how this association is defined). Therefore, as a zeroth step of the expansion, we can recursively subtitute all user labels in the schema with their corresponding entry schemas (cyclical references won't be allowed so this is a finite process).

Recall, "." is the inline entry name schema and its value is an array of schema references. These are required to resolve to directory schemas, with the semantics that their entries object is merged into the current one. So, also as a pre-processing step, we can perform these object mergers (in case of an attribute conflict, the later ones overwrite the earlier ones, and the entries object of the actual directory schema trumps the others).

So we can restrict ourselves to the expansion of self-contained directory schemas where the entries object does not have "." among its attributes.

To each entry that instantiates the given entry schema we will assign a stacking level. (Note that the stacking level will have a significance only for directories and directory schemas, but formally we define this assignment for any entry and entry schema.)

Assume that

  • D is a directory schema;
  • n: X is an entry name schema / entry spec pair in its entries object;
  • d is an instance of D with stacking level s;
  • m is an entry name in d derived from n.

Then

  • if X is of the form [ E, S ] (for some entry schema E and stacking level S), then the entry at m will be an instance of E with stacking level S;
  • if X is of the form [ "SELF", E, S ] and s = 0, then again, the entry at m will be an instance of E with stacking level S;
  • if X is of the form [ "SELF", E, S ] and s > 0, then the entry at m will be an instance of D with stacking level s - 1.

A special case is when E is NONE; in that case, instance of E is to be understood as there is no entry at m (that is, {} and {"foo": "NONE"} are the same).

So, for example, an instance of

{ "a": ["SELF", { "b": "SELF"}, 2 ] }

with stacking level 2 will unfold into the directory chain a/a/a/b/b.

file tree schema

A file tree schema is a JSON object with certain labels as attributes. That is, a file tree schema

  • must have a ROOT attribute;
  • must have a VERSION attribute;
  • optionally might have any number user labels as attributes.

the ROOT attribute

The value of the ROOT attribute is a non-self-referent entry spec.

the VERSION attribute

VERSION serves for versioning the file tree schema format itself. Its value must be 1. Further revisions of this document might specify or allow higher numeric values (possibly non-integers).

user label attributes

Values of user label attributes are entity schemas. They are regarded to be the definitions of their respective labels.

consistency requirements

Let's call a JSON object that matches the above conditions a quasi file tree schema. (Sorry for the weird name, we need to call such objects somehow, and they are not yet necessary file tree schemas, as further conditions apply.)

Basically, what we require is that each user label should be defined and the definitions should be non-circular.

Formally, we can define the following bipartite graph between the user labels and entry schemas occurring in a quasi file tree schema (considering the { "ROOT": <entry spec> } pair of the quasi file tree schema to be a honorary directory schema):

  • there is an edge from an user label to an entry schema if the entry schema is the definition of the label;
  • there is an edge from an entry schema to an user label if it's a directory schema and some entry spec in its entries object includes the given label.

Then our requirement is:

  • there should be an edge going out from each user label in the graph;
  • the graph should not contain a directed circle.

If a quasi file tree schema meets this condition, we call it a file tree schema.

expandig file tree schemas

Let the value of the ROOT attribute be of the form [ E, S ], where E is a schema reference and S is a stacking level; then an instance of this file tree schema is an instance of the entry schema referred by E with stacking level S.

So, for example, an instance of

{
	"ROOT": [ "entry", 2 ],
	"VERSION": 1,
	"entry": { "a": ["SELF", { "b": "SELF"}, 2 ] }
}

is directory chain a/a/a/b/b.

abbreviations

As an abbreviation, the VERSION attribute might be omitted.

The entry spec that is the value of the ROOT attribute can be abbreviated as entry specs in general.

{ "ROOT": E }, with E being a self-contained entry schema such that "ROOT" and "VERSION" is not among its entry name schemas, can be abbreviated by E. In other words, any self-contained entry schema E, with "ROOT" and "VERSION" not among its entry name schemas, is identified with the file tree schema that has E as its ROOT with stacking level 0.

So for example

{
	"ROOT": "entry",
	"VERSION": 1,
	"entry": { "a": "NULL" }
}

can be abbreviated as

{ "ROOT": { "a": "NULL" } }

and then in turn,

{ "a": "NULL" }

and this expands to the file tree that's obtained by

touch a
diff --git a/crefi.py b/crefi.py
index c3e8c9c..4156cf3 100755
--- a/crefi.py
+++ b/crefi.py
@@ -11,6 +11,7 @@ import string
import errno
import logging
import tarfile
+import json
datsiz = 0
timr = 0
@@ -281,6 +282,20 @@ def bytes2human(byts):
def multipledir(mnt_pnt,brdth,depth,files,fop, file_type="text",inter="1000", size="100K",mins="10K",maxs="500K",rand=False,l=10, randname=False):
+ if fop == "dumpschema":
+ tree = singledir(mnt_pnt, files, fop, file_type, inter,
+ size, mins, maxs, rand, l, randname)
+ for fn, tree0 in tree.items():
+ pass
+ lnam = "level" + str(brdth)
+ tree = { "ROOT": {}, "filedesc": tree0 }
+ if depth == 1:
+ tree["ROOT"][lnam] = "filedesc"
+ else:
+ tree["ROOT"][lnam] = ["dirlayout", depth - 1]
+ tree["dirlayout"] = { fn: "filedesc", "level": "SELF" }
+ return tree
+
files_count = 1
size = human2bytes(size)
maxs = human2bytes(maxs)
@@ -345,7 +360,65 @@ def multipledir(mnt_pnt,brdth,depth,files,fop, file_type="text",inter="1000", si
+def tarscript(sizemin, sizerange=0):
+ sl = ['size=' + str(sizemin)]
+ if sizerange:
+ sl.append('size=$(($size + (`od -A n -N 4 -t u4 /dev/urandom` % ' + str(sizerange) + ')))')
+ sl.extend(['d=`mkstemp -d`',
+ 'cd $d || exit 1',
+ 'rmdir $d',
+ """f=`basename "$ENTRYPATH" | sed 's/\.tar\.gz\.//'`""",
+ 'dd count=1 bs=$size if=/dev/urandom of="$f"',
+ 'tar cz "$f"',
+ 'rm "$f"'])
+ return "; ".join(sl)
+
def singledir(mnt_pnt, files, fop, file_type="text",inter="1000", size="100K",mins="10K",maxs="500K",rand=False,l=10, randname=False):
+ if fop == "dumpschema":
+ if file_type == "tar":
+ if rand:
+ mins = human2bytes(mins)
+ maxs = human2bytes(maxs)
+ sizemin, sizerange = mins, maxs - mins
+ else:
+ size = human2bytes(size)
+ sizemin, sizerange = size, 0
+ tree = ["CALLOUT", tarscript(sizemin, sizerange) ]
+ else:
+ if file_type == "text":
+ tree = ["LOOP", {"file": "/etc/services" } ]
+ elif file_type == "sparse":
+ tree = ["NULL", {}]
+ elif file_type == "binary":
+ tree = ["RANDOM", {}]
+ else:
+ logger.info("Not a valid file type")
+ sys.exit(1)
+ def mksizspec(s):
+ try:
+ int(s[:-1])
+ if not s[-1].upper() in ['K', 'M', 'G']:
+ raise ValueError
+ except ValueError:
+ s = human2bytes(s)
+ return s
+ if rand:
+ sizspec = [ mksizspec(s) for s in (mins, maxs) ]
+ else:
+ sizspec = mksizspec(size)
+ tree[1]["size"] = sizspec
+ if randname:
+ fn = get_filename(l)
+ x = "~~"
+ else:
+ fn = "file"
+ x = ""
+ if file_type == "tar":
+ fn += ".tar.gz."
+ else:
+ fn += x
+ fn += str(files)
+ return { fn: tree }
files_count = 1
os.chdir(mnt_pnt)
@@ -431,20 +504,23 @@ if __name__ == '__main__':
parser.add_option("-I", dest="inter", type="int", default=100,
help="print number files created of interval [defailt: %default]")
parser.add_option("--fop", action="store", type="string", dest="fop", default="create",
- help="fop to be performed on the files ( create,rename,chmod,chown,chgrp,symlink,hardlink) [default: %default]")
+ help="fop to be performed on the files ( create,rename,chmod,chown,chgrp,symlink,hardlink,dumpschema) [default: %default]")
parser.add_option("-R", dest="randname", action="store_false", default=True,
help="To disable random file name [default: Enabled]")
-
+ mnt_pnt = None
(option,args) = parser.parse_args()
- if not args:
+ if not args and option.fop != "dumpschema":
print "usage: <script> [option] <MNT_PT>"
print ""
sys.exit(1)
+ mnt_pnt = os.path.abspath(args[0])
logger = setupLogger("testlost")
- args[0] = os.path.abspath(args[0])
if option.dir:
- singledir(args[0], option.files, option.fop, option.file_type, option.inter, option.size, option.min, option.max, option.random, option.flen, option.randname)
+ tree = singledir(mnt_pnt, option.files, option.fop, option.file_type, option.inter, option.size, option.min, option.max, option.random, option.flen, option.randname)
else:
- multipledir(args[0], option.brdth, option.depth, option.files,option.fop, option.file_type, option.inter, option.size, option.min, option.max, option.random, option.flen, option.randname)
+ tree = multipledir(mnt_pnt, option.brdth, option.depth, option.files,option.fop, option.file_type, option.inter, option.size, option.min, option.max, option.random, option.flen, option.randname)
+ if option.fop == "dumpschema":
+ json.dump(tree, sys.stdout, indent=4)
+ print
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment