-
-
Save Roman2K/cc6fd61027306d73f1f2b193f1ce7e94 to your computer and use it in GitHub Desktop.
A proc string looks like "foo a b", where foo is the name of a proc and a b a space-separated list of arguments. Below are simple backup-agnostic examples of how to write one (last argument to scat). See README for real use-case examples: full proc strings for backup and restore.
Hello, World:
# stdout of echo serves as data of the seed chunk, fed to proc
# "write", which writes to stdout:
$ echo "Hello, World!" | scat "write -"
Hello, World!Procs may be chained as a pipe-separated list:
# Proc "cmd" feeds chunk data to stdin of a command and captures its
# stdout as data of a new chunk:
$ echo "Hello, World!" | scat "cmd cat | write -"
Hello, World!
# Proc "cmdout" produces new data:
$ scat "cmdout echo Hello, World! | write -" < /dev/null
Hello, World!
# More chaining:
$ echo -n "Hello, " | scat "cmd cat | write - | cmdout echo World! | write -"
Hello, World!
$ echo "Hello, World!" | scat "cmd gpg --batch -e -r 00828C1D | cmd gpg --batch -d | write -"
Hello, World!
$ echo "Hello, World!" | scat "cmdin tee hello" && cat hello
Hello, World!A chain is actually just another proc with special syntax for convenience to specify its args (0..n procs) separated by pipes instead of spaces, relaxing the need for parentheses. Since a chain is itself a proc also, it may be passed as argument to other procs, surrounded with curly brackets ({}), as in:
"split | { checksum | index - }"Important: Procs are non-blocking. In the above, the chain piped to
splitis run for every chunk output bysplitwithout waiting for the last one to be processed. To avoid resource hogging, limit the number of concurrent instances of a proc withbacklog:"split | backlog 8 { checksum | index - }"
Parentheses may surround the arguments to avoid ambiguity when passing procs as argument to other procs:
"backlog 8 cp(foo)"Example: Split file foo, write chunks to bar/:
$ echo hello > foo
$ scat "split | { checksum | index - | cp bar }" < foo > foo_index
$ ls bar
5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03For restoring, we need a list of all the chunks produced during backup. Proc index does that: it lists checksums of chunks output by its containing chain, preserving order. Note that it's part of a subchain ({}), following split: see index.
Re-create foo from chunk files in bar/:
$ scat "uindex | ucp bar | uchecksum | join -" < foo_index > foo
$ cat foo
helloThe following lists document procs, their purpose and arguments. Some arguments are more complex types than strings or ints, they may be procs too, or other types like dynprocs, stores, copiers, etc. See corresponding lists.
A proc can be thought of as a function that takes a chunk as input, may use its data (feed to a command, check integrity, etc.), modify its properties (checksum, target size, etc.) and possibly return new data as one or more new chunks as output (output of command, parity shards, etc.) fed to the next proc in the chain. They are classified as different types according to the nature of their action.
Proc types shouldn't really be a concern when choosing which to use, apart from understanding exactly what happens to the data transiting a chain and get the order right. For instance, it does matter to know which proc produces new data to correctly place index and checksum within a chain.
Types:
- mutator: modifies properties leaving data as-is, returns the chunk
- ex: assigning a checksum
- producer: produces new data by returning one or more new chunks
- ex: compressing data: 1→1 (new data, no checksum)
- ex: splitting into smaller chunks: 1→n (new data, no checksum)
- ex: reading an index from a chunk's data: 1→n (checksum, empty data)
- ex: joining data and parity shards: n→1 (new data, no checksum)
- passthrough: doesn't modify properties nor produce new data, returns the chunk
- ex: integrity check
- delegator: doesn't modify properties nor produce new data, passes the chunk through other proc(s)
- ex: limiting the number of concurrent instances of a proc
More procs exist but aren't exposed for use in a proc string: cascade on error, path-based command, etc. See procs/.
Note about examples:
- The examples below aren't usable as standalone proc strings and are to be interpreted as extracts of larger proc strings. See README for usable examples and adapt from this list.
- The
backlogrecommendation is voluntarily not respected either for simplification
Usage: split()
Content-Defined Chunking with default chunk size (min: 512KiB, max: 8MiB)
- type: producer
Ex:
"split | checksum | cp my_dir"Usage: split2(min max)
Idem split with custom min/max chunk size
- args:
min(bytes)max(bytes)
- type: producer
Ex:
"split2 1mib 4mib"Usage: index(path)
De-duplicates chunks and writes an index file to path
Tracks chunks output by the containing chain and writes a list of their checksums to path, preserving order.
Note:
indexis special in that it's called at the end of the chain as well, with a reference to the chunk that entered the chain. That chunk must have its checksum assigned, otherwise the chain's output chunks can't be tracked properly. As a consequence, the main chain couldn't look likesplit | checksum | index -because the seed chunk doesn't have a checksum beforesplit. Rather:split | { checksum | index - }.
- args:
path(string) path to index file, or-for stdout
- type: passthrough
- requires: checksum
Ex:
"split | { checksum | index - | cp my_dir }"Checksums should be generally be placed twiced in a chain: an initial checksum before index, before the first producer proc, to detect duplicates. And a final checksum after the last producer proc.
Ex:
"split | { checksum | index - | gzip | checksum }"Important: Some commands are not idempotent, such as
cmd gpg -e. Two identical chunks encrypted by this proc, though decrypted as identical original data, will result in different encrypted data and thus the checksums will differ as well, making the output chunk considered new and re-written/uploaded. To prevent such behaviour, place the finalchecksumbefore it:"split | { checksum | index - | gzip | checksum | cmd gpg --batch -e -r 00828C1D }"
Usage: uindex()
Reads the index from a chunk's data
Returns empty chunks (no data) with their checksum and target size assigned, ready for data retrieval by following procs.
- type: producer
Ex:
"uindex | ucp my_dir | uchecksum | join -"Usage: backlog(nslots proc)
Limits the number of concurrent instances of proc to nslots at a time
Since procs are non-blocking, it is highly recommended to wrap a proc immediately following split or uindex with backlog as chunks usually come out of them faster than they get processed by the rest of the chain, causing goroutines to be spawned uncontrollably. Without backlog, expect high memory usage, "too many open files" errors, etc.
To serialize the execution of a proc, pass 1 as nslots. Equivalent of a mutex, ensuring only a single instance is being run a time.
If proc is a chain, concurrency may be further limited by nesting backlogs within it.
- args:
nslots(int) max number of instancesproc(proc)
- type: delegator
Ex:
# integrity check with 8 workers:
"uindex | backlog 8 uchecksum"
# writing ordered chunks requires a mutex:
"split | backlog 1 { sort write - }"
# ...equivalent of:
"split | join -"
# process with 8 workers, write 4 files at a time:
"split | backlog 8 checksum | backlog 4 cp(my_dir)"
# ...or:
"split | backlog 8 { checksum | backlog 4 cp(my_dir) }"Usage: checksum()
Computes and assigns checksums
- type: mutator
Ex: see index
Usage: uchecksum()
Integrity check
- type: passthrough
- requires: checksum
Ex:
"uindex | ucp my_dir | uchecksum"Usage: gzip()
Compresses data in gzip format
Ex:
"gzip | checksum | cp my_dir"- type: producer
Usage: ugzip()
Uncompresses data compressed by gzip
- type: producer
Ex:
"ucp my_dir | uchecksum | ugzip"Usage: parity(ndata nparity)
Reed-Solomon erasure coding
Splits chunks into ndata data shards and nparity partity shards for error correction.
- args:
ndata(int) number of data shardsnparity(int) number of parity shards
- type: producer
Ex:
"parity 2 1 | checksum"Usage: uparity(ndata nparity)
Joins chunks split by parity into the original bigger chunk, recovering any error (failed integrity check, missing data)
- args: see
parity - type: producer
- requires: checksum, group (ndata + nparity)
Ex:
"uchecksum | group 3 | uparity 2 1"Usage: group(size)
Aggregates size contiguous chunks into one for procs that work with fixed-sized groups of chunks
For instance, parity(2 1) creates 3 shard chunks from one original and uparity needs those 3 grouped together to recreate the original. Use group before uparity: see example.
- args:
size(int) group size
- type: producer
Ex:
"group 3 | uparity 2 1"Usage: cmd(name arg...)
Filters a chunk's data through a command
- args:
name(string) command executable name: relative to$PATHor absolute path- 0..n
arg(string) command arguments
- type: producer
- stdin ← chunk data
- stdout → chunk data
Ex:
"cmd gpg --batch --encrypt -r 00828C1D"
"cmd gpg --batch --decrypt"Usage: cmdin(name arg...)
Runs a command using a chunk's data as stdin
- args: see
cmd - type: passthrough
- stdin ← chunk data
- stdout → (discarded)
Ex:
"cmdin tee /tmp/out"
"cmdin ssh bankmon dd of=/tmp/out"Usage: cmdout(name arg...)
Runs a command to produce new data
- args: see
cmd - type: producer
- stdin ← (none)
- stdout → chunk data
Ex:
"cmdout date | write - | cmdout echo Hello | write -"Usage: concur(max dynproc)
Feeds chunks to procs returned by dynproc, running only max of them at a time, concurrently
- args:
max(int) max number of instancesdynproc(dynproc)
- type: delegator
Ex:
# one transfer at a time:
"concur 1 mincopies(2
a=scp(bankmon:tmp/a)
b=rclone(drive:tmp/b)
)"Usage: multireader(copier...)
Retrieves data from copiers, randomly alternating between them and cascading on error (failover)
- args:
- 0..n
copier(copier)
- 0..n
- type: delegator
Ex:
"multireader(
a=rclone(drive:tmp/a)
b=scp(bankmon tmp/b)
)"Usage: sort()
Sorts chunks by their original order
Since procs are non-blocking, chunks get out order as they advance through a chain. But order is important at the time of re-assembling them into the original stream. sort buffers them until achieving a contiguous series and returns them in order.
- type: passthrough
Ex:
"sort | write -"Usage: write(path)
Writes a chunk's data to path
- args:
path(string) path to write to, or-for stdout
- type: passthrough
Ex:
"write -"Usage: join(path)
Joins chunks data in their original order, writing the concatenation to path. Short for backlog 1 { sort | write path }.
- args: see
write - type: passthrough
Ex:
"uindex | ucp my_dir | uchecksum | join -"Every store "foo" is also availablea as two procs:
- "foo" (write)
- "ufoo" (read)
See corresponding stores
Ex:
"rclone drive:tmp"
"urclone drive:tmp"A dynproc is similar to a function that takes a chunk as input and returns a variable number of procs to process that chunk.
Usage: stripe(min excl copier...)
Striping and N-copies duplication
Ensures there exist at least min copies of each chunk among all given copiers, creating missing ones as needed. Chunks are striped across stores by interleaving them in a Round-Robin fashion.
If chunks are grouped with group, then stripe may guarantee that at least excl chunks within that group are put on distinct stores from the others. Required for guaranteeing recoverability from parity so that any nparity stores may be lost while maintaining ability to recompute original data from the remaining >= ndata shards.
On consecutive runs, existing copies will be reused as much as possible while meeting the min and excl requirements, making new copies as necessary to meet them. Returns an error if not possible, whether for lack of provided stores, or not enough of them available with quota left.
Stores are filled up to their quota and a little bit over due to concurrency during writes/uploads causing imprecision in calculation. In theory, quota overage may reach up to group size × max chunk size × concurrency.
- args:
min(int) guarantee of mininum number of copiesexcl(int) guarantee of minimum number of exclusive chunks within a group- 0..n
copier(quotaRes)
- type of returned procs: passthrough
- requires: checksum, group (for excl > 0)
Ex:
# RAID 1: make 2 copies
"stripe(2 0
a=scp(bankmon tmp/a)
b=rclone(drive:tmp/b)=2gib
)"
# RAID 5: ensure exclusivity for ndata shards
"parity 2 1 | group 3 | stripe(1 2
a=scp(bankmon tmp/a)
b=rclone(drive:tmp/b)=2gib
c=rclone(drive2:tmp/c)
)"Usage: mincopies(min copier...)
N-copies duplication
Idem stripe with no guarantee of exclusivity of chunks across stores. Short for stripe(min 0 copier...).
- args: see
stripe - type of returned procs: see
stripe - requires: see
stripe
Ex:
"mincopies(2
a=scp(bankmon tmp/a)
b=rclone(drive:tmp/b)=2gib
)"A store represents a storage facility, local or remote. It provides a proc for writes/uploads, another for reads (or downloads) and can list existing entries, such as files or objects in buckets.
Filenames are hexadecimal SHA256 checksum hashes (64 chars). Ex: aeef70b69d4e9dc8eb95bea114c4e992831e4185ec93145c4c893b5811079bea
Usage: rclone(remote)
Cloud storage via rclone command
Note: The remote must be already configured via
rclone config.
- args:
remote(string) name of remote and directory in the form of"<remote>:<dir>"
- requires: checksum
Ex:
"rclone drive:tmp/backup"Usage: cp(dir level...)
Local filesystem storage in directory dir
If levels are specified, chunks are nested within subdirectories by hashing their checksum into one level-character long subdirectory per level.
- args:
dir(string) path to directory- 0..n
level(int) nesting levels
- requires: checksum
Ex:
"cp path/to/foo"
# ...writes chunks to: (relative to dir)
fd9fef3929a98c3c7ef810762cfd233a6f3b2a4e8eaae95b6c4baa17b09320a1
"cp path/to/foo 4"
# ...writes to:
fd9f/fd9fef3929a98c3c7ef810762cfd233a6f3b2a4e8eaae95b6c4baa17b09320a1
"cp path/to/foo 3 2"
# ...writes to:
fd9/fe/fd9fef3929a98c3c7ef810762cfd233a6f3b2a4e8eaae95b6c4baa17b09320a1Usage: scp(host dir level...)
Remote file system storage via SSH
Requires the following GNU-compatible commands:
- (local)
sshfor remote command execution - (remote)
ddfor streaming file transfer - (remote)
findfor listing existing files
Note:
sshand helper commands are used for file transfer instead ofscporsftpbecausescponly takes path arguments and that would require writing many temp files. Instead, files are streamed throughddwithout buffering to disk. For listing, neithersftpnorlsare used either due to necessary path escaping and inflexible output formatting which would have required error-prone parsing. Withssh+find, paths and file info are passed around in a manner that eliminates ambiguity: environment variables and NUL-separated strings.
- args:
host(string) first argument tossh:[user@]hostnamedir(string) path to directory- 0..n
level(int) seecp
- requires: checksum
Ex:
"scp bankmon /tmp"
"scp bankmon /tmp 4"Arguments to above types
Note: In equal sign-separated pairs such as
copier=limit, spaces are not allowed around=.
Quota resource, with or without quota limit
Format: copier=max or copier
- args:
copier(copier)max(bytes) default = unlimited
Ex:
"a=scp(bankmon tmp/a)"
"b=rclone(drive:tmp/b)=2gib"Format: id=store
- args:
id(string) used for internal book-keeping such as quota and stats computationstore(store)
Ex:
"foo=rclone(drive:tmp/bar)"Format: sequence of non-space characters
Ex:
"path/to/file"Format: <int><unit>
Size in bytes
Ex:
1024MiB
1GiB
1000MiB
1gb
Format: numeric characters
Ex:
"123"| #!/usr/bin/env ruby | |
| $stdin.each_line do |line| | |
| line.chomp =~ /^##(#*)\s*/ or next | |
| level, text = $1.size, $' | |
| anchor = text.downcase.tr(" ", "-").gsub(/[^\w-]/, "") | |
| print "%s* [%s](#%s)\n" % ["\t"*level, text, anchor] | |
| end |
| 00_TOC.md: 01_PROCSTRING.md | |
| ./gentoc < $< > $@ |