-
-
Save tkt028/a231b889e9d0d2fb48613f1bb4e0a2c8 to your computer and use it in GitHub Desktop.
How Arq stores your backup data - https://www.arqbackup.com/arq_data_format.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Arq stores backup data in a format similar to that of the open-source version | |
control system 'git'. | |
Content-Addressable Storage | |
--------------------------- | |
At the most basic level, Arq stores "blobs" using the SHA1 hash of the | |
contents as the name, much like git. Because of this, each unique blob is only | |
stored once. If 2 files on your system have the same contents, only 1 copy of | |
the contents will be stored. If the contents of a file change, the SHA1 hash is | |
different and the file is stored as a different blob. | |
Files are blobs, and commits and trees are blobs as well. | |
(It's not quite that simple actually. To make the names less susceptible to | |
lookup tables, Arq actually calculates the SHA1 hash of the computerUUID | |
concatenated with the blob's data. But we'll use "SHA1" as shorthand throughout | |
this document for this SHA1-derived identifier.) | |
"Computer UUID" | |
--------------- | |
When you first run Arq and add a target ("destination"), it creates a | |
"universally unique identifier" (UUID) for your computer (referred to below as | |
the "computerUUID"). All backup objects are stored with that as a prefix. | |
Encryption Dat File | |
------------------- | |
The first time you add a folder to Arq for backing up, it prompts you to choose | |
an encryption password. Arq creates 2 randomly-generated encryption keys. The | |
first key is used for encrypting/decrypting; the second key is used for | |
creating HMACs. | |
Arq stores those keys, encrypted with the encryption password you chose, in a | |
file called /<computerUUID>/encryptionv2.dat. You can change your encryption | |
password at any time by decrypting this file with the old encryption password | |
and then re-encrypting it with your new encryption password. | |
The encryptionv2.dat file format is: | |
header 45 4e 43 52 ENCR | |
59 50 54 49 YPTI | |
4f 4e 56 32 ONV2 | |
salt xx xx xx xx | |
xx xx xx xx | |
HMACSHA256 xx xx xx xx | |
xx xx xx xx | |
xx xx xx xx | |
xx xx xx xx | |
xx xx xx xx | |
xx xx xx xx | |
xx xx xx xx | |
xx xx xx xx | |
IV xx xx xx xx | |
xx xx xx xx | |
xx xx xx xx | |
xx xx xx xx | |
encrypted master keys xx xx xx xx | |
... | |
To create the encryptionv2.dat file: | |
1. Generate a random salt. | |
2. Generate a random IV. | |
3. Generate 2 random 32-byte "master keys" (64 bytes total). | |
4. Derive 64-byte encryption key from user-supplied encryption password using PBKDF2/HMACSHA1 (200000 rounds) and the salt from step 1. | |
5. Encrypt the master keys with AES256-CBC using the first 32 bytes of the derived key from step 4 and IV from step 2. | |
6. Calculate the HMAC-SHA256 of (IV + encrypted master keys) using the second 32 bytes of the derived key from step 4. | |
7. Concatenate the items as described in the file format shown above. | |
To get the 2 "master keys": | |
1. Copy salt from the 8 bytes after the header. | |
2. Derive 64-byte encryption key from user-supplied encryption password using PBKDF2/HMACSHA1 (200000 rounds) and the salt from step 1. | |
3. Calculate HMAC-SHA256 of (IV + encrypted master keys) using second 32 bytes of key from step 2, and verify against HMAC-SHA256 in the file. | |
4. Decrypt the ciphertext using the first 32 bytes of the derived key from step 2 to get 2 32-byte "master keys". | |
Note: We use HMACSHA1 as the PRF with PBKDF2 because that's the only one available on Windows (in .NET). | |
EncryptedObject | |
--------------- | |
We use the term "EncryptedObject" throughout this document as shorthand to | |
describe an object containing data in the following format: | |
header 41 52 51 4f ARQO | |
HMACSHA256 xx xx xx xx | |
xx xx xx xx | |
xx xx xx xx | |
xx xx xx xx | |
xx xx xx xx | |
xx xx xx xx | |
xx xx xx xx | |
xx xx xx xx | |
master IV xx xx xx xx | |
xx xx xx xx | |
xx xx xx xx | |
xx xx xx xx | |
encrypted data IV + session key xx xx xx xx | |
... | |
ciphertext xx xx xx xx | |
... | |
To create an EncryptedObject: | |
1. Generate a random session key (Arq reuses it for up to 256 objects before replacing it). | |
2. Generate a random "data IV". | |
3. Encrypt plaintext with session key and data IV. | |
4. Generate a random "master IV". | |
5. Encrypt (data IV + session key) with AES256-CBC using the first "master key" from the Encryption Dat File and the "master IV". | |
4. Calculate HMAC-SHA256 of (master IV + "encrypted data IV + session key" + ciphertext) using the second 32-byte "master key". | |
7. Assemble the data in the format shown above. | |
To get the plaintext: | |
1. Calculate HMAC-SHA256 of (master IV + "encrypted data IV + session key" + ciphertext) and verify against HMAC-SHA256 in the file using the second "master key" from the Encryption Dat File. | |
2. Decrypt "encrypted data IV + session key" using the first "master key" from the Encryption Dat File and the "master IV". | |
2. Decrypt the ciphertext using the session key and data IV. | |
Folder Configuration Files | |
-------------------------- | |
Each time you add a folder for backup, Arq creates a UUID for it and stores 2 | |
objects at the target: | |
object: /<computer_uuid>/buckets/<folder_uuid> | |
This file contains a "plist"-format XML document containing: | |
1. the 9-byte header "encrypted" | |
2. an EncryptedObject containing a plist like this: | |
<plist version="1.0"> | |
<dict> | |
<key>AWSRegionName</key> | |
<string>us-east-1</string> | |
<key>BucketUUID</key> | |
<string>408E376B-ECF7-4688-902A-1E7671BC5B9A</string> | |
<key>BucketName</key> | |
<string>company</string> | |
<key>ComputerUUID</key> | |
<string>600150F6-70BB-47C6-A538-6F3A2258D524</string> | |
<key>LocalPath</key> | |
<string>/Users/stefan/src/company</string> | |
<key>LocalMountPoint</key> | |
</string>/</string> | |
<key>StorageType</key> | |
<integer>1</integer> | |
<key>VaultName</key> | |
<string>arq_408E376B-ECF7-4688-902A-1E7671BC5B9A</string> | |
<key>VaultCreatedTime</key> | |
<real>12345678.0</real> | |
<key>Excludes</key> | |
<dict> | |
<key>Enabled</key> | |
<false></false> | |
<key>MatchAny</key> | |
<true></true> | |
<key>Conditions</key> | |
<array></array> | |
</dict> | |
</dict> | |
</plist> | |
Only Glacier-backed folders have "VaultName" and "VaultCreatedTime" keys. | |
NOTE: The folder's UUID and name are called "BucketUUID" and "BucketName" | |
in the plist; this is a holdover from previous iterations of Arq and is not | |
to be confused with S3's "bucket" concept. | |
Commits, Trees and Blobs | |
------------------------ | |
When Arq backs up a folder, it creates 3 types of objects: "commits", "trees" | |
and "blobs". | |
Each backup that you see in Arq corresponds to a "commit" object in the backup | |
data. Its name is the SHA1 of its contents. The commit contains the SHA1 of a | |
"tree" object in the backup data. This tree corresponds to the folder you're | |
backing up. | |
Each tree contains "nodes"; each node has either the SHA1 of another tree, or | |
the SHA1 of a file (or multiple SHA1s, see "Tree format" below). | |
All commits, trees and blobs are stored as EncryptedObjects (see | |
"EncryptedObject" above). | |
Commit Format | |
------------- | |
A "commit" contains the following bytes (see "Data Format Documentation" below | |
for explanation of [String], [UInt32], [Date], etc): | |
43 6f 6d 6d 69 74 56 30 31 31 "CommitV011" | |
[String:"<author>"] | |
[String:"<comment>"] | |
[UInt64:num_parent_commits] (this is always 0 or 1) | |
( | |
[String:parent_commit_sha1] /* can't be null */ | |
[Bool:parent_commit_encryption_key_stretched]] /* present for Commit version >= 4 */ | |
) /* repeat num_parent_commits times */ | |
[String:tree_sha1]] /* can't be null */ | |
[Bool:tree_encryption_key_stretched]] /* present for Commit version >= 4 */ | |
[Bool:tree_is_compressed] /* present for Commit version 8 and 9 only; indicates Gzip compression or none */ | |
[CompressionType:tree_compression_type] /* present for Commit version >= 10 */ | |
[String:"file://<hostname><path_to_folder>"] | |
[String:"<merge_common_ancestor_sha1>"] /* only present for Commit version 7 or *older* (was never used) */ | |
[Bool:is_merge_common_ancestor_encryption_key_stretched] /* only present for Commit version 4 to 7 */ | |
[Date:creation_date] | |
[UInt64:num_failed_files] /* only present for Commit version 3 or later */ | |
( | |
[String:"<relative_path>"] /* only present for Commit version 3 or later */ | |
[String:"<error_message>"] /* only present for Commit version 3 or later */ | |
) /* repeat num_failed_files times */ | |
[Bool:has_missing_nodes] /* only present for Commit version 8 or later */ | |
[Bool:is_complete] /* only present for Commit version 9 or later */ | |
[Data:config_plist_xml] /* a copy of the XML file as described above */ | |
Tree Format | |
----------- | |
A tree contains the following bytes: | |
54 72 65 65 56 30 31 36 "Treev019" | |
[Bool:xattrs_are_compressed] /* present for Tree versions 12-18 */ | |
[CompressionType:xattrs_compression_type] /* present for Tree version >= 19; indicates Gzip compression or none */ | |
[Bool:acl_is_compressed] /* present for Tree versions 12-18 */ | |
[CompressionType:acl_compression_type] /* present for Tree version >= 19; indicates Gzip compression or none */ | |
[BlobKey:xattrs_blob_key] | |
[UInt64:xattrs_size] | |
[BlobKey:acl_blob_key] | |
[Int32:uid] | |
[Int32:gid] | |
[Int32:mode] | |
[Int64:mtime_sec] | |
[Int64:mtime_nsec] | |
[Int64:flags] | |
[Int32:finderFlags] | |
[Int32:extendedFinderFlags] | |
[Int32:st_dev] | |
[Int32:st_ino] | |
[UInt32:st_nlink] | |
[Int32:st_rdev] | |
[Int64:ctime_sec] | |
[Int64:ctime_nsec] | |
[Int64:st_blocks] | |
[UInt32:st_blksize] | |
[UInt64:aggregate_size_on_disk] /* only present for Tree version 11 to 16 (never used) */ | |
[Int64:create_time_sec] /* only present for Tree version 15 or later */ | |
[Int64:create_time_nsec] /* only present for Tree version 15 or later */ | |
[UInt32:missing_node_count] /* only present for Tree version 18 or later */ | |
( | |
[String:"<missing_node_name>"] /* only present for Tree version 18 or later */ | |
) /* repeat <missing_node_count> times */ | |
[UInt32:node_count] | |
( | |
[String:"<file name>"] /* can't be null */ | |
[Node] | |
) /* repeat <node_count> times */ | |
Each [Node] contains the following bytes: | |
[Bool:isTree] | |
[Bool:data_are_compressed] /* present for Tree versions 12-18 */ | |
[CompressionType:data_compression_type] /* present for Tree version >= 19; indicates Gzip compression or none */ | |
[Bool:xattrs_are_compressed] /* present for Tree versions 12-18 */ | |
[CompressionType:xattrs_compression_type] /* present for Tree version >= 19; indicates Gzip compression or none */ | |
[Bool:acl_is_compressed] /* present for Tree versions 12-18 */ | |
[CompressionType:acl_compression_type] /* present for Tree version >= 19; indicates Gzip compression or none */ | |
[Int32:data_blob_keys_count] | |
( | |
[BlobKey:data_blob_key] | |
) /* repeat <data_blob_keys_count> times */ | |
[UIn64:data_size] | |
[String:"<thumbnail sha1>"] /* only present for Tree version 18 or earlier (never used) */ | |
[Bool:is_thumbnail_encryption_key_stretched] /* only present for Tree version 14 to 18 */ | |
[String:"<preview sha1>"] /* only present for Tree version 18 or earlier (never used) */ | |
[Bool:is_preview_encryption_key_stretched] /* only present for Tree version 14 to 18 */ | |
[BlobKey:xattrs_blob_key] | |
[UInt64:xattrs_size] | |
[BlobKey:acl_blob_key] | |
[Int32:uid] | |
[Int32:gid] | |
[Int32:mode] | |
[Int64:mtime_sec] | |
[Int64:mtime_nsec] | |
[Int64:flags] | |
[Int32:finderFlags] | |
[Int32:extendedFinderFlags] | |
[String:"<finder file type>"] | |
[String:"<finder file creator>"] | |
[Bool:is_file_extension_hidden] | |
[Int32:st_dev] | |
[Int32:st_ino] | |
[UInt32:st_nlink] | |
[Int32:st_rdev] | |
[Int64:ctime_sec] | |
[Int64:ctime_nsec] | |
[Int64:create_time_sec] | |
[Int64:create_time_nsec] | |
[Int64:st_blocks] | |
[UInt32:st_blksize] | |
Notes: | |
- A Node can have multiple data SHA1s if the file is very large. Arq breaks up | |
large files into multiple blobs using a rolling checksum algorithm. This way | |
Arq only backs up the parts of a file that have changed. | |
- "<xattrs_blob_key>" is the key of a blob containing the sorted extended | |
attributes of the file (see "XAttrSet Format" below). Note this means | |
extended-attribute sets are "de-duplicated". | |
- "<acl_blob_key>" is the SHA1 of the blob containing the result of acl_to_text() | |
on the file's ACL. Note this means the ACLs are "de-duplicated". | |
- "create_time_sec" and "create_time_nsec" contain the value of the | |
ATTR_CMN_CRTIME attribute of the file | |
XAttrSet Format | |
--------------- | |
Each XAttrSet blob contains the following bytes: | |
58 41 74 74 72 53 65 74 56 30 30 32 "XAttrSetV002" | |
[UInt64:xattr_count] | |
( | |
[String:"<xattr name>"] /* can't be null */ | |
[Data:xattr_data] | |
) | |
More on Object Storage | |
---------------------- | |
In general, each blob is stored as an object with a path of the form: | |
/<computer_uuid>/objects/<sha1> | |
But for small files, the overhead associated with putting and getting the | |
objects to/from the storage destination makes backing them up very inefficient. | |
So, small files (files under 64KB in length) are stored in "packs", which are | |
explained below. | |
Packs | |
----- | |
Each folder configured for backup maintains 2 "packsets", one for trees and | |
commits, and one for all other small files. The packsets are named: | |
<folder_uuid>-trees | |
<folder_uuid>-blobs | |
Small files are separated into 2 packsets because the trees and commits are | |
cached locally (so that Arq gives reasonable performance for browsing backups); | |
all other small blobs don't need to be cached. | |
A packset is a set of "packs". When Arq is backing up a folder, it combines | |
small files into a single larger packfile; when the packfile reaches 10MB, it | |
is stored at the destination. Also, when Arq finishes backing up a folder it | |
stores its unsaved packfiles no matter their sizes. | |
When storing a pack, Arq stores the packfile as: | |
/<computer_uuid>/packsets/<folder_uuid>-(blobs|trees)/<sha1>.pack | |
It also stores an index of the SHA1s contained in the pack as: | |
/<computer_uuid>/packsets/<folder_uuid>-(blobs|trees)/<sha1>.index | |
Pack Index Format | |
----------------- | |
magic number ff 74 4f 63 | |
version (2) 00 00 00 02 network-byte-order | |
fanout[0] 00 00 00 02 (4-byte count of SHA1s starting with 0x00) | |
... | |
fanout[255] 00 00 f0 f2 (4-byte count of total objects == count of SHA1s starting with 0xff or smaller) | |
object[0] 00 00 00 00 (8-byte network-byte-order offset) | |
00 00 00 00 | |
00 00 00 00 (8-byte network-byte-order data length) | |
00 00 00 00 | |
00 xx xx xx (sha1 starting with 00) | |
xx xx xx xx | |
xx xx xx xx | |
xx xx xx xx | |
xx xx xx xx | |
00 00 00 00 (4 bytes for alignment) | |
object[1] 00 00 00 00 (8-byte network-byte-order offset) | |
00 00 00 00 | |
00 00 00 00 (8-byte network-byte-order data length) | |
00 00 00 00 | |
00 xx xx xx (sha1 starting with 00) | |
xx xx xx xx | |
xx xx xx xx | |
xx xx xx xx | |
xx xx xx xx | |
00 00 00 00 (4 bytes for alignment) | |
object[2] 00 00 00 00 (8-byte network-byte-order offset) | |
00 00 00 00 | |
00 00 00 00 (8-byte network-byte-order data length) | |
00 00 00 00 | |
00 xx xx xx (sha1 starting with 00) | |
xx xx xx xx | |
xx xx xx xx | |
xx xx xx xx | |
xx xx xx xx | |
00 00 00 00 (4 bytes for alignment) | |
... | |
object[f0f1] 00 00 00 00 (8-byte network-byte-order offset) | |
00 00 00 00 | |
00 00 00 00 (8-byte network-byte-order data length) | |
00 00 00 00 | |
ff xx xx xx (sha1 starting with ff) | |
xx xx xx xx | |
xx xx xx xx | |
xx xx xx xx | |
xx xx xx xx | |
00 00 00 00 (4 bytes for alignment) | |
Glacier archiveId not null 01 (1 byte) /* Glacier only */ | |
Glacier archiveId strlen 00 00 00 00 (network-byte-order 8 bytes) /* Glacier only */ | |
00 00 00 08 /* Glacier only */ | |
Glacier archiveId string xx xx xx xx (n bytes) /* Glacier only */ | |
xx xx xx xx /* Glacier only */ | |
Glacier pack size 00 00 00 00 (8-byte network-byte-order data length) /* Glacier only */ | |
00 00 00 00 /* Glacier only */ | |
20-byte SHA1 of all of the xx xx xx xx | |
above xx xx xx xx | |
xx xx xx xx | |
xx xx xx xx | |
xx xx xx xx | |
Pack File Format | |
---------------- | |
signature 50 41 43 4b ("PACK") | |
version (2) 00 00 00 02 (network-byte-order 4 bytes) | |
object count 00 00 00 00 (network-byte-order 8 bytes) | |
object count 00 00 f0 f2 | |
object[0] mimetype not null 01 (1 byte) (this is usually zero) | |
object[0] mimetype strlen 00 00 00 00 (network-byte-order 8 bytes) (this isn't here if not-null is zero) | |
00 00 00 08 | |
object[0] mimetype string xx xx xx xx (n bytes) | |
xx xx xx xx | |
object[0] name not null 01 (1 byte) (this is usually zero) | |
object[0] name strlen 00 00 00 00 (network-byte-order 8 bytes) (this isn't here if not-null is zero) | |
00 00 00 08 | |
object[0] name string xx xx xx xx (n bytes) | |
xx xx xx xx | |
object[0] data length 00 00 00 00 (network-byte-order 8 bytes) | |
00 00 00 06 | |
object[0] data xx xx xx xx (n bytes) | |
xx xx | |
... | |
object[f0f2] mimetype not null 01 (1 byte) (this is usually zero) | |
object[f0f2] mimetype len 00 00 00 00 (network-byte-order 8 bytes) (this isn't here if not-null is zero) | |
00 00 00 08 | |
object[f0f2] mimetype str xx xx xx xx (n bytes) | |
xx xx xx xx | |
object[f0f2] name not null 01 (1 byte) (this is usually zero) | |
object[f0f2] name strlen 00 00 00 00 (network-byte-order 8 bytes) (this isn't here if not-null is zero) | |
00 00 00 08 | |
object[f0f2] name string xx xx xx xx (n bytes) | |
xx xx xx xx | |
object[f0f2] data length 00 00 00 00 (network-byte-order 8 bytes) | |
00 00 00 04 | |
object[f0f2] data 12 34 12 34 | |
20-byte SHA1 of all of the xx xx xx xx | |
above xx xx xx xx | |
xx xx xx xx | |
xx xx xx xx | |
xx xx xx xx | |
Data Format Documentation Conventions | |
------------------------------------- | |
We used a few shortcuts in some of the data format explanations above: | |
[BlobKey:value] | |
A [BlobKey] is stored as: | |
[String:sha1] /* can't be null */ | |
[Bool:is_encryption_key_stretched] /* only present for Tree version 14 or later, Commit version 4 or later */ | |
[UInt32:storage_type] /* 1==S3, 2==Glacier; only present for Tree version 17 or later */ | |
[String:archive_id] /* only present for Tree version 17 or later, if storage_type==2 */ | |
[UInt64:archive_size] /* only present for Tree version 17 or later, if storage_type==2 */ | |
[Date:archive_upload_date] /* only present for Tree version 17 or later, if storage_type==2 */ | |
[Bool:value] | |
A [Bool] is stored as 1 byte, either 00 or 01. | |
[String:"<string>"] | |
A [String] is stored as: | |
00 or 01 isNotNull flag | |
if not null: | |
00 00 00 00 8-byte network-byte-order length | |
00 00 00 0c | |
xx xx xx xx UTF-8 string data | |
xx xx xx xx | |
xx xx xx xx | |
[UInt32:<the_number>] | |
A [UInt32] is stored as: | |
00 00 00 00 network-byte-order uint32_t | |
[Int32:<the_number>] | |
An [Int32] is stored as: | |
00 00 00 00 network-byte-order int32_t | |
[UInt64:<the_number>] | |
A [UInt64] is stored as: | |
00 00 00 00 network-byte-order uint64_t | |
00 00 00 00 | |
[Int64:<the_number>] | |
An [Int64] is stored as: | |
00 00 00 00 network-byte-order int64_t | |
00 00 00 00 | |
[Date:<the_date>] | |
A [Date] is stored as: | |
00 or 01 isNotNull flag | |
if not null: | |
00 00 01 26 8-byte network-byte-order milliseconds | |
a8 79 09 48 since the first instant of 1 January 1970, GMT. | |
[Data:<xattr_data>] | |
A [Data] is stored as: | |
[UInt64:<length>] data length | |
xx xx xx xx bytes | |
xx xx xx xx | |
xx xx xx xx | |
... | |
[CompressionType] | |
Compression type is stored as an [Int32]. | |
0 == none | |
1 == Gzip | |
2 == LZ4 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment