There is a project that I've spent the last two to three months working on that uses Firebase. The project includes a web app and iOS application and focuses heavily on real-time user interaction. We've really enjoyed working with Firebase and the Firebase web and iOS SDKs. It makes this real-time programming much simpler than rolling our own data-syncing solution for the server and multiple client language.
It will soon (in the next two weeks) be time to release the project, and we have no effective way in place to back up our data. Firebase offers a "private backups" feature for the "Bonfire" plan, but we obviously don't want to pay the $150 / month until we absolutely have to. Until we reach a point where we will use the Bonfire plan, we are forced to roll our own solution.
Must-haves:
- Backups up to 10GB. Candle plan offers up to 10GB storage / month.
- Files stored in an archive-able format.
- Restore only the data you need (e.g. restore the
users
ref, but not themessages
ref). - Run the backups as a cron-job on a low-powered cloud instance (e.g. AWS EC2 micro instance).
Nice-to-haves:
- Incremental backups for efficiency. Candle plan offers 50GB transfer / month.
- Files pushed off to Amazon S3 or similar for safe keeping.
Since Firebase offers this service in one of their higher level packages, they're obviously not going to make it trivial to replicate. We should account for the following constraints.
- GET requests over the Firebase REST API cannot exceed 200MB returned data.
- High-level
.child
requests are not practical due to the amount of data. - Monthly transfer capped at 50GB.
The simplest thing to do is to just copy your data, starting at the root, into a JSON file. Several existing solutions on Github implement this appraoch (add links). It is now also possible with the Firebase CLI (add links).
The problem: once you exceed 200MB of data in your Firebase, this approach fails.
I was pleased when I saw that the Firebase REST API offers the url paramater shallow=true
. This parameter will make your GET request return only the keys at a given path.
For example, lets say GET https://myfirebase.firebasio.com/users.json
returns:
{
"abc": {
"name": "Jack"
},
"def": {
"name": "Jill"
}
}
then GET https://myfirebase.com/users.json?shallow=true
returns:
{
"abc": true,
"def": true
}
and GET https://myfirebase.com/users/abc/name.json?shallow=true
returns:
"Jack"
So my immediate reaction was "that's perfect! I can just use this to incrementally build my backup without worrying about request sizes!" It worked like this simplified snippet:
// Start at the root
var paths = [''];
var store = {};
while (paths.length > 0) {
// Pop off the first path and make a shallow GET request
var path = paths.shift();
var data = GET 'https://myfirebase.com/' + path' + '.json?shallow=true';
// If the data returned is an object, take its keys and add them
// to the array of paths that will be requested. Otherwise, store
// the value at its path.
if (typeof data === 'object') {
Object.keys(data).forEach(key => paths.push(path + '/' + key));
} else {
store[path] = data;
}
}
Using this approach, you end up with a store
object that contains all of the paths to individual properties in your firebase. For example, our nested JSON users structure turns into a nice flat structure:
{
'users/abc/name': 'Jack',
'users/def/name': 'Jill'
}
This is convenient for restoring - just make a POST
to each path with its corresponding piece of data to rebuild your entire Firebase. Also, if you incrementally append this data to a file, you can make this run on very little memory, perfect for a cheap EC2 instance.
The problem: it's extremely slow to make this many GET
requests. To be more specific, it took about 2 hours to download ~6000 records with 3 levels of nesting at a total of 20MB size. Some profiling showed the process spent ~90% of its running time waiting on network requests. This was after "optimizing" to take up to 1000 paths off the front of the paths array and make all 1000 requests
Making one huge request doesn't work due to Firebase constraints. Making thousands of individual requests doesnt' work due to network latency. The next step is to find a happy spot in between.
Using the REST API parameters orderBy
, startAt
, and limitToFirst
we can retrieve a large collection more efficiently by sharding it into multiple requests.
The following pseudo code does this:
limit = 10, start = "", count = limit
while count == limit:
results = GET /users.json? format=export & orderBy="$key" & startAt = start & limitToFirst = limit
store the results
count = results.length
start = key of the last result
So here we are requesting 10 users ordered by their keys. Then we take the key of the last user and use that as the starting point for the next ten users. You get 10 users with the first request and 9 new users with each subsequent request. You can of course tweak the limit
value to get more than 10 (9) users at a time.
Perhaps a simpler approach would first request all of the ids at this location using a shallow request, sort them, chop off 10 at a time and use the first and last as our startAt
and endAt
bounds. This assumes that the value of the number of keys alone does not exceed 200MB, which by my estimates would be somewhere around 700,000 records.
keys = GET /users.json ? shallow=true
keys.sort()
while keys.length > 0:
shardkeys = keys.splice(0, 10)
startKey = shardkeys.shift()
endKey = shardkeys.pop()
results = GET /users.json ? format=export & startAt = startKey & endAt = endKey
store the results
Cool. Now we feel a little smarter. We easily fall within Firebase's request constraints and we are much faster than requesting each property individually.
~~The problem: Firebase structures will have various levels of nesting. Imagine a new social network is based on pokes. Each user has a pokes
object in which ~10K pokes have been recorded. One could argue this is poorly-architected data, but poorly-archticted data still needs to be backed up. Can we make the backup solution smart enough to shard requests against each users pokes
object? ...keep reading~~
The next question to answer becomes: how do we know which collections to shard?
Firebase gives us a way to declare security and validation rules for our data as one big JSON-like file. If you are worried about backing up your data you've probably set up some of these rules. We can insert some "clues" within our rules to tell our backup solution where to shard requests when doing backups.
First, it's important to note that we can retrieve our firebase rules with a simple GET request:
GET https://myfirebase.firebaseio.com/.settings/rules/.json?auth=myfirebasesecret
Next, we insert some "clues" into our rules in the form of keys and empty objects values. Here is part of our actual user
rules. Notice the subtle "backup:shard:10": {}
about half way down. We can use this "clue" to tell our backup script that objects at this location should be requested 10 at a time.
{
"rules": {
"users": {
".read": "auth != null",
".indexOn": ["email", "facebookId", "name"],
"backup:shard:10": {},
"$userId": {
".write": "auth.uid === $userId",
".validate": "newData.hasChildren(['id', 'email', 'name', 'updatedAt', 'createdAt', 'provider'])"
}
},
At this point we figure out how to actually implement this. A rough draft would be:
- Fetch the Firebase rules abd strip out any comments so we can turn it into a nice object.
- Traverse the object so that we get an array of paths. For example, in the above users object we care about
['rules/users', 'rules/users/$userId']
. - For each of these paths, peek into the object to see if that path has a child with a
"backup:..."
key. - If the child has a
"backup..."
key, fetch the collection according to that key (e.g. sharding 10 requests at a time). - If the child has no such key, then fetch the entire collection in one request.
To make it a little more robust, we can add another backup rule: "backup:ignore"
for children that we don't care about backing up. For example, we have a child user-activity
which is used to store timestamps used for "x users online" and "y users typing" indicators.
To make it more concrete, consider these example rules for a site where users can play chess and poker games:
{
"rules": {
"users": {
"backup:shard:10": {},
"$userId": { ... }
},
"games": {
"chess": {
"backup:shard:20": {},
"$chessGameId": { ... }
},
"poker": {
"backup:shard:5": {}
"$pokerGameId": { ... }
}
}
}
}
We want to shard user requests 10 at a time, chess game requests 20 at a time, and poker game requests 5 at a time.
Hi,
Your solution sounds great but where is it? :) You wrote 2 month ago you are going to release it in next 2 weeks. I just want to know if you still work on this solution or if I should move on to something else because you retired from this project? Your solution sounds great and I would not be able to pull this alone off but I need to move on in case you dont work on this anymore. thx for a short heads up in advance...