Skip to content

Instantly share code, notes, and snippets.

@jmikola
Last active September 17, 2018 16:23
Show Gist options
  • Save jmikola/8048099 to your computer and use it in GitHub Desktop.
Save jmikola/8048099 to your computer and use it in GitHub Desktop.
MongoDB schema comparison: large objects vs. hash buckets

Original Schema

The original-insert.php script creates 2,000 documents with the following schema:

{
    "_id" : ObjectId("52b3815de84df102428b4567"),
    "appver" : NumberLong(123456),
    "deploytype" : "production",
    "g" : NumberLong(123456),
    "guid" : "e91b59a9-eccd-5cbc-bfee-724e4394c842",
    "timeslice" : {
        "y" : NumberLong(13),
        "mo" : NumberLong(12),
        "d" : NumberLong(19),
        "h" : NumberLong(18),
        "m" : NumberLong(29)
    },
    "random" : "52b382730663e",
    "stats" : {
        "eventnames" : {
            "baz-baz-baz-baz-baz-baz-0" : NumberLong(0)
        }
        "events": NumberLong(0)
    }
}

The stats.eventnames object has 20,000 keys, which share one of three prefixes. The sum of their values is stored in stats.events.

For the update test (original-update.php), we find a document by appver, deploytype, g, guid, and timeslice. For the atomic update, we $set the random field to a new string and $inc five random keys in stats.eventnames and finally $inc stats.events by 5. This test executes 10,000 updates (each of the 2,000 documents is processed 5 times).

This benchmark is meant to stress MongoDB's walking of BSON objects with many keys at the same level. The fact that those keys share a common prefix is intentional to make the string comparisons for each key in the object more expensive (i.e. they cannot short-circuit as quickly).

Bucket Schema

The bucket-insert.php script creates 20,000 documents with the following schema:

{
    "_id" : ObjectId("52b3815de84df102428b4567"),
    "appver" : NumberLong(123456),
    "deploytype" : "production",
    "g" : NumberLong(123456),
    "guid" : "e91b59a9-eccd-5cbc-bfee-724e4394c842",
    "timeslice" : {
        "y" : NumberLong(13),
        "mo" : NumberLong(12),
        "d" : NumberLong(19),
        "h" : NumberLong(18),
        "m" : NumberLong(29)
    },
    "random" : "52b382730663e",
    "stats" : {
        "eventnames" : {
            "15" : {
                "baz-baz-baz-baz-baz-baz-0" : NumberLong(0)
            }
        }
        "events": NumberLong(0)
    }
}

Unlike the original schema, keys that would appear in the stats.eventnames object are bucketed by the first two characters of their MD5 hash. The sum of the final values is still stored in stats.events.

The update test (bucket-update.php) performs the same updates as the original script, but the $inc operations target events by their bucket path. This test executes 10,000 updates (each of the 2,000 documents is processed 5 times).

This benchmark is meant to measure an improvement by avoiding many string comparisons while iterating through BSON object keys. Conceivably, multiple levels of bucketing could be implemented (up to a point of diminishing returns).

Benchmark Results

While inserts are rather inefficient (PHP creates the documents on-demand and does not use batch insertion), the query/newObj arguments for all update operations are generated in advance so that update operations can be spammed to MongoDB. A write concern of zero is also used to ensure PHP does not wait for a response from MongoDB before issuing the next update operation.

Original

$ time php original-update.php
Updated 10000 documents in 18.351067 seconds (avg: 0.001835/document)

real	0m33.740s
user	0m12.820s
sys	0m0.428s

And mongostat output during the benchmark:

insert  query update delete getmore command flushes mapped  vsize    res faults  locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn       time 
    *0     *0     *0     *0      21     1|0       0  8.03g  16.3g  7.38g      0  test:0.0%          0       0|0     0|0     1k    96m     3   18:18:14 
    *0     *0    373     *0       7     3|0       0  8.03g  16.3g  7.41g      0 test:66.9%          0       0|0     0|1   211k    27m     3   18:18:15 
    *0     *0    514     *0       0     1|0       0  8.03g  16.3g  6.64g      0 test:98.3%          0       0|0     0|1   292k     2k     3   18:18:16 
    *0     *0    532     *0       0     1|0       0  8.03g  16.3g  6.96g      0 test:98.0%          0       0|0     0|1   302k     2k     3   18:18:17 
    *0     *0    527     *0       0     1|0       0  8.03g  16.3g  6.33g      0 test:98.5%          0       0|0     0|1   299k     2k     3   18:18:18 
    *0     *0    522     *0       0     1|0       0  8.03g  16.3g  6.25g      0 test:98.2%          0       0|0     0|1   296k     2k     3   18:18:19 
    *0     *0    508     *0       0     1|0       0  8.03g  16.3g  6.19g      0 test:98.4%          0       0|0     0|1   288k     2k     3   18:18:20 
    *0     *0    534     *0       0     1|0       0  8.03g  16.3g   6.5g      0 test:98.2%          0       0|0     0|1   303k     2k     3   18:18:21 
    *0     *0    520     *0       0     1|0       0  8.03g  16.3g  6.29g      0 test:98.3%          0       0|0     0|1   295k     2k     3   18:18:22 
    *0     *0    526     *0       0     1|0       0  8.03g  16.3g  6.31g      0 test:98.4%          0       0|0     0|1   298k     2k     3   18:18:23 
    *0     *0    503     *0       0     2|0       0  8.03g  16.3g  6.12g      0 test:98.2%          0       0|0     0|1   285k     2k     3   18:18:24 
    *0     *0    532     *0       0     1|0       0  8.03g  16.3g  6.44g      0 test:98.2%          0       0|0     0|1   302k     2k     3   18:18:25 
    *0     *0    524     *0       0     1|0       0  8.03g  16.3g  6.34g      0 test:98.6%          0       0|0     0|1   297k     2k     3   18:18:26 
    *0     *0    522     *0       0     1|0       0  8.03g  16.3g  6.34g      0 test:98.1%          0       0|0     0|1   296k     2k     3   18:18:27 
    *0     *0    513     *0       0     1|0       0  8.03g  16.3g  6.15g      0 test:98.4%          0       0|0     0|1   291k     2k     3   18:18:28 
    *0     *0    520     *0       0     1|0       0  8.03g  16.3g  6.34g      0 test:98.3%          0       0|0     0|1   295k     2k     3   18:18:29 
    *0     *0    524     *0       0     1|0       0  8.03g  16.3g   6.3g      0 test:98.3%          0       0|0     0|1   297k     2k     3   18:18:30 
    *0     *0    525     *0       0     1|0       0  8.03g  16.3g  6.24g      0 test:98.3%          0       0|0     0|1   298k     2k     3   18:18:31 
    *0     *0    472     *0       0     1|0       0  8.03g  16.3g  5.98g      0 test:90.8%          0       0|1     0|1   268k     2k     3   18:18:32 
    *0     *0    519     *0       0     2|0       0  8.03g  16.3g  6.29g      0 test:98.1%          0       0|0     0|1   295k     2k     3   18:18:33 
    *0     *0    290     *0       0     1|0       0  8.03g  16.3g  6.27g      0 test:56.5%          0       0|0     0|0   165k     2k     2   18:18:34 
    *0     *0     *0     *0       0     1|0       0  8.03g  16.3g  5.98g      0     .:1.3%          0       0|0     0|0    62b     2k     2   18:18:35

Bucket

$ time php bucket-update.php 
Updated 10000 documents in 1.469424 seconds (avg: 0.000147/document)

real	0m15.805s
user	0m11.916s
sys	0m0.508s

And mongostat output during the benchmark:

insert  query update delete getmore command flushes mapped  vsize    res faults  locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn       time 
    *0     *0     *0     *0      22     1|0       0  8.03g  16.3g  7.41g      0  test:0.1%          0       0|0     0|0     1k   101m     3   18:20:16 
    *0     *0   4210     *0       2     3|0       0  8.03g  16.3g  7.42g      0 test:71.6%          0       0|0     0|0     2m     4m     3   18:20:17 
    *0     *0   3542     *0       0     1|0       0  8.03g  16.3g  7.42g      0 test:87.8%          0       0|0     0|1     2m     2k     3   18:20:18 
    *0     *0   2248     *0       0     1|0       0  8.03g  16.3g  6.29g      0 test:46.4%          0       0|0     0|0     1m     2k     2   18:20:19 
    *0     *0     *0     *0       0     1|0       0  8.03g  16.3g  6.26g      0     .:0.2%          0       0|0     0|0    62b     2k     2   18:20:20
<?php
function create_guid() {
$chars = md5(uniqid(rand(), true));
return sprintf(
'%s-%s-%s-%s-%s',
substr($chars, 0, 8),
substr($chars, 8, 4),
substr($chars, 12, 4),
substr($chars, 16, 4),
substr($chars, 20, 12)
);
};
function create_timeslice() {
$date = getdate();
return [
'y' => $date['year'] % 100,
'mo' => $date['mon'],
'd' => $date['mday'],
'h' => $date['hours'],
'm' => $date['minutes'],
];
}
function create_eventnames($n) {
static $prefixes = [
'foo-foo-foo-foo-foo-foo',
'bar-bar-bar-bar-bar-bar',
'baz-baz-baz-baz-baz-baz',
];
$object = new \stdClass();
for ($i = 0; $i < $n; $i++) {
$eventname = sprintf('%s-%d', $prefixes[rand(0, 2)], $i);
$bucketname = substr(md5($eventname), 0, 2);
if ( ! isset($object->$bucketname)) {
$object->$bucketname = new \stdClass();
}
$object->$bucketname->$eventname = 0;
}
return $object;
}
function create_document() {
return [
'_id' => new MongoId(),
'appver' => 123456,
'deploytype' => 'production',
'g' => 123456,
'guid' => create_guid(),
'timeslice' => create_timeslice(),
'random' => uniqid(),
'stats' => [
'eventnames' => create_eventnames(20000),
'events' => 0,
],
];
}
$m = new MongoClient();
$c = $m->test->bucket;
$c->drop();
for ($i = 0; $i < 2000; $i++) {
$c->insert(create_document(), ['w' => 0]);
}
<?php
$m = new MongoClient();
$c = $m->test->bucket;
$c->ensureIndex([
'appver' => 1,
'deploytype' => 1,
'g' => 1,
'guid' => 1,
'timeslice' => 1,
]);
$updates = [];
foreach ($c->find() as $document) {
$query = array_intersect_key($document, [ 'appver' => 1, 'deploytype' => 1, 'g' => 1, 'guid' => 1, 'timeslice' => 1 ]);
$newobj = [
'$set' => [ 'random' => uniqid() ],
'$inc' => [
'stats.events' => 5,
],
];
foreach (array_rand($document['stats']['eventnames'], 5) as $bucketname) {
$eventname = array_rand($document['stats']['eventnames'][$bucketname], 1);
$newobj['$inc']["stats.eventnames.$bucketname.$eventname"] = 1;
}
$updates[] = [ $query, $newobj ];
}
$count = $c->count();
$limit = 10000;
$start = microtime(true);
for ($i = 0; $i < $limit; $i++) {
$query = $updates[$i % $count][0];
$newobj = $updates[$i % $count][1];
$c->update($query, $newobj, ['w' => 0]);
}
$end = microtime(true);
printf("Updated %d documents in %f seconds (avg: %f/document)\n", $limit, ($end - $start), ($end - $start) / $limit);
<?php
function create_guid() {
$chars = md5(uniqid(rand(), true));
return sprintf(
'%s-%s-%s-%s-%s',
substr($chars, 0, 8),
substr($chars, 8, 4),
substr($chars, 12, 4),
substr($chars, 16, 4),
substr($chars, 20, 12)
);
};
function create_timeslice() {
$date = getdate();
return [
'y' => $date['year'] % 100,
'mo' => $date['mon'],
'd' => $date['mday'],
'h' => $date['hours'],
'm' => $date['minutes'],
];
}
function create_eventnames($n) {
static $prefixes = [
'foo-foo-foo-foo-foo-foo',
'bar-bar-bar-bar-bar-bar',
'baz-baz-baz-baz-baz-baz',
];
$object = new \stdClass();
for ($i = 0; $i < $n; $i++) {
$eventname = sprintf('%s-%d', $prefixes[rand(0, 2)], $i);
$object->$eventname = 0;
}
return $object;
}
function create_document() {
return [
'_id' => new MongoId(),
'appver' => 123456,
'deploytype' => 'production',
'g' => 123456,
'guid' => create_guid(),
'timeslice' => create_timeslice(),
'random' => uniqid(),
'stats' => [
'eventnames' => create_eventnames(20000),
'events' => 0,
],
];
}
$m = new MongoClient();
$c = $m->test->original;
$c->drop();
for ($i = 0; $i < 2000; $i++) {
$c->insert(create_document(), ['w' => 0]);
}
<?php
$m = new MongoClient();
$c = $m->test->original;
$c->ensureIndex([
'appver' => 1,
'deploytype' => 1,
'g' => 1,
'guid' => 1,
'timeslice' => 1,
]);
$updates = [];
foreach ($c->find() as $document) {
$query = array_intersect_key($document, [ 'appver' => 1, 'deploytype' => 1, 'g' => 1, 'guid' => 1, 'timeslice' => 1 ]);
$newobj = [
'$set' => [ 'random' => uniqid() ],
'$inc' => [ 'stats.events' => 5 ],
];
foreach (array_rand($document['stats']['eventnames'], 5) as $eventname) {
$newobj['$inc']["stats.eventnames.$eventname"] = 1;
}
$updates[] = [ $query, $newobj ];
}
$count = $c->count();
$limit = 10000;
$start = microtime(true);
for ($i = 0; $i < $limit; $i++) {
$query = $updates[$i % $count][0];
$newobj = $updates[$i % $count][1];
$c->update($query, $newobj, ['w' => 0]);
}
$end = microtime(true);
printf("Updated %d documents in %f seconds (avg: %f/document)\n", $limit, ($end - $start), ($end - $start) / $limit);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment