-
-
Save markpapadakis/1bc50417f64e3f86e160 to your computer and use it in GitHub Desktop.
| # CloudDS | |
| 2048 requests in sequence - percentiles captured in an HDRHistogram | |
| All CloudDS caching options disabled. | |
| Cluster Size: 7 | |
| Replication Factor: 3 | |
| Row size in Columns: 10 | |
| Row size/payload: 2.5KBs | |
| Column Family size in rows: 4MM | |
| Cluster configuration: half of those 7 nodes commodity HW with 8GB RAM, other blade-class nodes with 4GB of RAM. Gigabit link for interconnect. | |
| Cluster in active/heavy use by many other production services(not idle). | |
| This is for wide columns - that is, each row can have 0+ columns, and when updating a row you | |
| can update/add any columns - no need RCU, which in turns means, | |
| each GET requires (unless the optimizer deems unnecessary) | |
| fetching the columns content needed for all row updates from | |
| files an inmemory maps in order to compile the final response. | |
| CloudDS supports an alternative storage engine (KV) where value is opaque data(blob), | |
| and those semantics requirements are not needed, thus it's enough to just locate | |
| the most recent update among all recorded updates -- no need to merge. | |
| This should probably give a 10-15% perfomance improvement for this benchmark | |
| but this is for the wide-columns storage engine. | |
| GET: | |
| Consistency Level: QUORUM | |
| 1% => 440 | |
| 5% => 463 | |
| 10% => 470 | |
| 20% => 480 | |
| 50% => 507 | |
| 75% => 533 | |
| 90% => 565 | |
| 95% => 589 | |
| 97% => 608 | |
| 99% => 668 | |
| values in microseconds. So, at 99% percentile, we need, worse case scenario, 0.7 milliseconds. | |
| --- | |
| Same test, this time with Consistency Level: ONE | |
| 1% => 202 | |
| 5% => 206 | |
| 10% => 209 | |
| 20% => 214 | |
| 50% => 238 | |
| 75% => 251 | |
| 90% => 270 | |
| 95% => 279 | |
| 97% => 295 | |
| 99% => 382 | |
| # CloudDS - PUT | |
| Same setup, number of requests, cluster, etc | |
| Consistency Level: QUORUM | |
| 1% => 461 | |
| 5% => 482 | |
| 10% => 499 | |
| 20% => 518 | |
| 50% => 562 | |
| 75% => 608 | |
| 90% => 658 | |
| 95% => 695 | |
| 97% => 727 | |
| 99% => 781 | |
| Consistency Level: ONE | |
| 1% => 252 | |
| 5% => 263 | |
| 10% => 270 | |
| 20% => 279 | |
| 50% => 294 | |
| 75% => 305 | |
| 90% => 320 | |
| 95% => 330 | |
| 97% => 338 | |
| 99% => 382 | |
| When optimizations planned are implemented, it should be at least 20-30% more efficient. | |
| # mySQL | |
| Same benchmark for a mySQL query on a table with of 1MM rows. (keys appropriately set) | |
| Query: SELECT * FROM table WHERE user = 'markpapadakis' | |
| 1% => 774 | |
| 5% => 817 | |
| 10% => 830 | |
| 20% => 841 | |
| 50% => 874 | |
| 75% => 912 | |
| 90% => 959 | |
| 95% => 1002 | |
| 97% => 1043 | |
| 99% => 1158 |
Thanks for the comments Jason :)
Right, it's mostly pulled from kernel page cache, for 4 distinct SSTables.
CloudDS SSTables require at worse 2 disk seeks, one for locating the offset in the SSTable for the value and another for reading the value. The first disk seek is 99% of the time a cache-hit (a LUT backed binary search reduces that to a binary search in no more than 16 elements so that's pretty fast and contained within a single 4K disk page). Data are read directly (mmap is the default access interface, pread64() based access also support - configurable on a per CF basis so that you can choose how to spend your memory on there ).
However, please note:
those numbers are on our active cluster, it's heavily in use - it's not a cluster comprised of idle nodes. Compactions are ongoing, many reads/writes operation are in flight, etc.
The compaction process is throttled, on the OS level and on the application level, and so in practice it doesn't affect requests processing enough to be an issue. If it is close to becoming an issue, a heuristic adjusts the throttling scheme accordingly.
Indeed, I am curious about PUT vs GET timings myself, especially since a PUT is, as you said, a simple matter of storing in a concurrent map-like structure (for the memory table) and appending on a file(the commit log). I will investigate further.
Also, it turns out, approximately 70% of that time is I/O requests scheduling and tasks scheduling overhead. As this is improved, the overall throughput will go up and latency will drop further.
However, there is not much you can do when you are dealing with the kernel -- when a thread is put to sleep, the kernel needs to wake it up, and that takes time -- so I think one of the key ways to solve that is to have as few threads needed as possible around in order to avoid having to put any to sleep (currently, the system adjusts from 2 to 64, and it's almost always down to 2 -- doesn't need any more for the kind of requests volume it processed).
Thanks again:)
Clearly, all the data is in the kernel cache (and mmapped), so I guess what you are paying for is the system call to read the mmap data, followed by any merging of the data (as it comes from separate/multiple sstables?) that is necessary. I don't have any real-world production experience with mysql (nor do I want any 😄), so I can't comment on it's performance.
I think these are nice, low metrics you've got, but, tbh, you're reading everything from page cache, which isn't a great measure of a database. How do those numbers looks when you have to start fetching data from disk? Especially if you've only got 8GB of RAM, I suspect you have to go to disk a lot - unless you don't have much data. And then, how do things like compaction/repair muck up the I/O contention for reads? If you have a distributed, LSM-based system, surely those things will have some impact on the system and end-user response times.
As a side note, it's interesting that PUTs take more time than GETs, but I suspect it's related to a combo of commit log and and memtable insertion.
HTH,
-Jason