Skip to content

Instantly share code, notes, and snippets.

@markpapadakis
Last active August 29, 2015 14:09
Show Gist options
  • Select an option

  • Save markpapadakis/1bc50417f64e3f86e160 to your computer and use it in GitHub Desktop.

Select an option

Save markpapadakis/1bc50417f64e3f86e160 to your computer and use it in GitHub Desktop.
# CloudDS
2048 requests in sequence - percentiles captured in an HDRHistogram
All CloudDS caching options disabled.
Cluster Size: 7
Replication Factor: 3
Row size in Columns: 10
Row size/payload: 2.5KBs
Column Family size in rows: 4MM
Cluster configuration: half of those 7 nodes commodity HW with 8GB RAM, other blade-class nodes with 4GB of RAM. Gigabit link for interconnect.
Cluster in active/heavy use by many other production services(not idle).
This is for wide columns - that is, each row can have 0+ columns, and when updating a row you
can update/add any columns - no need RCU, which in turns means,
each GET requires (unless the optimizer deems unnecessary)
fetching the columns content needed for all row updates from
files an inmemory maps in order to compile the final response.
CloudDS supports an alternative storage engine (KV) where value is opaque data(blob),
and those semantics requirements are not needed, thus it's enough to just locate
the most recent update among all recorded updates -- no need to merge.
This should probably give a 10-15% perfomance improvement for this benchmark
but this is for the wide-columns storage engine.
GET:
Consistency Level: QUORUM
1% => 440
5% => 463
10% => 470
20% => 480
50% => 507
75% => 533
90% => 565
95% => 589
97% => 608
99% => 668
values in microseconds. So, at 99% percentile, we need, worse case scenario, 0.7 milliseconds.
---
Same test, this time with Consistency Level: ONE
1% => 202
5% => 206
10% => 209
20% => 214
50% => 238
75% => 251
90% => 270
95% => 279
97% => 295
99% => 382
# CloudDS - PUT
Same setup, number of requests, cluster, etc
Consistency Level: QUORUM
1% => 461
5% => 482
10% => 499
20% => 518
50% => 562
75% => 608
90% => 658
95% => 695
97% => 727
99% => 781
Consistency Level: ONE
1% => 252
5% => 263
10% => 270
20% => 279
50% => 294
75% => 305
90% => 320
95% => 330
97% => 338
99% => 382
When optimizations planned are implemented, it should be at least 20-30% more efficient.
# mySQL
Same benchmark for a mySQL query on a table with of 1MM rows. (keys appropriately set)
Query: SELECT * FROM table WHERE user = 'markpapadakis'
1% => 774
5% => 817
10% => 830
20% => 841
50% => 874
75% => 912
90% => 959
95% => 1002
97% => 1043
99% => 1158
@jasobrown
Copy link
Copy Markdown

Clearly, all the data is in the kernel cache (and mmapped), so I guess what you are paying for is the system call to read the mmap data, followed by any merging of the data (as it comes from separate/multiple sstables?) that is necessary. I don't have any real-world production experience with mysql (nor do I want any 😄), so I can't comment on it's performance.

I think these are nice, low metrics you've got, but, tbh, you're reading everything from page cache, which isn't a great measure of a database. How do those numbers looks when you have to start fetching data from disk? Especially if you've only got 8GB of RAM, I suspect you have to go to disk a lot - unless you don't have much data. And then, how do things like compaction/repair muck up the I/O contention for reads? If you have a distributed, LSM-based system, surely those things will have some impact on the system and end-user response times.

As a side note, it's interesting that PUTs take more time than GETs, but I suspect it's related to a combo of commit log and and memtable insertion.

HTH,

-Jason

@markpapadakis
Copy link
Copy Markdown
Author

Thanks for the comments Jason :)

Right, it's mostly pulled from kernel page cache, for 4 distinct SSTables.
CloudDS SSTables require at worse 2 disk seeks, one for locating the offset in the SSTable for the value and another for reading the value. The first disk seek is 99% of the time a cache-hit (a LUT backed binary search reduces that to a binary search in no more than 16 elements so that's pretty fast and contained within a single 4K disk page). Data are read directly (mmap is the default access interface, pread64() based access also support - configurable on a per CF basis so that you can choose how to spend your memory on there ).

However, please note:
those numbers are on our active cluster, it's heavily in use - it's not a cluster comprised of idle nodes. Compactions are ongoing, many reads/writes operation are in flight, etc.

The compaction process is throttled, on the OS level and on the application level, and so in practice it doesn't affect requests processing enough to be an issue. If it is close to becoming an issue, a heuristic adjusts the throttling scheme accordingly.

Indeed, I am curious about PUT vs GET timings myself, especially since a PUT is, as you said, a simple matter of storing in a concurrent map-like structure (for the memory table) and appending on a file(the commit log). I will investigate further.

Also, it turns out, approximately 70% of that time is I/O requests scheduling and tasks scheduling overhead. As this is improved, the overall throughput will go up and latency will drop further.
However, there is not much you can do when you are dealing with the kernel -- when a thread is put to sleep, the kernel needs to wake it up, and that takes time -- so I think one of the key ways to solve that is to have as few threads needed as possible around in order to avoid having to put any to sleep (currently, the system adjusts from 2 to 64, and it's almost always down to 2 -- doesn't need any more for the kind of requests volume it processed).

Thanks again:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment