Skip to content

Instantly share code, notes, and snippets.

@yuvalif
Last active March 2, 2026 17:40
Show Gist options
  • Select an option

  • Save yuvalif/9ae7508b599aa8718521109b77dbc413 to your computer and use it in GitHub Desktop.

Select an option

Save yuvalif/9ae7508b599aa8718521109b77dbc413 to your computer and use it in GitHub Desktop.

RGW tcmalloc Profiling

Background

All daemons in ceph are using tcmalloc as the memory allocator to achieve better performance. In a recent PR the ability to get information on how tcmalloc performs in the RGW was added. In this project, we should use the profiling information from RGW runs to tune the tcmalloc parameters so that would be more suitable for the memory use of the RGW.

Evaluation Stage

Step 1 - Build Ceph and Run Basic Tests

First would be to have a Linux based development environment, as a minimum you would need a 4 CPU machine, with 8G RAM and 50GB disk. Unless you already have a Linux distro you like, I would recommend choosing from:

  • Fedora (42/43) - my favorite!
  • Ubuntu (24.04 LTS)
  • WSL (Windows Subsystem for Linux), though it would probably take much longer...
  • RHEL9/Centos9
  • Other Linux distros - try at your own risk :-)

Once you have that up and running, you should clone the Ceph repo from github (https://github.com/ceph/ceph). Make sure that you fetch the code from the above PR so that the RGW will have tcmalloc profiling support. If you don't know what github and git are, this is the right time to close these gaps :-) And yes, you should have a github account, so you can later share your work on the project.

Install any missing system dependencies use:

./install-deps.sh

Note that the first build may take a long time, so the following cmake parameter could be used to minimize the build time. With a fresh ceph clone use the following:

./do_cmake.sh -DBOOST_J=$(nproc) -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DWITH_MGR_DASHBOARD_FRONTEND=OFF \
  -DWITH_DPDK=OFF -DWITH_SPDK=OFF -DWITH_SEASTAR=OFF -DWITH_CEPHFS=OFF -DWITH_RBD=OFF -DWITH_KRBD=OFF -DWITH_CCACHE=OFF -Gninja

Then invoke the build process (using ninja) from within the build directory (created by do_cmake.sh). Assuming the build was completed successfully, you can run the unit tests (see: https://github.com/ceph/ceph#running-unit-tests). Now you are ready to run the ceph processes, as explained here: https://github.com/ceph/ceph#running-a-test-cluster You probably would also like to check the developer guide (https://docs.ceph.com/docs/master/dev/developer_guide/) and learn more on how to build Ceph and run it locally (https://docs.ceph.com/docs/master/dev/quick_guide/).

  • install the awc cli tool
  • configure the tool according to the access and secret keys showing in the output of the vstart.sh command
  • start the vtsart cluster:
$ MON=1 OSD=1 MDS=0 MGR=0 RGW=1 ../src/vstart.sh -n -d
  • create a bucket:
$ aws --endpoint-url http://localhost:8000 s3 mb s3://fish
  • create a file, and upload it:
$ head -c 512 </dev/urandom > myfile
$ aws --endpoint-url http://localhost:8000 s3 cp myfile s3://fish
  • list the bucket and make sure the file is there:
$ aws --endpoint-url http://localhost:8000 s3 ls s3://fish

Step 2 - Profile the RGW Under Load

install hsbench, s5cmd or build your own tool (based on the boto3 python library) and try to load the RGW with requests.

Use the tcmalloc profiling tool to get the different usage patterns:

  • write only
  • combines write + read + delete
  • read only
  • bucket listing. Note that to get interesting results, you would need ~1M objects in the bucket, which may taks ~hour to fill
  • any other combination you find interesting

And for different object sizes:

  • small objects: 4K - 256K
  • medium objects: 1M - 4M
  • large object (with multipart upload): 24M

Feel free to try different combinations of the above, spread across different numbers of buckets.

Note that when running the ceph cluster with vstart the performance numbers will be low (due to the OSD/drive speed). And does not give an indication to the actual performance of the RGW. In the real project we will perform this analysis on faster hardware, but the methodology would be similar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment