Skip to content

Instantly share code, notes, and snippets.

@mbukatov
Created February 21, 2017 16:20
Show Gist options
  • Save mbukatov/86f1a2cc480d0deae32a9e48805a4115 to your computer and use it in GitHub Desktop.
Save mbukatov/86f1a2cc480d0deae32a9e48805a4115 to your computer and use it in GitHub Desktop.
Ceph OSD Journal notes

Ceph OSD Journal notes

OSD Journal configuration, upstream:

OSD Journal configuration, RHCS 2 (downstream):

Red Hat Customer Portal:

Details

Summary

Based on resources linked above.

Ceph OSD Journals provides (why OSD journal exists and what it does?):

  • speed (small random I/O can be quickly written in sequence into journal)
  • consistency (full description of operation is written into journal - both data and metadata - first)

Facts:

  • Do not host multiple journals into single HDD (the whole point of OSD journal is to provide a place where OSD can write random I/O requests in sequential way as they arrive, without need to do seek/random access - with 2 or more journals on a single HDD, this property is lost).
  • Multiple journals can be hosted on SSD/NVMe only. The number of OSD journals which could be hosted on a signle SSD/NVMe depends on it's seq. write limits.
  • When the storage machine doesn't have (enough or at all) SSD for journals, most common use case is to host both journal and OSD disk on a single HDD - so called collocated journal).
  • Optimal strategy is to have HDD for OSD data, and SSD with journals. One needs to pay attentions to limits though.

Debugging and Analysis

Analyse write patterns

Simple trick to analyse the write patterns applied to your Ceph journal.

Assuming your journal device is /dev/sdb1, checking for 10 seconds:

$ iostat -dmx /dev/sbd1 10 | awk '/[0-9]/ {print $8}'
16.25

Now converting sectors to KiB.

16.25 * 512 / 1024 = 8

And yes, I was sending 8K requests :)

see: https://ceph.com/geen-categorie/ceph-analyse-journal-write-pattern/

Show OSD to Journal Mapping

# ceph-disk list
WARNING:ceph-disk:Old blkid does not support ID_PART_ENTRY_* fields, trying sgdisk; may not correctly identify ceph volumes with dmcrypt
/dev/sda :
/dev/sda1 other, xfs, mounted on /boot
/dev/sda2 other, LVM2_member
/dev/sdb :
/dev/sdb1 ceph data, active, unknown cluster 6f7cebf2-ceef-49b1-8928-2d36e6044db4, osd.19, journal /dev/sde1
/dev/sdc :
/dev/sdc1 ceph data, active, unknown cluster 6f7cebf2-ceef-49b1-8928-2d36e6044db4, osd.20, journal /dev/sde2
/dev/sdd :
/dev/sdd1 ceph data, active, unknown cluster 6f7cebf2-ceef-49b1-8928-2d36e6044db4, osd.21, journal /dev/sde3
/dev/sde :
/dev/sde1 ceph journal, for /dev/sdb1
/dev/sde2 ceph journal, for /dev/sdc1
/dev/sde3 ceph journal, for /dev/sdd1

See:

Creating a Ceph OSD from a designated disk partition

See: http://dachary.org/?p=2548

@japplewhite
Copy link

Another somewhat common use case is to host journals and osds on the same flash or NVMe device so that not only writes are fast but also reads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment