first, use:
teuthology-lock --brief
to see if you already have a machine locked. if not, use:
teuthology-lock --lock-many 1 --machine-type smithi
to lock a machine.
- in theory "smithi" could be replaced with other machine types (e.g. "mira", "gibba")
- not specifying the OS will increase the chance of lockign a machine. you can later on reimage the machine with whatever OS you need. e.g. to get a CentOS 9 OS on an already lociked machine:
teuthology-reimage -v --os-type centos --os-version 9.stream <hostname>
- reimaging does not complete properly, abd you would see a message that looks like that (for RGW test suites):
paramiko.ssh_exception.BadHostKeyException: Host key for server 'smithi191.front.sepia.ceph.com' does not match: got 'AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBIiEnmRAHiRzxJ8A8VHbp6Sfj/cZlObX5agO2bSneMsIjVB9gBU+F8yqw+ZMthTf+dL2AuUJ1zqRBifpjSXRuzY=', expected 'AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBEOfQy5E9ekjRHzGsi3vO8EdY9oIhotS67hhc/7DEbu5Y44D3wVb9UzeT+mOyxULkTif20vMskwMezi+mNhFgR4='
you can stop the process at that point (e.g. Ctrl-C
) as the reimaging was done.
copy the first hash (the inside the single quotes after the work "got"), and copy it to orig.config.yaml
under the "targets" section.
for the above case it would look like that:
targets:
smithi191.front.sepia.ceph.com: ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBIiEnmRAHiRzxJ8A8VHbp6Sfj/cZlObX5agO2bSneMsIjVB9gBU+F8yqw+ZMthTf+dL2AuUJ1zqRBifpjSXRuzY=
- when no reimaging is needed, running the test for the first time would give a similar error, you should stop the test and copy the hash from the error mesage into the
orig.config.yaml
, and re-run the test - once you don't need the machine, please unlock:
teuthology-lock --unlock <hostname>
- for ansible to work, you must be able to non-interactively ssh to the machine. to do that, ssh to the machine:
ssh <hostname>
and select "yes". if you already sshed to that machine in the past, you will have to delete the old lines from ~/.ssh/known_hosts
referencing this machine.
make sure thta all relevant lines are deleted by calling ssh <hostname>
and making sure that there is no interactive step.
- make sure there exists a directory called "archive_dir"
- the file that controls the execution of the test is
orig.config.yaml
. this is an example:
targets:
<hostname>: ecdsa-sha2-nistp256 <hash from reimaging>
archive_path: <full path to archive dir>
verbose: true
interactive-on-error: true
## wait_for_scrub being false makes locked runs go a lot faster
wait_for_scrub: false
owner: scheduled_<user>@teuthology
kernel:
kdb: true
sha1: distro
overrides:
admin_socket:
branch: <branch name>
ceph:
conf:
client:
debug rgw: 20
rgw crypt require ssl: false
rgw crypt s3 kms backend: testing
rgw crypt s3 kms encryption keys: testkey-1=YmluCmJvb3N0CmJvb3N0LWJ1aWxkCmNlcGguY29uZgo=
testkey-2=aWIKTWFrZWZpbGUKbWFuCm91dApzcmMKVGVzdGluZwo=
rgw d3n l1 datacache persistent path: /tmp/rgw_datacache/
rgw d3n l1 datacache size: 10737418240
rgw d3n l1 local datacache enabled: true
rgw enable ops log: true
rgw lc debug interval: 10
rgw torrent flag: true
setgroup: ceph
setuser: ceph
mgr:
debug mgr: 20
debug ms: 1
mon:
debug mon: 20
debug ms: 1
debug paxos: 20
osd:
bdev async discard: true
bdev enable discard: true
bluestore allocator: bitmap
bluestore block size: 96636764160
bluestore fsck on mount: true
debug bluefs: 1/20
debug bluestore: 1/20
debug ms: 1
debug osd: 20
debug rocksdb: 4/10
mon osd backfillfull_ratio: 0.85
mon osd full ratio: 0.9
mon osd nearfull ratio: 0.8
osd failsafe full ratio: 0.95
osd objectstore: bluestore
flavor: default
fs: xfs
log-ignorelist:
- \(MDS_ALL_DOWN\)
- \(MDS_UP_LESS_THAN_MAX\)
- \(PG_AVAILABILITY\)
- \(PG_DEGRADED\)
wait-for-scrub: false
ceph-deploy:
bluestore: true
conf:
client:
log file: /var/log/ceph/ceph-$name.$pid.log
mon:
osd default pool size: 2
osd:
bdev async discard: true
bdev enable discard: true
bluestore block size: 96636764160
bluestore fsck on mount: true
debug bluefs: 1/20
debug bluestore: 1/20
debug rocksdb: 4/10
mon osd backfillfull_ratio: 0.85
mon osd full ratio: 0.9
mon osd nearfull ratio: 0.8
osd failsafe full ratio: 0.95
osd objectstore: bluestore
install:
ceph:
flavor: default
sha1: <sha1 of branch on ceph-ci>
openssl_keys:
rgw.client.0:
ca: root
client: client.0
embed-key: true
root:
client: client.0
cn: teuthology
install:
- client.0
key-type: rsa:4096
rgw:
client.0:
ssl certificate: rgw.client.0
compression type: random
datacache: true
datacache_path: /tmp/rgw_datacache
ec-data-pool: false
frontend: beast
storage classes: LUKEWARM, FROZEN
s3tests:
force-branch: ceph-master
selinux:
whitelist:
- scontext=system_u:system_r:logrotate_t:s0
thrashosds:
bdev_inject_crash: 2
bdev_inject_crash_probability: 0.5
workunit:
branch: <branch name>
sha1: <sha1 of branch on ceph-ci>
roles:
- - mon.a
- mon.b
- mgr.x
- osd.0
- osd.1
- client.0
repo: https://github.com/ceph/ceph-ci.git
sha1: <sha1 of branch on ceph-ci>
suite_branch: <branch name of test suite>
suite_relpath: qa
suite_repo: <repo for test suite>
tasks:
- install:
extra_system_packages:
deb:
- s3cmd
rpm:
- s3cmd
- ceph: null
- openssl_keys: null
- rgw:
client.0: null
- <name of test suite>:
client.0:
rgw_server: client.0
Notes
- values marked inside rectangular brackets (
<>
) should be filled- shaman builds takes time, and not needed when only test code is changed. this is why the
suite_repo
is usually your fork of the ceph repo, and notceph-ci
teuthology -v --archive archive_dir orig.config.yaml
- during the setup of the test the step that takes most of the time is the "ansible" one
to make sure that progres is done during that step, track the ansible log file:
~/archive_dir/ansible.log
- test test will stop on error, so that the machine where it runs could be used for debugging
- the test log would be printed to the terminal, and also to
~/archive_dir/teuthology.log
- once debugging is done, hit
Ctrl-D
to do the cleanup - due to an issue with the cleanup process, when rerunning the test the machine has to be reimaged, otherwise, the folowing error is likely to happen:
/dev/vg_nvme: already exists in filesystem
see the instruction on "known_hosts" cleanup, after reimaging.
Debugging RGW