Skip to content

Instantly share code, notes, and snippets.

View tbg's full-sized avatar

Tobias Grieger tbg

View GitHub Profile
#!/bin/bash
# Save this to, say, andy-rebalance-experiment.sh
# run `chmod +x andy-rebalance-experiment.sh` (once)
# To run, run `./andy-rebalance-experiment.sh`.
set -euxo pipefail
export CLUSTER=andy-rebalance
roachprod create $CLUSTER -n 4 --clouds=aws --aws-machine-type-ssd=c5d.4xlarge
>>> roachprod stop: Wed Apr 3 06:15:02 UTC 2019
PID COMMAND
1 /sbin/init HOME=/ init=/sbin/init NETWORK_SKIP_ENSLAVED= recovery= TERM=linux drop_caps= BOOT_IMAGE=/boot/vmlinuz-4.15.0-1026-gcp PATH=/sbin:/usr/sbin:/bin:/usr/bin PWD=/ rootmnt=/root
2 [kthreadd]
3 [kworker/0:0]
4 [kworker/0:0H]
5 [kworker/u8:0]
6 [mm_percpu_wq]
7 [ksoftirqd/0]
8 [rcu_sched]
digraph G {
"Start" -> "Dead Node"
"Start" -> "Unresponsive Node"
"Dead Node" -> "OOM"
"OOM" -> "heap_profiler"
"OOM" -> "goroutine_dump"
"OOM" -> "dmesg"
"OOM" -> "log messages"
"Dead Node" -> "Fatal error"
cr.store.totalbytes 1
1553683040000000000 8520
1553683050000000000 65054
1553683060000000000 93149
...
07:31:21 cluster.go:252: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod create teamcity-1194326-election-after-restart -n 3 --gce-machine-type=n1-standard-4 --local-ssd-no-ext4-barrier
Creating cluster teamcity-1194326-election-after-restart with 3 nodes
teamcity-1194326-election-after-restart: [gce] 12h28m4s remaining
teamcity-1194326-election-after-restart-0001 teamcity-1194326-election-after-restart-0001.us-east1-b.cockroach-ephemeral 10.142.0.80 35.237.52.36
teamcity-1194326-election-after-restart-0002 teamcity-1194326-election-after-restart-0002.us-east1-b.cockroach-ephemeral 10.142.0.44 35.196.152.243
teamcity-1194326-election-after-restart-0003 teamcity-1194326-election-after-restart-0003.us-east1-b.cockroach-ephemeral 10.142.0.23 34.73.18.251
Syncing...
failed to update roachprod.crdb.io DNS: Command: gcloud [--project cockroach-shared dns record-sets import -z roachprod --delete-all-existing --zone-file-format /root/.roachprod/dns.bind932780071]
Output: ERROR: (gcloud
This file has been truncated, but you can view the full file.
(lldb) thread backtrace all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x00007fff75e2d1b2 libsystem_kernel.dylib`__psynch_cvwait + 10
frame #1: 0x00007fff75ee65fe libsystem_pthread.dylib`_pthread_cond_wait + 775
frame #2: 0x0000000004060754 cockroach`runtime.pthread_cond_timedwait_relative_np_trampoline + 20
frame #3: 0x000000000405e180 cockroach`runtime.asmcgocall + 112
frame #4: 0x000000000404e06b cockroach`runtime.pthread_cond_timedwait_relative_np + 59
frame #5: 0x000000000402d49d cockroach`runtime.semasleep + 269
frame #6: 0x000000000400ca5d cockroach`runtime.notetsleep_internal + 269
frame #7: 0x000000000400ccc1 cockroach`runtime.notetsleepg + 97
recap: n3 had good stats and crashed because it talked to n2 which has bad stats. n4 is still around and has the good stats.
n2 is andy-72:6
n3 is andy-72:3
# n2
~/cockroach debug rocksdb --hex query --db=./cockroach
get 0x0169f70150727261736b00
0x0169F70150727261736B00 ==> 0x120408001000180020002800325B6ACFE8BF03083B101C1A50096C5D564CA220891520A39895FDFFFFFFFFFF0128B2E6FBFFFFFFFFFFFF013086E6ECFEFFFFFFFFFF0138B2E6FBFFFFFFFFFFFF01409DB2A8FEFFFFFFFFFF0148B2E6FBFFFFFFFFFFFF0160B7066805
Routine
510: select [0~2 minutes] [Created by internal.go:206 (*internalExecutorImpl).initConnEx({[] [] false})]
Stack
- replica_range_lease.go:911 (*Replica).redirectOnOrAcquireLease.func2({[{824884140400 *} {824636858368 #402} {122128928 #92} {824888890480 *} {824784257152 *} {824835119768 *} {824835119768 *}] [] false})
- replica_range_lease.go:967 (*Replica).redirectOnOrAcquireLease({[{824636858368 #402} {122128928 #92} {824888890480 *} {0 } {0 } {0 } {0 } {0 } {0 } {0 }] [] true})
- replica_read.go:40 (*Replica).executeReadOnlyBatch({[{824636858368 #402} {122128928 #92} {824888890480 *} {1551788088188014000 *} {0 } {4294967297 #4} {1 } {6 } {0 } {824876587776 *}] [] true})
- replica.go:517 (*Replica).sendWithRangeID({[{824636858368 #402} {122128928 #92} {824888890480 *} {6 } {1551788088188014000 *} {0 } {4294967297 #4} {1 } {6 } {0 }] [] true})
- replica.go:462 (*Replica).Send({[{824636858368 #402} {122128928 #92} {824888890432 *} {1551788088188014000 *} {0 } {4294967297 #4} {1 } {6 } {0 } {824876587776
@tbg
tbg / timeout
Created February 11, 2019 15:05
This file has been truncated, but you can view the full file.
3 runs so far, 0 failures, over 5s
3 runs so far, 0 failures, over 10s
3 runs so far, 0 failures, over 15s
3 runs so far, 0 failures, over 20s
3 runs so far, 0 failures, over 25s
3 runs so far, 0 failures, over 30s
3 runs so far, 0 failures, over 35s
3 runs so far, 0 failures, over 40s
3 runs so far, 0 failures, over 45s
3 runs so far, 0 failures, over 50s
@tbg
tbg / abrt
Created February 11, 2019 15:05
This file has been truncated, but you can view the full file.
SIGABRT: abort
PC=0x7fff731101b2 m=0 sigcode=0
goroutine 0 [idle]:
runtime.pthread_cond_wait(0x857f580, 0x857f540, 0x0)
/usr/local/Cellar/go/1.11.1/libexec/src/runtime/sys_darwin.go:302 +0x51
runtime.semasleep(0xffffffffffffffff, 0x857f200)
/usr/local/Cellar/go/1.11.1/libexec/src/runtime/os_darwin.go:63 +0x85
runtime.notesleep(0x857f340)
/usr/local/Cellar/go/1.11.1/libexec/src/runtime/lock_sema.go:167 +0xe3