Skip to content

Instantly share code, notes, and snippets.

@dasl-
Created October 4, 2019 17:12
Show Gist options
  • Save dasl-/8c6535705fe682c02a445e35ba60b232 to your computer and use it in GitHub Desktop.
Save dasl-/8c6535705fe682c02a445e35ba60b232 to your computer and use it in GitHub Desktop.

PlannedReparentShard

  1. starts here: https://github.com/dasl-/vitess/blob/managed-tablet/go/vt/vtctl/reparent.go#L98
  2. DemoteMaster: https://github.com/dasl-/vitess/blob/managed-tablet/go/vt/wrangler/reparent.go#L423
    1. Set serving = false in vt topo: https://github.com/dasl-/vitess/blob/managed-tablet/go/vt/vttablet/tabletmanager/rpc_replication.go#L305
      1. SetServingType https://github.com/dasl-/vitess/blob/master/go/vt/vttablet/tabletserver/tabletserver.go#L470
    2. Set RO
    3. FLUSH TABLES WITH READ LOCK (why? this wait for long running selects) https://github.com/dasl-/vitess/blob/managed-tablet/go/vt/mysqlctl/reparent.go#L97-L104
    4. UNLOCK TABLES
    5. SELECT @@GLOBAL.gtid_executed
  3. PromoteSlaveWhenCaughtUp https://github.com/dasl-/vitess/blob/managed-tablet/go/vt/wrangler/reparent.go#L435 Note that the timeout for this is set here: https://github.com/dasl-/vitess/blob/managed-tablet/go/vt/wrangler/reparent.go#L428 By default it is 30s: https://github.com/dasl-/vitess/blob/managed-tablet/go/vt/topo/locks.go#L53
    1. Wait for replication to catch up: SELECT WAIT_UNTIL_SQL_THREAD_AFTER_GTIDS('%s', %v) . The timeout is https://github.com/dasl-/vitess/blob/managed-tablet/go/mysql/flavor_mysql.go#L134
    2. SET @@global.read_only = false
    3. Change topo type to master
    4. agent.refreshTablet https://github.com/dasl-/vitess/blob/master/go/vt/vttablet/tabletmanager/rpc_replication.go#L459
      1. agent.updateState https://github.com/dasl-/vitess/blob/master/go/vt/vttablet/tabletmanager/state_change.go#L157
      2. agent.broadcastHealth (let vtgate know we're serving) https://github.com/dasl-/vitess/blob/master/go/vt/vttablet/tabletmanager/state_change.go#L341
  4. Insert row into the reparent journal table on the new master. https://github.com/dasl-/vitess/blob/f23777db959c7bb6be128d091e81f47e04bbc2a8/go/vt/wrangler/reparent.go#L473
  5. reparent all replicas to the new master. https://github.com/dasl-/vitess/blob/managed-tablet/go/vt/wrangler/reparent.go#L460
    1. Run CHANGE MASTER TO ... query. https://github.com/dasl-/vitess/blob/f23777db959c7bb6be128d091e81f47e04bbc2a8/go/vt/vttablet/tabletmanager/rpc_replication.go#L540
    2. Wait for row inserted into master's reparent journal to replicate. https://github.com/dasl-/vitess/blob/f23777db959c7bb6be128d091e81f47e04bbc2a8/go/vt/vttablet/tabletmanager/rpc_replication.go#L563
  6. UpdateShardFields in topology https://github.com/dasl-/vitess/blob/master/go/vt/wrangler/reparent.go#L502

Docs say: -wait_slave_timeout duration time to wait for slaves to catch up in reparenting (default 30s)

waitSlaveTimeout is used for

  1. searching for best new master candidate (which has executed most transactions)
  2. reparenting all replicas to the new master after the new master has been enabled.

I'd expect it to also be used for timing out PromoteSlaveWhenCaughtUp. PR to do so has been merged: https://github.com/vitessio/vitess/commit/baa15c2571257ed2bd00cc9f38f0d748eeeaa6d2

We had an incident recently using PlannedReparentShard: https://github.etsycorp.com/gist/dleibovic/b666ad2c0d3a1f9ec612f413c4adfa18

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment