Skip to content

Instantly share code, notes, and snippets.

@beekhof
Last active February 11, 2020 07:56
Show Gist options
  • Save beekhof/33da521a709a3e316a9cfa2f565b0045 to your computer and use it in GitHub Desktop.
Save beekhof/33da521a709a3e316a9cfa2f565b0045 to your computer and use it in GitHub Desktop.

// LeaseSpec: https://github.com/kubernetes/api/blob/master/coordination/v1/types.go#L40

leasePadding = 30 seconds

Unless otherwise specified, when updating the following fields, always set:

  • LeaseDurationSeconds = lifecycle.metal3.io/maintenance + leasePadding,
  • AcquireTime = now,
  • RenewTime = now,
  • HolderIdentity = machine-remediation

A valid lease means: now < LeaseDurationSeconds + AcquireTime

Controller, if lifecycle.metal3.io/maintenance exists:

  • if valid lease AND not the current owner:
    • update lifecycle.metal3.io/maintenance-status = waiting
    • exit
  • if no lease:
    • create with HolderIdentity, LeaseTransitions = 1, AcquireTime, LeaseDurationSeconds
    • Set ownerRef to refer to the Node
    • update lifecycle.metal3.io/maintenance-status = new, create
  • if HolderIdentity doesn’t match:
    • update HolderIdentity, LeaseTransitions + 1, AcquireTime, LeaseDurationSeconds
    • update lifecycle.metal3.io/maintenance-status = new, acquired
    • Ensure lease ownerRef refers to the Node
  • if LeaseDurationSeconds == 0
    • update LeaseTransitions + 1, AcquireTime, LeaseDurationSeconds
    • update lifecycle.metal3.io/maintenance-status = new
  • if LeaseDurationSeconds != lifecycle.metal3.io/maintenance + leasePadding:
    • // lease interval changed
    • update LeaseDurationSeconds, and RenewTime
    • update lifecycle.metal3.io/maintenance-status = updated
  • else if now > (LeaseDurationSeconds + AcquireTime):
    • // The lease records a previous maintenance window
    • // Unlikely but could happen if:
    • // 1. we get API errors preventing the annotation from being deleted prior to the lease expiring
    • // 2. we get API errors preventing LeaseDurationSeconds from being unset and someone recreated the annotation after we deleted it
    • // 3. Reconcile() does not get called within leasePadding of the lease expiring
    • // The second is more likely, so treat as a new request
    • update LeaseDurationSeconds, AcquireTime, and RenewTime
    • update lifecycle.metal3.io/maintenance-status = new, stale
  • if lifecycle.metal3.io/maintenance-status == ended
    • // Someone re-created the annotation, with the same interval, after we deleted it, but before we unset LeaseDurationSeconds, and before the old lease expired
    • // Unlikely but could happen if we get API errors preventing LeaseDurationSeconds from being unset
    • // Treat as a request for a new lease starting now
    • update LeaseDurationSeconds, AcquireTime, and RenewTime
    • update lifecycle.metal3.io/maintenance-status = new, recreate
  • if (now + leasePadding) > (LeaseDurationSeconds + AcquireTime):
    • delete annotation
    • update lifecycle.metal3.io/maintenance-status = ended
    • if lease time remaining > 0, use a retry loop to uncordon for up to “lease time remaining” seconds
    • if lease time remaining still > 0, use a retry loop to cancel drain for up to “lease time remaining” seconds
    • if lease time remaining still > 0, use a retry loop to set LeaseDurationSeconds = 0 for up to “lease time remaining”
    • exit
  • cordon
  • drain
  • update lifecycle.metal3.io/maintenance-status = active

Controller, if lifecycle.metal3.io/maintenance does not exist:

  • if valid lease AND we are the current owner:
    • update lifecycle.metal3.io/maintenance-status = ended
    • if lease time remaining > 0, use a retry loop to uncordon for up to “lease time remaining” seconds
    • if lease time remaining still > 0, use a retry loop to cancel drain for up to “lease time remaining” seconds
    • if lease time remaining still > 0, use a retry loop to set LeaseDurationSeconds = 0 for up to “lease time remaining” seconds
    • on any errors: requeue
  • else if LeaseDurationSeconds > 0 and we are the current owner:
    • // We lost the lease prior to tearing down maintenance mode, probably due to API errors
    • // This could result in manual drains/cordons being undone and may not be a good idea to implement
    • // On the other hand, the node will be useless if it remains cordoned
    • update LeaseDurationSeconds = leasePadding, and RenewTime = now
    • if lease time remaining > 0, use a retry loop to uncordon for up to “lease time remaining” seconds
    • if lease time remaining still > 0, use a retry loop to set LeaseDurationSeconds = 0 for up to “lease time remaining” seconds
    • on any errors: requeue
@MoserMichael
Copy link

Controller, if lifecycle.metal3.io/maintenance does not exist:
if valid lease AND we are the current owner:

question: the conditional loops might block the controller loop for a significant period of time so that other reconcile request remain queued? Is that correct?

@beekhof
Copy link
Author

beekhof commented Feb 10, 2020

if valid lease AND not the current owner:
update lifecycle.metal3.io/maintenance-status = waiting
requeue (not exit as written in the gist)

is that correct?

sure. probably most if not all returns from the function should be requeues

@beekhof
Copy link
Author

beekhof commented Feb 10, 2020

  1. question: leasePadding - is this a constant? If yes then what is it's value?

first line of the file

  1. question: Should leasePadding be added or ishould it be aligned to the value of leasePadding (upwards) ?

I don't understand the second part of the question

@beekhof
Copy link
Author

beekhof commented Feb 10, 2020

if HolderIdentity doesn’t match:

question: is that equivalent to "if lease expired AND not the current owner: " ?

effectively. why? are you thinking it duplicates "if valid lease AND not the current owner:" ?

@beekhof
Copy link
Author

beekhof commented Feb 10, 2020

if LeaseDurationSeconds == 0

question: what is the meaning that the LeaseDurationInSeconds is 0 (or the referene value nil) ?
does that mean we are in "stale" state? If that is the case: shouldn't we do that when entering "stale" state?

LeaseDurationInSeconds is set to zero when we finish cleaning up at the end of a lease. its the last thing we do

@beekhof
Copy link
Author

beekhof commented Feb 10, 2020

else if now > (LeaseDurationSeconds + AcquireTime)

  1. question: should that be
    else if now > (LeaseDurationSeconds + max(RenewTime,AcquireTime))

no
https://github.com/kubernetes/api/blob/master/coordination/v1/types.go#L55

  1. question: do we need to increment the # of LeaseTransitions?

no, only when the holder changes
see: https://github.com/kubernetes/api/blob/master/coordination/v1/types.go#L59

  1. question: "update LeaseDurationSeconds" to what value should it be updated? is it nil or 0 or some other value? AcquireTime and RenewTime set to now?

same as everywhere else, lifecycle.metal3.io/maintenance + leasePadding

@beekhof
Copy link
Author

beekhof commented Feb 10, 2020

if lifecycle.metal3.io/maintenance-status == ended

  1. question: i thought you told me that you didn't want conditions on the status in the annotation?

yes, i'm not happy about it either - it's this or create a CRD which we are still trying to avoid.
for now I'm tell myself its ok because its only relevant if an unlikely series of events occur

  1. question: to what value should we update LeaseDurationSeconds, AcquireTime, and RenewTime ?

any time you're updating AcquireTime and RenewTime you should set it to now
LeaseDurationSeconds is always lifecycle.metal3.io/maintenance + leasePadding

@MoserMichael
Copy link

if HolderIdentity doesn’t match:
question: is that equivalent to "if lease expired AND not the current owner: " ?

effectively. why? are you thinking it duplicates "if valid lease AND not the current owner:" ?

Ok, in this case (if HolderIdentity doesn’t match: ) applies to both the case the the lease has expired and that it has not expired. If the lease has not expired and HolderIdentity doesn't match then how is it possible for us to take ownership of the lease just like that? Don't we have to wait for the lease to expire first? (that's my main problem)

@MoserMichael
Copy link

MoserMichael commented Feb 11, 2020

f think you will never get to one of the checks:
if lifecycle.metal3.io/maintenance exists:
...
if lifecycle.metal3.io/maintenance-status == ended
because when you transition to ended state you also delete the lifecycle.metal3.io/maintenance annotation. so it won't get to second nested check on a future reconcile call

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment