Move Pod

OpsMgr always wants to move the pod to achieve desired state. But currently, Kubernetes does not offer an API to move a Pod.

So we use a Copy--Delete--Create method to implement the Move operation, by calling Kubernetes APIserver.

Binding-on-Creation

In the Copy-Kill-Create method, before the third step "Create", we assign the new destination host NodeName for the new pod: Pod.Spec.NodeName, so the new Pod will be binded to that desired host directly, without the necessary to wait for a scheduler to bind it to a node.

We call this method "Biding-on-Creation", which makes it easier to move a Pod controlled by a ReplicationController/ReplicaSet.

How it may work?

For Pods not controlled by ReplicationController/ReplicaSet, it is easy to understand the Copy--Kill--Create method will be able to move the Pod to the desired node.

Problem

For pods controlled by ReplicationController/ReplicaSet, when one of them is killed, the ReplicationController/ReplicaSet will be get notified, and create a new pod. The ReplicationController/ReplicaSet will create a new Pod immediately to make sure there is enough number of Running replicas. However, ReplicationController/ReplicaSet also makes sure that there is no more than desired number of Running replicas.

Since we will create a new Pod too, so the ControllerManager will decide to delete one of the two Pods: the pod newly created by ControllerManager, and the pod created by our MoveOperation. How to make sure that the pod created by our MoveOperation will survive?

ControllerManager

According to the code of ReplicationManager, when ReplicationController decides which Pods are to be deleted, it will sorts the Pod of the ReplicationController according some conditions of the pods. The first condition is to check whether a Pod is assigned a Node or not. If a Pod is not assigned a Node, then it will be deleted first.

// ActivePods type allows custom sorting of pods so a controller can pick the best ones to delete.
type ActivePods []*v1.Pod

func (s ActivePods) Len() int      { return len(s) }
func (s ActivePods) Swap(i, j int) { s[i], s[j] = s[j], s[i] }

func (s ActivePods) Less(i, j int) bool {
	// 1. Unassigned < assigned
	// If only one of the pods is unassigned, the unassigned one is smaller
	if s[i].Spec.NodeName != s[j].Spec.NodeName && (len(s[i].Spec.NodeName) == 0 || len(s[j].Spec.NodeName) == 0) {
		return len(s[i].Spec.NodeName) == 0
	}
	// 2. PodPending < PodUnknown < PodRunning
	m := map[v1.PodPhase]int{v1.PodPending: 0, v1.PodUnknown: 1, v1.PodRunning: 2}
	if m[s[i].Status.Phase] != m[s[j].Status.Phase] {
		return m[s[i].Status.Phase] < m[s[j].Status.Phase]
	}
	// 3. Not ready < ready
	// If only one of the pods is not ready, the not ready one is smaller
	if podutil.IsPodReady(s[i]) != podutil.IsPodReady(s[j]) {
		return !podutil.IsPodReady(s[i])
	}
	// TODO: take availability into account when we push minReadySeconds information from deployment into pods,
	//       see https://github.com/kubernetes/kubernetes/issues/22065
	// 4. Been ready for empty time < less time < more time
	// If both pods are ready, the latest ready one is smaller
	if podutil.IsPodReady(s[i]) && podutil.IsPodReady(s[j]) && !podReadyTime(s[i]).Equal(podReadyTime(s[j])) {
		return afterOrZero(podReadyTime(s[i]), podReadyTime(s[j]))
	}
	// 5. Pods with containers with higher restart counts < lower restart counts
	if maxContainerRestarts(s[i]) != maxContainerRestarts(s[j]) {
		return maxContainerRestarts(s[i]) > maxContainerRestarts(s[j])
	}
	// 6. Empty creation time pods < newer pods < older pods
	if !s[i].CreationTimestamp.Equal(s[j].CreationTimestamp) {
		return afterOrZero(s[i].CreationTimestamp, s[j].CreationTimestamp)
	}
	return false
}

Solution

Before the delete operation, we first modify the scheduler of the corresponding ReplicationController/ReplicaSet to a none exist scheduler. The consequece of this modification is that, the Pod created by ControllerManager will be waiting for a none exist scheduler to assign a node, and of course this Pod will not get a node because there is no scheduler to assign a node for it.

Second, we create a new pod with the node name assigned. Then when ContollerManager decides to delete a Pod, it will choose the one created by ControllerManager.

An Experiment

This experiment is to prove: If one of pods controlled by a ReplicationController is killed, and two pods are created for the ReplicationController, then the pod which is earlier to get to the Running state will be kept, and the later will be deleted by ReplicationController.

Setup

For the convienient of description, make the following definition:

podA represents the pod created by ReplicationController, 
podB represents the pod created by OurClient.

1. Build and run a customer scheduler (named slow-xyzscheduler). This scheduler is almost the same with "kube-scheduler", except that it will sleep for 30 seconds every time when to schedule a pod.

$ go get k8s.io/kubernetes

# modify file k8s.io/kubernetes/plugin/pkg/scheduler/scheduler.go
# by adding one line into the beginning of func (sched *Scheduler) scheduleOne()
#  time.Sleep(time.Second * 30)

$ cd k8s.io/kubernetes/plugin/cmd/kube-scheduler/
$ go build

# run it
$ ./kube-scheduler --kubeconfig ./my.kubeconfig --logtostderr --v 3 --scheduler-name slow-xyzscheduler --leader-elect=false

2. Create an ReplicationController, and set the controller to our "slow-xyzscheduler"; This "slow-xyzscheduler" will schedule the pods created by RepliactionController.

3. Copy-Delete-Create one of the Pods of the ReplicationController by our tool movePod; We made a modification of the code:

After the kill of the original Pod, and before the creation of the copied, add one line of code:

  time.Sleep(time.Second * 10)

This modification will make sure that the podA (created by ReplicationController) will be earlier than the podB (created by the tool movePod).

Result

PodA will be deleted by ReplicationController. This confirmed that: even podA is created earlier than podB, but podB is earlier to get to the stat of Running, so podA will be deleted by ReplicationController.

songbinliu/MovePod.md