Skip to content

Instantly share code, notes, and snippets.

@jbw976
Last active August 15, 2018 00:30
Show Gist options
  • Save jbw976/358b67f5c0822cd6d424cdfd2961d748 to your computer and use it in GitHub Desktop.
Save jbw976/358b67f5c0822cd6d424cdfd2961d748 to your computer and use it in GitHub Desktop.
Rook #1501 Slack discussion on repro in integration tests
Slack link (which may be archived due to 10k message limit): https://rook-io.slack.com/archives/C764K425D/p1533770804000083
we have kubelet logs: https://jenkins.rook.io/blue/organizations/jenkins/rook%2Frook/detail/PR-2010/3/artifacts
and it will even be useful to debug the statefulset issue in this build
travisn [4:33 PM]
in the kubelet log:
```22:58:24.839204 14247 desired_state_of_world_populator.go:311] Failed to add volume "rookpvc" (specName: "pvc-875ab62f-9b5e-11e8-b0eb-0af5d80321b6") for pod "875d980e-9b5e-11e8-b0eb-0af5d80321b6" to desiredStateOfWorld. err=failed to get Plugin from volumeSpec for volume "pvc-875ab62f-9b5e-11e8-b0eb-0af5d80321b6" err=no volume plugin matched```
hmm, why isn’t the flex volume plugin found?
the agent log shows it is listening on the socket
travisn [4:46 PM]
in 1.11 there was a change to limit how flex drivers are loaded. https://github.com/kubernetes/kubernetes/pull/58519
seems like it’s causing our issue
travisn [4:53 PM]
i’m going to try 1.11.2 (instead of 1.11.0), but I don’t see a related fix in the changelog since then
jbw976 [5:18 PM]
that would be an annoying regression if that's the case @travisn. it seems like the intent was efficiency for plugin probing, I wonder if there's a race that is being exacerbated by the rook agent installing its plugin to multiple directories (up to 4, a 2x2 matrix of namespaced/non-namespaced and rook.io/ceph.rook.io).
wonder what the agent logs look like around that time too
and i wonder if we should be dumping out the plugin directory contents (recursively) upon failure too?
actually, this looks pretty related from the kubelet log, and it's the same timestamp as when the agent is installing its plugins:
```Aug 08 22:57:24 ip-172-31-25-17 kubelet[14247]: E0808 22:57:24.631723 14247 driver-call.go:251] Failed to unmarshal output for command: init, output: "", error: unexpected end of JSON input
Aug 08 22:57:24 ip-172-31-25-17 kubelet[14247]: W0808 22:57:24.631761 14247 driver-call.go:144] FlexVolume: driver call failed: executable: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/rook.io~block-k8s-ns-system/block-k8s-ns-system, args: [init], error: fork/exec /usr/libexec/kubernetes/kubelet-plugins/volume/exec/rook.io~block-k8s-ns-system/block-k8s-ns-system: no such file or directory, output: ""
Aug 08 22:57:24 ip-172-31-25-17 kubelet[14247]: E0808 22:57:24.809539 14247 plugins.go:595] Error dynamically probing plugins: Error creating Flexvolume plugin from directory rook.io~block-k8s-ns-system, skipping. Error: unexpected end of JSON input```
See at the same time in the agent log, it is installing plugins:
```2018-08-08 22:57:24.497773 I | flexvolume: Listening on unix socket for Kubernetes volume attach commands: /flexmnt/ceph.rook.io~block-k8s-ns-system/.rook.sock
2018-08-08 22:57:24.574891 I | flexvolume: Listening on unix socket for Kubernetes volume attach commands: /flexmnt/ceph.rook.io~rook/.rook.sock
2018-08-08 22:57:24.796417 I | flexvolume: Listening on unix socket for Kubernetes volume attach commands: /flexmnt/rook.io~block-k8s-ns-system/.rook.sock
2018-08-08 22:57:24.862425 I | flexvolume: Listening on unix socket for Kubernetes volume attach commands: /flexmnt/rook.io~rook/.rook.sock```
there is some race between the agent installing its plugin to the 4 locations and the kubelet performing discovery and calling `Init`...maybe this is being hit because the rate-limiter was removed in the PR you linked to? https://github.com/kubernetes/kubernetes/pull/58519/files#r163433782
travisn [9:29 PM]
good debugging
what if we stopped copying the driver to the legacy locations? If we get it down to a single driver we shouldn’t see this. For backwards compatibility maybe this means we need a flag that still allows copying to the legacy locations. But by default we wouldn’t need to copy it to multiple locations anymore. (edited)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment