Last active
August 15, 2018 00:30
-
-
Save jbw976/358b67f5c0822cd6d424cdfd2961d748 to your computer and use it in GitHub Desktop.
Rook #1501 Slack discussion on repro in integration tests
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Slack link (which may be archived due to 10k message limit): https://rook-io.slack.com/archives/C764K425D/p1533770804000083 | |
we have kubelet logs: https://jenkins.rook.io/blue/organizations/jenkins/rook%2Frook/detail/PR-2010/3/artifacts | |
and it will even be useful to debug the statefulset issue in this build | |
travisn [4:33 PM] | |
in the kubelet log: | |
```22:58:24.839204 14247 desired_state_of_world_populator.go:311] Failed to add volume "rookpvc" (specName: "pvc-875ab62f-9b5e-11e8-b0eb-0af5d80321b6") for pod "875d980e-9b5e-11e8-b0eb-0af5d80321b6" to desiredStateOfWorld. err=failed to get Plugin from volumeSpec for volume "pvc-875ab62f-9b5e-11e8-b0eb-0af5d80321b6" err=no volume plugin matched``` | |
hmm, why isn’t the flex volume plugin found? | |
the agent log shows it is listening on the socket | |
travisn [4:46 PM] | |
in 1.11 there was a change to limit how flex drivers are loaded. https://github.com/kubernetes/kubernetes/pull/58519 | |
seems like it’s causing our issue | |
travisn [4:53 PM] | |
i’m going to try 1.11.2 (instead of 1.11.0), but I don’t see a related fix in the changelog since then | |
jbw976 [5:18 PM] | |
that would be an annoying regression if that's the case @travisn. it seems like the intent was efficiency for plugin probing, I wonder if there's a race that is being exacerbated by the rook agent installing its plugin to multiple directories (up to 4, a 2x2 matrix of namespaced/non-namespaced and rook.io/ceph.rook.io). | |
wonder what the agent logs look like around that time too | |
and i wonder if we should be dumping out the plugin directory contents (recursively) upon failure too? | |
actually, this looks pretty related from the kubelet log, and it's the same timestamp as when the agent is installing its plugins: | |
```Aug 08 22:57:24 ip-172-31-25-17 kubelet[14247]: E0808 22:57:24.631723 14247 driver-call.go:251] Failed to unmarshal output for command: init, output: "", error: unexpected end of JSON input | |
Aug 08 22:57:24 ip-172-31-25-17 kubelet[14247]: W0808 22:57:24.631761 14247 driver-call.go:144] FlexVolume: driver call failed: executable: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/rook.io~block-k8s-ns-system/block-k8s-ns-system, args: [init], error: fork/exec /usr/libexec/kubernetes/kubelet-plugins/volume/exec/rook.io~block-k8s-ns-system/block-k8s-ns-system: no such file or directory, output: "" | |
Aug 08 22:57:24 ip-172-31-25-17 kubelet[14247]: E0808 22:57:24.809539 14247 plugins.go:595] Error dynamically probing plugins: Error creating Flexvolume plugin from directory rook.io~block-k8s-ns-system, skipping. Error: unexpected end of JSON input``` | |
See at the same time in the agent log, it is installing plugins: | |
```2018-08-08 22:57:24.497773 I | flexvolume: Listening on unix socket for Kubernetes volume attach commands: /flexmnt/ceph.rook.io~block-k8s-ns-system/.rook.sock | |
2018-08-08 22:57:24.574891 I | flexvolume: Listening on unix socket for Kubernetes volume attach commands: /flexmnt/ceph.rook.io~rook/.rook.sock | |
2018-08-08 22:57:24.796417 I | flexvolume: Listening on unix socket for Kubernetes volume attach commands: /flexmnt/rook.io~block-k8s-ns-system/.rook.sock | |
2018-08-08 22:57:24.862425 I | flexvolume: Listening on unix socket for Kubernetes volume attach commands: /flexmnt/rook.io~rook/.rook.sock``` | |
there is some race between the agent installing its plugin to the 4 locations and the kubelet performing discovery and calling `Init`...maybe this is being hit because the rate-limiter was removed in the PR you linked to? https://github.com/kubernetes/kubernetes/pull/58519/files#r163433782 | |
travisn [9:29 PM] | |
good debugging | |
what if we stopped copying the driver to the legacy locations? If we get it down to a single driver we shouldn’t see this. For backwards compatibility maybe this means we need a flag that still allows copying to the legacy locations. But by default we wouldn’t need to copy it to multiple locations anymore. (edited) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment