Skip to content

Instantly share code, notes, and snippets.

@ruo91
Last active March 2, 2022 16:19
Show Gist options
  • Select an option

  • Save ruo91/9da83620ea46decb4414fad24c15d2be to your computer and use it in GitHub Desktop.

Select an option

Save ruo91/9da83620ea46decb4414fad24c15d2be to your computer and use it in GitHub Desktop.
OpenShift v4.x - Operator Status Pending

1. 이슈 사항

기존에 구성된 Operator의 상태가 Pending으로 확인 됨. Operator Pending

2. 원인 분석 및 해결

이슈가 발생 했던 Operator는 strimzi-kafka-operator, jaeger-product 이다.

2.1. Strimzi Kafka Operator

Kafka Instance의 PVC 용량을 변경 했을때 발생한다.

현 고객사는 스토리지 벤더인 NetApp의 Trident CSI를 OpenShift에 포팅하여 Storage Class 기능을 추가한 환경이다.
CSI(Container Storage Interface)를 사용하기 때문에, 사용자는 PVC(Persistent Volume Claim)를 생성하면,
PV(Persistent Volume)를 자동으로 dynamic provisioning을 해주기 때문에, PVC 생성시 사용 할 스토리지 용량만 정의 해주면
자동으로 만들어 사용가능 하다.

이런 환경에서 사용자는 Kafka CRD를 통해 아래와 같이 초기 2Gi 용량을 사용했다가 10Gi를 늘려 발생한 상황이라 보면 된다.

- Kafka Instance: Before

[root@bastion ~]# oc get kafka -o yaml -n ybkim
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: test-ybkim
  namespace: ybkim
spec:
  entityOperator:
    topicOperator: {}
    userOperator: {}
  kafka:
    config:
      default.replication.factor: 3
      inter.broker.protocol.version: '3.0'
      min.insync.replicas: 2
      offsets.topic.replication.factor: 3
      transaction.state.log.min.isr: 2
      transaction.state.log.replication.factor: 3
    jvmOptions:
      '-Xms': 4g
      '-Xmx': 4g
    listeners:
      - name: plain
        port: 9092
        tls: false
        type: internal
      - name: tls
        port: 9093
        tls: true
        type: internal
    replicas: 3
    resources:
      limits:
        cpu: 2
        memory: 8Gi
      requests:
        cpu: 2
        memory: 8Gi
    storage:
      class: ontap-sc-delete
      size: 2Gi
      type: persistent-claim
    version: 3.0.0
  zookeeper:
    jvmOptions:
      '-Xms': 4g
      '-Xmx': 4g
    replicas: 3
    resources:
      limits:
        cpu: 2
        memory: 8Gi
      requests:
        cpu: 2
        memory: 8Gi
    storage:
      class: ontap-sc-delete
      size: 1Gi
      type: persistent-claim

- Kafka Instance: After

[root@bastion ~]# oc get kafka -o yaml -n ybkim
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: test-ybkim
  namespace: ybkim
spec:
  entityOperator:
    topicOperator: {}
    userOperator: {}
  kafka:
    config:
      default.replication.factor: 3
      inter.broker.protocol.version: '3.0'
      min.insync.replicas: 2
      offsets.topic.replication.factor: 3
      transaction.state.log.min.isr: 2
      transaction.state.log.replication.factor: 3
    jvmOptions:
      '-Xms': 4g
      '-Xmx': 4g
    listeners:
      - name: plain
        port: 9092
        tls: false
        type: internal
      - name: tls
        port: 9093
        tls: true
        type: internal
    replicas: 3
    resources:
      limits:
        cpu: 2
        memory: 8Gi
      requests:
        cpu: 2
        memory: 8Gi
    storage:
      class: ontap-sc-delete
      size: 10Gi
      type: persistent-claim
    version: 3.0.0
  zookeeper:
    jvmOptions:
      '-Xms': 4g
      '-Xmx': 4g
    replicas: 3
    resources:
      limits:
        cpu: 2
        memory: 8Gi
      requests:
        cpu: 2
        memory: 8Gi
    storage:
      class: ontap-sc-delete
      size: 1Gi
      type: persistent-claim

위와 같이 Kafka의 스토리지 용량이 10Gi로 재설정이 되면,
kafka-entity-operator에 의해 statefulset 정보를 업데이트 하면서 kafka pod를 재기동 한다.

이후 아래와 같이 strimzi-cluster-operator pod에서 로그가 출력되고, Operator Status는 Pending 상태로 고정이 된다.

[root@bastion ~]# oc logs -f strimzi-cluster-operator-v0.28.0-6c9c45c46-wcfcc -n openshift-operators
2022-03-02 05:29:28 INFO  ClusterOperator:123 - Triggering periodic reconciliation for namespace *
2022-03-02 05:29:28 INFO  AbstractOperator:226 - Reconciliation #14186(timer) Kafka(test-ybkim/ybkim): Kafka test-ybkim will be checked for creation or modification
2022-03-02 05:29:30 WARN  KafkaAssemblyOperator:2911 - Reconciliation #14186(timer) Kafka(test-ybkim/ybkim): Storage Class ontap-sc-delete does not support resizing of volumes. PVC data-test-ybkim-kafka-0 cannot be resized. Reconciliation will proceed without reconciling this PVC.
2022-03-02 05:29:30 WARN  KafkaAssemblyOperator:2911 - Reconciliation #14186(timer) Kafka(test-ybkim/ybkim): Storage Class ontap-sc-delete does not support resizing of volumes. PVC data-test-ybkim-kafka-1 cannot be resized. Reconciliation will proceed without reconciling this PVC.
2022-03-02 05:29:30 WARN  KafkaAssemblyOperator:2911 - Reconciliation #14186(timer) Kafka(test-ybkim/ybkim): Storage Class ontap-sc-delete does not support resizing of volumes. PVC data-test-ybkim-kafka-2 cannot be resized. Reconciliation will proceed without reconciling this PVC.
2022-03-02 05:29:31 INFO  AbstractOperator:517 - Reconciliation #14186(timer) Kafka(test-ybkim/ybkim): reconciled

위 내용은 Storage Class에서 PVC 용량 확장(expand) 기능을 제공하지 않아서 발생하는 것이다.
Trident 문서[1]를 찾아보니 Storage Class에 allowVolumeExpansion: true 옵션을 사용 가능한 것으로 확인이 됐다.

따라서, 기존 Stroage Class에 아래와 같이 해당 옵션을 추가하면 expand 기능이 활성화 되어 이슈가 해결 된다.

[root@bastion ~]# oc edit storageclass ontap-sc-delete
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: ontap-sc-delete
provisioner: csi.trident.netapp.io
parameters:
  backendType: ontap-nas
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: Immediate

- RefURL

[1]: NetApp: Trident Expand Volumes

2.2. Jaeger Operator (Red Hat OpenShift distributed tracing platform)

All Namespace로 구성된 jaeger-operator pod에 아래와 같은 로그 발생.

[root@bastion ~]# oc logs -f jaeger-operator-5fdcddd4bc-2dxq6 -n openshift-operators
time="2022-03-02T05:21:23Z" level=info msg=Versions arch=amd64 identity=openshift-operators.jaeger-operator jaeger=1.30.0 jaeger-operator=v1.30.0 os=linux version=go1.17.2
I0302 05:21:24.627546       1 request.go:668] Waited for 1.036315321s due to client-side throttling, not priority and fairness, request: GET:https://10.200.0.1:443/apis/storage.k8s.io/v1?timeout=32s
2022-03-02T05:21:29.086Z	INFO	controller-runtime.metrics	metrics server is starting to listen	{"addr": "0.0.0.0:8383"}
time="2022-03-02T05:21:29Z" level=info msg="Auto-detected the platform" platform=openshift
time="2022-03-02T05:21:29Z" level=info msg="Automatically adjusted the 'es-provision' flag" es-provision=yes
time="2022-03-02T05:21:29Z" level=info msg="Automatically adjusted the 'kafka-provision' flag" kafka-provision=yes
time="2022-03-02T05:21:29Z" level=info msg="The service account running this operator does not have the role 'system:auth-delegator', consider granting it for additional capabilities"
time="2022-03-02T05:21:29Z" level=error msg="error getting a list of deployments to analyze" error="deployments.apps is forbidden: User \"system:serviceaccount:openshift-operators:jaeger-operator\" cannot list resource \"deployments\" in API group \"apps\" at the cluster scope"
time="2022-03-02T05:21:29Z" level=error msg="error getting a list of existing jaeger instances" error="jaegers.jaegertracing.io is forbidden: User \"system:serviceaccount:openshift-operators:jaeger-operator\" cannot list resource \"jaegers\" in API group \"jaegertracing.io\" at the cluster scope"
time="2022-03-02T05:21:29Z" level=warning msg="failed to upgrade managed instances" error="jaegers.jaegertracing.io is forbidden: User \"system:serviceaccount:openshift-operators:jaeger-operator\" cannot list resource \"jaegers\" in API group \"jaegertracing.io\" at the cluster scope"
time="2022-03-02T05:21:29Z" level=info msg="Updated OAuth Proxy image flag" image="quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:225521bf209fa3d3450b1778eda2ac15bbda786c8c224195e120ebd1bf789b47"

해당 로그는 Operator 구성시 InstallPlan 과정에서 CSV(Cluster Service Version) 객체에 포함된
ClusterRole과 ClusterRoleBinding이 jaeger-operator가 사용하는 Service Account에 Binding 되어있지 않아
forbidden 권한 이슈가 발생했던 상황이다.

기본적으로 CSV에는 Role과 관련 된 설정이 자동으로 생성되고, 권한을 부여하여 문제 없이 구성이 되는 것이 맞는데,
사용자가 실수로 해당 Role을 삭제 했거나 Operator 구성상에 문제가 있었을 경우 누락 발생 가능성이 있다.

따라서, 해당 이슈는 ClusterRole, ClusterRoleBinding을 추가하여 Binding 하면 해결 된다.

- ClusterRole 생성 / 추가

[root@bastion ~]# vi jaeger-operator-cluster-role.yaml
kind: ClusterRole 
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: jaeger-operator.v1.30.0
  labels:
    olm.owner: jaeger-operator.v1.30.0
    olm.owner.kind: ClusterServiceVersion
    olm.owner.namespace: openshift-operators
    operators.coreos.com/jaeger-product.openshift-operators: ''
rules:
  - verbs:
      - create
    apiGroups:
      - authentication.k8s.io
    resources:
      - tokenreviews
  - verbs:
      - create
    apiGroups:
      - authorization.k8s.io
    resources:
      - subjectaccessreviews
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - apps
    resources:
      - daemonsets
      - deployments
      - replicasets
      - statefulsets
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - apps
    resources:
      - deployments
  - verbs:
      - get
      - patch
      - update
    apiGroups:
      - apps
    resources:
      - deployments/status
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - autoscaling
    resources:
      - horizontalpodautoscalers
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - batch
    resources:
      - cronjobs
      - jobs
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - console.openshift.io
    resources:
      - consolelinks
  - verbs:
      - create
      - get
      - list
      - update
    apiGroups:
      - coordination.k8s.io
    resources:
      - leases
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - ''
    resources:
      - configmaps
      - persistentvolumeclaims
      - pods
      - secrets
      - serviceaccounts
      - services
      - services/finalizers
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - ''
    resources:
      - namespaces
  - verbs:
      - get
      - patch
      - update
    apiGroups:
      - ''
    resources:
      - namespaces/status
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - extensions
    resources:
      - ingresses
  - verbs:
      - get
      - list
      - watch
    apiGroups:
      - image.openshift.io
    resources:
      - imagestreams
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - jaegertracing.io
    resources:
      - jaegers
  - verbs:
      - update
    apiGroups:
      - jaegertracing.io
    resources:
      - jaegers/finalizers
  - verbs:
      - get
      - patch
      - update
    apiGroups:
      - jaegertracing.io
    resources:
      - jaegers/status
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - kafka.strimzi.io
    resources:
      - kafkas
      - kafkausers
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - logging.openshift.io
    resources:
      - elasticsearches
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - monitoring.coreos.com
    resources:
      - servicemonitors
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - networking.k8s.io
    resources:
      - ingresses
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - rbac.authorization.k8s.io
    resources:
      - clusterrolebindings
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - route.openshift.io
    resources:
      - routes
[root@bastion ~]# oc create -f jaeger-operator-cluster-role.yaml

- ClusterRoleBinding 생성 / 추가

Jaeger Operator가 사용하는 Service Account에 Binding 한다.

[root@bastion ~]# vi jaeger-operator-cluster-role-binding.yaml
kind: ClusterRoleBinding 
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: jaeger-operator.v1.30.0
  labels:
    olm.owner: jaeger-operator.v1.30.0
    olm.owner.kind: ClusterServiceVersion
    olm.owner.namespace: openshift-operators
    operators.coreos.com/jaeger-product.openshift-operators: ''
subjects:
  - kind: ServiceAccount
    name: jaeger-operator
    namespace: openshift-operators
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: jaeger-operator.v1.30.0
[root@bastion ~]# oc create -f jaeger-operator-cluster-role-binding.yaml

Role 설정 이후 Jaeger Operator Pod를 재시작하면 몇초 후
Service Account에 Binding 된 권한을 확인하고 Operator 상태를 Succeeded로 변경하여 해결이 된다.

3. 결론

A~Z까지의 모든 상황을 고려하고 어느 누구도 믿지 않으며, 시간에 쫓기면 해결점이 보입니다?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment