Skip to content

Instantly share code, notes, and snippets.

@ruo91
Last active March 9, 2022 11:02
Show Gist options
  • Select an option

  • Save ruo91/82ac48e987b016634e9dff1f32387c8e to your computer and use it in GitHub Desktop.

Select an option

Save ruo91/82ac48e987b016634e9dff1f32387c8e to your computer and use it in GitHub Desktop.
OpenShift v4.x - Operator Unknown / Pending 이슈 정리

고객사 OpenShift v4.x 환경에서 발생한 Operator 이슈 및 해결 내용을 정리 한다.

1. Operator Unknown 상태

폐쇄망으로 구성된 Catalog Source 기반으로 kafka, redis, nexus 관련 Operator들 설치하고 사용하고 있다가 Operator 구성이
온라인 환경으로 변경됨에 따라 기존에 설치 되어있던 CSV(Cluster Service Version) 버전과 맞지 않아 발생했던 상황이다. Operator Unknown

해당 이슈는 기 설치된 Operator Subscription에서 CSV 버전 정보를 제거 하고, Operator의 Package Manifest 정보를 관리하는
OLM(Operator Lifecyle Manager), CVO(Cluster Version Operator) 서비스를 재기동하여 해결됐다.

OLM 로직상 Unknown 상태가 발생되는 원인은 크게 2가지로 구분이 가능한데,
첫번째는 Operator 설치시 Subscription 정보에서 Catalog Source 이름을 잘못 명시한 경우와
Catalog Source가 삭제된 경우에 발생할 수 있다.

두번째는 Subscription 정보에서 startingCSV 버전이 Catalog Source Manifest 정보와 일치하지 않거나,
삭제 된 경우 발생할 수 있다.

따라서, 이슈가 발생한 근본적인 이유는 최신 버전으로 업데이트 된 Catalog Source 이미지에 기 설치된 Operator 이미지의
CSV Manifest가 삭제 되어 발생한 것이며, 이 정보가 이전에 구성된 Operator Subscription 정보에서 CSV 버전 최신화 반영이 되지 않아 발생한 것이다.

- Catalog Source DB 정보 확인

상단이 Disconnected 환경으로 구성한 Custom Catalog Source의 Strimzi Kafka Operator CSV 정보
하단이 온라인 최신버전의 Catalog Source의 Strimzi Kafka Operator CSV 정보 Operator Catalog Source CSV Diff

2. Operator Pending 상태

Strimzi Kafka Operator, Red Hat OpenShift distributed tracing platform(jaeger) Operator에서 발생했다. Operator Pending

2.1. Strimzi Kafka Operator

DevOps 팀에서 사용 중이던 Kafka Instance에서 기존 PVC 영역의 용량을 2Gi로 사용하고 있었다가, 10Gi로 변경하여 발생.
즉, PVC 확장(expand) 기능을 사용한 것인데, 개발서버에 구성되어 있던 NetApp Trident CSI의 Storage Class에서 PVC 확장 기능이 비활성화 되어 발생한 것이다.

- Kafka Instance: Before

[root@bastion ~]# oc get kafka -o yaml -n ybkim
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: test-ybkim
  namespace: ybkim
spec:
...
......
  kafka:
    storage:
      class: ontap-sc-delete
      size: 2Gi
      type: persistent-claim
    version: 3.0.0
...
......

- Kafka Instance: After

[root@bastion ~]# oc get kafka -o yaml -n ybkim
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: test-ybkim
  namespace: ybkim
spec:
...
......
  kafka:
    storage:
      class: ontap-sc-delete
      size: 10Gi
      type: persistent-claim
    version: 3.0.0
...
......

- Kafka Operator 로그

[root@bastion ~]# oc logs -f strimzi-cluster-operator-v0.28.0-6c9c45c46-wcfcc -n openshift-operators
2022-03-02 05:29:28 INFO  ClusterOperator:123 - Triggering periodic reconciliation for namespace *
2022-03-02 05:29:28 INFO  AbstractOperator:226 - Reconciliation #14186(timer) Kafka(test-ybkim/ybkim): Kafka my-cluster will be checked for creation or modification
2022-03-02 05:29:30 WARN  KafkaAssemblyOperator:2911 - Reconciliation #14186(timer) Kafka(test-ybkim/ybkim): Storage Class ontap-sc-delete does not support resizing of volumes. PVC data-test-ybkim-kafka-0 cannot be resized. Reconciliation will proceed without reconciling this PVC.
2022-03-02 05:29:30 WARN  KafkaAssemblyOperator:2911 - Reconciliation #14186(timer) Kafka(test-ybkim/ybkim): Storage Class ontap-sc-delete does not support resizing of volumes. PVC data-test-ybkim-kafka-1 cannot be resized. Reconciliation will proceed without reconciling this PVC.
2022-03-02 05:29:30 WARN  KafkaAssemblyOperator:2911 - Reconciliation #14186(timer) Kafka(test-ybkim/ybkim): Storage Class ontap-sc-delete does not support resizing of volumes. PVC data-test-ybkim-kafka-2 cannot be resized. Reconciliation will proceed without reconciling this PVC.
2022-03-02 05:29:31 INFO  AbstractOperator:517 - Reconciliation #14186(timer) Kafka(test-ybkim/ybkim): reconciled

Trident 문서[1]를 확인하여, PVC 확장을 사용할 수 있도록 Storage Class에 allowVolumeExpansion 옵션을 추가 후 해결됐다.
[1]: NetApp: Expand Volumes

- Storage Class: ontap-sc-delete

[root@bastion ~]# oc edit storageclass ontap-sc-delete
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: ontap-sc-delete
provisioner: csi.trident.netapp.io
parameters:
  backendType: ontap-nas
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: Immediate

- Storage Class: ontap-sc-retain

[root@bastion ~]# oc edit storageclass ontap-sc-retain
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: ontap-sc-retain
provisioner: csi.trident.netapp.io
parameters:
  backendType: ontap-nas
reclaimPolicy: Retain
allowVolumeExpansion: true
volumeBindingMode: Immediate

2.2. Red Hat OpenShift distributed tracing platform (Jager) Operator

Jaeger Operator는 Service Mesh Operator(istio)를 사용시 서비스들의 End-To-End를 위한 추적(Tracing) 도구이다.

Operator 구성시 InstallPlan 과정에서 CSV(Cluster Service Version)에 포함된 ClusterRole과 ClusterRoleBinding 부분이
jaeger-operator가 사용하는 Service Account에 Binding 되어있지 않아 forbidden 권한 이슈가 발생했던 상황이다.

기본적으로 CSV에는 Role과 관련 된 설정이 자동으로 생성되고, 권한을 부여하여 문제 없이 구성이 되는 것이 맞는데,
사용자가 강제로 해당 Role을 삭제 했거나 Operator 구성상에 문제가 있었을 경우 누락 발생 가능성이 있다.

따라서, 해당 이슈는 ClusterRole, ClusterRoleBinding을 별도 추가 후 Binding을 작업을 진행하여 해결됐다.

- Jaeger Operator 로그

[root@bastion ~]# oc logs -f jaeger-operator-5fdcddd4bc-2dxq6 -n openshift-operators
time="2022-03-02T05:21:23Z" level=info msg=Versions arch=amd64 identity=openshift-operators.jaeger-operator jaeger=1.30.0 jaeger-operator=v1.30.0 os=linux version=go1.17.2
I0302 05:21:24.627546       1 request.go:668] Waited for 1.036315321s due to client-side throttling, not priority and fairness, request: GET:https://10.200.0.1:443/apis/storage.k8s.io/v1?timeout=32s
2022-03-02T05:21:29.086Z	INFO	controller-runtime.metrics	metrics server is starting to listen	{"addr": "0.0.0.0:8383"}
time="2022-03-02T05:21:29Z" level=info msg="Auto-detected the platform" platform=openshift
time="2022-03-02T05:21:29Z" level=info msg="Automatically adjusted the 'es-provision' flag" es-provision=yes
time="2022-03-02T05:21:29Z" level=info msg="Automatically adjusted the 'kafka-provision' flag" kafka-provision=yes
time="2022-03-02T05:21:29Z" level=info msg="The service account running this operator does not have the role 'system:auth-delegator', consider granting it for additional capabilities"
time="2022-03-02T05:21:29Z" level=error msg="error getting a list of deployments to analyze" error="deployments.apps is forbidden: User \"system:serviceaccount:openshift-operators:jaeger-operator\" cannot list resource \"deployments\" in API group \"apps\" at the cluster scope"
time="2022-03-02T05:21:29Z" level=error msg="error getting a list of existing jaeger instances" error="jaegers.jaegertracing.io is forbidden: User \"system:serviceaccount:openshift-operators:jaeger-operator\" cannot list resource \"jaegers\" in API group \"jaegertracing.io\" at the cluster scope"
time="2022-03-02T05:21:29Z" level=warning msg="failed to upgrade managed instances" error="jaegers.jaegertracing.io is forbidden: User \"system:serviceaccount:openshift-operators:jaeger-operator\" cannot list resource \"jaegers\" in API group \"jaegertracing.io\" at the cluster scope"
time="2022-03-02T05:21:29Z" level=info msg="Updated OAuth Proxy image flag" image="quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:225521bf209fa3d3450b1778eda2ac15bbda786c8c224195e120ebd1bf789b47"

- ClusterRole 생성 / 추가

[root@bastion ~]# vi jaeger-operator-cluster-role.yaml
kind: ClusterRole 
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: jaeger-operator.v1.30.0
  labels:
    olm.owner: jaeger-operator.v1.30.0
    olm.owner.kind: ClusterServiceVersion
    olm.owner.namespace: openshift-operators
    operators.coreos.com/jaeger-product.openshift-operators: ''
rules:
  - verbs:
      - create
    apiGroups:
      - authentication.k8s.io
    resources:
      - tokenreviews
  - verbs:
      - create
    apiGroups:
      - authorization.k8s.io
    resources:
      - subjectaccessreviews
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - apps
    resources:
      - daemonsets
      - deployments
      - replicasets
      - statefulsets
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - apps
    resources:
      - deployments
  - verbs:
      - get
      - patch
      - update
    apiGroups:
      - apps
    resources:
      - deployments/status
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - autoscaling
    resources:
      - horizontalpodautoscalers
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - batch
    resources:
      - cronjobs
      - jobs
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - console.openshift.io
    resources:
      - consolelinks
  - verbs:
      - create
      - get
      - list
      - update
    apiGroups:
      - coordination.k8s.io
    resources:
      - leases
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - ''
    resources:
      - configmaps
      - persistentvolumeclaims
      - pods
      - secrets
      - serviceaccounts
      - services
      - services/finalizers
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - ''
    resources:
      - namespaces
  - verbs:
      - get
      - patch
      - update
    apiGroups:
      - ''
    resources:
      - namespaces/status
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - extensions
    resources:
      - ingresses
  - verbs:
      - get
      - list
      - watch
    apiGroups:
      - image.openshift.io
    resources:
      - imagestreams
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - jaegertracing.io
    resources:
      - jaegers
  - verbs:
      - update
    apiGroups:
      - jaegertracing.io
    resources:
      - jaegers/finalizers
  - verbs:
      - get
      - patch
      - update
    apiGroups:
      - jaegertracing.io
    resources:
      - jaegers/status
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - kafka.strimzi.io
    resources:
      - kafkas
      - kafkausers
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - logging.openshift.io
    resources:
      - elasticsearches
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - monitoring.coreos.com
    resources:
      - servicemonitors
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - networking.k8s.io
    resources:
      - ingresses
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - rbac.authorization.k8s.io
    resources:
      - clusterrolebindings
  - verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
    apiGroups:
      - route.openshift.io
    resources:
      - routes
[root@bastion ~]# oc create -f jaeger-operator-cluster-role.yaml

- ClusterRoleBinding 생성 / 추가

Jaeger Operator가 사용하는 Service Account에 Binding 한다.

[root@bastion ~]# vi jaeger-operator-cluster-role-binding.yaml
kind: ClusterRoleBinding 
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: jaeger-operator.v1.30.0
  labels:
    olm.owner: jaeger-operator.v1.30.0
    olm.owner.kind: ClusterServiceVersion
    olm.owner.namespace: openshift-operators
    operators.coreos.com/jaeger-product.openshift-operators: ''
subjects:
  - kind: ServiceAccount
    name: jaeger-operator
    namespace: openshift-operators
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: jaeger-operator.v1.30.0
[root@bastion ~]# oc create -f jaeger-operator-cluster-role-binding.yaml

- Jaeger Operator Pod 재시작

Pod 재시작 후 몇초 이내 Service Account에 Binding 된 권한을 확인하고 Operator 상태를 Succeeded로 변경하여 해결됐다.

[root@bastion ~]# oc delete pod jaeger-operator-5fdcddd4bc-2dxq6 -n openshift-operators
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment