고객사 OpenShift v4.x 환경에서 발생한 Operator 이슈 및 해결 내용을 정리 한다.
폐쇄망으로 구성된 Catalog Source 기반으로 kafka, redis, nexus 관련 Operator들 설치하고 사용하고 있다가 Operator 구성이
온라인 환경으로 변경됨에 따라 기존에 설치 되어있던 CSV(Cluster Service Version) 버전과 맞지 않아 발생했던 상황이다.
해당 이슈는 기 설치된 Operator Subscription에서 CSV 버전 정보를 제거 하고, Operator의 Package Manifest 정보를 관리하는
OLM(Operator Lifecyle Manager), CVO(Cluster Version Operator) 서비스를 재기동하여 해결됐다.
OLM 로직상 Unknown 상태가 발생되는 원인은 크게 2가지로 구분이 가능한데,
첫번째는 Operator 설치시 Subscription 정보에서 Catalog Source 이름을 잘못 명시한 경우와
Catalog Source가 삭제된 경우에 발생할 수 있다.
두번째는 Subscription 정보에서 startingCSV 버전이 Catalog Source Manifest 정보와 일치하지 않거나,
삭제 된 경우 발생할 수 있다.
따라서, 이슈가 발생한 근본적인 이유는 최신 버전으로 업데이트 된 Catalog Source 이미지에 기 설치된 Operator 이미지의
CSV Manifest가 삭제 되어 발생한 것이며, 이 정보가 이전에 구성된 Operator Subscription 정보에서 CSV 버전 최신화 반영이 되지 않아 발생한 것이다.
상단이 Disconnected 환경으로 구성한 Custom Catalog Source의 Strimzi Kafka Operator CSV 정보
하단이 온라인 최신버전의 Catalog Source의 Strimzi Kafka Operator CSV 정보
Strimzi Kafka Operator, Red Hat OpenShift distributed tracing platform(jaeger) Operator에서 발생했다.
DevOps 팀에서 사용 중이던 Kafka Instance에서 기존 PVC 영역의 용량을 2Gi로 사용하고 있었다가, 10Gi로 변경하여 발생.
즉, PVC 확장(expand) 기능을 사용한 것인데, 개발서버에 구성되어 있던 NetApp Trident CSI의 Storage Class에서 PVC 확장 기능이 비활성화 되어 발생한 것이다.
[root@bastion ~]# oc get kafka -o yaml -n ybkim
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: test-ybkim
namespace: ybkim
spec:
...
......
kafka:
storage:
class: ontap-sc-delete
size: 2Gi
type: persistent-claim
version: 3.0.0
...
......
[root@bastion ~]# oc get kafka -o yaml -n ybkim
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: test-ybkim
namespace: ybkim
spec:
...
......
kafka:
storage:
class: ontap-sc-delete
size: 10Gi
type: persistent-claim
version: 3.0.0
...
......
[root@bastion ~]# oc logs -f strimzi-cluster-operator-v0.28.0-6c9c45c46-wcfcc -n openshift-operators
2022-03-02 05:29:28 INFO ClusterOperator:123 - Triggering periodic reconciliation for namespace *
2022-03-02 05:29:28 INFO AbstractOperator:226 - Reconciliation #14186(timer) Kafka(test-ybkim/ybkim): Kafka my-cluster will be checked for creation or modification
2022-03-02 05:29:30 WARN KafkaAssemblyOperator:2911 - Reconciliation #14186(timer) Kafka(test-ybkim/ybkim): Storage Class ontap-sc-delete does not support resizing of volumes. PVC data-test-ybkim-kafka-0 cannot be resized. Reconciliation will proceed without reconciling this PVC.
2022-03-02 05:29:30 WARN KafkaAssemblyOperator:2911 - Reconciliation #14186(timer) Kafka(test-ybkim/ybkim): Storage Class ontap-sc-delete does not support resizing of volumes. PVC data-test-ybkim-kafka-1 cannot be resized. Reconciliation will proceed without reconciling this PVC.
2022-03-02 05:29:30 WARN KafkaAssemblyOperator:2911 - Reconciliation #14186(timer) Kafka(test-ybkim/ybkim): Storage Class ontap-sc-delete does not support resizing of volumes. PVC data-test-ybkim-kafka-2 cannot be resized. Reconciliation will proceed without reconciling this PVC.
2022-03-02 05:29:31 INFO AbstractOperator:517 - Reconciliation #14186(timer) Kafka(test-ybkim/ybkim): reconciled
Trident 문서[1]를 확인하여, PVC 확장을 사용할 수 있도록 Storage Class에 allowVolumeExpansion 옵션을 추가 후 해결됐다.
[1]: NetApp: Expand Volumes
[root@bastion ~]# oc edit storageclass ontap-sc-delete
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: ontap-sc-delete
provisioner: csi.trident.netapp.io
parameters:
backendType: ontap-nas
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: Immediate
[root@bastion ~]# oc edit storageclass ontap-sc-retain
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: ontap-sc-retain
provisioner: csi.trident.netapp.io
parameters:
backendType: ontap-nas
reclaimPolicy: Retain
allowVolumeExpansion: true
volumeBindingMode: Immediate
Jaeger Operator는 Service Mesh Operator(istio)를 사용시 서비스들의 End-To-End를 위한 추적(Tracing) 도구이다.
Operator 구성시 InstallPlan 과정에서 CSV(Cluster Service Version)에 포함된 ClusterRole과 ClusterRoleBinding 부분이
jaeger-operator가 사용하는 Service Account에 Binding 되어있지 않아 forbidden 권한 이슈가 발생했던 상황이다.
기본적으로 CSV에는 Role과 관련 된 설정이 자동으로 생성되고, 권한을 부여하여 문제 없이 구성이 되는 것이 맞는데,
사용자가 강제로 해당 Role을 삭제 했거나 Operator 구성상에 문제가 있었을 경우 누락 발생 가능성이 있다.
따라서, 해당 이슈는 ClusterRole, ClusterRoleBinding을 별도 추가 후 Binding을 작업을 진행하여 해결됐다.
[root@bastion ~]# oc logs -f jaeger-operator-5fdcddd4bc-2dxq6 -n openshift-operators
time="2022-03-02T05:21:23Z" level=info msg=Versions arch=amd64 identity=openshift-operators.jaeger-operator jaeger=1.30.0 jaeger-operator=v1.30.0 os=linux version=go1.17.2
I0302 05:21:24.627546 1 request.go:668] Waited for 1.036315321s due to client-side throttling, not priority and fairness, request: GET:https://10.200.0.1:443/apis/storage.k8s.io/v1?timeout=32s
2022-03-02T05:21:29.086Z INFO controller-runtime.metrics metrics server is starting to listen {"addr": "0.0.0.0:8383"}
time="2022-03-02T05:21:29Z" level=info msg="Auto-detected the platform" platform=openshift
time="2022-03-02T05:21:29Z" level=info msg="Automatically adjusted the 'es-provision' flag" es-provision=yes
time="2022-03-02T05:21:29Z" level=info msg="Automatically adjusted the 'kafka-provision' flag" kafka-provision=yes
time="2022-03-02T05:21:29Z" level=info msg="The service account running this operator does not have the role 'system:auth-delegator', consider granting it for additional capabilities"
time="2022-03-02T05:21:29Z" level=error msg="error getting a list of deployments to analyze" error="deployments.apps is forbidden: User \"system:serviceaccount:openshift-operators:jaeger-operator\" cannot list resource \"deployments\" in API group \"apps\" at the cluster scope"
time="2022-03-02T05:21:29Z" level=error msg="error getting a list of existing jaeger instances" error="jaegers.jaegertracing.io is forbidden: User \"system:serviceaccount:openshift-operators:jaeger-operator\" cannot list resource \"jaegers\" in API group \"jaegertracing.io\" at the cluster scope"
time="2022-03-02T05:21:29Z" level=warning msg="failed to upgrade managed instances" error="jaegers.jaegertracing.io is forbidden: User \"system:serviceaccount:openshift-operators:jaeger-operator\" cannot list resource \"jaegers\" in API group \"jaegertracing.io\" at the cluster scope"
time="2022-03-02T05:21:29Z" level=info msg="Updated OAuth Proxy image flag" image="quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:225521bf209fa3d3450b1778eda2ac15bbda786c8c224195e120ebd1bf789b47"
[root@bastion ~]# vi jaeger-operator-cluster-role.yaml
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: jaeger-operator.v1.30.0
labels:
olm.owner: jaeger-operator.v1.30.0
olm.owner.kind: ClusterServiceVersion
olm.owner.namespace: openshift-operators
operators.coreos.com/jaeger-product.openshift-operators: ''
rules:
- verbs:
- create
apiGroups:
- authentication.k8s.io
resources:
- tokenreviews
- verbs:
- create
apiGroups:
- authorization.k8s.io
resources:
- subjectaccessreviews
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- apps
resources:
- daemonsets
- deployments
- replicasets
- statefulsets
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- apps
resources:
- deployments
- verbs:
- get
- patch
- update
apiGroups:
- apps
resources:
- deployments/status
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- autoscaling
resources:
- horizontalpodautoscalers
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- batch
resources:
- cronjobs
- jobs
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- console.openshift.io
resources:
- consolelinks
- verbs:
- create
- get
- list
- update
apiGroups:
- coordination.k8s.io
resources:
- leases
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- ''
resources:
- configmaps
- persistentvolumeclaims
- pods
- secrets
- serviceaccounts
- services
- services/finalizers
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- ''
resources:
- namespaces
- verbs:
- get
- patch
- update
apiGroups:
- ''
resources:
- namespaces/status
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- extensions
resources:
- ingresses
- verbs:
- get
- list
- watch
apiGroups:
- image.openshift.io
resources:
- imagestreams
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- jaegertracing.io
resources:
- jaegers
- verbs:
- update
apiGroups:
- jaegertracing.io
resources:
- jaegers/finalizers
- verbs:
- get
- patch
- update
apiGroups:
- jaegertracing.io
resources:
- jaegers/status
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- kafka.strimzi.io
resources:
- kafkas
- kafkausers
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- logging.openshift.io
resources:
- elasticsearches
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- monitoring.coreos.com
resources:
- servicemonitors
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- networking.k8s.io
resources:
- ingresses
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- rbac.authorization.k8s.io
resources:
- clusterrolebindings
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- route.openshift.io
resources:
- routes
[root@bastion ~]# oc create -f jaeger-operator-cluster-role.yaml
Jaeger Operator가 사용하는 Service Account에 Binding 한다.
[root@bastion ~]# vi jaeger-operator-cluster-role-binding.yaml
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: jaeger-operator.v1.30.0
labels:
olm.owner: jaeger-operator.v1.30.0
olm.owner.kind: ClusterServiceVersion
olm.owner.namespace: openshift-operators
operators.coreos.com/jaeger-product.openshift-operators: ''
subjects:
- kind: ServiceAccount
name: jaeger-operator
namespace: openshift-operators
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: jaeger-operator.v1.30.0
[root@bastion ~]# oc create -f jaeger-operator-cluster-role-binding.yaml
Pod 재시작 후 몇초 이내 Service Account에 Binding 된 권한을 확인하고 Operator 상태를 Succeeded로 변경하여 해결됐다.
[root@bastion ~]# oc delete pod jaeger-operator-5fdcddd4bc-2dxq6 -n openshift-operators


