기존에 구성된 Operator의 상태가 Pending으로 확인 됨.
이슈가 발생 했던 Operator는 strimzi-kafka-operator, jaeger-product 이다.
Kafka Instance의 PVC 용량을 변경 했을때 발생한다.
현 고객사는 스토리지 벤더인 NetApp의 Trident CSI를 OpenShift에 포팅하여 Storage Class 기능을 추가한 환경이다.
CSI(Container Storage Interface)를 사용하기 때문에, 사용자는 PVC(Persistent Volume Claim)를 생성하면,
PV(Persistent Volume)를 자동으로 dynamic provisioning을 해주기 때문에, PVC 생성시 사용 할 스토리지 용량만 정의 해주면
자동으로 만들어 사용가능 하다.
이런 환경에서 사용자는 Kafka CRD를 통해 아래와 같이 초기 2Gi 용량을 사용했다가 10Gi를 늘려 발생한 상황이라 보면 된다.
[root@bastion ~]# oc get kafka -o yaml -n ybkim
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: test-ybkim
namespace: ybkim
spec:
entityOperator:
topicOperator: {}
userOperator: {}
kafka:
config:
default.replication.factor: 3
inter.broker.protocol.version: '3.0'
min.insync.replicas: 2
offsets.topic.replication.factor: 3
transaction.state.log.min.isr: 2
transaction.state.log.replication.factor: 3
jvmOptions:
'-Xms': 4g
'-Xmx': 4g
listeners:
- name: plain
port: 9092
tls: false
type: internal
- name: tls
port: 9093
tls: true
type: internal
replicas: 3
resources:
limits:
cpu: 2
memory: 8Gi
requests:
cpu: 2
memory: 8Gi
storage:
class: ontap-sc-delete
size: 2Gi
type: persistent-claim
version: 3.0.0
zookeeper:
jvmOptions:
'-Xms': 4g
'-Xmx': 4g
replicas: 3
resources:
limits:
cpu: 2
memory: 8Gi
requests:
cpu: 2
memory: 8Gi
storage:
class: ontap-sc-delete
size: 1Gi
type: persistent-claim
[root@bastion ~]# oc get kafka -o yaml -n ybkim
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: test-ybkim
namespace: ybkim
spec:
entityOperator:
topicOperator: {}
userOperator: {}
kafka:
config:
default.replication.factor: 3
inter.broker.protocol.version: '3.0'
min.insync.replicas: 2
offsets.topic.replication.factor: 3
transaction.state.log.min.isr: 2
transaction.state.log.replication.factor: 3
jvmOptions:
'-Xms': 4g
'-Xmx': 4g
listeners:
- name: plain
port: 9092
tls: false
type: internal
- name: tls
port: 9093
tls: true
type: internal
replicas: 3
resources:
limits:
cpu: 2
memory: 8Gi
requests:
cpu: 2
memory: 8Gi
storage:
class: ontap-sc-delete
size: 10Gi
type: persistent-claim
version: 3.0.0
zookeeper:
jvmOptions:
'-Xms': 4g
'-Xmx': 4g
replicas: 3
resources:
limits:
cpu: 2
memory: 8Gi
requests:
cpu: 2
memory: 8Gi
storage:
class: ontap-sc-delete
size: 1Gi
type: persistent-claim
위와 같이 Kafka의 스토리지 용량이 10Gi로 재설정이 되면,
kafka-entity-operator에 의해 statefulset 정보를 업데이트 하면서 kafka pod를 재기동 한다.
이후 아래와 같이 strimzi-cluster-operator pod에서 로그가 출력되고, Operator Status는 Pending 상태로 고정이 된다.
[root@bastion ~]# oc logs -f strimzi-cluster-operator-v0.28.0-6c9c45c46-wcfcc -n openshift-operators
2022-03-02 05:29:28 INFO ClusterOperator:123 - Triggering periodic reconciliation for namespace *
2022-03-02 05:29:28 INFO AbstractOperator:226 - Reconciliation #14186(timer) Kafka(test-ybkim/ybkim): Kafka test-ybkim will be checked for creation or modification
2022-03-02 05:29:30 WARN KafkaAssemblyOperator:2911 - Reconciliation #14186(timer) Kafka(test-ybkim/ybkim): Storage Class ontap-sc-delete does not support resizing of volumes. PVC data-test-ybkim-kafka-0 cannot be resized. Reconciliation will proceed without reconciling this PVC.
2022-03-02 05:29:30 WARN KafkaAssemblyOperator:2911 - Reconciliation #14186(timer) Kafka(test-ybkim/ybkim): Storage Class ontap-sc-delete does not support resizing of volumes. PVC data-test-ybkim-kafka-1 cannot be resized. Reconciliation will proceed without reconciling this PVC.
2022-03-02 05:29:30 WARN KafkaAssemblyOperator:2911 - Reconciliation #14186(timer) Kafka(test-ybkim/ybkim): Storage Class ontap-sc-delete does not support resizing of volumes. PVC data-test-ybkim-kafka-2 cannot be resized. Reconciliation will proceed without reconciling this PVC.
2022-03-02 05:29:31 INFO AbstractOperator:517 - Reconciliation #14186(timer) Kafka(test-ybkim/ybkim): reconciled
위 내용은 Storage Class에서 PVC 용량 확장(expand) 기능을 제공하지 않아서 발생하는 것이다.
Trident 문서[1]를 찾아보니 Storage Class에 allowVolumeExpansion: true 옵션을 사용 가능한 것으로 확인이 됐다.
따라서, 기존 Stroage Class에 아래와 같이 해당 옵션을 추가하면 expand 기능이 활성화 되어 이슈가 해결 된다.
[root@bastion ~]# oc edit storageclass ontap-sc-delete
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: ontap-sc-delete
provisioner: csi.trident.netapp.io
parameters:
backendType: ontap-nas
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: Immediate
[1]: NetApp: Trident Expand Volumes
All Namespace로 구성된 jaeger-operator pod에 아래와 같은 로그 발생.
[root@bastion ~]# oc logs -f jaeger-operator-5fdcddd4bc-2dxq6 -n openshift-operators
time="2022-03-02T05:21:23Z" level=info msg=Versions arch=amd64 identity=openshift-operators.jaeger-operator jaeger=1.30.0 jaeger-operator=v1.30.0 os=linux version=go1.17.2
I0302 05:21:24.627546 1 request.go:668] Waited for 1.036315321s due to client-side throttling, not priority and fairness, request: GET:https://10.200.0.1:443/apis/storage.k8s.io/v1?timeout=32s
2022-03-02T05:21:29.086Z INFO controller-runtime.metrics metrics server is starting to listen {"addr": "0.0.0.0:8383"}
time="2022-03-02T05:21:29Z" level=info msg="Auto-detected the platform" platform=openshift
time="2022-03-02T05:21:29Z" level=info msg="Automatically adjusted the 'es-provision' flag" es-provision=yes
time="2022-03-02T05:21:29Z" level=info msg="Automatically adjusted the 'kafka-provision' flag" kafka-provision=yes
time="2022-03-02T05:21:29Z" level=info msg="The service account running this operator does not have the role 'system:auth-delegator', consider granting it for additional capabilities"
time="2022-03-02T05:21:29Z" level=error msg="error getting a list of deployments to analyze" error="deployments.apps is forbidden: User \"system:serviceaccount:openshift-operators:jaeger-operator\" cannot list resource \"deployments\" in API group \"apps\" at the cluster scope"
time="2022-03-02T05:21:29Z" level=error msg="error getting a list of existing jaeger instances" error="jaegers.jaegertracing.io is forbidden: User \"system:serviceaccount:openshift-operators:jaeger-operator\" cannot list resource \"jaegers\" in API group \"jaegertracing.io\" at the cluster scope"
time="2022-03-02T05:21:29Z" level=warning msg="failed to upgrade managed instances" error="jaegers.jaegertracing.io is forbidden: User \"system:serviceaccount:openshift-operators:jaeger-operator\" cannot list resource \"jaegers\" in API group \"jaegertracing.io\" at the cluster scope"
time="2022-03-02T05:21:29Z" level=info msg="Updated OAuth Proxy image flag" image="quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:225521bf209fa3d3450b1778eda2ac15bbda786c8c224195e120ebd1bf789b47"
해당 로그는 Operator 구성시 InstallPlan 과정에서 CSV(Cluster Service Version) 객체에 포함된
ClusterRole과 ClusterRoleBinding이 jaeger-operator가 사용하는 Service Account에 Binding 되어있지 않아
forbidden 권한 이슈가 발생했던 상황이다.
기본적으로 CSV에는 Role과 관련 된 설정이 자동으로 생성되고, 권한을 부여하여 문제 없이 구성이 되는 것이 맞는데,
사용자가 실수로 해당 Role을 삭제 했거나 Operator 구성상에 문제가 있었을 경우 누락 발생 가능성이 있다.
따라서, 해당 이슈는 ClusterRole, ClusterRoleBinding을 추가하여 Binding 하면 해결 된다.
[root@bastion ~]# vi jaeger-operator-cluster-role.yaml
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: jaeger-operator.v1.30.0
labels:
olm.owner: jaeger-operator.v1.30.0
olm.owner.kind: ClusterServiceVersion
olm.owner.namespace: openshift-operators
operators.coreos.com/jaeger-product.openshift-operators: ''
rules:
- verbs:
- create
apiGroups:
- authentication.k8s.io
resources:
- tokenreviews
- verbs:
- create
apiGroups:
- authorization.k8s.io
resources:
- subjectaccessreviews
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- apps
resources:
- daemonsets
- deployments
- replicasets
- statefulsets
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- apps
resources:
- deployments
- verbs:
- get
- patch
- update
apiGroups:
- apps
resources:
- deployments/status
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- autoscaling
resources:
- horizontalpodautoscalers
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- batch
resources:
- cronjobs
- jobs
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- console.openshift.io
resources:
- consolelinks
- verbs:
- create
- get
- list
- update
apiGroups:
- coordination.k8s.io
resources:
- leases
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- ''
resources:
- configmaps
- persistentvolumeclaims
- pods
- secrets
- serviceaccounts
- services
- services/finalizers
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- ''
resources:
- namespaces
- verbs:
- get
- patch
- update
apiGroups:
- ''
resources:
- namespaces/status
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- extensions
resources:
- ingresses
- verbs:
- get
- list
- watch
apiGroups:
- image.openshift.io
resources:
- imagestreams
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- jaegertracing.io
resources:
- jaegers
- verbs:
- update
apiGroups:
- jaegertracing.io
resources:
- jaegers/finalizers
- verbs:
- get
- patch
- update
apiGroups:
- jaegertracing.io
resources:
- jaegers/status
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- kafka.strimzi.io
resources:
- kafkas
- kafkausers
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- logging.openshift.io
resources:
- elasticsearches
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- monitoring.coreos.com
resources:
- servicemonitors
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- networking.k8s.io
resources:
- ingresses
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- rbac.authorization.k8s.io
resources:
- clusterrolebindings
- verbs:
- create
- delete
- get
- list
- patch
- update
- watch
apiGroups:
- route.openshift.io
resources:
- routes
[root@bastion ~]# oc create -f jaeger-operator-cluster-role.yaml
Jaeger Operator가 사용하는 Service Account에 Binding 한다.
[root@bastion ~]# vi jaeger-operator-cluster-role-binding.yaml
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: jaeger-operator.v1.30.0
labels:
olm.owner: jaeger-operator.v1.30.0
olm.owner.kind: ClusterServiceVersion
olm.owner.namespace: openshift-operators
operators.coreos.com/jaeger-product.openshift-operators: ''
subjects:
- kind: ServiceAccount
name: jaeger-operator
namespace: openshift-operators
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: jaeger-operator.v1.30.0
[root@bastion ~]# oc create -f jaeger-operator-cluster-role-binding.yaml
Role 설정 이후 Jaeger Operator Pod를 재시작하면 몇초 후
Service Account에 Binding 된 권한을 확인하고 Operator 상태를 Succeeded로 변경하여 해결이 된다.
A~Z까지의 모든 상황을 고려하고 어느 누구도 믿지 않으며, 시간에 쫓기면 해결점이 보입니다?
