Skip to content

Instantly share code, notes, and snippets.

@liweinan
Created December 22, 2025 13:54
Show Gist options
  • Select an option

  • Save liweinan/28927a870099494b6e23fc8aaf58c3c3 to your computer and use it in GitHub Desktop.

Select an option

Save liweinan/28927a870099494b6e23fc8aaf58c3c3 to your computer and use it in GitHub Desktop.
Test Case Manual Log
anan@think:~/works/openshift-versions/works$ cat install-config.yaml.bkup 
additionalTrustBundlePolicy: Proxyonly
apiVersion: v1
baseDomain: qe.devcluster.openshift.com
compute:
- architecture: amd64
  hyperthreading: Disabled
  name: worker
  platform: {}
  replicas: 3
controlPlane:
  architecture: amd64
  hyperthreading: Disabled 
  name: master
  platform: {}
  replicas: 3
metadata:
  name: weli-test
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 10.0.0.0/16
  networkType: OVNKubernetes
  serviceNetwork:
  - 172.30.0.0/16
platform:
  aws:
    region: us-east-1
    vpc: {}
publish: External
@liweinan
Copy link
Author

anan@think:~/works/openshift-versions/works$ ../421nightly/openshift-install create cluster
INFO Credentials loaded from the "default" profile in file "/home/anan/.aws/credentials" 
INFO Successfully populated MCS CA cert information: root-ca 2035-12-20T13:48:01Z 2025-12-22T13:48:01Z 
INFO Successfully populated MCS TLS cert information: root-ca 2035-12-20T13:48:01Z 2025-12-22T13:48:01Z 
INFO Credentials loaded from the AWS config using "SharedConfigCredentials: /home/anan/.aws/credentials" provider 
INFO Consuming Install Config from target directory 
INFO Adding clusters...                           
INFO Creating infrastructure resources...         
INFO Reconciling IAM roles for control-plane and compute nodes 
INFO Creating IAM role for master                 
INFO Creating IAM role for worker                 
INFO Started local control plane with envtest     
INFO Stored kubeconfig for envtest in: /home/anan/works/openshift-versions/works/.clusterapi_output/envtest.kubeconfig 
INFO Running process: Cluster API with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:40405 --webhook-port=43869 --webhook-cert-dir=/tmp/envtest-serving-certs-3385276705 --kubeconfig=/home/anan/works/openshift-versions/works/.clusterapi_output/envtest.kubeconfig] 
INFO Running process: aws infrastructure provider with args [-v=4 --diagnostics-address=0 --health-addr=127.0.0.1:36761 --webhook-port=43653 --webhook-cert-dir=/tmp/envtest-serving-certs-160783845 --feature-gates=BootstrapFormatIgnition=true,ExternalResourceGC=true,TagUnmanagedNetworkResources=false,EKS=false,MachinePool=false --kubeconfig=/home/anan/works/openshift-versions/works/.clusterapi_output/envtest.kubeconfig] 
INFO Creating infra manifests...                  
INFO Created manifest *v1.Namespace, namespace= name=openshift-cluster-api-guests 
INFO Created manifest *v1beta2.AWSClusterControllerIdentity, namespace= name=default 
I1222 21:49:12.381469 3467894 warning_handler.go:65] "cluster.x-k8s.io/v1beta1 Cluster is deprecated; use cluster.x-k8s.io/v1beta2 Cluster" logger="KubeAPIWarningLogger"
INFO Created manifest *v1beta1.Cluster, namespace=openshift-cluster-api-guests name=weli-test-s85s4 
INFO Created manifest *v1beta2.AWSCluster, namespace=openshift-cluster-api-guests name=weli-test-s85s4 
INFO Done creating infra manifests                
INFO Creating kubeconfig entry for capi cluster weli-test-s85s4 
INFO Waiting up to 15m0s (until 10:04PM CST) for network infrastructure to become ready... 
INFO Network infrastructure is ready              
INFO Creating Route53 records for control plane load balancer 
INFO Created private Hosted Zone                  
INFO Created manifest *v1beta2.AWSMachine, namespace=openshift-cluster-api-guests name=weli-test-s85s4-bootstrap 
INFO Created manifest *v1beta2.AWSMachine, namespace=openshift-cluster-api-guests name=weli-test-s85s4-master-0 
INFO Created manifest *v1beta2.AWSMachine, namespace=openshift-cluster-api-guests name=weli-test-s85s4-master-1 
INFO Created manifest *v1beta2.AWSMachine, namespace=openshift-cluster-api-guests name=weli-test-s85s4-master-2 
I1222 21:56:18.717475 3467894 warning_handler.go:65] "cluster.x-k8s.io/v1beta1 Machine is deprecated; use cluster.x-k8s.io/v1beta2 Machine" logger="KubeAPIWarningLogger"
INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=weli-test-s85s4-bootstrap 
INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=weli-test-s85s4-master-0 
INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=weli-test-s85s4-master-1 
INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=weli-test-s85s4-master-2 
INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=weli-test-s85s4-bootstrap 
INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=weli-test-s85s4-master 
INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=weli-test-s85s4-worker 
INFO Waiting up to 15m0s (until 10:11PM CST) for machines [weli-test-s85s4-bootstrap weli-test-s85s4-master-0 weli-test-s85s4-master-1 weli-test-s85s4-master-2] to provision... 
INFO Control-plane machines are ready             
INFO Cluster API resources have been created. Waiting for cluster to become ready... 
INFO Waiting up to 20m0s (until 10:16PM CST) for the Kubernetes API at https://api.weli-test.qe.devcluster.openshift.com:6443... 
ERROR Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.weli-test.qe.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 100.51.90.116:6443: connect: connection refused 
ERROR Bootstrap failed to complete: Get "https://api.weli-test.qe.devcluster.openshift.com:6443/version": dial tcp 34.232.16.112:6443: connect: connection refused 
ERROR Failed waiting for Kubernetes API. This error usually happens when there is a problem on the bootstrap host that prevents creating a temporary control plane. 
INFO Pulling Cluster API artifacts                
INFO Pulling VM console logs                      
INFO Pulling debug logs from the bootstrap machine 
ERROR Invalid log bundle or the bootstrap machine could not be reached and bootstrap logs were not collected 
INFO Bootstrap gather logs captured here "log-bundle-20251222221650.tar.gz" 
INFO Shutting down local Cluster API controllers... 
INFO Stopped controller: Cluster API              
INFO Stopped controller: aws infrastructure provider 
INFO Shutting down local Cluster API control plane... 
INFO Local Cluster API system has completed operations 

@liweinan
Copy link
Author

anan@think:~/works/openshift-versions/works$ head -n 20 install-config.yaml
additionalTrustBundlePolicy: Proxyonly
apiVersion: v1
baseDomain: qe.devcluster.openshift.com
compute:
- architecture: amd64
  hyperthreading: Disabled
  name: worker
  platform: {}
  replicas: 3
controlPlane:
  architecture: amd64
  hyperthreading: Enabled 
  name: master
  platform: {}
  replicas: 3

INFO Waiting up to 20m0s (until 11:17PM CST) for the Kubernetes API at https://api.weli-test.qe.devcluster.openshift.com:6443... 
ERROR Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.weli-test.qe.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 98.85.31.4:6443: connect: connection refused 
ERROR Bootstrap failed to complete: Get "https://api.weli-test.qe.devcluster.openshift.com:6443/version": dial tcp 52.23.16.72:6443: connect: connection refused 
ERROR Failed waiting for Kubernetes API. This error usually happens when there is a problem on the bootstrap host that prevents creating a temporary control plane. 
INFO Pulling Cluster API artifacts                
INFO Pulling VM console logs                      
INFO Pulling debug logs from the bootstrap machine 
ERROR Invalid log bundle or the bootstrap machine could not be reached and bootstrap logs were not collected 
INFO Bootstrap gather logs captured here "log-bundle-20251222231737.tar.gz" 
INFO Shutting down local Cluster API controllers... 

weli@tower ~/works/oc-swarm/openshift-progress/works/log-bundle-20251222231737 
❯ pwd
/Users/weli/works/oc-swarm/openshift-progress/works/log-bundle-20251222231737

OCP-23544 日志分析报告(第二次测试)

测试场景

OCP-23544: [ipi-on-aws] [Hyperthreading] Create cluster with hyperthreading disabled with default instance size.

预期结果: 集群创建失败

测试时间: 2025-12-22 22:49:44 (UTC+8)

配置验证

✅ Install Config 配置正确

rendered-assets/openshift/manifests/cluster-config.yaml 中可以看到:

compute:
- architecture: amd64
  hyperthreading: Disabled  # ✅ Worker 节点禁用超线程(符合要求)
  name: worker
  platform: {}
  replicas: 3
controlPlane:
  architecture: amd64
  hyperthreading: Enabled   # ✅ Master 节点启用超线程(符合要求)
  name: master
  platform: {}
  replicas: 3

关键改进

  • ✅ Worker 节点:hyperthreading: Disabled(符合测试要求)
  • ✅ ControlPlane 节点:hyperthreading: Enabled(符合测试要求)
  • ✅ 配置完全符合 OCP-23544 的测试要求

✅ 实例类型配置

从多个配置文件中确认:

  • Master 节点: m6i.xlarge ✅(默认实例大小)
  • Worker 节点: m6i.xlarge ✅(默认实例大小)
  • Bootstrap 节点: m6i.xlarge

所有节点都使用了默认实例大小 m6i.xlarge,符合测试要求。

✅ MachineConfig 配置

99-worker-disable-hyperthreading.yaml 可以看到:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 99-worker-disable-hyperthreading
spec:
  kernelArguments:
  - nosmt
  - smt-enabled=off

关键发现

  • ✅ Worker 节点正确配置了禁用超线程的内核参数
  • ✅ Master 节点没有禁用超线程的 MachineConfig(符合要求)

集群状态分析

❌ Worker 节点创建状态

clusterapi/Cluster-openshift-cluster-api-guests-weli-test-87ndj.yaml 中可以看到:

conditions:
- type: WorkersAvailable
  reason: NoWorkers
  status: "True"
- type: WorkerMachinesReady
  reason: NoReplicas
  status: "True"
- type: WorkerMachinesUpToDate
  reason: NoReplicas
  status: "True"

关键发现

  • 没有 Worker 节点被创建 (NoWorkers, NoReplicas)
  • ✅ Master 节点已创建(3 个 master 节点都有对应的 AWSMachine 对象)
  • ✅ Bootstrap 节点已创建

MachineSet 状态

99_openshift-cluster-api_worker-machineset-0.yaml 可以看到:

status:
  replicas: 0  # ❌ 没有创建任何 worker 实例

所有 5 个 worker MachineSet(us-east-1a 到 us-east-1d, us-east-1f)的 replicas 都是 0。

AWSMachine 对象

clusterapi/ 目录中,只找到了以下 AWSMachine 对象:

  • AWSMachine-openshift-cluster-api-guests-weli-test-87ndj-bootstrap.yaml
  • AWSMachine-openshift-cluster-api-guests-weli-test-87ndj-master-0.yaml
  • AWSMachine-openshift-cluster-api-guests-weli-test-87ndj-master-1.yaml
  • AWSMachine-openshift-cluster-api-guests-weli-test-87ndj-master-2.yaml
  • 没有 worker 节点的 AWSMachine 对象

⚠️ Control Plane 状态

从 Cluster 状态可以看到:

conditions:
- type: ControlPlaneAvailable
  reason: InternalError
  status: "Unknown"
  message: "Please check controller logs for errors"
- type: ControlPlaneMachinesReady
  reason: NotReady
  status: "False"
  message: "Waiting for Cluster control plane to be initialized"
- type: InfrastructureReady
  reason: Ready
  status: "True"

关键发现

  • ✅ Infrastructure 已就绪
  • ⚠️ Control Plane 存在内部错误
  • ⚠️ Control Plane 机器未就绪(等待初始化)

Master 节点实例状态

AWSMachine-openshift-cluster-api-guests-weli-test-87ndj-master-0.yaml 可以看到:

status:
  ready: true
  instancestate: running
  addresses:
  - type: InternalIP
    address: 10.0.53.192
  conditions:
  - type: Ready
    status: "True"
  - type: InstanceReady
    status: "True"

关键发现

  • ✅ Master 节点的 AWS 实例已创建并运行
  • ✅ Master 节点实例状态为 ready: true
  • ⚠️ 但 Control Plane 整体未就绪

⚠️ 超时错误

从 serial 日志中可以看到:

[weli-test-87ndj-master-0-serial.log]
ignition[831]: GET error: Get "https://api-int.weli-test.qe.devcluster.openshift.com:22623/config/master": 
  dial tcp 10.0.108.249:22623: i/o timeout

[weli-test-87ndj-master-2-serial.log]
ignition[781]: GET error: Get "https://api-int.weli-test.qe.devcluster.openshift.com:22623/config/master": 
  dial tcp 10.0.2.4:22623: i/o timeout

关键发现

  • ⚠️ Master 节点在尝试从 bootstrap 节点获取配置时超时
  • ⚠️ 这可能表明 bootstrap 节点或网络存在问题

CPU 选项分析

AWSMachine-openshift-cluster-api-guests-weli-test-87ndj-master-0.yaml 中可以看到:

instancetype: m6i.xlarge
cpuoptions:
  confidentialcompute: ""

关键发现

  • ⚠️ cpuoptions.ThreadsPerCore 未设置
  • 对于 m6i.xlarge 实例类型,如果要在 AWS 层面禁用超线程,需要设置 ThreadsPerCore: 1
  • 当前配置可能只在操作系统层面禁用了超线程(通过内核参数),而不是在 AWS 实例层面
  • 这可能是问题的关键:Worker 节点需要在 AWS 实例层面禁用超线程,但配置中没有设置

失败原因分析

可能的原因

  1. Worker 节点未创建的原因

    • Control Plane 未完全初始化,导致 worker 节点无法创建
    • 或者 worker 节点在创建过程中失败(但日志中没有 worker 节点的 AWSMachine 对象,说明可能根本没有尝试创建)
  2. Control Plane 初始化失败

    • Master 节点实例已创建并运行
    • 但 Control Plane 组件未完全初始化
    • 可能的原因:
      • 资源不足(虽然 master 启用了超线程,但可能仍有其他资源问题)
      • 网络问题(从超时错误可以看出)
      • 配置问题
  3. 超线程配置问题

    • Worker 节点需要在 AWS 实例层面禁用超线程(设置 ThreadsPerCore: 1
    • 但当前配置中 cpuoptions 为空
    • 这可能导致:
      • Worker 节点创建时 AWS API 调用失败
      • 或者 worker 节点创建后资源计算错误

验证结论

✅ 完全验证了的问题

  1. 配置正确性

    • ✅ Worker 节点配置了 hyperthreading: Disabled
    • ✅ ControlPlane 节点配置了 hyperthreading: Enabled
    • ✅ 使用了默认实例大小 m6i.xlarge
    • ✅ MachineConfig 正确配置了禁用超线程的内核参数(仅 worker)
  2. 集群创建失败

    • ✅ 集群确实创建失败(没有 worker 节点,Control Plane 未就绪)
    • ✅ 符合预期结果

⚠️ 部分验证的问题

  1. Worker 节点创建失败

    • ⚠️ Worker 节点根本没有被创建(没有 AWSMachine 对象)
    • ⚠️ 无法验证 worker 节点禁用超线程后的具体行为
    • ⚠️ 可能因为 Control Plane 未就绪导致 worker 节点无法创建
  2. AWS 实例层面的超线程配置

    • ⚠️ cpuoptions.ThreadsPerCore 未设置
    • ⚠️ 只在操作系统层面禁用了超线程,而不是在 AWS 实例层面
    • ⚠️ 这可能不是导致失败的直接原因,但可能影响资源计算
  3. 失败的根本原因

    • ⚠️ 无法确定失败是否完全由 worker 节点禁用超线程导致
    • ⚠️ Control Plane 也存在问题(InternalError)
    • ⚠️ 可能存在网络或配置问题

与第一次测试的对比

项目 第一次测试 第二次测试
Worker hyperthreading Disabled ✅ Disabled ✅
ControlPlane hyperthreading Disabled ❌ Enabled ✅
实例类型 m6i.xlarge ✅ m6i.xlarge ✅
Worker 节点创建 未创建 ❌ 未创建 ❌
Control Plane 状态 InternalError ⚠️ InternalError ⚠️
配置符合要求 部分符合 ⚠️ 完全符合 ✅

关键改进

  • ✅ 第二次测试的配置完全符合 OCP-23544 的要求
  • ⚠️ 但两次测试都出现了相同的问题:worker 节点未创建,Control Plane 未就绪

建议

  1. 检查 AWS 实例配置

    • 验证 worker 节点的 AWSMachine 对象是否应该设置 cpuoptions.ThreadsPerCore: 1
    • 检查 openshift-install 是否应该在创建 worker 节点时设置此选项
  2. 查看更详细的日志

    • 检查 bootstrap 节点日志,确认 bootstrap 是否正常运行
    • 检查 master 节点日志,确认 Control Plane 组件初始化失败的具体原因
    • 查看集群操作符的错误日志
  3. 验证资源计算

    • 计算禁用超线程后的实际可用 CPU 资源
    • m6i.xlarge 禁用超线程后:4 vCPU → 2 物理核心
    • 验证是否满足 worker 节点的最小资源要求
  4. 网络问题排查

    • 检查 master 节点到 bootstrap 节点的网络连接
    • 验证 API 端点是否可访问

总结

✅ 完全验证了 OCP-23544 的配置要求

  • Worker 节点:hyperthreading: Disabled
  • ControlPlane 节点:hyperthreading: Enabled
  • 使用默认实例大小 m6i.xlarge
  • 集群创建失败 ✅

⚠️ 但存在以下问题

  • Worker 节点未创建,无法验证 worker 节点的具体问题
  • Control Plane 也存在问题,无法确定失败是否完全由 worker 节点禁用超线程导致
  • AWS 实例层面的超线程配置可能不完整

建议

  • 需要进一步调查 worker 节点未创建的根本原因
  • 检查是否需要设置 cpuoptions.ThreadsPerCore: 1 来在 AWS 层面禁用超线程
  • 查看更详细的错误日志以确定失败的确切原因

@liweinan
Copy link
Author

liweinan commented Dec 23, 2025

OCP-22168

The command to extract INFRA_ID:

anan@think:~/works/openshift-versions/works$ INFRA_ID_PREFIX="${INFRA_ID_PREFIX:-${CLUSTER_NAME:-}}"

if [[ -n "${INFRA_ID_PREFIX}" ]]; then
  INFRA_ID=$(aws --region "${AWS_REGION}" ec2 describe-vpcs 2>/dev/null | \
    jq -r --arg prefix "${INFRA_ID_PREFIX}" \
    '.Vpcs[]? | 
     select(.Tags != null and (.Tags | type) == "array" and (.Tags | length) > 0) | 
     select(.Tags[]? | select(.Key == "Name" and (.Value | startswith($prefix)))) |
     .Tags[]? | 
     select(.Key != null and (.Key | startswith("kubernetes.io/cluster/"))) | 
     .Key | 
     sub("^kubernetes.io/cluster/"; "")' | \
    head -n 1)
fi
anan@think:~/works/openshift-versions/works$ echo $INFRA_ID
weli-test-569wj
anan@think:~/works/openshift-versions/works$ echo $INFRA_ID_PREFIX
weli-test
anan@think:~/works/openshift-versions/works$ 

anan@think:~/works/openshift-versions/works$ INFRA_ID_PREFIX="${INFRA_ID_PREFIX:-${CLUSTER_NAME:-}}"

if [[ -n "${INFRA_ID_PREFIX}" ]]; then
  INFRA_ID=$(aws --region "${AWS_REGION}" ec2 describe-vpcs 2>/dev/null | \
    jq -r --arg prefix "${INFRA_ID_PREFIX}" \
    '.Vpcs[]? | 
     select(.Tags != null and (.Tags | type) == "array" and (.Tags | length) > 0) | 
     select(.Tags[]? | select(.Key == "Name" and (.Value | startswith($prefix)))) |
     .Tags[]? | 
     select(.Key != null and (.Key | startswith("kubernetes.io/cluster/"))) | 
     .Key | 
     sub("^kubernetes.io/cluster/"; "")' | \
    head -n 1)
fi
anan@think:~/works/openshift-versions/works$ echo $INFRA_ID
weli-test-569wj
anan@think:~/works/openshift-versions/works$ echo $INFRA_ID_PREFIX
weli-test
anan@think:~/works/openshift-versions/works$ echo "{\"aws\":{\"region\":\"${AWS_REGION}\",\"identifier\":[{\"kubernetes.io/cluster/${INFRA_ID}\":\"owned\"}]}}"
{"aws":{"region":"us-east-1","identifier":[{"kubernetes.io/cluster/weli-test-569wj":"owned"}]}}
anan@think:~/works/openshift-versions/works$ echo "{\"aws\":{\"region\":\"${AWS_REGION}\",\"identifier\":[{\"kubernetes.io/cluster/${INFRA_ID}\":\"owned\"}]}}" | jq
{
  "aws": {
    "region": "us-east-1",
    "identifier": [
      {
        "kubernetes.io/cluster/weli-test-569wj": "owned"
      }
    ]
  }
}
anan@think:~/works/openshift-versions/works$ 

Original metadata.json:

anan@think:~/works/openshift-versions/works$ cat metadata.json | jq
{
  "clusterName": "weli-test",
  "clusterID": "6f84551a-5936-42dc-95f3-a04952f958d2",
  "infraID": "weli-test-569wj",
  "aws": {
    "region": "us-east-1",
    "identifier": [
      {
        "kubernetes.io/cluster/weli-test-569wj": "owned"
      },
      {
        "openshiftClusterID": "6f84551a-5936-42dc-95f3-a04952f958d2"
      },
      {
        "sigs.k8s.io/cluster-api-provider-aws/cluster/weli-test-569wj": "owned"
      }
    ],
    "clusterDomain": "weli-test.qe.devcluster.openshift.com"
  },
  "featureSet": "",
  "customFeatureSet": null
}

@liweinan
Copy link
Author

# OCP-22663 - [ipi-on-aws] Pick instance types for machines per region basis

## Test Case Overview

This test case validates that the OpenShift installer correctly selects instance types for AWS machines based on regional availability. The installer uses a priority-based fallback mechanism to select the best available instance type for each region.

## Current Implementation Behavior

The installer uses the following instance type priority list for AMD64 architecture:
1. m6i.xlarge (primary preference)
2. m5.xlarge (fallback)
3. r5.xlarge (fallback)
4. c5.2xlarge (fallback)
5. m5.2xlarge (fallback)
6. c5d.2xlarge (fallback)
7. r5.2xlarge (fallback)

The installer automatically checks instance type availability in the selected region and availability zones, selecting the first available type from the priority list.

## Test Steps

### Test Case 1: Standard Region with m6i Available

Objective: Verify that the installer selects m6i.xlarge when it's available in the region.

Prerequisites:

  • AWS credentials configured
  • Access to a standard AWS region (e.g., us-east-1, us-west-2, ap-northeast-1, eu-west-1)

Steps:

1. Create the Install Config asset:

openshift-install create install-config --dir instance_types1

2. Modify the region field in install-config.yaml:

platform:
  aws:
    region: us-east-1  # or another region where m6i is available

3. Generate the Kubernetes manifests:

openshift-install create manifests --dir instance_types1

Expected Result:

  • The installer should select m6i.xlarge as the instance type
  • Verify the instance type in the generated manifests:
    grep -r instanceType: instance_types1/
  • Expected output should show:
    openshift/99_openshift-cluster-api_master-machines-0.yaml:      instanceType: m6i.xlarge
    

### Test Case 2: Region Where m6i is Not Available

Objective: Verify that the installer falls back to m5.xlarge when m6i is not available in the region.

Prerequisites:

  • AWS credentials configured
  • Access to a region where m6i instance types are not available (e.g., eu-north-1, eu-west-3, us-gov-east-1)

Steps:

1. Create the Install Config asset:

openshift-install create install-config --dir instance_types2

2. Modify the region field in install-config.yaml:

platform:
  aws:
    region: eu-west-3  # Region where m6i may not be available

3. Generate the Kubernetes manifests:

openshift-install create manifests --dir instance_types2

Expected Result:

  • The installer should detect that m6i.xlarge is not available and fall back to m5.xlarge
  • Verify the instance type in the generated manifests:
    grep -r instanceType: instance_types2/
  • Expected output should show:
    openshift/99_openshift-cluster-api_master-machines-0.yaml:      instanceType: m5.xlarge
    

### Test Case 3: Full Cluster Installation Verification

Objective: Verify that the selected instance type works correctly during actual cluster installation.

Prerequisites:

  • AWS credentials configured with sufficient permissions
  • Valid base domain and pull secret

Steps:

1. Use the install config from Test Case 1 or Test Case 2

2. Launch the cluster:

openshift-install create cluster --dir instance_types2

Expected Result:

  • Installation completes successfully
  • Master nodes are created with the expected instance type
  • Verify instance types of running instances:
    # After cluster installation, verify via AWS CLI or console
    aws ec2 describe-instances --filters "Name=tag:Name,Values=*master*" --query 'Reservations[*].Instances[*].[InstanceType,Tags[?Key==`Name`].Value|[0]]' --output table
  • Create a new project and deploy a test application to verify cluster functionality:
    oc new-project test-instance-types
    oc new-app --image=nginx --name=test-app
    oc get pods -w

## Additional Verification

### Verify Instance Type Selection Logic

To understand why a specific instance type was selected, check the installer logs:

# Enable debug logging
export OPENSHIFT_INSTALL_LOG_LEVEL=debug
openshift-install create manifests --dir instance_types1

Look for log messages related to instance type selection and availability checks.

### Manual Instance Type Availability Check

You can manually verify instance type availability in a region using AWS CLI:

# Check if m6i.xlarge is available in a specific region
aws ec2 describe-instance-type-offerings \
  --location-type availability-zone \
  --filters "Name=instance-type,Values=m6i.xlarge" \
  --region us-east-1 \
  --query 'InstanceTypeOfferings[*].Location' \
  --output table

# Check if m5.xlarge is available
aws ec2 describe-instance-type-offerings \
  --location-type availability-zone \
  --filters "Name=instance-type,Values=m5.xlarge" \
  --region eu-west-3 \
  --query 'InstanceTypeOfferings[*].Location' \
  --output table

## Notes

1. Instance Type Availability: Instance type availability can vary by region and availability zone. The installer automatically handles this by checking availability and selecting the best option.

2. Regional Overrides: If specific regions require different instance type priorities, they can be configured in pkg/types/aws/defaults/platform.go using the defaultMachineTypes map.

3. Architecture Support: This test case focuses on AMD64 architecture. ARM64 architecture uses different instance types (e.g., m6g.xlarge).

4. Version Compatibility:

  • For OpenShift 4.10 and later: Default instance type is m6i.xlarge, with fallback to m5.xlarge if m6i is not available
  • For OpenShift 4.6 to 4.9: Default instance type was m5.xlarge
  • For OpenShift 4.5 and earlier: Default instance type was m4.xlarge

## Implementation Details

This section explains how the instance type selection logic works in the codebase, including the key components and their interactions.

### 1. Instance Type Defaults Definition

Location: pkg/types/aws/defaults/platform.go

The InstanceTypes() function defines the default priority list of instance types based on architecture and topology:

// InstanceTypes returns a list of instance types, in decreasing priority order
func InstanceTypes(region string, arch types.Architecture, topology configv1.TopologyMode) []string {
    // Check for region-specific overrides first
    if classesForArch, ok := defaultMachineTypes[arch]; ok {
        if classes, ok := classesForArch[region]; ok {
            return classes
        }
    }

    instanceSize := defaultInstanceSizeHighAvailabilityTopology // "xlarge"
    // Single node topology requires larger instance (2xlarge) for 8 cores
    if topology == configv1.SingleReplicaTopologyMode {
        instanceSize = defaultInstanceSizeSingleReplicaTopology // "2xlarge"
    }

    switch arch {
    case types.ArchitectureARM64:
        return []string{
            fmt.Sprintf("m6g.%s", instanceSize),
        }
    default: // AMD64
        return []string{
            fmt.Sprintf("m6i.%s", instanceSize),  // Primary: m6i.xlarge
            fmt.Sprintf("m5.%s", instanceSize),    // Fallback 1: m5.xlarge
            fmt.Sprintf("r5.%s", instanceSize),   // Fallback 2: r5.xlarge
            "c5.2xlarge",                         // Fallback 3
            "m5.2xlarge",                         // Fallback 4
            "c5d.2xlarge",                        // Fallback 5 (Local Zone compatible)
            "r5.2xlarge",                         // Fallback 6
        }
    }
}

Key Points:

  • Returns instance types in priority order (highest to lowest)
  • Supports region-specific overrides via defaultMachineTypes map
  • Adjusts instance size based on topology (HA vs single-node)
  • Different instance types for ARM64 vs AMD64 architectures

### 2. Instance Type Selection Logic

Location: pkg/asset/machines/aws/instance_types.go

The PreferredInstanceType() function selects the best available instance type by checking availability in the specified zones:

// PreferredInstanceType returns a preferred instance type from the list of 
// instance types provided in descending order of preference
func PreferredInstanceType(ctx context.Context, meta *awsconfig.Metadata, 
    types []string, zones []string) (string, error) {
    if len(types) == 0 {
        return "", errors.New("at least one instance type required")
    }

    // Create EC2 client to query instance type availability
    client, err := awsconfig.NewEC2Client(ctx, awsconfig.EndpointOptions{
        Region:    meta.Region,
        Endpoints: meta.Services,
    })
    if err != nil {
        return "", fmt.Errorf("failed to create EC2 client: %w", err)
    }

    // Query AWS to get instance type availability per zone
    found, err := getInstanceTypeZoneInfo(ctx, client, types, zones)
    if err != nil {
        // If query fails, return first type as fallback
        return types[0], err
    }

    // Iterate through types in priority order
    for _, t := range types {
        // Check if this instance type is available in ALL required zones
        if found[t].HasAll(zones...) {
            return t, nil
        }
    }

    // If no type available in all zones, return first type with error
    return types[0], errors.New("no instance type found for the zone constraint")
}

The getInstanceTypeZoneInfo() function queries AWS EC2 API to check instance type availability:

func getInstanceTypeZoneInfo(ctx context.Context, client *ec2.Client, 
    types []string, zones []string) (map[string]sets.Set[string], error) {
    found := map[string]sets.Set[string]{}
    
    // Query AWS EC2 DescribeInstanceTypeOfferings API
    resp, err := client.DescribeInstanceTypeOfferings(ctx, &ec2.DescribeInstanceTypeOfferingsInput{
        Filters: []ec2types.Filter{
            {
                Name:   aws.String("location"),
                Values: zones,  // Filter by availability zones
            },
            {
                Name:   aws.String("instance-type"),
                Values: types,  // Filter by instance types
            },
        },
        LocationType: ec2types.LocationTypeAvailabilityZone,
    })
    if err != nil {
        return found, err
    }

    // Build a map: instance type -> set of available zones
    for _, offering := range resp.InstanceTypeOfferings {
        f, ok := found[string(offering.InstanceType)]
        if !ok {
            f = sets.New[string]()
            found[string(offering.InstanceType)] = f
        }
        f.Insert(aws.ToString(offering.Location))
    }
    return found, nil
}

Key Points:

  • Queries AWS EC2 API to check real-time instance type availability
  • Requires instance type to be available in ALL specified availability zones
  • Returns first available type from priority list
  • Falls back to first type if API query fails

### 3. Master Machine Configuration

Location: pkg/asset/machines/master.go

The master machine configuration integrates the instance type selection logic:

// When instance type is not specified by user
if mpool.InstanceType == "" {
    // Determine topology mode
    topology := configv1.HighlyAvailableTopologyMode
    if pool.Replicas != nil && *pool.Replicas == 1 {
        topology = configv1.SingleReplicaTopologyMode
    }
    
    // Get priority list of instance types
    instanceTypes := awsdefaults.InstanceTypes(
        installConfig.Config.Platform.AWS.Region,
        installConfig.Config.ControlPlane.Architecture,
        topology,
    )
    
    // Select best available instance type
    mpool.InstanceType, err = aws.PreferredInstanceType(
        ctx,
        installConfig.AWS,
        instanceTypes,
        mpool.Zones,
    )
    if err != nil {
        // If selection fails, use first type from list as fallback
        logrus.Warn(errors.Wrap(err, "failed to find default instance type"))
        mpool.InstanceType = instanceTypes[0]
    }
}

// Filter zones if instance type is not available in all default zones
if zoneDefaults {
    mpool.Zones, err = aws.FilterZonesBasedOnInstanceType(
        ctx,
        installConfig.AWS,
        mpool.InstanceType,
        mpool.Zones,
    )
    if err != nil {
        logrus.Warn(errors.Wrap(err, "failed to filter zone list"))
    }
}

Key Points:

  • Only runs when user hasn't specified an instance type
  • Determines topology (HA vs single-node) based on replica count
  • Calls InstanceTypes() to get priority list
  • Calls PreferredInstanceType() to select best available type
  • Filters zones if selected instance type isn't available in all zones

### 4. Machine Manifest Generation

Location: pkg/asset/machines/aws/machines.go

The Machines() function generates Kubernetes Machine manifests with the selected instance type:

// Machines returns a list of machines for a machinepool
func Machines(clusterID string, region string, subnets aws.SubnetsByZone, 
    pool *types.MachinePool, role, userDataSecret string, 
    userTags map[string]string, publicSubnet bool) ([]machineapi.Machine, 
    *machinev1.ControlPlaneMachineSet, error) {
    
    mpool := pool.Platform.AWS
    
    // Create machines for each replica
    for idx := int64(0); idx < total; idx++ {
        zone := mpool.Zones[int(idx)%len(mpool.Zones)]
        subnet, ok := subnets[zone]
        
        // Create provider config with selected instance type
        provider, err := provider(&machineProviderInput{
            clusterID:        clusterID,
            region:           region,
            subnet:           subnet.ID,
            instanceType:     mpool.InstanceType,  // Uses selected instance type
            osImage:          mpool.AMIID,
            zone:             zone,
            role:             role,
            // ... other fields
        })
        
        // Create Machine object
        machine := machineapi.Machine{
            Spec: machineapi.MachineSpec{
                ProviderSpec: machineapi.ProviderSpec{
                    Value: &runtime.RawExtension{Object: provider},
                },
            },
        }
        machines = append(machines, machine)
    }
    
    return machines, controlPlaneMachineSet, nil
}

The provider() function creates the AWS machine provider configuration:

func provider(in *machineProviderInput) (*machineapi.AWSMachineProviderConfig, error) {
    config := &machineapi.AWSMachineProviderConfig{
        TypeMeta: metav1.TypeMeta{
            APIVersion: "machine.openshift.io/v1beta1",
            Kind:       "AWSMachineProviderConfig",
        },
        InstanceType: in.instanceType,  // Set from selected instance type
        // ... other configuration fields
    }
    return config, nil
}

Key Points:

  • Generates Machine manifests for each replica
  • Uses the instance type selected by PreferredInstanceType()
  • Creates AWSMachineProviderConfig with the instance type
  • Distributes machines across availability zones

### Execution Flow Summary

1. User creates install-config → Specifies region (and optionally instance type)
2. Master machine configuration (master.go):

  • If instance type not specified, calls InstanceTypes() to get priority list
  • Calls PreferredInstanceType() to select best available type
    3. Instance type selection (instance_types.go):
  • Queries AWS EC2 API to check availability
  • Returns first type available in all zones
    4. Machine manifest generation (machines.go):
  • Creates Machine objects with selected instance type
  • Writes manifests to disk

## Related Code References

  • Instance type defaults: pkg/types/aws/defaults/platform.go
  • Instance type selection logic: pkg/asset/machines/aws/instance_types.go
  • Machine manifest generation: pkg/asset/machines/aws/machines.go
  • Master machine configuration: pkg/asset/machines/master.go

@liweinan
Copy link
Author

OCP-29648

anan@think:~/works/openshift-versions/works$ cat install-config.yaml.bkup 
additionalTrustBundlePolicy: Proxyonly
apiVersion: v1
baseDomain: qe.devcluster.openshift.com
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform:
    aws:
      amiID: ami-01095d1967818437c
  replicas: 3
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform:
    aws:
      amiID: ami-0c1a8e216e46bb60c
  replicas: 3
metadata:
  name: weli-test
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 10.0.0.0/16
  networkType: OVNKubernetes
  serviceNetwork:
  - 172.30.0.0/16
platform:
  aws:
    region: us-east-1
    vpc: {}
publish: External

# 查看 master 节点的 AMI(应该看到 ami-0c1a8e216e46bb60c)
echo "Master 节点 AMI:"
aws ec2 describe-instances \
  --region "${REGION}" \
  --filters "Name=tag:kubernetes.io/cluster/${INFRA_ID},Values=owned" \
            "Name=tag:Name,Values=*master*" \
            "Name=instance-state-name,Values=running" \
  --output json | jq -r '.Reservations[].Instances[].ImageId' | sort | uniq

# 查看 worker 节点的 AMI(应该看到 ami-01095d1967818437c)
echo "Worker 节点 AMI:"
aws ec2 describe-instances \
  --region "${REGION}" \
  --filters "Name=tag:kubernetes.io/cluster/${INFRA_ID},Values=owned" \
            "Name=tag:Name,Values=*worker*" \
            "Name=instance-state-name,Values=running" \
  --output json | jq -r '.Reservations[].Instances[].ImageId' | sort | uniq
Master 节点 AMI:
ami-0c1a8e216e46bb60c
Worker 节点 AMI:
ami-01095d1967818437c

@liweinan
Copy link
Author

OCP-21531

Verify the Pull Secret:

anan@think:~/works/openshift-versions/421nightly$ vi ../auth.json
anan@think:~/works/openshift-versions/421nightly$ oc adm release extract --command openshift-install --from=registry.ci.openshift.org/ocp/release:4.21.0-0.nightly-2025-12-22-170804 -a ../auth.json 
anan@think:~/works/openshift-versions/421nightly$ du -h openshift-install 
654M	openshift-install

Export variables:

anan@think:~/works/openshift-versions/work3$ export OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=registry.ci.openshift.org/ocp/release:4.21.0-0.nightly-2025-12-22-170804
anan@think:~/works/openshift-versions/work3$ export OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE=ami-01095d1967818437c

Using different version installer to install the cluster:

anan@think:~/works/openshift-versions/work3$ ../421rc0/openshift-install version
../421rc0/openshift-install 4.21.0-rc.0
built from commit 8f88b34924c2267a2aa446dcdc6ccdd5260f9c45
release image quay.io/openshift-release-dev/ocp-release@sha256:ecde621d6f74aa1af4cd351f8b571ca2a61bbc32826e49cdf1b7fbff07f04ede
WARNING Found override for release image (registry.ci.openshift.org/ocp/release:4.21.0-0.nightly-2025-12-22-170804). Release Image Architecture is unknown 
release architecture unknown
default architecture amd64
anan@think:~/works/openshift-versions/work3$ ../421rc0/openshift-install create cluster
WARNING Found override for release image (registry.ci.openshift.org/ocp/release:4.21.0-0.nightly-2025-12-22-170804). Release Image Architecture is unknown 
INFO Credentials loaded from the "default" profile in file "/home/anan/.aws/credentials" 
WARNING Found override for OS Image. Please be warned, this is not advised 
INFO Successfully populated MCS CA cert information: root-ca 2035-12-23T03:35:54Z 2025-12-25T03:35:54Z 
INFO Successfully populated MCS TLS cert information: root-ca 2035-12-23T03:35:54Z 2025-12-25T03:35:54Z 
INFO Credentials loaded from the AWS config using "SharedConfigCredentials: /home/anan/.aws/credentials" provider 
WARNING Found override for release image (registry.ci.openshift.org/ocp/release:4.21.0-0.nightly-2025-12-22-170804). Please be warned, this is not advised 

Check the installed cluster version and the used amiID:

anan@think:~/works/openshift-versions/work3$ export KUBECONFIG=/home/anan/works/openshift-versions/work3/auth/kubeconfig
anan@think:~/works/openshift-versions/work3$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.21.0-0.nightly-2025-12-22-170804   True        False         71m     Cluster version is 4.21.0-0.nightly-2025-12-22-170804
$ oc get machineset.machine.openshift.io -n openshift-machine-api -o json | \
jq -r '.items[] | .spec.template.spec.providerSpec.value.ami.id'
ami-01095d1967818437c
ami-01095d1967818437c
ami-01095d1967818437c
ami-01095d1967818437c
ami-01095d1967818437c

@liweinan
Copy link
Author

OCP-22425

OCP-22425

Cluster A:

anan@think:~/works/openshift-versions/work3$ oc get nodes
NAME                           STATUS   ROLES                  AGE   VERSION
ip-10-0-106-174.ec2.internal   Ready    control-plane,master   8h    v1.34.2
ip-10-0-157-14.ec2.internal    Ready    control-plane,master   8h    v1.34.2
ip-10-0-30-65.ec2.internal     Ready    worker                 8h    v1.34.2
ip-10-0-54-54.ec2.internal     Ready    worker                 8h    v1.34.2
ip-10-0-74-122.ec2.internal    Ready    worker                 8h    v1.34.2
ip-10-0-76-206.ec2.internal    Ready    control-plane,master   8h    v1.34.2
anan@think:~/works/openshift-versions/work3$ oc get route -n openshift-authentication
NAME              HOST/PORT                                                    PATH   SERVICES          PORT   TERMINATION            WILDCARD
oauth-openshift   oauth-openshift.apps.weli-test.qe.devcluster.openshift.com          oauth-openshift   6443   passthrough/Redirect   None
anan@think:~/works/openshift-versions/work3$ oc get po -n openshift-apiserver
NAME                         READY   STATUS    RESTARTS   AGE
apiserver-6b767844c6-2jztv   2/2     Running   0          8h
apiserver-6b767844c6-g4rck   2/2     Running   0          8h
apiserver-6b767844c6-jzv4z   2/2     Running   0          8h
anan@think:~/works/openshift-versions/work3$ oc rsh -n openshift-apiserver apiserver-6b767844c6-2jztv
Defaulted container "openshift-apiserver" out of: openshift-apiserver, openshift-apiserver-check-endpoints, fix-audit-permissions (init)
sh-5.1# 

Cluster B:

anan@think:~/works/openshift-versions/works2$ oc get nodes
NAME                           STATUS   ROLES                  AGE   VERSION
ip-10-0-122-6.ec2.internal     Ready    control-plane,master   27m   v1.34.2
ip-10-0-134-89.ec2.internal    Ready    control-plane,master   27m   v1.34.2
ip-10-0-141-244.ec2.internal   Ready    worker                 13m   v1.34.2
ip-10-0-31-52.ec2.internal     Ready    worker                 21m   v1.34.2
ip-10-0-67-21.ec2.internal     Ready    control-plane,master   27m   v1.34.2
ip-10-0-96-196.ec2.internal    Ready    worker                 21m   v1.34.2
anan@think:~/works/openshift-versions/works2$ oc get po -n openshift-apiserver
NAME                         READY   STATUS    RESTARTS   AGE
apiserver-574bdcd758-j85sh   2/2     Running   0          10m
apiserver-574bdcd758-l98ph   2/2     Running   0          10m
apiserver-574bdcd758-p922j   2/2     Running   0          8m8s
anan@think:~/works/openshift-versions/works2$ oc rsh -n openshift-apiserver apiserver-574bdcd758-j85sh
Defaulted container "openshift-apiserver" out of: openshift-apiserver, openshift-apiserver-check-endpoints, fix-audit-permissions (init)
sh-5.1# curl -k https://oauth-openshift.apps.weli-test.qe.devcluster.openshift.com/healthz
oksh-5.1# 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment