anan@think:~/works/openshift-versions/works$ cat install-config.yaml.bkup
additionalTrustBundlePolicy: Proxyonly
apiVersion: v1
baseDomain: qe.devcluster.openshift.com
compute:
- architecture: amd64
hyperthreading: Disabled
name: worker
platform: {}
replicas: 3
controlPlane:
architecture: amd64
hyperthreading: Disabled
name: master
platform: {}
replicas: 3
metadata:
name: weli-test
networking:
clusterNetwork:
- cidr: 10.128.0.0/14
hostPrefix: 23
machineNetwork:
- cidr: 10.0.0.0/16
networkType: OVNKubernetes
serviceNetwork:
- 172.30.0.0/16
platform:
aws:
region: us-east-1
vpc: {}
publish: External-
-
Save liweinan/28927a870099494b6e23fc8aaf58c3c3 to your computer and use it in GitHub Desktop.
liweinan
commented
Dec 22, 2025
anan@think:~/works/openshift-versions/works$ head -n 20 install-config.yaml
additionalTrustBundlePolicy: Proxyonly
apiVersion: v1
baseDomain: qe.devcluster.openshift.com
compute:
- architecture: amd64
hyperthreading: Disabled
name: worker
platform: {}
replicas: 3
controlPlane:
architecture: amd64
hyperthreading: Enabled
name: master
platform: {}
replicas: 3INFO Waiting up to 20m0s (until 11:17PM CST) for the Kubernetes API at https://api.weli-test.qe.devcluster.openshift.com:6443...
ERROR Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.weli-test.qe.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 98.85.31.4:6443: connect: connection refused
ERROR Bootstrap failed to complete: Get "https://api.weli-test.qe.devcluster.openshift.com:6443/version": dial tcp 52.23.16.72:6443: connect: connection refused
ERROR Failed waiting for Kubernetes API. This error usually happens when there is a problem on the bootstrap host that prevents creating a temporary control plane.
INFO Pulling Cluster API artifacts
INFO Pulling VM console logs
INFO Pulling debug logs from the bootstrap machine
ERROR Invalid log bundle or the bootstrap machine could not be reached and bootstrap logs were not collected
INFO Bootstrap gather logs captured here "log-bundle-20251222231737.tar.gz"
INFO Shutting down local Cluster API controllers... weli@tower ~/works/oc-swarm/openshift-progress/works/log-bundle-20251222231737
❯ pwd
/Users/weli/works/oc-swarm/openshift-progress/works/log-bundle-20251222231737OCP-23544 日志分析报告(第二次测试)
测试场景
OCP-23544: [ipi-on-aws] [Hyperthreading] Create cluster with hyperthreading disabled with default instance size.
预期结果: 集群创建失败
测试时间: 2025-12-22 22:49:44 (UTC+8)
配置验证
✅ Install Config 配置正确
从 rendered-assets/openshift/manifests/cluster-config.yaml 中可以看到:
compute:
- architecture: amd64
hyperthreading: Disabled # ✅ Worker 节点禁用超线程(符合要求)
name: worker
platform: {}
replicas: 3
controlPlane:
architecture: amd64
hyperthreading: Enabled # ✅ Master 节点启用超线程(符合要求)
name: master
platform: {}
replicas: 3关键改进:
- ✅ Worker 节点:
hyperthreading: Disabled(符合测试要求) - ✅ ControlPlane 节点:
hyperthreading: Enabled(符合测试要求) - ✅ 配置完全符合 OCP-23544 的测试要求
✅ 实例类型配置
从多个配置文件中确认:
- Master 节点:
m6i.xlarge✅(默认实例大小) - Worker 节点:
m6i.xlarge✅(默认实例大小) - Bootstrap 节点:
m6i.xlarge✅
所有节点都使用了默认实例大小 m6i.xlarge,符合测试要求。
✅ MachineConfig 配置
从 99-worker-disable-hyperthreading.yaml 可以看到:
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: 99-worker-disable-hyperthreading
spec:
kernelArguments:
- nosmt
- smt-enabled=off关键发现:
- ✅ Worker 节点正确配置了禁用超线程的内核参数
- ✅ Master 节点没有禁用超线程的 MachineConfig(符合要求)
集群状态分析
❌ Worker 节点创建状态
从 clusterapi/Cluster-openshift-cluster-api-guests-weli-test-87ndj.yaml 中可以看到:
conditions:
- type: WorkersAvailable
reason: NoWorkers
status: "True"
- type: WorkerMachinesReady
reason: NoReplicas
status: "True"
- type: WorkerMachinesUpToDate
reason: NoReplicas
status: "True"关键发现:
- ❌ 没有 Worker 节点被创建 (
NoWorkers,NoReplicas) - ✅ Master 节点已创建(3 个 master 节点都有对应的 AWSMachine 对象)
- ✅ Bootstrap 节点已创建
MachineSet 状态
从 99_openshift-cluster-api_worker-machineset-0.yaml 可以看到:
status:
replicas: 0 # ❌ 没有创建任何 worker 实例所有 5 个 worker MachineSet(us-east-1a 到 us-east-1d, us-east-1f)的 replicas 都是 0。
AWSMachine 对象
在 clusterapi/ 目录中,只找到了以下 AWSMachine 对象:
- ✅
AWSMachine-openshift-cluster-api-guests-weli-test-87ndj-bootstrap.yaml - ✅
AWSMachine-openshift-cluster-api-guests-weli-test-87ndj-master-0.yaml - ✅
AWSMachine-openshift-cluster-api-guests-weli-test-87ndj-master-1.yaml - ✅
AWSMachine-openshift-cluster-api-guests-weli-test-87ndj-master-2.yaml - ❌ 没有 worker 节点的 AWSMachine 对象
⚠️ Control Plane 状态
从 Cluster 状态可以看到:
conditions:
- type: ControlPlaneAvailable
reason: InternalError
status: "Unknown"
message: "Please check controller logs for errors"
- type: ControlPlaneMachinesReady
reason: NotReady
status: "False"
message: "Waiting for Cluster control plane to be initialized"
- type: InfrastructureReady
reason: Ready
status: "True"关键发现:
- ✅ Infrastructure 已就绪
⚠️ Control Plane 存在内部错误⚠️ Control Plane 机器未就绪(等待初始化)
Master 节点实例状态
从 AWSMachine-openshift-cluster-api-guests-weli-test-87ndj-master-0.yaml 可以看到:
status:
ready: true
instancestate: running
addresses:
- type: InternalIP
address: 10.0.53.192
conditions:
- type: Ready
status: "True"
- type: InstanceReady
status: "True"关键发现:
- ✅ Master 节点的 AWS 实例已创建并运行
- ✅ Master 节点实例状态为
ready: true ⚠️ 但 Control Plane 整体未就绪
⚠️ 超时错误
从 serial 日志中可以看到:
[weli-test-87ndj-master-0-serial.log]
ignition[831]: GET error: Get "https://api-int.weli-test.qe.devcluster.openshift.com:22623/config/master":
dial tcp 10.0.108.249:22623: i/o timeout
[weli-test-87ndj-master-2-serial.log]
ignition[781]: GET error: Get "https://api-int.weli-test.qe.devcluster.openshift.com:22623/config/master":
dial tcp 10.0.2.4:22623: i/o timeout
关键发现:
⚠️ Master 节点在尝试从 bootstrap 节点获取配置时超时⚠️ 这可能表明 bootstrap 节点或网络存在问题
CPU 选项分析
从 AWSMachine-openshift-cluster-api-guests-weli-test-87ndj-master-0.yaml 中可以看到:
instancetype: m6i.xlarge
cpuoptions:
confidentialcompute: ""关键发现:
⚠️ cpuoptions.ThreadsPerCore未设置- 对于
m6i.xlarge实例类型,如果要在 AWS 层面禁用超线程,需要设置ThreadsPerCore: 1 - 当前配置可能只在操作系统层面禁用了超线程(通过内核参数),而不是在 AWS 实例层面
- 这可能是问题的关键:Worker 节点需要在 AWS 实例层面禁用超线程,但配置中没有设置
失败原因分析
可能的原因
-
Worker 节点未创建的原因:
- Control Plane 未完全初始化,导致 worker 节点无法创建
- 或者 worker 节点在创建过程中失败(但日志中没有 worker 节点的 AWSMachine 对象,说明可能根本没有尝试创建)
-
Control Plane 初始化失败:
- Master 节点实例已创建并运行
- 但 Control Plane 组件未完全初始化
- 可能的原因:
- 资源不足(虽然 master 启用了超线程,但可能仍有其他资源问题)
- 网络问题(从超时错误可以看出)
- 配置问题
-
超线程配置问题:
- Worker 节点需要在 AWS 实例层面禁用超线程(设置
ThreadsPerCore: 1) - 但当前配置中
cpuoptions为空 - 这可能导致:
- Worker 节点创建时 AWS API 调用失败
- 或者 worker 节点创建后资源计算错误
- Worker 节点需要在 AWS 实例层面禁用超线程(设置
验证结论
✅ 完全验证了的问题
-
配置正确性:
- ✅ Worker 节点配置了
hyperthreading: Disabled - ✅ ControlPlane 节点配置了
hyperthreading: Enabled - ✅ 使用了默认实例大小
m6i.xlarge - ✅ MachineConfig 正确配置了禁用超线程的内核参数(仅 worker)
- ✅ Worker 节点配置了
-
集群创建失败:
- ✅ 集群确实创建失败(没有 worker 节点,Control Plane 未就绪)
- ✅ 符合预期结果
⚠️ 部分验证的问题
-
Worker 节点创建失败:
⚠️ Worker 节点根本没有被创建(没有 AWSMachine 对象)⚠️ 无法验证 worker 节点禁用超线程后的具体行为⚠️ 可能因为 Control Plane 未就绪导致 worker 节点无法创建
-
AWS 实例层面的超线程配置:
⚠️ cpuoptions.ThreadsPerCore未设置⚠️ 只在操作系统层面禁用了超线程,而不是在 AWS 实例层面⚠️ 这可能不是导致失败的直接原因,但可能影响资源计算
-
失败的根本原因:
⚠️ 无法确定失败是否完全由 worker 节点禁用超线程导致⚠️ Control Plane 也存在问题(InternalError)⚠️ 可能存在网络或配置问题
与第一次测试的对比
| 项目 | 第一次测试 | 第二次测试 |
|---|---|---|
| Worker hyperthreading | Disabled ✅ | Disabled ✅ |
| ControlPlane hyperthreading | Disabled ❌ | Enabled ✅ |
| 实例类型 | m6i.xlarge ✅ | m6i.xlarge ✅ |
| Worker 节点创建 | 未创建 ❌ | 未创建 ❌ |
| Control Plane 状态 | InternalError |
InternalError |
| 配置符合要求 | 部分符合 |
完全符合 ✅ |
关键改进:
- ✅ 第二次测试的配置完全符合 OCP-23544 的要求
⚠️ 但两次测试都出现了相同的问题:worker 节点未创建,Control Plane 未就绪
建议
-
检查 AWS 实例配置:
- 验证 worker 节点的 AWSMachine 对象是否应该设置
cpuoptions.ThreadsPerCore: 1 - 检查 openshift-install 是否应该在创建 worker 节点时设置此选项
- 验证 worker 节点的 AWSMachine 对象是否应该设置
-
查看更详细的日志:
- 检查 bootstrap 节点日志,确认 bootstrap 是否正常运行
- 检查 master 节点日志,确认 Control Plane 组件初始化失败的具体原因
- 查看集群操作符的错误日志
-
验证资源计算:
- 计算禁用超线程后的实际可用 CPU 资源
m6i.xlarge禁用超线程后:4 vCPU → 2 物理核心- 验证是否满足 worker 节点的最小资源要求
-
网络问题排查:
- 检查 master 节点到 bootstrap 节点的网络连接
- 验证 API 端点是否可访问
总结
✅ 完全验证了 OCP-23544 的配置要求:
- Worker 节点:
hyperthreading: Disabled✅ - ControlPlane 节点:
hyperthreading: Enabled✅ - 使用默认实例大小
m6i.xlarge✅ - 集群创建失败 ✅
- Worker 节点未创建,无法验证 worker 节点的具体问题
- Control Plane 也存在问题,无法确定失败是否完全由 worker 节点禁用超线程导致
- AWS 实例层面的超线程配置可能不完整
建议:
- 需要进一步调查 worker 节点未创建的根本原因
- 检查是否需要设置
cpuoptions.ThreadsPerCore: 1来在 AWS 层面禁用超线程 - 查看更详细的错误日志以确定失败的确切原因
OCP-22168
The command to extract INFRA_ID:
anan@think:~/works/openshift-versions/works$ INFRA_ID_PREFIX="${INFRA_ID_PREFIX:-${CLUSTER_NAME:-}}"
if [[ -n "${INFRA_ID_PREFIX}" ]]; then
INFRA_ID=$(aws --region "${AWS_REGION}" ec2 describe-vpcs 2>/dev/null | \
jq -r --arg prefix "${INFRA_ID_PREFIX}" \
'.Vpcs[]? |
select(.Tags != null and (.Tags | type) == "array" and (.Tags | length) > 0) |
select(.Tags[]? | select(.Key == "Name" and (.Value | startswith($prefix)))) |
.Tags[]? |
select(.Key != null and (.Key | startswith("kubernetes.io/cluster/"))) |
.Key |
sub("^kubernetes.io/cluster/"; "")' | \
head -n 1)
fi
anan@think:~/works/openshift-versions/works$ echo $INFRA_ID
weli-test-569wj
anan@think:~/works/openshift-versions/works$ echo $INFRA_ID_PREFIX
weli-test
anan@think:~/works/openshift-versions/works$ anan@think:~/works/openshift-versions/works$ INFRA_ID_PREFIX="${INFRA_ID_PREFIX:-${CLUSTER_NAME:-}}"
if [[ -n "${INFRA_ID_PREFIX}" ]]; then
INFRA_ID=$(aws --region "${AWS_REGION}" ec2 describe-vpcs 2>/dev/null | \
jq -r --arg prefix "${INFRA_ID_PREFIX}" \
'.Vpcs[]? |
select(.Tags != null and (.Tags | type) == "array" and (.Tags | length) > 0) |
select(.Tags[]? | select(.Key == "Name" and (.Value | startswith($prefix)))) |
.Tags[]? |
select(.Key != null and (.Key | startswith("kubernetes.io/cluster/"))) |
.Key |
sub("^kubernetes.io/cluster/"; "")' | \
head -n 1)
fi
anan@think:~/works/openshift-versions/works$ echo $INFRA_ID
weli-test-569wj
anan@think:~/works/openshift-versions/works$ echo $INFRA_ID_PREFIX
weli-test
anan@think:~/works/openshift-versions/works$ echo "{\"aws\":{\"region\":\"${AWS_REGION}\",\"identifier\":[{\"kubernetes.io/cluster/${INFRA_ID}\":\"owned\"}]}}"
{"aws":{"region":"us-east-1","identifier":[{"kubernetes.io/cluster/weli-test-569wj":"owned"}]}}
anan@think:~/works/openshift-versions/works$ echo "{\"aws\":{\"region\":\"${AWS_REGION}\",\"identifier\":[{\"kubernetes.io/cluster/${INFRA_ID}\":\"owned\"}]}}" | jq
{
"aws": {
"region": "us-east-1",
"identifier": [
{
"kubernetes.io/cluster/weli-test-569wj": "owned"
}
]
}
}
anan@think:~/works/openshift-versions/works$ Original metadata.json:
anan@think:~/works/openshift-versions/works$ cat metadata.json | jq
{
"clusterName": "weli-test",
"clusterID": "6f84551a-5936-42dc-95f3-a04952f958d2",
"infraID": "weli-test-569wj",
"aws": {
"region": "us-east-1",
"identifier": [
{
"kubernetes.io/cluster/weli-test-569wj": "owned"
},
{
"openshiftClusterID": "6f84551a-5936-42dc-95f3-a04952f958d2"
},
{
"sigs.k8s.io/cluster-api-provider-aws/cluster/weli-test-569wj": "owned"
}
],
"clusterDomain": "weli-test.qe.devcluster.openshift.com"
},
"featureSet": "",
"customFeatureSet": null
}# OCP-22663 - [ipi-on-aws] Pick instance types for machines per region basis
## Test Case Overview
This test case validates that the OpenShift installer correctly selects instance types for AWS machines based on regional availability. The installer uses a priority-based fallback mechanism to select the best available instance type for each region.
## Current Implementation Behavior
The installer uses the following instance type priority list for AMD64 architecture:
1. m6i.xlarge (primary preference)
2. m5.xlarge (fallback)
3. r5.xlarge (fallback)
4. c5.2xlarge (fallback)
5. m5.2xlarge (fallback)
6. c5d.2xlarge (fallback)
7. r5.2xlarge (fallback)
The installer automatically checks instance type availability in the selected region and availability zones, selecting the first available type from the priority list.
## Test Steps
### Test Case 1: Standard Region with m6i Available
Objective: Verify that the installer selects m6i.xlarge when it's available in the region.
Prerequisites:
- AWS credentials configured
- Access to a standard AWS region (e.g.,
us-east-1,us-west-2,ap-northeast-1,eu-west-1)
Steps:
1. Create the Install Config asset:
openshift-install create install-config --dir instance_types12. Modify the region field in install-config.yaml:
platform:
aws:
region: us-east-1 # or another region where m6i is available3. Generate the Kubernetes manifests:
openshift-install create manifests --dir instance_types1Expected Result:
- The installer should select
m6i.xlargeas the instance type - Verify the instance type in the generated manifests:
grep -r instanceType: instance_types1/
- Expected output should show:
openshift/99_openshift-cluster-api_master-machines-0.yaml: instanceType: m6i.xlarge
### Test Case 2: Region Where m6i is Not Available
Objective: Verify that the installer falls back to m5.xlarge when m6i is not available in the region.
Prerequisites:
- AWS credentials configured
- Access to a region where
m6iinstance types are not available (e.g.,eu-north-1,eu-west-3,us-gov-east-1)
Steps:
1. Create the Install Config asset:
openshift-install create install-config --dir instance_types22. Modify the region field in install-config.yaml:
platform:
aws:
region: eu-west-3 # Region where m6i may not be available3. Generate the Kubernetes manifests:
openshift-install create manifests --dir instance_types2Expected Result:
- The installer should detect that
m6i.xlargeis not available and fall back tom5.xlarge - Verify the instance type in the generated manifests:
grep -r instanceType: instance_types2/
- Expected output should show:
openshift/99_openshift-cluster-api_master-machines-0.yaml: instanceType: m5.xlarge
### Test Case 3: Full Cluster Installation Verification
Objective: Verify that the selected instance type works correctly during actual cluster installation.
Prerequisites:
- AWS credentials configured with sufficient permissions
- Valid base domain and pull secret
Steps:
1. Use the install config from Test Case 1 or Test Case 2
2. Launch the cluster:
openshift-install create cluster --dir instance_types2Expected Result:
- Installation completes successfully
- Master nodes are created with the expected instance type
- Verify instance types of running instances:
# After cluster installation, verify via AWS CLI or console aws ec2 describe-instances --filters "Name=tag:Name,Values=*master*" --query 'Reservations[*].Instances[*].[InstanceType,Tags[?Key==`Name`].Value|[0]]' --output table
- Create a new project and deploy a test application to verify cluster functionality:
oc new-project test-instance-types oc new-app --image=nginx --name=test-app oc get pods -w
## Additional Verification
### Verify Instance Type Selection Logic
To understand why a specific instance type was selected, check the installer logs:
# Enable debug logging
export OPENSHIFT_INSTALL_LOG_LEVEL=debug
openshift-install create manifests --dir instance_types1Look for log messages related to instance type selection and availability checks.
### Manual Instance Type Availability Check
You can manually verify instance type availability in a region using AWS CLI:
# Check if m6i.xlarge is available in a specific region
aws ec2 describe-instance-type-offerings \
--location-type availability-zone \
--filters "Name=instance-type,Values=m6i.xlarge" \
--region us-east-1 \
--query 'InstanceTypeOfferings[*].Location' \
--output table
# Check if m5.xlarge is available
aws ec2 describe-instance-type-offerings \
--location-type availability-zone \
--filters "Name=instance-type,Values=m5.xlarge" \
--region eu-west-3 \
--query 'InstanceTypeOfferings[*].Location' \
--output table## Notes
1. Instance Type Availability: Instance type availability can vary by region and availability zone. The installer automatically handles this by checking availability and selecting the best option.
2. Regional Overrides: If specific regions require different instance type priorities, they can be configured in pkg/types/aws/defaults/platform.go using the defaultMachineTypes map.
3. Architecture Support: This test case focuses on AMD64 architecture. ARM64 architecture uses different instance types (e.g., m6g.xlarge).
4. Version Compatibility:
- For OpenShift 4.10 and later: Default instance type is
m6i.xlarge, with fallback tom5.xlargeifm6iis not available - For OpenShift 4.6 to 4.9: Default instance type was
m5.xlarge - For OpenShift 4.5 and earlier: Default instance type was
m4.xlarge
## Implementation Details
This section explains how the instance type selection logic works in the codebase, including the key components and their interactions.
### 1. Instance Type Defaults Definition
Location: pkg/types/aws/defaults/platform.go
The InstanceTypes() function defines the default priority list of instance types based on architecture and topology:
// InstanceTypes returns a list of instance types, in decreasing priority order
func InstanceTypes(region string, arch types.Architecture, topology configv1.TopologyMode) []string {
// Check for region-specific overrides first
if classesForArch, ok := defaultMachineTypes[arch]; ok {
if classes, ok := classesForArch[region]; ok {
return classes
}
}
instanceSize := defaultInstanceSizeHighAvailabilityTopology // "xlarge"
// Single node topology requires larger instance (2xlarge) for 8 cores
if topology == configv1.SingleReplicaTopologyMode {
instanceSize = defaultInstanceSizeSingleReplicaTopology // "2xlarge"
}
switch arch {
case types.ArchitectureARM64:
return []string{
fmt.Sprintf("m6g.%s", instanceSize),
}
default: // AMD64
return []string{
fmt.Sprintf("m6i.%s", instanceSize), // Primary: m6i.xlarge
fmt.Sprintf("m5.%s", instanceSize), // Fallback 1: m5.xlarge
fmt.Sprintf("r5.%s", instanceSize), // Fallback 2: r5.xlarge
"c5.2xlarge", // Fallback 3
"m5.2xlarge", // Fallback 4
"c5d.2xlarge", // Fallback 5 (Local Zone compatible)
"r5.2xlarge", // Fallback 6
}
}
}Key Points:
- Returns instance types in priority order (highest to lowest)
- Supports region-specific overrides via
defaultMachineTypesmap - Adjusts instance size based on topology (HA vs single-node)
- Different instance types for ARM64 vs AMD64 architectures
### 2. Instance Type Selection Logic
Location: pkg/asset/machines/aws/instance_types.go
The PreferredInstanceType() function selects the best available instance type by checking availability in the specified zones:
// PreferredInstanceType returns a preferred instance type from the list of
// instance types provided in descending order of preference
func PreferredInstanceType(ctx context.Context, meta *awsconfig.Metadata,
types []string, zones []string) (string, error) {
if len(types) == 0 {
return "", errors.New("at least one instance type required")
}
// Create EC2 client to query instance type availability
client, err := awsconfig.NewEC2Client(ctx, awsconfig.EndpointOptions{
Region: meta.Region,
Endpoints: meta.Services,
})
if err != nil {
return "", fmt.Errorf("failed to create EC2 client: %w", err)
}
// Query AWS to get instance type availability per zone
found, err := getInstanceTypeZoneInfo(ctx, client, types, zones)
if err != nil {
// If query fails, return first type as fallback
return types[0], err
}
// Iterate through types in priority order
for _, t := range types {
// Check if this instance type is available in ALL required zones
if found[t].HasAll(zones...) {
return t, nil
}
}
// If no type available in all zones, return first type with error
return types[0], errors.New("no instance type found for the zone constraint")
}The getInstanceTypeZoneInfo() function queries AWS EC2 API to check instance type availability:
func getInstanceTypeZoneInfo(ctx context.Context, client *ec2.Client,
types []string, zones []string) (map[string]sets.Set[string], error) {
found := map[string]sets.Set[string]{}
// Query AWS EC2 DescribeInstanceTypeOfferings API
resp, err := client.DescribeInstanceTypeOfferings(ctx, &ec2.DescribeInstanceTypeOfferingsInput{
Filters: []ec2types.Filter{
{
Name: aws.String("location"),
Values: zones, // Filter by availability zones
},
{
Name: aws.String("instance-type"),
Values: types, // Filter by instance types
},
},
LocationType: ec2types.LocationTypeAvailabilityZone,
})
if err != nil {
return found, err
}
// Build a map: instance type -> set of available zones
for _, offering := range resp.InstanceTypeOfferings {
f, ok := found[string(offering.InstanceType)]
if !ok {
f = sets.New[string]()
found[string(offering.InstanceType)] = f
}
f.Insert(aws.ToString(offering.Location))
}
return found, nil
}Key Points:
- Queries AWS EC2 API to check real-time instance type availability
- Requires instance type to be available in ALL specified availability zones
- Returns first available type from priority list
- Falls back to first type if API query fails
### 3. Master Machine Configuration
Location: pkg/asset/machines/master.go
The master machine configuration integrates the instance type selection logic:
// When instance type is not specified by user
if mpool.InstanceType == "" {
// Determine topology mode
topology := configv1.HighlyAvailableTopologyMode
if pool.Replicas != nil && *pool.Replicas == 1 {
topology = configv1.SingleReplicaTopologyMode
}
// Get priority list of instance types
instanceTypes := awsdefaults.InstanceTypes(
installConfig.Config.Platform.AWS.Region,
installConfig.Config.ControlPlane.Architecture,
topology,
)
// Select best available instance type
mpool.InstanceType, err = aws.PreferredInstanceType(
ctx,
installConfig.AWS,
instanceTypes,
mpool.Zones,
)
if err != nil {
// If selection fails, use first type from list as fallback
logrus.Warn(errors.Wrap(err, "failed to find default instance type"))
mpool.InstanceType = instanceTypes[0]
}
}
// Filter zones if instance type is not available in all default zones
if zoneDefaults {
mpool.Zones, err = aws.FilterZonesBasedOnInstanceType(
ctx,
installConfig.AWS,
mpool.InstanceType,
mpool.Zones,
)
if err != nil {
logrus.Warn(errors.Wrap(err, "failed to filter zone list"))
}
}Key Points:
- Only runs when user hasn't specified an instance type
- Determines topology (HA vs single-node) based on replica count
- Calls
InstanceTypes()to get priority list - Calls
PreferredInstanceType()to select best available type - Filters zones if selected instance type isn't available in all zones
### 4. Machine Manifest Generation
Location: pkg/asset/machines/aws/machines.go
The Machines() function generates Kubernetes Machine manifests with the selected instance type:
// Machines returns a list of machines for a machinepool
func Machines(clusterID string, region string, subnets aws.SubnetsByZone,
pool *types.MachinePool, role, userDataSecret string,
userTags map[string]string, publicSubnet bool) ([]machineapi.Machine,
*machinev1.ControlPlaneMachineSet, error) {
mpool := pool.Platform.AWS
// Create machines for each replica
for idx := int64(0); idx < total; idx++ {
zone := mpool.Zones[int(idx)%len(mpool.Zones)]
subnet, ok := subnets[zone]
// Create provider config with selected instance type
provider, err := provider(&machineProviderInput{
clusterID: clusterID,
region: region,
subnet: subnet.ID,
instanceType: mpool.InstanceType, // Uses selected instance type
osImage: mpool.AMIID,
zone: zone,
role: role,
// ... other fields
})
// Create Machine object
machine := machineapi.Machine{
Spec: machineapi.MachineSpec{
ProviderSpec: machineapi.ProviderSpec{
Value: &runtime.RawExtension{Object: provider},
},
},
}
machines = append(machines, machine)
}
return machines, controlPlaneMachineSet, nil
}The provider() function creates the AWS machine provider configuration:
func provider(in *machineProviderInput) (*machineapi.AWSMachineProviderConfig, error) {
config := &machineapi.AWSMachineProviderConfig{
TypeMeta: metav1.TypeMeta{
APIVersion: "machine.openshift.io/v1beta1",
Kind: "AWSMachineProviderConfig",
},
InstanceType: in.instanceType, // Set from selected instance type
// ... other configuration fields
}
return config, nil
}Key Points:
- Generates Machine manifests for each replica
- Uses the instance type selected by
PreferredInstanceType() - Creates AWSMachineProviderConfig with the instance type
- Distributes machines across availability zones
### Execution Flow Summary
1. User creates install-config → Specifies region (and optionally instance type)
2. Master machine configuration (master.go):
- If instance type not specified, calls
InstanceTypes()to get priority list - Calls
PreferredInstanceType()to select best available type
3. Instance type selection (instance_types.go): - Queries AWS EC2 API to check availability
- Returns first type available in all zones
4. Machine manifest generation (machines.go): - Creates Machine objects with selected instance type
- Writes manifests to disk
## Related Code References
- Instance type defaults:
pkg/types/aws/defaults/platform.go - Instance type selection logic:
pkg/asset/machines/aws/instance_types.go - Machine manifest generation:
pkg/asset/machines/aws/machines.go - Master machine configuration:
pkg/asset/machines/master.go
OCP-29648
anan@think:~/works/openshift-versions/works$ cat install-config.yaml.bkup
additionalTrustBundlePolicy: Proxyonly
apiVersion: v1
baseDomain: qe.devcluster.openshift.com
compute:
- architecture: amd64
hyperthreading: Enabled
name: worker
platform:
aws:
amiID: ami-01095d1967818437c
replicas: 3
controlPlane:
architecture: amd64
hyperthreading: Enabled
name: master
platform:
aws:
amiID: ami-0c1a8e216e46bb60c
replicas: 3
metadata:
name: weli-test
networking:
clusterNetwork:
- cidr: 10.128.0.0/14
hostPrefix: 23
machineNetwork:
- cidr: 10.0.0.0/16
networkType: OVNKubernetes
serviceNetwork:
- 172.30.0.0/16
platform:
aws:
region: us-east-1
vpc: {}
publish: External# 查看 master 节点的 AMI(应该看到 ami-0c1a8e216e46bb60c)
echo "Master 节点 AMI:"
aws ec2 describe-instances \
--region "${REGION}" \
--filters "Name=tag:kubernetes.io/cluster/${INFRA_ID},Values=owned" \
"Name=tag:Name,Values=*master*" \
"Name=instance-state-name,Values=running" \
--output json | jq -r '.Reservations[].Instances[].ImageId' | sort | uniq
# 查看 worker 节点的 AMI(应该看到 ami-01095d1967818437c)
echo "Worker 节点 AMI:"
aws ec2 describe-instances \
--region "${REGION}" \
--filters "Name=tag:kubernetes.io/cluster/${INFRA_ID},Values=owned" \
"Name=tag:Name,Values=*worker*" \
"Name=instance-state-name,Values=running" \
--output json | jq -r '.Reservations[].Instances[].ImageId' | sort | uniq
Master 节点 AMI:
ami-0c1a8e216e46bb60c
Worker 节点 AMI:
ami-01095d1967818437cOCP-21531
Verify the Pull Secret:
anan@think:~/works/openshift-versions/421nightly$ vi ../auth.json
anan@think:~/works/openshift-versions/421nightly$ oc adm release extract --command openshift-install --from=registry.ci.openshift.org/ocp/release:4.21.0-0.nightly-2025-12-22-170804 -a ../auth.json
anan@think:~/works/openshift-versions/421nightly$ du -h openshift-install
654M openshift-installExport variables:
anan@think:~/works/openshift-versions/work3$ export OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=registry.ci.openshift.org/ocp/release:4.21.0-0.nightly-2025-12-22-170804
anan@think:~/works/openshift-versions/work3$ export OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE=ami-01095d1967818437cUsing different version installer to install the cluster:
anan@think:~/works/openshift-versions/work3$ ../421rc0/openshift-install version
../421rc0/openshift-install 4.21.0-rc.0
built from commit 8f88b34924c2267a2aa446dcdc6ccdd5260f9c45
release image quay.io/openshift-release-dev/ocp-release@sha256:ecde621d6f74aa1af4cd351f8b571ca2a61bbc32826e49cdf1b7fbff07f04ede
WARNING Found override for release image (registry.ci.openshift.org/ocp/release:4.21.0-0.nightly-2025-12-22-170804). Release Image Architecture is unknown
release architecture unknown
default architecture amd64anan@think:~/works/openshift-versions/work3$ ../421rc0/openshift-install create cluster
WARNING Found override for release image (registry.ci.openshift.org/ocp/release:4.21.0-0.nightly-2025-12-22-170804). Release Image Architecture is unknown
INFO Credentials loaded from the "default" profile in file "/home/anan/.aws/credentials"
WARNING Found override for OS Image. Please be warned, this is not advised
INFO Successfully populated MCS CA cert information: root-ca 2035-12-23T03:35:54Z 2025-12-25T03:35:54Z
INFO Successfully populated MCS TLS cert information: root-ca 2035-12-23T03:35:54Z 2025-12-25T03:35:54Z
INFO Credentials loaded from the AWS config using "SharedConfigCredentials: /home/anan/.aws/credentials" provider
WARNING Found override for release image (registry.ci.openshift.org/ocp/release:4.21.0-0.nightly-2025-12-22-170804). Please be warned, this is not advised Check the installed cluster version and the used amiID:
anan@think:~/works/openshift-versions/work3$ export KUBECONFIG=/home/anan/works/openshift-versions/work3/auth/kubeconfig
anan@think:~/works/openshift-versions/work3$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.21.0-0.nightly-2025-12-22-170804 True False 71m Cluster version is 4.21.0-0.nightly-2025-12-22-170804$ oc get machineset.machine.openshift.io -n openshift-machine-api -o json | \
jq -r '.items[] | .spec.template.spec.providerSpec.value.ami.id'
ami-01095d1967818437c
ami-01095d1967818437c
ami-01095d1967818437c
ami-01095d1967818437c
ami-01095d1967818437cOCP-22425
OCP-22425
Cluster A:
anan@think:~/works/openshift-versions/work3$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-106-174.ec2.internal Ready control-plane,master 8h v1.34.2
ip-10-0-157-14.ec2.internal Ready control-plane,master 8h v1.34.2
ip-10-0-30-65.ec2.internal Ready worker 8h v1.34.2
ip-10-0-54-54.ec2.internal Ready worker 8h v1.34.2
ip-10-0-74-122.ec2.internal Ready worker 8h v1.34.2
ip-10-0-76-206.ec2.internal Ready control-plane,master 8h v1.34.2anan@think:~/works/openshift-versions/work3$ oc get route -n openshift-authentication
NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD
oauth-openshift oauth-openshift.apps.weli-test.qe.devcluster.openshift.com oauth-openshift 6443 passthrough/Redirect None
anan@think:~/works/openshift-versions/work3$ oc get po -n openshift-apiserver
NAME READY STATUS RESTARTS AGE
apiserver-6b767844c6-2jztv 2/2 Running 0 8h
apiserver-6b767844c6-g4rck 2/2 Running 0 8h
apiserver-6b767844c6-jzv4z 2/2 Running 0 8hanan@think:~/works/openshift-versions/work3$ oc rsh -n openshift-apiserver apiserver-6b767844c6-2jztv
Defaulted container "openshift-apiserver" out of: openshift-apiserver, openshift-apiserver-check-endpoints, fix-audit-permissions (init)
sh-5.1# Cluster B:
anan@think:~/works/openshift-versions/works2$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-122-6.ec2.internal Ready control-plane,master 27m v1.34.2
ip-10-0-134-89.ec2.internal Ready control-plane,master 27m v1.34.2
ip-10-0-141-244.ec2.internal Ready worker 13m v1.34.2
ip-10-0-31-52.ec2.internal Ready worker 21m v1.34.2
ip-10-0-67-21.ec2.internal Ready control-plane,master 27m v1.34.2
ip-10-0-96-196.ec2.internal Ready worker 21m v1.34.2anan@think:~/works/openshift-versions/works2$ oc get po -n openshift-apiserver
NAME READY STATUS RESTARTS AGE
apiserver-574bdcd758-j85sh 2/2 Running 0 10m
apiserver-574bdcd758-l98ph 2/2 Running 0 10m
apiserver-574bdcd758-p922j 2/2 Running 0 8m8s
anan@think:~/works/openshift-versions/works2$ oc rsh -n openshift-apiserver apiserver-574bdcd758-j85sh
Defaulted container "openshift-apiserver" out of: openshift-apiserver, openshift-apiserver-check-endpoints, fix-audit-permissions (init)sh-5.1# curl -k https://oauth-openshift.apps.weli-test.qe.devcluster.openshift.com/healthz
oksh-5.1#