and renderedVersion 5 I see all of the monitors reset to default.
Oct 04 14:51:54 localhost.localdomain flightctl-agent[1075]: time="2024-10-04T14:51:54.184625Z" level=info msg="Reset CPU monitor alerts"
Oct 04 14:51:54 localhost.localdomain flightctl-agent[1075]: time="2024-10-04T14:51:54.184803Z" level=info msg="Reset disk monitor alerts"
Oct 04 14:51:54 localhost.localdomain flightctl-agent[1075]: time="2024-10-04T14:51:54.184885Z" level=info msg="Reset memory monitor alerts"
Oct 04 14:51:54 localhost.localdomain flightctl-agent[1075]: time="2024-10-04T14:51:54.192619Z" level=info msg="Spec upgrade complete: clearing rollback spec"
Oct 04 14:51:54 localhost.localdomain flightctl-agent[1075]: time="2024-10-04T14:51:54.323268Z" level=info msg="Synced device to renderedVersion: 5"
Defaults
- Sampling Interval: 1 minute
This means that each of the following alert conditions is sampled every minute, and they must remain true for the entire specified duration
. This approach is conservative, but I’ve set it as the default starting point.
{
Severity: v1alpha1.ResourceAlertSeverityTypeCritical,
Percentage: 90,
Duration: "30m",
Description: "", // use generated description
},
{
Severity: v1alpha1.ResourceAlertSeverityTypeWarning,
Percentage: 80,
Duration: "1h",
Description: "", // use generated description
},
{
Severity: v1alpha1.ResourceAlertSeverityTypeCritical,
Percentage: 90,
Duration: "10m",
Description: "", // use generated description
},
{
Severity: v1alpha1.ResourceAlertSeverityTypeWarning,
Percentage: 80,
Duration: "30m",
Description: "", // use generated description
},
{
Severity: v1alpha1.ResourceAlertSeverityTypeCritical,
Percentage: 90,
Duration: "30m",
Description: "", // use generated description
},
{
Severity: v1alpha1.ResourceAlertSeverityTypeWarning,
Percentage: 80,
Duration: "1h",
Description: "", // use generated description
},
Next you appear to have defined a new rule Disk alert for renderedVersion 6
Oct 04 14:55:53 localhost.localdomain flightctl-agent[1075]: time="2024-10-04T14:55:53.726749Z" level=info msg="Updating sampling interval from 1m0s to 10m0s"
Oct 04 14:55:53 localhost.localdomain flightctl-agent[1075]: time="2024-10-04T14:55:53.726891Z" level=info msg="Updated monitor: Disk"
Oct 04 14:55:53 localhost.localdomain flightctl-agent[1075]: time="2024-10-04T14:55:53.738331Z" level=info msg="Spec upgrade complete: clearing rollback spec"
Oct 04 14:55:53 localhost.localdomain flightctl-agent[1075]: time="2024-10-04T14:55:53.865897Z" level=info msg="Synced device to renderedVersion: 6"
Here you defined a rule for a path / which could be read only. You also defined a duratuion of 1s but then also a sampling interval of 10m. So we lack some validation here as the duration should not be less than the sampling interval.
[
{
"path": "/",
"alertRules": [
{
"duration": "1s",
"severity": "Warning",
"percentage": 1,
"description": "Disk space for application data is >1% full."
},
{
"duration": "1s",
"severity": "Critical",
"percentage": 90,
"description": "Disk space for application data is >90% full."
}
],
"monitorType": "Disk",
"samplingInterval": "10m"
}
]
will create a bug