and renderedVersion 5 I see all of the monitors reset to default.
Oct 04 14:51:54 localhost.localdomain flightctl-agent[1075]: time="2024-10-04T14:51:54.184625Z" level=info msg="Reset CPU monitor alerts"
Oct 04 14:51:54 localhost.localdomain flightctl-agent[1075]: time="2024-10-04T14:51:54.184803Z" level=info msg="Reset disk monitor alerts"
Oct 04 14:51:54 localhost.localdomain flightctl-agent[1075]: time="2024-10-04T14:51:54.184885Z" level=info msg="Reset memory monitor alerts"
Oct 04 14:51:54 localhost.localdomain flightctl-agent[1075]: time="2024-10-04T14:51:54.192619Z" level=info msg="Spec upgrade complete: clearing rollback spec"
Oct 04 14:51:54 localhost.localdomain flightctl-agent[1075]: time="2024-10-04T14:51:54.323268Z" level=info msg="Synced device to renderedVersion: 5"
Defaults
- Sampling Interval: 1 minute
This means that each of the following alert conditions is sampled every minute, and they must remain true for the entire specified duration. This approach is conservative, but I’ve set it as the default starting point.
{
Severity: v1alpha1.ResourceAlertSeverityTypeCritical,
Percentage: 90,
Duration: "30m",
Description: "", // use generated description
},
{
Severity: v1alpha1.ResourceAlertSeverityTypeWarning,
Percentage: 80,
Duration: "1h",
Description: "", // use generated description
}, {
Severity: v1alpha1.ResourceAlertSeverityTypeCritical,
Percentage: 90,
Duration: "10m",
Description: "", // use generated description
},
{
Severity: v1alpha1.ResourceAlertSeverityTypeWarning,
Percentage: 80,
Duration: "30m",
Description: "", // use generated description
}, {
Severity: v1alpha1.ResourceAlertSeverityTypeCritical,
Percentage: 90,
Duration: "30m",
Description: "", // use generated description
},
{
Severity: v1alpha1.ResourceAlertSeverityTypeWarning,
Percentage: 80,
Duration: "1h",
Description: "", // use generated description
},Next you appear to have defined a new rule Disk alert for renderedVersion 6
Oct 04 14:55:53 localhost.localdomain flightctl-agent[1075]: time="2024-10-04T14:55:53.726749Z" level=info msg="Updating sampling interval from 1m0s to 10m0s"
Oct 04 14:55:53 localhost.localdomain flightctl-agent[1075]: time="2024-10-04T14:55:53.726891Z" level=info msg="Updated monitor: Disk"
Oct 04 14:55:53 localhost.localdomain flightctl-agent[1075]: time="2024-10-04T14:55:53.738331Z" level=info msg="Spec upgrade complete: clearing rollback spec"
Oct 04 14:55:53 localhost.localdomain flightctl-agent[1075]: time="2024-10-04T14:55:53.865897Z" level=info msg="Synced device to renderedVersion: 6"
Here you defined a rule for a path / which could be read only. You also defined a duratuion of 1s but then also a sampling interval of 10m. So we lack some validation here as the duration should not be less than the sampling interval.
[
{
"path": "/",
"alertRules": [
{
"duration": "1s",
"severity": "Warning",
"percentage": 1,
"description": "Disk space for application data is >1% full."
},
{
"duration": "1s",
"severity": "Critical",
"percentage": 90,
"description": "Disk space for application data is >90% full."
}
],
"monitorType": "Disk",
"samplingInterval": "10m"
}
]
will create a bug