Skip to content

Instantly share code, notes, and snippets.

@hexfusion
Last active October 5, 2024 16:26
Show Gist options
  • Save hexfusion/80808c1e2a5bb5933268b0e1644a36e1 to your computer and use it in GitHub Desktop.
Save hexfusion/80808c1e2a5bb5933268b0e1644a36e1 to your computer and use it in GitHub Desktop.

and renderedVersion 5 I see all of the monitors reset to default.

Oct 04 14:51:54 localhost.localdomain flightctl-agent[1075]: time="2024-10-04T14:51:54.184625Z" level=info msg="Reset CPU monitor alerts"
Oct 04 14:51:54 localhost.localdomain flightctl-agent[1075]: time="2024-10-04T14:51:54.184803Z" level=info msg="Reset disk monitor alerts"
Oct 04 14:51:54 localhost.localdomain flightctl-agent[1075]: time="2024-10-04T14:51:54.184885Z" level=info msg="Reset memory monitor alerts"
Oct 04 14:51:54 localhost.localdomain flightctl-agent[1075]: time="2024-10-04T14:51:54.192619Z" level=info msg="Spec upgrade complete: clearing rollback spec"
Oct 04 14:51:54 localhost.localdomain flightctl-agent[1075]: time="2024-10-04T14:51:54.323268Z" level=info msg="Synced device to renderedVersion: 5"

Defaults

  • Sampling Interval: 1 minute

This means that each of the following alert conditions is sampled every minute, and they must remain true for the entire specified duration. This approach is conservative, but I’ve set it as the default starting point.

CPU

			{
				Severity:    v1alpha1.ResourceAlertSeverityTypeCritical,
				Percentage:  90,
				Duration:    "30m",
				Description: "", // use generated description
			},
			{
				Severity:    v1alpha1.ResourceAlertSeverityTypeWarning,
				Percentage:  80,
				Duration:    "1h",
				Description: "", // use generated description
			},

Disk

			{
				Severity:    v1alpha1.ResourceAlertSeverityTypeCritical,
				Percentage:  90,
				Duration:    "10m",
				Description: "", // use generated description
			},
			{
				Severity:    v1alpha1.ResourceAlertSeverityTypeWarning,
				Percentage:  80,
				Duration:    "30m",
				Description: "", // use generated description
			},

Memory

			{
				Severity:    v1alpha1.ResourceAlertSeverityTypeCritical,
				Percentage:  90,
				Duration:    "30m",
				Description: "", // use generated description
			},
			{
				Severity:    v1alpha1.ResourceAlertSeverityTypeWarning,
				Percentage:  80,
				Duration:    "1h",
				Description: "", // use generated description
			},

Next you appear to have defined a new rule Disk alert for renderedVersion 6

Oct 04 14:55:53 localhost.localdomain flightctl-agent[1075]: time="2024-10-04T14:55:53.726749Z" level=info msg="Updating sampling interval from 1m0s to 10m0s"
Oct 04 14:55:53 localhost.localdomain flightctl-agent[1075]: time="2024-10-04T14:55:53.726891Z" level=info msg="Updated monitor: Disk"
Oct 04 14:55:53 localhost.localdomain flightctl-agent[1075]: time="2024-10-04T14:55:53.738331Z" level=info msg="Spec upgrade complete: clearing rollback spec"
Oct 04 14:55:53 localhost.localdomain flightctl-agent[1075]: time="2024-10-04T14:55:53.865897Z" level=info msg="Synced device to renderedVersion: 6"

Here you defined a rule for a path / which could be read only. You also defined a duratuion of 1s but then also a sampling interval of 10m. So we lack some validation here as the duration should not be less than the sampling interval.

[
  {
    "path": "/",
    "alertRules": [
      {
        "duration": "1s",
        "severity": "Warning",
        "percentage": 1,
        "description": "Disk space for application data is >1% full."
      },
      {
        "duration": "1s",
        "severity": "Critical",
        "percentage": 90,
        "description": "Disk space for application data is >90% full."
      }
    ],
    "monitorType": "Disk",
    "samplingInterval": "10m"
  }
]
@hexfusion
Copy link
Author

will create a bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment