Created
January 30, 2026 13:17
-
-
Save hkboujrida/07b6255ce3c58801ab3f3357cc1deb9c to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Hybrid Observability Architecture Documentation | |
| This document describes the unified monitoring and logging strategy for a multi-environment hybrid infrastructure (INT, PREPROD, PRD) encompassing On-Premises Service Fabric, SQL Servers, AS400, and Kubernetes. | |
| 1. High-Level Architecture Diagram | |
| The diagram below illustrates the flow of telemetry data from various sources through Grafana Alloy collectors to the centralized LGTM (Loki, Grafana, Tempo, Mimir) stack. | |
| graph TD | |
| subgraph "On-Premises Infrastructure (INT/PREPROD/PRD)" | |
| subgraph "Windows Node (Service Fabric)" | |
| SF_App[SF Services] | |
| SF_Alloy[Alloy Service] | |
| OS_Metrics[CPU/RAM/Disk/Process] | |
| end | |
| subgraph "SQL Server Node" | |
| SQL_DB[(SQL Server)] | |
| SQL_Alloy[Alloy Service] | |
| end | |
| subgraph "Legacy Layer" | |
| AS400[AS400 / iSeries] | |
| end | |
| end | |
| subgraph "Kubernetes Cluster" | |
| K8s_Alloy[Alloy DaemonSet/Deployment] | |
| JDBC_Exp[Custom JDBC Exporter] | |
| BB_Exp[Blackbox Exporter] | |
| subgraph "LGTM Stack" | |
| Mimir[(Mimir - Metrics)] | |
| Loki[(Loki - Logs)] | |
| Grafana[Grafana - Visualization] | |
| end | |
| end | |
| %% Flows | |
| OS_Metrics -->|Scrape| SF_Alloy | |
| SF_App -->|Logs| SF_Alloy | |
| SQL_DB -->|Scrape| SQL_Alloy | |
| SF_Alloy -->|Push Remote Write| Mimir | |
| SF_Alloy -->|Push Loki Write| Loki | |
| SQL_Alloy -->|Push| Mimir | |
| AS400 -->|JDBC Query| JDBC_Exp | |
| JDBC_Exp -->|Metrics| K8s_Alloy | |
| BB_Exp -->|ICMP/HTTP Probes| K8s_Alloy | |
| K8s_Alloy -->|Push| Mimir | |
| K8s_Alloy -->|Push| Loki | |
| Grafana -->|Query| Mimir | |
| Grafana -->|Query| Loki | |
| 2. Product & Tool Glossary | |
| For readers unfamiliar with the stack, here is a brief description of each core component: | |
| The Observability Pipeline | |
| Grafana Alloy: A vendor-neutral telemetry collector (successor to Grafana Agent). It is highly "pluggable" and handles the scraping, transforming, and forwarding of metrics, logs, and traces. | |
| Prometheus: The industry-standard format for metrics. In this architecture, we use the Prometheus protocol for scraping data. | |
| Prometheus Exporters: Small "sidecar" programs that translate non-Prometheus data (like Windows OS stats or SQL queries) into a format Alloy can understand. | |
| Blackbox Exporter: Probes endpoints over HTTP, HTTPS, DNS, TCP, and ICMP to see if they are "alive" from the outside. | |
| JDBC Exporter: A custom bridge that uses Java Database Connectivity to run queries against legacy databases (AS400) and turn the results into metrics. | |
| The LGTM Backend Stack | |
| Grafana: The visualization layer used to build dashboards and alerts. | |
| Loki: A horizontally scalable, highly available, multi-tenant log aggregation system (inspired by Prometheus). | |
| Mimir: The backend for metrics. It provides long-term storage and massive scalability for Prometheus data. | |
| Tempo: (Optional/Included) A high-volume, distributed tracing backend. | |
| 3. How Alloy Works with LGTM (Internal Pipeline) | |
| Alloy acts as the "glue" between your data sources and the LGTM stack. It uses a component-based configuration (HCL) to define a declarative pipeline. | |
| The Alloy Pipeline Diagram | |
| graph LR | |
| subgraph "Alloy Internal Pipeline" | |
| Discovery[<b>1. Discovery</b><br/>Find Targets<br/>K8s API / Static Config] | |
| Collection[<b>2. Collection</b><br/>Scrape Metrics<br/>Tail Logs] | |
| Processing[<b>3. Processing</b><br/>Relabel / Filter<br/>Add 'env' labels] | |
| Forwarding[<b>4. Forwarding</b><br/>Batching / Retry] | |
| end | |
| subgraph "LGTM Stack (Storage)" | |
| Mimir[Mimir<br/>Metrics] | |
| Loki[Loki<br/>Logs] | |
| end | |
| Discovery --> Collection | |
| Collection --> Processing | |
| Processing --> Forwarding | |
| Forwarding -->|prometheus.remote_write| Mimir | |
| Forwarding -->|loki.write| Loki | |
| Key Stages Explained: | |
| Discovery: Alloy automatically identifies what needs to be monitored. In your K8s cluster, it watches the K8s API to find pods. On Windows, it uses static configurations to find SQL Server ports. | |
| Collection: Alloy "pulls" data from endpoints (like the Windows Exporter or Traefik) or "tails" files for logs. | |
| Processing (Relabeling): This is where Alloy adds the critical env="prd" or workload="sql" labels. It can also drop sensitive data or filter out noise before sending it over the network. | |
| Forwarding: Alloy batches data to optimize network usage and implements an exponential backoff retry strategy. If the K8s cluster (Mimir/Loki) is temporarily unreachable, Alloy buffers the data in memory. | |
| 4. Detailed Component Roles | |
| A. Grafana Alloy Deployment Modes | |
| Service Mode (Windows): Installed as a native Windows Service on SF and SQL nodes. It focuses on OS-level health and local log files. | |
| Deployment/DaemonSet Mode (K8s): Runs inside your cluster. It handles high-level probes (Blackbox) and the AS400 bridge. | |
| B. On-Premises Workloads | |
| Service Fabric (SF) Cluster: | |
| Logs: Alloy tails application log files and Windows Event Logs. | |
| OS Metrics: Uses prometheus.exporter.windows to capture system-level health. | |
| SQL Servers: | |
| Role: Collects database-specific health via the MSSQL exporter component within Alloy. | |
| AS400 (Custom Exporter): | |
| The Bridge: A container in K8s runs the JDBC Exporter. It connects to the AS400 via the network to extract system utilization data. | |
| 5. Environment Isolation Strategy | |
| To manage the three environments efficiently, the design uses Label-Based Multi-Tenancy: | |
| Environment | |
| Source Label | |
| Backend Destination | |
| INT | |
| env="int" | |
| Mimir/Loki (Tenant: INT) | |
| PREPROD | |
| env="preprod" | |
| Mimir/Loki (Tenant: PREPROD) | |
| PRD | |
| env="prd" | |
| Mimir/Loki (Tenant: PRD) | |
| 6. Summary of Benefits | |
| Unified Tooling: Same configuration language for Windows, K8s, and AS400. | |
| Security: Uses a "Push" architecture, minimizing firewall changes on-premises. | |
| Resilience: Alloy's internal forwarding logic ensures no data is lost during minor network blips. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment