Skip to content

Instantly share code, notes, and snippets.

@enachb
Created February 3, 2026 02:41
Show Gist options
  • Select an option

  • Save enachb/b9c895ea7a57fe78f54d8d48990c9c48 to your computer and use it in GitHub Desktop.

Select an option

Save enachb/b9c895ea7a57fe78f54d8d48990c9c48 to your computer and use it in GitHub Desktop.
Sensemesh Agentic CEL Rule Engine - System Design Document

System Design Document: Agentic CEL Rule Engine

Status: Draft v1.0 Author: Sensemesh Architecture Team Context: Event-Driven Architecture, gRPC/Protobuf, Go, NATS


1. Executive Summary

This document outlines the architecture for a high-performance event filtering system. The system allows users to define data filtering rules using natural language (e.g., "Alert me if a person is detected outside business hours").

The solution utilizes an LLM as a Transpiler to convert user intent into Common Expression Language (CEL) scripts. These scripts are strictly validated against the Sensemesh Protobuf Schema before being deployed to a high-frequency Data Plane, ensuring safety, type correctness, and microsecond-level evaluation latency.


2. System Architecture

The system follows a "Compiler-Runtime" pattern, separating the slow, interactive rule generation (Control Plane) from the fast, real-time evaluation (Data Plane).

2.0 End-to-End Flow Overview

flowchart LR
    subgraph Control Plane
        A[User Chat] --> B[Chat Service<br/>Gemini LLM]
        B --> C{CEL Validator}
        C -- valid --> D[(DB: CelRule /<br/>MissionRule)]
        C -- invalid --> B
    end

    subgraph Data Plane
        D -- startup/reload --> E[CEL Engine<br/>cached Programs]
        F[NATS Event Stream] --> G[Alert Worker]
        G --> E
        E -- match --> H[Action Dispatcher<br/>SMS / Email / Slack / etc.]
    end
Loading

2.0.1 Full Lifecycle: Chat to Alert Delivery

This diagram shows the complete path from a user creating a rule via chat, through to a real-world event triggering an alert delivery.

sequenceDiagram
    participant User
    participant ChatSvc as Chat Service
    participant LLM as Gemini 1.5 Flash
    participant Val as CEL Validator
    participant DB as PostgreSQL
    participant Worker as Alert Worker
    participant NATS as NATS Bus
    participant Camera as Camera / OD Agent
    participant SMS as SMS Gateway

    rect rgb(240, 248, 255)
    Note over User,DB: Phase 1 — Rule Creation (Control Plane)
    User->>ChatSvc: "Text me if a person is seen on the front porch after hours"
    Note over ChatSvc: Extract Principal (tenant_id, facility_ids) from gRPC context
    ChatSvc->>LLM: System prompt (scoped devices, facilities) + TriggerEvent schema + tool defs

    loop Self-Healing Validation
        LLM->>ChatSvc: Tool: create_cel_rule(expression, actions)
        ChatSvc->>Val: Compile expression against TriggerEvent proto
        alt Invalid
            Val-->>ChatSvc: error (bad field / type mismatch)
            ChatSvc->>LLM: FunctionResponse with error
        end
    end

    Val-->>ChatSvc: is_valid=true
    Note over ChatSvc: Server-side: stamp tenant_id + facility_id (UUIDs) from Principal
    ChatSvc->>DB: Save AlertDefinition + CelRule + AlertAction(SMS) (tenant_id/facility_id as UUID)
    ChatSvc->>User: "Done! You'll get a text when a person is detected after hours."
    end

    rect rgb(245, 255, 245)
    Note over Worker,SMS: Phase 2 — Rule Deployment (Data Plane Boot)
    Worker->>DB: ListAlertDefinitions(enabled_only=true)
    DB-->>Worker: AlertDefinitions with CelRules + tenant_id/facility_id (UUIDs)
    Worker->>Worker: Compile user CEL → cache cel.Programs + build tenant guard maps (UUID→string once)
    end

    rect rgb(255, 248, 240)
    Note over Camera,SMS: Phase 3 — Live Event Evaluation (Data Plane Runtime)
    Camera->>NATS: TriggerEvent (person detected, 22:30, hw=porch-cam)
    NATS->>Worker: Deliver event (broad subscription: *.*.*.*.detection.>)
    Worker->>Worker: Go guard: event.customer_id == def.tenant_id && event.facility_id == def.facility_id
    Worker->>Worker: Guard passed (nanoseconds)
    Worker->>Worker: CEL Eval: event.label=="person" && !event.environment.is_business_hours
    Worker->>Worker: Result: true
    Worker->>DB: Create AlertInstance (PENDING)
    Worker->>SMS: Send "Person detected on front porch at 22:30"
    SMS-->>Worker: Delivery confirmed
    Worker->>DB: Update AlertInstance (DELIVERED)
    end
Loading

2.1 The Data Contract (The "Source of Truth")

We use the existing Sensemesh protobuf schema to drive the entire system. No new .proto files are required. The schema generates the Go structs, defines the gRPC services, and automates the CEL validation.

There are two evaluation contexts, each backed by an existing proto message:

Context A: Alert Rules — TriggerEvent

Defined in proto/api/alerting.proto. This is the primary event envelope that arrives via NATS and is evaluated against CelRule expressions stored in AlertDefinition.

// proto/api/alerting.proto (lines 168-188)
message TriggerEvent {
  string event_id = 1;
  string region = 2;
  string customer_id = 3;
  string facility_id = 4;
  string hardware_id = 5;
  string data_type = 6;           // "detection", "sensor", etc.
  string stage = 7;
  string label = 8;               // "person", "vehicle", etc.
  int64 occurred_at = 9;          // epoch ms
  google.protobuf.Struct facts = 10;       // flexible JSON payload
  string schema_id = 11;
  string schema_hash = 12;
  string nats_subject = 13;
  google.protobuf.Struct environment = 14; // time, weather, etc.
  map<string, string> tags = 15;
  optional float confidence = 16;          // 0.0-1.0
  optional string severity = 17;           // LOW, MEDIUM, HIGH, CRITICAL
}

Context B: Mission Rules — MissionDetectionPayload

Defined in proto/mission/mission.proto. Used for per-mission CEL rules that evaluate against real-time detection events during active missions.

// proto/mission/mission.proto (lines 146-159)
message MissionDetectionPayload {
  string class_name = 1;          // "person", "vehicle", etc.
  float confidence = 2;           // 0.0-1.0
  LatLngMicro location = 3;      // GPS coordinates
  float bbox_x_min = 4;
  float bbox_y_min = 5;
  float bbox_x_max = 6;
  float bbox_y_max = 7;
  bytes detector_id = 8;          // UUID of detecting device
  string detector_label = 9;      // Human-readable device name
}

2.2 Rule Storage (Already Defined)

Rules and their actions are already modeled in the proto schema:

Concept Proto Message File
Alert rule CelRule alerting.proto:29
Alert rule grouping AlertDefinition alerting.proto:12
Alert actions AlertAction (SMS, Email, Webhook, Slack, Teams, Discord, PagerDuty, Script) alerting.proto:54
Mission rule MissionRule mission_rules.proto:27
Mission actions MissionRuleAction (SMS, Discord) mission_rules.proto:20
Alert instance tracking AlertInstance alerting.proto:114

3. The Control Plane (Rule Definition)

Goal: Allow users to safely create rules via chat. Challenges:

  • LLMs can hallucinate invalid variable names (e.g., event.clownDetected instead of event.label == "clown").
  • Users must never see or create rules that cross tenant boundaries.
  • Mandatory filters (tenant_id, facility_id) must be enforced server-side, not delegated to the LLM.

3.0 Tenant Context Injection and Isolation

3.0.1 The Trust Boundary

The authenticated user's tenant_id and facility_id(s) are extracted from the gRPC Principal (via security.FromContext(ctx)) — the same mechanism used by the RLS interceptor across all Sensemesh services. These values are never sent to the LLM and never embedded in the LLM-generated CEL expression. Instead, they are enforced through three independent mechanisms:

  1. System prompt scoping — the LLM only sees devices/facilities belonging to the user's tenant (scoped tool results).
  2. Go-level tenant guard — each rule record stores the owning tenant_id and facility_id. At evaluation time, a plain Go if check compares these fields against the incoming event's customer_id and facility_id. This runs before the CEL engine is invoked — zero CEL overhead for tenant isolation.
  3. DB-level RLS — rules are stored with tenant_id/facility_id columns and the RLS interceptor ensures queries only return the user's own rules.
flowchart TD
    A[Authenticated User] -->|gRPC metadata| B[Chat Service]
    B -->|security.FromContext| C{Principal}
    C -->|tenant_id, facility_ids| D[System Prompt Context]
    C -->|tenant_id, facility_ids| E[DB Record<br/>tenant_id + facility_id columns]
    C -->|tenant_id, facility_ids| F[DB Record<br/>RLS interceptor]

    D -->|scoped tool results| G[LLM sees only<br/>user's devices]
    E -->|stamped on rule record| H[Data Plane:<br/>Go if-check runs FIRST<br/>before CEL eval]
    F -->|RLS interceptor| I[Only tenant's rules<br/>loaded into memory]

    subgraph Data Plane Evaluation
        J[Event arrives] --> K{Go if-check:<br/>tenant + facility match?}
        K -- No --> L[REJECT — skip rule]
        K -- Yes --> M{User CEL:<br/>LLM-generated filter}
        M -- true --> N[Alert!]
        M -- false --> O[Skip]
    end
Loading

Critical rule: The LLM never controls tenant_id or facility_id. It cannot forge, omit, or override them. The tenant/facility check is a hardcoded Go string comparison — no CEL involved, no expression to tamper with.

3.0.2 System Prompt Context Injection

When the Chat Service starts a Gemini session, it builds a system prompt that includes the user's scoped context. This ensures the LLM's tool calls (e.g., resolve_device) are automatically filtered to the user's tenant and facilities.

func (s *ChatServer) buildSystemPrompt(ctx context.Context) string {
    p, _ := security.FromContext(ctx)

    // Fetch user's devices (scoped to their tenant + facilities)
    devices := s.deviceMetaSvc.ListDevicesForFacilities(ctx, p.TenantID, p.FacilityIDs)

    return fmt.Sprintf(`You are a rule-creation assistant for Sensemesh.
The user belongs to tenant "%s" and has access to these facilities: %v.

Available devices:
%s

When creating CEL rules:
- Use the resolved device IDs, not display names, in CEL expressions.
- Do NOT include tenant_id or facility_id in the CEL expression — these are enforced automatically.
- Use resolve_device() to look up device UUIDs.
- Use resolve_enum() to look up enum integer values.
- The CEL expression should only contain the user's filtering logic (e.g., object class, confidence, time).
`,
        p.TenantID.String(),
        facilityNames(p.FacilityIDs),
        formatDeviceList(devices),
    )
}

What the LLM sees (example):

Available devices:
- "front gate camera" (id: 1321abc1-..., type: camera, facility: Main Office)
- "parking lot sensor" (id: 9f8e7d6c-..., type: sensor, facility: Main Office)
- "warehouse cam" (id: ab12cd34-..., type: camera, facility: Warehouse B)

What the LLM does NOT see: tenant_id bytes, facility_id bytes, or any device from another tenant.

3.0.3 Tenant Guard (Go If-Check, Not CEL)

When the LLM's create_cel_rule tool call passes validation, the Chat Service stamps the authenticated user's tenant_id and facility_id onto the rule record from the Principal. These are plain database columns — not part of the CEL expression. At evaluation time, the Alert Worker performs a Go string comparison against the incoming event before the CEL engine is ever invoked.

Why a Go if-check instead of a guard CEL expression?

A Go if on two string fields is a single pointer comparison per field — nanosecond cost, no CEL compilation, no CEL evaluation overhead. Since tenant_id and facility_id are always simple equality checks (or a small set membership test), there is no reason to involve the CEL engine at all.

Why not per-tenant NATS subscriptions? With N tenants × M facilities, you would need N×M dedicated subscriptions (and potentially N×M worker instances). The Go guard approach uses a single broad NATS subscription and filters at evaluation time — nanosecond overhead per rule, linear scaling regardless of tenant count.

sequenceDiagram
    participant LLM as Gemini LLM
    participant ChatSvc as Chat Service
    participant Val as CEL Validator
    participant DB as Database

    LLM->>ChatSvc: Tool: create_cel_rule(<br/>expression='event.label == "person" && event.confidence > 0.8',<br/>actions=[{type: SMS, phone: "+1234567890"}])

    ChatSvc->>Val: ValidateCelExpression(user expression)
    Val-->>ChatSvc: is_valid=true

    Note over ChatSvc: Stamp tenant_id + facility_id (UUID bytes) from Principal
    ChatSvc->>DB: Save AlertDefinition + CelRule {<br/>  cel_expression: (LLM-generated),<br/>  tenant_id: UUID (from Principal),<br/>  facility_id: UUID (from Principal)<br/>}
Loading

What gets saved to the database:

Field DB Type Value Source
cel_expression TEXT event.label == "person" && event.confidence > 0.8 LLM output (validated)
tenant_id UUID a1b2c3d4-... Server-side from Principal (for RLS + runtime guard)
facility_id UUID f5e6d7c8-... Server-side from Principal (for RLS + runtime guard)
AlertDefinition.topic_pattern TEXT *.*.*.*.detection.*.* Broad pattern — no tenant in topic

Note: All IDs (tenant_id, facility_id, rule id, etc.) are stored as UUID in PostgreSQL and represented as []byte (16 bytes) in proto / uuid.UUID in Go. The TriggerEvent.customer_id and TriggerEvent.facility_id fields are string in the proto (hex-encoded UUID). The guard converts between the two representations at rule load time (see below).

Runtime guard (Go code, not CEL):

// ruleGuard is built once at rule load time. It converts the DB UUID bytes
// into their string representation so the hot-path comparison against
// TriggerEvent string fields is a single string ==, with no allocation.
type ruleGuard struct {
    tenantID    string              // uuid.UUID(def.TenantId).String(), computed once
    facilitySet map[string]struct{} // uuid.UUID(fid).String() for each facility; empty = match all
}

func newRuleGuard(tenantID []byte, facilityIDs [][]byte) *ruleGuard {
    g := &ruleGuard{
        tenantID: uuid.UUID(tenantID).String(),
    }
    if len(facilityIDs) > 0 {
        g.facilitySet = make(map[string]struct{}, len(facilityIDs))
        for _, fid := range facilityIDs {
            g.facilitySet[uuid.UUID(fid).String()] = struct{}{}
        }
    }
    return g
}

// allows checks if the event belongs to this rule's tenant and facility.
// Plain Go string comparison — no CEL, no allocation, nanosecond cost.
func (g *ruleGuard) allows(trigger *pb.TriggerEvent) bool {
    if g.tenantID != trigger.CustomerId {
        return false
    }
    if len(g.facilitySet) > 0 {
        if _, ok := g.facilitySet[trigger.FacilityId]; !ok {
            return false
        }
    }
    return true
}

Why not embed tenant_id in the LLM-generated CEL? If the LLM were responsible for injecting event.customer_id == "tenant-abc" into every CEL expression, a prompt injection or hallucination could omit it — leaking data across tenants. The tenant guard is server-side Go code — the LLM has no access to it, and it cannot be bypassed by any CEL expression.

Why not a CEL guard expression? A CEL guard would require compiling and evaluating a second CEL program per rule per event. Since tenant isolation is always a simple string match, a Go if is strictly faster — nanoseconds vs. microseconds — and has zero risk of CEL-level bugs or injection.

3.1 The Validation Strategy (CEL Declarations)

We use cel.Declarations to enforce a strict contract. The LLM's output is compiled against the Protobuf definition. If the LLM invents a field that doesn't exist in the proto, the compilation fails immediately.

The "Root Object" Pattern: Instead of manually declaring every single field in our Go code, we declare one root variable per context. CEL then uses reflection to automatically "see" every field inside the proto.

Alert Rule Validation

func ValidateAlertCelRule(script string) (*pb.CelValidationResponse, error) {
    env, _ := cel.NewEnv(
        // Register the Proto Definition (The Blueprint)
        cel.Types(&pb.TriggerEvent{}),

        // Declare the Variable (The Instance)
        // CEL auto-discovers: event.event_id, event.region, event.confidence, etc.
        cel.Declarations(
            decls.NewVar("event", decls.NewObjectType("proto.api.TriggerEvent")),
        ),
    )

    ast, issues := env.Compile(script)
    if issues != nil && issues.Err() != nil {
        return &pb.CelValidationResponse{
            IsValid:      false,
            ErrorMessage: issues.Err().Error(),
        }, nil
    }

    refs := extractVariableReferences(ast)

    return &pb.CelValidationResponse{
        IsValid:            true,
        EvaluationSuccess:  true,
        VariableReferences: refs,
    }, nil
}

Mission Rule Validation

func ValidateMissionCelRule(script string) (*pb.ValidateMissionRuleCelResponse, error) {
    env, _ := cel.NewEnv(
        cel.Types(&pb.MissionDetectionPayload{}),
        cel.Declarations(
            decls.NewVar("detection",
                decls.NewObjectType("proto.mission.MissionDetectionPayload")),
        ),
    )

    _, issues := env.Compile(script)
    if issues != nil && issues.Err() != nil {
        return &pb.ValidateMissionRuleCelResponse{
            IsValid:      false,
            ErrorMessage: issues.Err().Error(),
        }, nil
    }

    return &pb.ValidateMissionRuleCelResponse{IsValid: true}, nil
}

3.2 Existing Validation RPCs

These are already defined in the proto schema and should be backed by the validation logic above:

RPC Service Proto File
ValidateCelExpression(CelValidationRequest) AlertDefinitionService alerting.proto:215
TestAlertDefinition(TestAlertDefinitionRequest) AlertDefinitionService alerting.proto:216
ValidateCelExpression(ValidateMissionRuleCelRequest) MissionRuleService mission_rules.proto:127

3.3 The Self-Healing Loop

The chat service (cloud/chat-service/) acts as the LLM orchestrator. When a user requests a rule in natural language:

sequenceDiagram
    participant User
    participant ChatService as Chat Service (Go)
    participant Gemini as Gemini LLM
    participant Validator as CEL Validator
    participant DB as Database

    User->>ChatService: "Alert me when a clown is detected"

    Note over ChatService: Inject Principal context (tenant_id, facility_ids, scoped device list)
    ChatService->>Gemini: System prompt (tenant-scoped) + TriggerEvent schema + tool definitions
    Gemini->>ChatService: Tool call: create_cel_rule("event.objects.contains(\"clown\")")

    Note over ChatService: Validate before persisting

    ChatService->>Validator: ValidateCelExpression("event.objects.contains(\"clown\")")
    Validator-->>ChatService: is_valid=false, error="TriggerEvent has no field 'objects'"

    Note over ChatService: Feed error back to LLM

    ChatService->>Gemini: FunctionResponse: {error: "no field 'objects'", hint: "use event.label"}
    Gemini->>ChatService: Tool call: create_cel_rule("event.label == \"clown\"")

    ChatService->>Validator: ValidateCelExpression("event.label == \"clown\"")
    Validator-->>ChatService: is_valid=true, variable_references=["event.label"]

    Note over ChatService: Validated — safe to persist
    Note over ChatService: Stamp tenant_id + facility_id from Principal onto AlertDefinition

    ChatService->>DB: CreateAlertDefinition (CelRule + tenant_id/facility_id as UUID)
    DB-->>ChatService: AlertDefinition saved

    ChatService->>Gemini: FunctionResponse: {status: "created", rule_id: "uuid-123"}
    Gemini->>ChatService: Text: "Done! I created an alert rule that triggers when a clown is detected."

    ChatService->>User: "Done! I created an alert rule that triggers when a clown is detected."
Loading

The CelValidationResponse.variable_references field is returned to the LLM so it can confirm which fields were actually resolved — providing an additional sanity check.

3.4 Entity Resolution: Display Names to IDs

Users refer to devices, facilities, and other entities by their display names (e.g., "front gate camera"), but CEL expressions operate on raw byte UUIDs and integer enum values. The Chat Service must resolve these before the LLM emits the final CEL expression.

3.4.1 The Problem

Protobuf bytes fields (like device_id, facility_id) store UUIDs as raw bytes, not as hex strings. A user saying "front gate camera" must be translated to the bytes representation of 1321abc1-2131-23... before the CEL expression can reference it. Similarly, a user saying "person" in the context of an ObjectClass enum must be translated to its integer value (1), not left as a string comparison.

3.4.2 Resolution Flow

The Chat Service exposes lookup tools to the LLM. When the user references an entity by name, the LLM calls the lookup tool first, receives the resolved ID, and then uses that ID in the CEL expression it generates.

sequenceDiagram
    participant User
    participant ChatSvc as Chat Service
    participant LLM as Gemini LLM
    participant DeviceMeta as DeviceMetadataService (gRPC)
    participant Val as CEL Validator
    participant DB as Database

    User->>ChatSvc: "Alert me when a person is seen on the front gate camera"

    Note over ChatSvc: Principal extracted: tenant_id=abc, facility_ids=[fac-123]
    ChatSvc->>LLM: System prompt (tenant-scoped device list) + tool definitions

    Note over LLM: Step 1 — Resolve device name to ID
    LLM->>ChatSvc: Tool: resolve_device(name="front gate camera")
    ChatSvc->>DeviceMeta: GetDeviceInfo (search by device_name, scoped to tenant + facility)
    DeviceMeta-->>ChatSvc: DeviceInfo{device_id: 0x1321abc1..., device_name: "front gate camera"}
    ChatSvc->>LLM: FunctionResponse: {device_id: "1321abc1-2131-2300-0000-000000000000", hardware_id: "1321abc1-2131-2300-0000-000000000000"}

    Note over LLM: Step 2 — Resolve enum "person" to integer
    LLM->>ChatSvc: Tool: resolve_enum(enum_type="ObjectClass", value="person")
    ChatSvc->>ChatSvc: Static lookup: PERSON = 1
    ChatSvc->>LLM: FunctionResponse: {enum_name: "PERSON", enum_value: 1}

    Note over LLM: Step 3 — Generate CEL with resolved values
    LLM->>ChatSvc: Tool: create_cel_rule(expression='event.hardware_id == "1321abc1-2131-2300-0000-000000000000" && int(event.facts.object_class) == 1')

    ChatSvc->>Val: ValidateCelExpression(...)
    Val-->>ChatSvc: is_valid=true

    ChatSvc->>DB: Save AlertDefinition + CelRule
    ChatSvc->>User: "Done! You'll get an alert when a person is detected on the front gate camera."
Loading

3.4.3 Device Name Resolution

The DeviceMetadataService (proto/model/device_metadata.proto) provides the lookup. The Chat Service wraps this as a Gemini tool:

// Tool definition exposed to the LLM
var resolveDeviceTool = &genai.FunctionDeclaration{
    Name:        "resolve_device",
    Description: "Look up a device's UUID and metadata by its user-friendly display name.",
    Parameters: &genai.Schema{
        Type: genai.TypeObject,
        Properties: map[string]*genai.Schema{
            "name": {Type: genai.TypeString, Description: "The user-assigned name of the device (e.g., 'front gate camera')"},
        },
        Required: []string{"name"},
    },
}

// Execution: calls DeviceMetadataService.GetDeviceInfo via gRPC
func (s *ChatServer) resolveDevice(name string) (map[string]interface{}, error) {
    // Search devices by name within the user's tenant/facility scope
    deviceInfo, err := s.deviceMetaSvc.GetDeviceInfoByName(ctx, name)
    if err != nil {
        return nil, fmt.Errorf("device '%s' not found", name)
    }

    // Return the UUID as a hex string — the LLM embeds this literal in the CEL expression
    return map[string]interface{}{
        "device_id":     uuid.UUID(deviceInfo.DeviceId).String(),
        "hardware_id":   uuid.UUID(deviceInfo.DeviceId).String(),
        "device_name":   deviceInfo.DeviceName,
        "facility_id":   uuid.UUID(deviceInfo.FacilityId).String(),
        "facility_name": deviceInfo.FacilityName,
        "device_type":   deviceInfo.DeviceType,
    }, nil
}

The LLM receives the resolved UUID and uses it as a string literal in the CEL expression. At evaluation time, TriggerEvent.hardware_id is already a string field, so this works directly:

event.hardware_id == "1321abc1-2131-2300-0000-000000000000"

For MissionDetectionPayload.detector_id (which is bytes), the LLM uses the b"" bytes literal syntax in CEL, or the expression compares against detector_label (a string) instead:

// Option A: match by label (simpler, but breaks if device is renamed)
detection.detector_label == "front gate camera"

// Option B: match by resolved UUID bytes (stable, survives renames)
detection.detector_id == b"\x13\x21\xab\xc1\x21\x31\x23\x00..."

The system prompt instructs the LLM to prefer Option B (ID-based matching) for durability.

3.4.4 Enum Translation: Integers, Not Strings

Protobuf enums (like ObjectClass) are represented as integers in the wire format and in CEL evaluation. The LLM must emit integer comparisons, not string comparisons.

Why integers over strings:

Approach CEL Expression Problem
String matching (bad) event.label == "person" Only works for TriggerEvent.label (a string field). Fragile — depends on the event producer spelling it exactly right.
Integer enum (good) int(event.facts.object_class) == 1 Type-safe. Matches the proto enum definition. Cannot drift.

The Chat Service exposes a static enum lookup tool:

var resolveEnumTool = &genai.FunctionDeclaration{
    Name:        "resolve_enum",
    Description: "Resolve a human-readable enum value to its integer representation in the proto schema.",
    Parameters: &genai.Schema{
        Type: genai.TypeObject,
        Properties: map[string]*genai.Schema{
            "enum_type": {Type: genai.TypeString, Description: "The enum type name (e.g., 'ObjectClass', 'MissionEventType', 'AlertStatus')"},
            "value":     {Type: genai.TypeString, Description: "The human-readable enum value (e.g., 'person', 'car', 'critical')"},
        },
        Required: []string{"enum_type", "value"},
    },
}

The lookup is a static map built from the proto descriptor at startup — no gRPC call needed:

// Built automatically from proto reflection at init time
var enumMaps = map[string]map[string]int32{
    "ObjectClass": {
        "person": 1, "bicycle": 2, "car": 3, "motorbike": 4,
        "bus": 6, "train": 7, "truck": 8, "dog": 17, // ... all 80 classes
    },
    "MissionEventType": {
        "detection": 1, "takeoff": 2, "hover": 3, "return": 4,
    },
    "AlertStatus": {
        "pending": 1, "delivered": 2, "failed": 3,
    },
}

The LLM calls resolve_enum(enum_type="ObjectClass", value="person"), receives {enum_name: "PERSON", enum_value: 1}, and emits:

int(event.facts.object_class) == 1

Note: For TriggerEvent.label (a plain string field), string matching is acceptable and simpler. The enum resolution is specifically for fields that carry protobuf enum types (e.g., ObjectClass in Event.objects[].object_class, MissionEvent.Type, etc.). The system prompt tells the LLM which fields are enums and which are strings.

3.4.5 Entity Lifecycle: Device Deletion and Rule Cleanup

When a device is deleted via DeviceMetadataService.DeleteDeviceInfo, any CEL rules that reference that device's ID become orphaned — they will never match again (harmless but wasteful) or worse, could match a recycled ID in the future.

sequenceDiagram
    participant Admin
    participant DeviceMeta as DeviceMetadataService
    participant NATS as NATS Bus
    participant RuleCleanup as Rule Cleanup Worker
    participant DB as Database
    participant Owner as Rule Owner (User)

    Admin->>DeviceMeta: DeleteDeviceInfo(id=0x1321abc1...)
    DeviceMeta->>DeviceMeta: Soft-delete device
    DeviceMeta->>NATS: Publish device.deleted {device_id: "1321abc1..."}

    NATS->>RuleCleanup: Receive device.deleted event

    RuleCleanup->>DB: Query CelRules WHERE cel_expression CONTAINS "1321abc1..."
    DB-->>RuleCleanup: [CelRule A, CelRule B]

    alt Rules found
        RuleCleanup->>DB: Disable CelRule A (enabled=false)
        RuleCleanup->>DB: Disable CelRule B (enabled=false)
        RuleCleanup->>Owner: Notify: "2 alert rules disabled — device 'front gate camera' was deleted"
    end

    Note over RuleCleanup: Also check MissionRules
    RuleCleanup->>DB: Query MissionRules WHERE cel_expression CONTAINS "1321abc1..."
    DB-->>RuleCleanup: [MissionRule X]
    RuleCleanup->>DB: Disable MissionRule X (enabled=false)
    RuleCleanup->>Owner: Notify: "Mission rule 'X' disabled — device 'front gate camera' was deleted"
Loading

Implementation strategy:

  1. Event-driven cleanup: When DeviceMetadataService.DeleteDeviceInfo is called, it publishes a device.deleted event on NATS. A cleanup worker subscribes to this topic.

  2. Rule scanning: The worker queries all CelRule and MissionRule records whose cel_expression contains the deleted device's UUID string. This is a simple LIKE '%uuid-hex%' query.

  3. Disable, don't delete: Rules are disabled (enabled=false), not deleted. This preserves audit history and allows the rule owner to review and re-target the rule to a replacement device.

  4. Owner notification: The rule owner is notified (via their preferred alert channel) that their rules were disabled due to device removal.

  5. Same pattern for facilities/tenants: If a facility is deleted, all rules scoped to that facility's devices should be disabled. The cleanup worker can cascade by first looking up all devices in the facility, then scanning rules for each device ID.

// Rule cleanup on device deletion
func (w *RuleCleanupWorker) OnDeviceDeleted(deviceID string) error {
    // Scan alert rules
    alertDefs, _ := w.db.ListAlertDefinitions(ctx, &pb.ListAlertDefinitionsRequest{
        HasCelExpression: &deviceID, // LIKE '%deviceID%' on cel_expression
    })

    for _, def := range alertDefs.Definitions {
        for _, rule := range def.Rules {
            if strings.Contains(rule.CelExpression, deviceID) {
                rule.Enabled = false
                w.db.UpdateCelRule(ctx, rule)
                w.notify(rule, "disabled: referenced device was deleted")
            }
        }
    }

    // Scan mission rules
    // (similar pattern using ListMissionRules across all missions)
    return nil
}

Design decision: We chose disable-over-delete because (a) users may want to re-target rules to a replacement device, (b) AlertInstances reference rule IDs and deleting them would break the audit trail, and (c) disabled rules have zero runtime cost since the CEL engine skips them.


4. The Data Plane (Evaluation)

Goal: Evaluate thousands of rules per second against live NATS event streams. Advantage: Because we use Protobufs, we pass the gRPC message directly to the engine with zero serialization overhead.

4.1 Alert Pipeline (NATS -> CEL -> Actions)

Events arrive via NATS subjects matching the pattern: <region>.<customer_id>.<facility_id>.<hardware_id>.<data_type>.<stage>.<label>

The AlertService.ProcessTrigger RPC (alerting.proto:231) is the entry point.

sequenceDiagram
    participant Camera as Camera / Sensor
    participant NATS as NATS Bus
    participant Worker as Alert Worker
    participant CEL as CEL Engine<br/>(cached Programs)
    participant DB as Database
    participant Dispatch as Action Dispatcher

    Note over Worker,CEL: Startup: load & compile rules
    Worker->>DB: ListAlertDefinitions(enabled_only=true)
    DB-->>Worker: AlertDefinitions with CelRules
    Worker->>CEL: CompileRules() → cached cel.Programs

    Note over Camera,Dispatch: Runtime: event stream processing

    Camera->>NATS: Publish TriggerEvent on<br/>us-east.cust1.fac1.cam-01.detection.raw.person
    NATS->>Worker: Deliver TriggerEvent (topic matches AlertDefinition.topic_pattern)

    Worker->>CEL: Eval(event=TriggerEvent) for each matched rule

    alt CEL returns true
        CEL-->>Worker: match=true
        Worker->>DB: Create AlertInstance (rule_id, trigger_data, cel_context)
        Worker->>Dispatch: Execute AlertActions
        Dispatch->>Dispatch: SMS / Email / Webhook / Slack / Discord / PagerDuty
    else CEL returns false
        CEL-->>Worker: match=false
        Note over Worker: Skip — no action
    else CEL error
        CEL-->>Worker: error
        Worker->>DB: Create AlertInstance (status=CEL_ERROR)
    end
Loading

4.1.1 Two-Stage Evaluation (Go Guard + CEL Expression)

The Data Plane uses a single broad NATS subscription (e.g., *.*.*.*.detection.>) shared across all tenants. Tenant isolation is enforced at evaluation time through a two-stage pipeline:

  1. Stage 1 — Go tenant guard (plain Go if check): compares the rule's tenant_id and facility_id against the event's customer_id and facility_id. If this fails, the rule is skipped immediately. No CEL involved — nanosecond cost.
  2. Stage 2 — User CEL expression (LLM-generated, validated): contains only the user's semantic filter logic. Only runs if the Go guard passes.
sequenceDiagram
    participant NATS as NATS Bus<br/>(broad subscription)
    participant Worker as Alert Worker
    participant CEL as CEL Engine<br/>(cached Programs)
    participant DB as Database
    participant Dispatch as Action Dispatcher

    NATS->>Worker: TriggerEvent (any tenant)

    loop For each AlertDefinition (all tenants)
        Worker->>Worker: Go guard: def.tenant_id == event.customer_id?
        alt Tenant mismatch
            Note over Worker: SKIP — wrong tenant (nanoseconds)
        else Tenant match
            Worker->>Worker: Go guard: def.facility_id == event.facility_id?
            alt Facility mismatch
                Note over Worker: SKIP — wrong facility (nanoseconds)
            else Facility match
                loop For each CelRule in definition
                    Worker->>CEL: Eval cel_expression against event
                    alt CEL matches
                        CEL-->>Worker: true
                        Worker->>DB: Create AlertInstance
                        Worker->>Dispatch: Execute AlertActions
                    else CEL does not match
                        CEL-->>Worker: false — skip
                    end
                end
            end
        end
    end
Loading

This design means:

  • One worker handles all tenants — no per-tenant subscription fan-out.
  • Go guard short-circuits on tenant mismatch in nanoseconds — the CEL engine is never invoked for events from other tenants.
  • Defense in depth: even if the LLM hallucinated a rule that tried to match all events (e.g., true), the Go guard would still reject events from other tenants. The guard is hardcoded Go logic, not an expression the LLM can influence.
func (e *AlertEngine) ProcessTrigger(trigger *pb.TriggerEvent) {
    // All definitions loaded at startup (scoped by RLS to authorized tenants).
    // The broad NATS subscription delivers events from all tenants.
    for _, def := range e.matchingDefinitions(trigger.NatsSubject) {

        // STAGE 1: Go tenant guard — plain string comparison, no CEL.
        // Short-circuits in nanoseconds if tenant/facility doesn't match.
        guard := e.guards[string(def.Id)]
        if !guard.allows(trigger) {
            continue // Wrong tenant/facility — skip entire definition
        }

        // Zero-allocation input: pass the proto pointer directly.
        // Only built if we pass the tenant guard.
        input := map[string]interface{}{
            "event": trigger,
        }

        for _, rule := range def.Rules {
            if !rule.Enabled {
                continue
            }

            // Time window check (e.g., "09:00-17:00", "weekdays")
            if rule.TimeWindow != nil && !inTimeWindow(*rule.TimeWindow) {
                continue
            }

            // Min confidence threshold
            if rule.MinConfidence != nil && trigger.Confidence < float32(*rule.MinConfidence)/100 {
                continue
            }

            // STAGE 2: User CEL expression (LLM-generated, validated)
            // Only reached if the Go guard passed.
            out, _, err := e.programs[string(rule.Id)].Eval(input)
            if err != nil {
                e.recordInstance(rule, trigger, pb.CEL_ERROR, err.Error())
                continue
            }

            if out.Value() == true {
                // Create AlertInstance (alerting.proto:114) with:
                //   rule_id, trigger_event_id, matched_topic, trigger_data,
                //   cel_context, cel_result
                // Then execute AlertActions: SMS, Email, Webhook, Slack, etc.
                e.createAlertInstance(rule, trigger)
                e.executeActions(rule.Actions, trigger)

                if rule.StopOnMatch {
                    break
                }
            }
        }
    }
}

4.2 Mission Pipeline (Detection -> CEL -> MissionRuleActions)

sequenceDiagram
    participant OD as Object Detector
    participant NATS as NATS Bus
    participant MEngine as Mission Engine
    participant CEL as CEL Engine<br/>(cached Programs)
    participant Dispatch as Action Dispatcher

    OD->>NATS: Publish MissionRuntimeEvent (type=DETECTION)
    NATS->>MEngine: Deliver MissionDetectionPayload for mission-xyz

    MEngine->>CEL: Eval(detection=MissionDetectionPayload) for each MissionRule

    alt Rule matches
        CEL-->>MEngine: match=true
        MEngine->>Dispatch: Execute MissionRuleActions
        Dispatch->>Dispatch: Render message_template → send SMS / Discord
    else No match
        CEL-->>MEngine: match=false
    end
Loading
func (e *MissionEngine) EvaluateDetection(detection *pb.MissionDetectionPayload, missionID []byte) {
    input := map[string]interface{}{
        "detection": detection,
    }

    for _, rule := range e.rulesForMission(missionID) {
        if !rule.Enabled || rule.IsDeleted {
            continue
        }

        out, _, err := e.programs[rule.Id].Eval(input)
        if err != nil {
            log.Printf("[mission-cel] rule %s eval error: %v", rule.Name, err)
            continue
        }

        if out.Value() == true {
            // Execute MissionRuleActions (SMS, Discord)
            // Render message_template with CEL placeholders
            e.executeActions(rule.Actions, detection)
        }
    }
}

4.3 Program Caching and Tenant Guard Maps

At startup (or when rules change), compile user CEL expressions into cel.Program ASTs and build the Go tenant guard maps:

func (e *AlertEngine) CompileRules(definitions []*pb.AlertDefinition) error {
    env, _ := cel.NewEnv(
        cel.Types(&pb.TriggerEvent{}),
        cel.Declarations(
            decls.NewVar("event", decls.NewObjectType("proto.api.TriggerEvent")),
        ),
    )

    for _, def := range definitions {
        defID := string(def.Id)

        // Build Go tenant guard from UUID bytes (converted to string once at load time)
        e.guards[defID] = newRuleGuard(def.TenantId, def.FacilityIds)

        // Compile user CEL expressions only
        for _, rule := range def.Rules {
            ast, issues := env.Compile(rule.CelExpression)
            if issues != nil && issues.Err() != nil {
                return fmt.Errorf("rule %s CEL compilation failed: %w", rule.Name, issues.Err())
            }

            prg, err := env.Program(ast)
            if err != nil {
                return fmt.Errorf("rule %s program creation failed: %w", rule.Name, err)
            }

            e.programs[string(rule.Id)] = prg
        }
    }
    return nil
}

5. CEL Field Reference

5.1 Alert Context (event — TriggerEvent)

CEL Expression Type Description
event.event_id string Unique event identifier
event.region string Region identifier
event.customer_id string Customer/tenant identifier
event.facility_id string Facility identifier
event.hardware_id string Camera/sensor/device ID
event.data_type string "detection", "sensor", "image"
event.stage string Processing stage
event.label string "person", "vehicle", "motion"
event.occurred_at int64 Epoch milliseconds
event.confidence float 0.0–1.0
event.severity string "LOW", "MEDIUM", "HIGH", "CRITICAL"
event.tags map<string,string> Key-value tags
event.nats_subject string Full NATS subject
event.facts google.protobuf.Struct Dynamic JSON payload (escape hatch)
event.environment google.protobuf.Struct Environmental context

5.2 Mission Context (detection — MissionDetectionPayload)

CEL Expression Type Description
detection.class_name string "person", "vehicle", "bicycle"
detection.confidence float 0.0–1.0
detection.location.lat_e6 int32 GPS latitude (microdegrees)
detection.location.lng_e6 int32 GPS longitude (microdegrees)
detection.bbox_x_min float Bounding box left
detection.bbox_y_min float Bounding box top
detection.bbox_x_max float Bounding box right
detection.bbox_y_max float Bounding box bottom
detection.detector_id bytes UUID of detecting device
detection.detector_label string Human-readable device name

5.3 Example Rules

Natural Language Context CEL Expression
Alert on persons with high confidence Alert event.label == "person" && event.confidence > 0.8
Critical severity events only Alert event.severity == "CRITICAL"
Detections from camera hw-001 Alert event.data_type == "detection" && event.hardware_id == "hw-001"
Events tagged as outdoor Alert "outdoor" in event.tags
Person detected with >90% confidence Mission detection.class_name == "person" && detection.confidence > 0.9
Vehicle near HQ (within lat/lng range) Mission detection.class_name == "vehicle" && detection.location.lat_e6 > 337000000 && detection.location.lat_e6 < 338000000
Any detection from porch camera Mission detection.detector_label == "front porch camera"

6. Why This Architecture?

Feature Benefit
cel.Declarations + Root Object Prevents invalid rules from reaching the database. Acts as a strict compiler firewall against LLM hallucinations.
Zero Maintenance Adding fields to TriggerEvent or MissionDetectionPayload in the proto automatically makes them available in CEL — no Go code changes needed.
Proto-First Evaluation Uses native Go structs and direct memory access instead of slow JSON parsing or map lookups.
Sandboxing CEL is non-Turing complete. Users cannot write infinite loops or access the file system.
Existing Proto Schema No new .proto files required. TriggerEvent, CelRule, MissionRule, validation RPCs — all already defined.
Two-Tier Rule System Global alert rules (AlertDefinition.rules) for cross-cutting concerns + per-mission rules (MissionRule) for scoped detection logic.
google.protobuf.Struct Escape Hatch event.facts allows dynamic JSON payloads for data not yet modeled in the proto schema.
NATS Subject Matching AlertDefinition.topic_pattern pre-filters which definitions are even evaluated, reducing unnecessary CEL evaluations.
Built-in Actions SMS, Email, Webhook, Slack, Teams, Discord, PagerDuty already modeled in AlertAction and MissionRuleAction.
Go Tenant Guard (No CEL Overhead) A single broad NATS subscription serves all tenants. Tenant isolation is enforced by a plain Go if check on tenant_id + facility_id — nanosecond cost, runs before the CEL engine is ever invoked. Three layers of defense: (1) LLM prompt scoping, (2) Go guard at eval time, (3) DB-level RLS. The LLM never controls tenant_id or facility_id.

7. Dependencies

The following Go packages are required:

github.com/google/cel-go              # Already in cloud/go.mod
google.golang.org/protobuf            # Already in cloud/go.mod
github.com/SkyDaddyAI/sensemesh/cloud/proto  # Generated proto package

cel-go is already present in the monorepo's cloud/go.mod.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment