Status: Draft v1.0 Author: Sensemesh Architecture Team Context: Event-Driven Architecture, gRPC/Protobuf, Go, NATS
This document outlines the architecture for a high-performance event filtering system. The system allows users to define data filtering rules using natural language (e.g., "Alert me if a person is detected outside business hours").
The solution utilizes an LLM as a Transpiler to convert user intent into Common Expression Language (CEL) scripts. These scripts are strictly validated against the Sensemesh Protobuf Schema before being deployed to a high-frequency Data Plane, ensuring safety, type correctness, and microsecond-level evaluation latency.
The system follows a "Compiler-Runtime" pattern, separating the slow, interactive rule generation (Control Plane) from the fast, real-time evaluation (Data Plane).
flowchart LR
subgraph Control Plane
A[User Chat] --> B[Chat Service<br/>Gemini LLM]
B --> C{CEL Validator}
C -- valid --> D[(DB: CelRule /<br/>MissionRule)]
C -- invalid --> B
end
subgraph Data Plane
D -- startup/reload --> E[CEL Engine<br/>cached Programs]
F[NATS Event Stream] --> G[Alert Worker]
G --> E
E -- match --> H[Action Dispatcher<br/>SMS / Email / Slack / etc.]
end
This diagram shows the complete path from a user creating a rule via chat, through to a real-world event triggering an alert delivery.
sequenceDiagram
participant User
participant ChatSvc as Chat Service
participant LLM as Gemini 1.5 Flash
participant Val as CEL Validator
participant DB as PostgreSQL
participant Worker as Alert Worker
participant NATS as NATS Bus
participant Camera as Camera / OD Agent
participant SMS as SMS Gateway
rect rgb(240, 248, 255)
Note over User,DB: Phase 1 — Rule Creation (Control Plane)
User->>ChatSvc: "Text me if a person is seen on the front porch after hours"
Note over ChatSvc: Extract Principal (tenant_id, facility_ids) from gRPC context
ChatSvc->>LLM: System prompt (scoped devices, facilities) + TriggerEvent schema + tool defs
loop Self-Healing Validation
LLM->>ChatSvc: Tool: create_cel_rule(expression, actions)
ChatSvc->>Val: Compile expression against TriggerEvent proto
alt Invalid
Val-->>ChatSvc: error (bad field / type mismatch)
ChatSvc->>LLM: FunctionResponse with error
end
end
Val-->>ChatSvc: is_valid=true
Note over ChatSvc: Server-side: stamp tenant_id + facility_id (UUIDs) from Principal
ChatSvc->>DB: Save AlertDefinition + CelRule + AlertAction(SMS) (tenant_id/facility_id as UUID)
ChatSvc->>User: "Done! You'll get a text when a person is detected after hours."
end
rect rgb(245, 255, 245)
Note over Worker,SMS: Phase 2 — Rule Deployment (Data Plane Boot)
Worker->>DB: ListAlertDefinitions(enabled_only=true)
DB-->>Worker: AlertDefinitions with CelRules + tenant_id/facility_id (UUIDs)
Worker->>Worker: Compile user CEL → cache cel.Programs + build tenant guard maps (UUID→string once)
end
rect rgb(255, 248, 240)
Note over Camera,SMS: Phase 3 — Live Event Evaluation (Data Plane Runtime)
Camera->>NATS: TriggerEvent (person detected, 22:30, hw=porch-cam)
NATS->>Worker: Deliver event (broad subscription: *.*.*.*.detection.>)
Worker->>Worker: Go guard: event.customer_id == def.tenant_id && event.facility_id == def.facility_id
Worker->>Worker: Guard passed (nanoseconds)
Worker->>Worker: CEL Eval: event.label=="person" && !event.environment.is_business_hours
Worker->>Worker: Result: true
Worker->>DB: Create AlertInstance (PENDING)
Worker->>SMS: Send "Person detected on front porch at 22:30"
SMS-->>Worker: Delivery confirmed
Worker->>DB: Update AlertInstance (DELIVERED)
end
We use the existing Sensemesh protobuf schema to drive the entire system. No new .proto files are required. The schema generates the Go structs, defines the gRPC services, and automates the CEL validation.
There are two evaluation contexts, each backed by an existing proto message:
Defined in proto/api/alerting.proto. This is the primary event envelope that arrives via NATS and is evaluated against CelRule expressions stored in AlertDefinition.
// proto/api/alerting.proto (lines 168-188)
message TriggerEvent {
string event_id = 1;
string region = 2;
string customer_id = 3;
string facility_id = 4;
string hardware_id = 5;
string data_type = 6; // "detection", "sensor", etc.
string stage = 7;
string label = 8; // "person", "vehicle", etc.
int64 occurred_at = 9; // epoch ms
google.protobuf.Struct facts = 10; // flexible JSON payload
string schema_id = 11;
string schema_hash = 12;
string nats_subject = 13;
google.protobuf.Struct environment = 14; // time, weather, etc.
map<string, string> tags = 15;
optional float confidence = 16; // 0.0-1.0
optional string severity = 17; // LOW, MEDIUM, HIGH, CRITICAL
}Defined in proto/mission/mission.proto. Used for per-mission CEL rules that evaluate against real-time detection events during active missions.
// proto/mission/mission.proto (lines 146-159)
message MissionDetectionPayload {
string class_name = 1; // "person", "vehicle", etc.
float confidence = 2; // 0.0-1.0
LatLngMicro location = 3; // GPS coordinates
float bbox_x_min = 4;
float bbox_y_min = 5;
float bbox_x_max = 6;
float bbox_y_max = 7;
bytes detector_id = 8; // UUID of detecting device
string detector_label = 9; // Human-readable device name
}Rules and their actions are already modeled in the proto schema:
| Concept | Proto Message | File |
|---|---|---|
| Alert rule | CelRule |
alerting.proto:29 |
| Alert rule grouping | AlertDefinition |
alerting.proto:12 |
| Alert actions | AlertAction (SMS, Email, Webhook, Slack, Teams, Discord, PagerDuty, Script) |
alerting.proto:54 |
| Mission rule | MissionRule |
mission_rules.proto:27 |
| Mission actions | MissionRuleAction (SMS, Discord) |
mission_rules.proto:20 |
| Alert instance tracking | AlertInstance |
alerting.proto:114 |
Goal: Allow users to safely create rules via chat. Challenges:
- LLMs can hallucinate invalid variable names (e.g.,
event.clownDetectedinstead ofevent.label == "clown"). - Users must never see or create rules that cross tenant boundaries.
- Mandatory filters (tenant_id, facility_id) must be enforced server-side, not delegated to the LLM.
The authenticated user's tenant_id and facility_id(s) are extracted from the gRPC Principal (via security.FromContext(ctx)) — the same mechanism used by the RLS interceptor across all Sensemesh services. These values are never sent to the LLM and never embedded in the LLM-generated CEL expression. Instead, they are enforced through three independent mechanisms:
- System prompt scoping — the LLM only sees devices/facilities belonging to the user's tenant (scoped tool results).
- Go-level tenant guard — each rule record stores the owning
tenant_idandfacility_id. At evaluation time, a plain Goifcheck compares these fields against the incoming event'scustomer_idandfacility_id. This runs before the CEL engine is invoked — zero CEL overhead for tenant isolation. - DB-level RLS — rules are stored with tenant_id/facility_id columns and the RLS interceptor ensures queries only return the user's own rules.
flowchart TD
A[Authenticated User] -->|gRPC metadata| B[Chat Service]
B -->|security.FromContext| C{Principal}
C -->|tenant_id, facility_ids| D[System Prompt Context]
C -->|tenant_id, facility_ids| E[DB Record<br/>tenant_id + facility_id columns]
C -->|tenant_id, facility_ids| F[DB Record<br/>RLS interceptor]
D -->|scoped tool results| G[LLM sees only<br/>user's devices]
E -->|stamped on rule record| H[Data Plane:<br/>Go if-check runs FIRST<br/>before CEL eval]
F -->|RLS interceptor| I[Only tenant's rules<br/>loaded into memory]
subgraph Data Plane Evaluation
J[Event arrives] --> K{Go if-check:<br/>tenant + facility match?}
K -- No --> L[REJECT — skip rule]
K -- Yes --> M{User CEL:<br/>LLM-generated filter}
M -- true --> N[Alert!]
M -- false --> O[Skip]
end
Critical rule: The LLM never controls tenant_id or facility_id. It cannot forge, omit, or override them. The tenant/facility check is a hardcoded Go string comparison — no CEL involved, no expression to tamper with.
When the Chat Service starts a Gemini session, it builds a system prompt that includes the user's scoped context. This ensures the LLM's tool calls (e.g., resolve_device) are automatically filtered to the user's tenant and facilities.
func (s *ChatServer) buildSystemPrompt(ctx context.Context) string {
p, _ := security.FromContext(ctx)
// Fetch user's devices (scoped to their tenant + facilities)
devices := s.deviceMetaSvc.ListDevicesForFacilities(ctx, p.TenantID, p.FacilityIDs)
return fmt.Sprintf(`You are a rule-creation assistant for Sensemesh.
The user belongs to tenant "%s" and has access to these facilities: %v.
Available devices:
%s
When creating CEL rules:
- Use the resolved device IDs, not display names, in CEL expressions.
- Do NOT include tenant_id or facility_id in the CEL expression — these are enforced automatically.
- Use resolve_device() to look up device UUIDs.
- Use resolve_enum() to look up enum integer values.
- The CEL expression should only contain the user's filtering logic (e.g., object class, confidence, time).
`,
p.TenantID.String(),
facilityNames(p.FacilityIDs),
formatDeviceList(devices),
)
}What the LLM sees (example):
Available devices:
- "front gate camera" (id: 1321abc1-..., type: camera, facility: Main Office)
- "parking lot sensor" (id: 9f8e7d6c-..., type: sensor, facility: Main Office)
- "warehouse cam" (id: ab12cd34-..., type: camera, facility: Warehouse B)
What the LLM does NOT see: tenant_id bytes, facility_id bytes, or any device from another tenant.
When the LLM's create_cel_rule tool call passes validation, the Chat Service stamps the authenticated user's tenant_id and facility_id onto the rule record from the Principal. These are plain database columns — not part of the CEL expression. At evaluation time, the Alert Worker performs a Go string comparison against the incoming event before the CEL engine is ever invoked.
Why a Go if-check instead of a guard CEL expression?
A Go if on two string fields is a single pointer comparison per field — nanosecond cost, no CEL compilation, no CEL evaluation overhead. Since tenant_id and facility_id are always simple equality checks (or a small set membership test), there is no reason to involve the CEL engine at all.
Why not per-tenant NATS subscriptions? With N tenants × M facilities, you would need N×M dedicated subscriptions (and potentially N×M worker instances). The Go guard approach uses a single broad NATS subscription and filters at evaluation time — nanosecond overhead per rule, linear scaling regardless of tenant count.
sequenceDiagram
participant LLM as Gemini LLM
participant ChatSvc as Chat Service
participant Val as CEL Validator
participant DB as Database
LLM->>ChatSvc: Tool: create_cel_rule(<br/>expression='event.label == "person" && event.confidence > 0.8',<br/>actions=[{type: SMS, phone: "+1234567890"}])
ChatSvc->>Val: ValidateCelExpression(user expression)
Val-->>ChatSvc: is_valid=true
Note over ChatSvc: Stamp tenant_id + facility_id (UUID bytes) from Principal
ChatSvc->>DB: Save AlertDefinition + CelRule {<br/> cel_expression: (LLM-generated),<br/> tenant_id: UUID (from Principal),<br/> facility_id: UUID (from Principal)<br/>}
What gets saved to the database:
| Field | DB Type | Value | Source |
|---|---|---|---|
cel_expression |
TEXT |
event.label == "person" && event.confidence > 0.8 |
LLM output (validated) |
tenant_id |
UUID |
a1b2c3d4-... |
Server-side from Principal (for RLS + runtime guard) |
facility_id |
UUID |
f5e6d7c8-... |
Server-side from Principal (for RLS + runtime guard) |
AlertDefinition.topic_pattern |
TEXT |
*.*.*.*.detection.*.* |
Broad pattern — no tenant in topic |
Note: All IDs (
tenant_id,facility_id, ruleid, etc.) are stored asUUIDin PostgreSQL and represented as[]byte(16 bytes) in proto /uuid.UUIDin Go. TheTriggerEvent.customer_idandTriggerEvent.facility_idfields arestringin the proto (hex-encoded UUID). The guard converts between the two representations at rule load time (see below).
Runtime guard (Go code, not CEL):
// ruleGuard is built once at rule load time. It converts the DB UUID bytes
// into their string representation so the hot-path comparison against
// TriggerEvent string fields is a single string ==, with no allocation.
type ruleGuard struct {
tenantID string // uuid.UUID(def.TenantId).String(), computed once
facilitySet map[string]struct{} // uuid.UUID(fid).String() for each facility; empty = match all
}
func newRuleGuard(tenantID []byte, facilityIDs [][]byte) *ruleGuard {
g := &ruleGuard{
tenantID: uuid.UUID(tenantID).String(),
}
if len(facilityIDs) > 0 {
g.facilitySet = make(map[string]struct{}, len(facilityIDs))
for _, fid := range facilityIDs {
g.facilitySet[uuid.UUID(fid).String()] = struct{}{}
}
}
return g
}
// allows checks if the event belongs to this rule's tenant and facility.
// Plain Go string comparison — no CEL, no allocation, nanosecond cost.
func (g *ruleGuard) allows(trigger *pb.TriggerEvent) bool {
if g.tenantID != trigger.CustomerId {
return false
}
if len(g.facilitySet) > 0 {
if _, ok := g.facilitySet[trigger.FacilityId]; !ok {
return false
}
}
return true
}Why not embed tenant_id in the LLM-generated CEL? If the LLM were responsible for injecting
event.customer_id == "tenant-abc"into every CEL expression, a prompt injection or hallucination could omit it — leaking data across tenants. The tenant guard is server-side Go code — the LLM has no access to it, and it cannot be bypassed by any CEL expression.
Why not a CEL guard expression? A CEL guard would require compiling and evaluating a second CEL program per rule per event. Since tenant isolation is always a simple string match, a Go
ifis strictly faster — nanoseconds vs. microseconds — and has zero risk of CEL-level bugs or injection.
We use cel.Declarations to enforce a strict contract. The LLM's output is compiled against the Protobuf definition. If the LLM invents a field that doesn't exist in the proto, the compilation fails immediately.
The "Root Object" Pattern: Instead of manually declaring every single field in our Go code, we declare one root variable per context. CEL then uses reflection to automatically "see" every field inside the proto.
func ValidateAlertCelRule(script string) (*pb.CelValidationResponse, error) {
env, _ := cel.NewEnv(
// Register the Proto Definition (The Blueprint)
cel.Types(&pb.TriggerEvent{}),
// Declare the Variable (The Instance)
// CEL auto-discovers: event.event_id, event.region, event.confidence, etc.
cel.Declarations(
decls.NewVar("event", decls.NewObjectType("proto.api.TriggerEvent")),
),
)
ast, issues := env.Compile(script)
if issues != nil && issues.Err() != nil {
return &pb.CelValidationResponse{
IsValid: false,
ErrorMessage: issues.Err().Error(),
}, nil
}
refs := extractVariableReferences(ast)
return &pb.CelValidationResponse{
IsValid: true,
EvaluationSuccess: true,
VariableReferences: refs,
}, nil
}func ValidateMissionCelRule(script string) (*pb.ValidateMissionRuleCelResponse, error) {
env, _ := cel.NewEnv(
cel.Types(&pb.MissionDetectionPayload{}),
cel.Declarations(
decls.NewVar("detection",
decls.NewObjectType("proto.mission.MissionDetectionPayload")),
),
)
_, issues := env.Compile(script)
if issues != nil && issues.Err() != nil {
return &pb.ValidateMissionRuleCelResponse{
IsValid: false,
ErrorMessage: issues.Err().Error(),
}, nil
}
return &pb.ValidateMissionRuleCelResponse{IsValid: true}, nil
}These are already defined in the proto schema and should be backed by the validation logic above:
| RPC | Service | Proto File |
|---|---|---|
ValidateCelExpression(CelValidationRequest) |
AlertDefinitionService |
alerting.proto:215 |
TestAlertDefinition(TestAlertDefinitionRequest) |
AlertDefinitionService |
alerting.proto:216 |
ValidateCelExpression(ValidateMissionRuleCelRequest) |
MissionRuleService |
mission_rules.proto:127 |
The chat service (cloud/chat-service/) acts as the LLM orchestrator. When a user requests a rule in natural language:
sequenceDiagram
participant User
participant ChatService as Chat Service (Go)
participant Gemini as Gemini LLM
participant Validator as CEL Validator
participant DB as Database
User->>ChatService: "Alert me when a clown is detected"
Note over ChatService: Inject Principal context (tenant_id, facility_ids, scoped device list)
ChatService->>Gemini: System prompt (tenant-scoped) + TriggerEvent schema + tool definitions
Gemini->>ChatService: Tool call: create_cel_rule("event.objects.contains(\"clown\")")
Note over ChatService: Validate before persisting
ChatService->>Validator: ValidateCelExpression("event.objects.contains(\"clown\")")
Validator-->>ChatService: is_valid=false, error="TriggerEvent has no field 'objects'"
Note over ChatService: Feed error back to LLM
ChatService->>Gemini: FunctionResponse: {error: "no field 'objects'", hint: "use event.label"}
Gemini->>ChatService: Tool call: create_cel_rule("event.label == \"clown\"")
ChatService->>Validator: ValidateCelExpression("event.label == \"clown\"")
Validator-->>ChatService: is_valid=true, variable_references=["event.label"]
Note over ChatService: Validated — safe to persist
Note over ChatService: Stamp tenant_id + facility_id from Principal onto AlertDefinition
ChatService->>DB: CreateAlertDefinition (CelRule + tenant_id/facility_id as UUID)
DB-->>ChatService: AlertDefinition saved
ChatService->>Gemini: FunctionResponse: {status: "created", rule_id: "uuid-123"}
Gemini->>ChatService: Text: "Done! I created an alert rule that triggers when a clown is detected."
ChatService->>User: "Done! I created an alert rule that triggers when a clown is detected."
The CelValidationResponse.variable_references field is returned to the LLM so it can confirm which fields were actually resolved — providing an additional sanity check.
Users refer to devices, facilities, and other entities by their display names (e.g., "front gate camera"), but CEL expressions operate on raw byte UUIDs and integer enum values. The Chat Service must resolve these before the LLM emits the final CEL expression.
Protobuf bytes fields (like device_id, facility_id) store UUIDs as raw bytes, not as hex strings. A user saying "front gate camera" must be translated to the bytes representation of 1321abc1-2131-23... before the CEL expression can reference it. Similarly, a user saying "person" in the context of an ObjectClass enum must be translated to its integer value (1), not left as a string comparison.
The Chat Service exposes lookup tools to the LLM. When the user references an entity by name, the LLM calls the lookup tool first, receives the resolved ID, and then uses that ID in the CEL expression it generates.
sequenceDiagram
participant User
participant ChatSvc as Chat Service
participant LLM as Gemini LLM
participant DeviceMeta as DeviceMetadataService (gRPC)
participant Val as CEL Validator
participant DB as Database
User->>ChatSvc: "Alert me when a person is seen on the front gate camera"
Note over ChatSvc: Principal extracted: tenant_id=abc, facility_ids=[fac-123]
ChatSvc->>LLM: System prompt (tenant-scoped device list) + tool definitions
Note over LLM: Step 1 — Resolve device name to ID
LLM->>ChatSvc: Tool: resolve_device(name="front gate camera")
ChatSvc->>DeviceMeta: GetDeviceInfo (search by device_name, scoped to tenant + facility)
DeviceMeta-->>ChatSvc: DeviceInfo{device_id: 0x1321abc1..., device_name: "front gate camera"}
ChatSvc->>LLM: FunctionResponse: {device_id: "1321abc1-2131-2300-0000-000000000000", hardware_id: "1321abc1-2131-2300-0000-000000000000"}
Note over LLM: Step 2 — Resolve enum "person" to integer
LLM->>ChatSvc: Tool: resolve_enum(enum_type="ObjectClass", value="person")
ChatSvc->>ChatSvc: Static lookup: PERSON = 1
ChatSvc->>LLM: FunctionResponse: {enum_name: "PERSON", enum_value: 1}
Note over LLM: Step 3 — Generate CEL with resolved values
LLM->>ChatSvc: Tool: create_cel_rule(expression='event.hardware_id == "1321abc1-2131-2300-0000-000000000000" && int(event.facts.object_class) == 1')
ChatSvc->>Val: ValidateCelExpression(...)
Val-->>ChatSvc: is_valid=true
ChatSvc->>DB: Save AlertDefinition + CelRule
ChatSvc->>User: "Done! You'll get an alert when a person is detected on the front gate camera."
The DeviceMetadataService (proto/model/device_metadata.proto) provides the lookup. The Chat Service wraps this as a Gemini tool:
// Tool definition exposed to the LLM
var resolveDeviceTool = &genai.FunctionDeclaration{
Name: "resolve_device",
Description: "Look up a device's UUID and metadata by its user-friendly display name.",
Parameters: &genai.Schema{
Type: genai.TypeObject,
Properties: map[string]*genai.Schema{
"name": {Type: genai.TypeString, Description: "The user-assigned name of the device (e.g., 'front gate camera')"},
},
Required: []string{"name"},
},
}
// Execution: calls DeviceMetadataService.GetDeviceInfo via gRPC
func (s *ChatServer) resolveDevice(name string) (map[string]interface{}, error) {
// Search devices by name within the user's tenant/facility scope
deviceInfo, err := s.deviceMetaSvc.GetDeviceInfoByName(ctx, name)
if err != nil {
return nil, fmt.Errorf("device '%s' not found", name)
}
// Return the UUID as a hex string — the LLM embeds this literal in the CEL expression
return map[string]interface{}{
"device_id": uuid.UUID(deviceInfo.DeviceId).String(),
"hardware_id": uuid.UUID(deviceInfo.DeviceId).String(),
"device_name": deviceInfo.DeviceName,
"facility_id": uuid.UUID(deviceInfo.FacilityId).String(),
"facility_name": deviceInfo.FacilityName,
"device_type": deviceInfo.DeviceType,
}, nil
}The LLM receives the resolved UUID and uses it as a string literal in the CEL expression. At evaluation time, TriggerEvent.hardware_id is already a string field, so this works directly:
event.hardware_id == "1321abc1-2131-2300-0000-000000000000"
For MissionDetectionPayload.detector_id (which is bytes), the LLM uses the b"" bytes literal syntax in CEL, or the expression compares against detector_label (a string) instead:
// Option A: match by label (simpler, but breaks if device is renamed)
detection.detector_label == "front gate camera"
// Option B: match by resolved UUID bytes (stable, survives renames)
detection.detector_id == b"\x13\x21\xab\xc1\x21\x31\x23\x00..."
The system prompt instructs the LLM to prefer Option B (ID-based matching) for durability.
Protobuf enums (like ObjectClass) are represented as integers in the wire format and in CEL evaluation. The LLM must emit integer comparisons, not string comparisons.
Why integers over strings:
| Approach | CEL Expression | Problem |
|---|---|---|
| String matching (bad) | event.label == "person" |
Only works for TriggerEvent.label (a string field). Fragile — depends on the event producer spelling it exactly right. |
| Integer enum (good) | int(event.facts.object_class) == 1 |
Type-safe. Matches the proto enum definition. Cannot drift. |
The Chat Service exposes a static enum lookup tool:
var resolveEnumTool = &genai.FunctionDeclaration{
Name: "resolve_enum",
Description: "Resolve a human-readable enum value to its integer representation in the proto schema.",
Parameters: &genai.Schema{
Type: genai.TypeObject,
Properties: map[string]*genai.Schema{
"enum_type": {Type: genai.TypeString, Description: "The enum type name (e.g., 'ObjectClass', 'MissionEventType', 'AlertStatus')"},
"value": {Type: genai.TypeString, Description: "The human-readable enum value (e.g., 'person', 'car', 'critical')"},
},
Required: []string{"enum_type", "value"},
},
}The lookup is a static map built from the proto descriptor at startup — no gRPC call needed:
// Built automatically from proto reflection at init time
var enumMaps = map[string]map[string]int32{
"ObjectClass": {
"person": 1, "bicycle": 2, "car": 3, "motorbike": 4,
"bus": 6, "train": 7, "truck": 8, "dog": 17, // ... all 80 classes
},
"MissionEventType": {
"detection": 1, "takeoff": 2, "hover": 3, "return": 4,
},
"AlertStatus": {
"pending": 1, "delivered": 2, "failed": 3,
},
}The LLM calls resolve_enum(enum_type="ObjectClass", value="person"), receives {enum_name: "PERSON", enum_value: 1}, and emits:
int(event.facts.object_class) == 1
Note: For
TriggerEvent.label(a plain string field), string matching is acceptable and simpler. The enum resolution is specifically for fields that carry protobuf enum types (e.g.,ObjectClassinEvent.objects[].object_class,MissionEvent.Type, etc.). The system prompt tells the LLM which fields are enums and which are strings.
When a device is deleted via DeviceMetadataService.DeleteDeviceInfo, any CEL rules that reference that device's ID become orphaned — they will never match again (harmless but wasteful) or worse, could match a recycled ID in the future.
sequenceDiagram
participant Admin
participant DeviceMeta as DeviceMetadataService
participant NATS as NATS Bus
participant RuleCleanup as Rule Cleanup Worker
participant DB as Database
participant Owner as Rule Owner (User)
Admin->>DeviceMeta: DeleteDeviceInfo(id=0x1321abc1...)
DeviceMeta->>DeviceMeta: Soft-delete device
DeviceMeta->>NATS: Publish device.deleted {device_id: "1321abc1..."}
NATS->>RuleCleanup: Receive device.deleted event
RuleCleanup->>DB: Query CelRules WHERE cel_expression CONTAINS "1321abc1..."
DB-->>RuleCleanup: [CelRule A, CelRule B]
alt Rules found
RuleCleanup->>DB: Disable CelRule A (enabled=false)
RuleCleanup->>DB: Disable CelRule B (enabled=false)
RuleCleanup->>Owner: Notify: "2 alert rules disabled — device 'front gate camera' was deleted"
end
Note over RuleCleanup: Also check MissionRules
RuleCleanup->>DB: Query MissionRules WHERE cel_expression CONTAINS "1321abc1..."
DB-->>RuleCleanup: [MissionRule X]
RuleCleanup->>DB: Disable MissionRule X (enabled=false)
RuleCleanup->>Owner: Notify: "Mission rule 'X' disabled — device 'front gate camera' was deleted"
Implementation strategy:
-
Event-driven cleanup: When
DeviceMetadataService.DeleteDeviceInfois called, it publishes adevice.deletedevent on NATS. A cleanup worker subscribes to this topic. -
Rule scanning: The worker queries all
CelRuleandMissionRulerecords whosecel_expressioncontains the deleted device's UUID string. This is a simpleLIKE '%uuid-hex%'query. -
Disable, don't delete: Rules are disabled (
enabled=false), not deleted. This preserves audit history and allows the rule owner to review and re-target the rule to a replacement device. -
Owner notification: The rule owner is notified (via their preferred alert channel) that their rules were disabled due to device removal.
-
Same pattern for facilities/tenants: If a facility is deleted, all rules scoped to that facility's devices should be disabled. The cleanup worker can cascade by first looking up all devices in the facility, then scanning rules for each device ID.
// Rule cleanup on device deletion
func (w *RuleCleanupWorker) OnDeviceDeleted(deviceID string) error {
// Scan alert rules
alertDefs, _ := w.db.ListAlertDefinitions(ctx, &pb.ListAlertDefinitionsRequest{
HasCelExpression: &deviceID, // LIKE '%deviceID%' on cel_expression
})
for _, def := range alertDefs.Definitions {
for _, rule := range def.Rules {
if strings.Contains(rule.CelExpression, deviceID) {
rule.Enabled = false
w.db.UpdateCelRule(ctx, rule)
w.notify(rule, "disabled: referenced device was deleted")
}
}
}
// Scan mission rules
// (similar pattern using ListMissionRules across all missions)
return nil
}Design decision: We chose disable-over-delete because (a) users may want to re-target rules to a replacement device, (b) AlertInstances reference rule IDs and deleting them would break the audit trail, and (c) disabled rules have zero runtime cost since the CEL engine skips them.
Goal: Evaluate thousands of rules per second against live NATS event streams. Advantage: Because we use Protobufs, we pass the gRPC message directly to the engine with zero serialization overhead.
Events arrive via NATS subjects matching the pattern:
<region>.<customer_id>.<facility_id>.<hardware_id>.<data_type>.<stage>.<label>
The AlertService.ProcessTrigger RPC (alerting.proto:231) is the entry point.
sequenceDiagram
participant Camera as Camera / Sensor
participant NATS as NATS Bus
participant Worker as Alert Worker
participant CEL as CEL Engine<br/>(cached Programs)
participant DB as Database
participant Dispatch as Action Dispatcher
Note over Worker,CEL: Startup: load & compile rules
Worker->>DB: ListAlertDefinitions(enabled_only=true)
DB-->>Worker: AlertDefinitions with CelRules
Worker->>CEL: CompileRules() → cached cel.Programs
Note over Camera,Dispatch: Runtime: event stream processing
Camera->>NATS: Publish TriggerEvent on<br/>us-east.cust1.fac1.cam-01.detection.raw.person
NATS->>Worker: Deliver TriggerEvent (topic matches AlertDefinition.topic_pattern)
Worker->>CEL: Eval(event=TriggerEvent) for each matched rule
alt CEL returns true
CEL-->>Worker: match=true
Worker->>DB: Create AlertInstance (rule_id, trigger_data, cel_context)
Worker->>Dispatch: Execute AlertActions
Dispatch->>Dispatch: SMS / Email / Webhook / Slack / Discord / PagerDuty
else CEL returns false
CEL-->>Worker: match=false
Note over Worker: Skip — no action
else CEL error
CEL-->>Worker: error
Worker->>DB: Create AlertInstance (status=CEL_ERROR)
end
The Data Plane uses a single broad NATS subscription (e.g., *.*.*.*.detection.>) shared across all tenants. Tenant isolation is enforced at evaluation time through a two-stage pipeline:
- Stage 1 — Go tenant guard (plain Go
ifcheck): compares the rule'stenant_idandfacility_idagainst the event'scustomer_idandfacility_id. If this fails, the rule is skipped immediately. No CEL involved — nanosecond cost. - Stage 2 — User CEL expression (LLM-generated, validated): contains only the user's semantic filter logic. Only runs if the Go guard passes.
sequenceDiagram
participant NATS as NATS Bus<br/>(broad subscription)
participant Worker as Alert Worker
participant CEL as CEL Engine<br/>(cached Programs)
participant DB as Database
participant Dispatch as Action Dispatcher
NATS->>Worker: TriggerEvent (any tenant)
loop For each AlertDefinition (all tenants)
Worker->>Worker: Go guard: def.tenant_id == event.customer_id?
alt Tenant mismatch
Note over Worker: SKIP — wrong tenant (nanoseconds)
else Tenant match
Worker->>Worker: Go guard: def.facility_id == event.facility_id?
alt Facility mismatch
Note over Worker: SKIP — wrong facility (nanoseconds)
else Facility match
loop For each CelRule in definition
Worker->>CEL: Eval cel_expression against event
alt CEL matches
CEL-->>Worker: true
Worker->>DB: Create AlertInstance
Worker->>Dispatch: Execute AlertActions
else CEL does not match
CEL-->>Worker: false — skip
end
end
end
end
end
This design means:
- One worker handles all tenants — no per-tenant subscription fan-out.
- Go guard short-circuits on tenant mismatch in nanoseconds — the CEL engine is never invoked for events from other tenants.
- Defense in depth: even if the LLM hallucinated a rule that tried to match all events (e.g.,
true), the Go guard would still reject events from other tenants. The guard is hardcoded Go logic, not an expression the LLM can influence.
func (e *AlertEngine) ProcessTrigger(trigger *pb.TriggerEvent) {
// All definitions loaded at startup (scoped by RLS to authorized tenants).
// The broad NATS subscription delivers events from all tenants.
for _, def := range e.matchingDefinitions(trigger.NatsSubject) {
// STAGE 1: Go tenant guard — plain string comparison, no CEL.
// Short-circuits in nanoseconds if tenant/facility doesn't match.
guard := e.guards[string(def.Id)]
if !guard.allows(trigger) {
continue // Wrong tenant/facility — skip entire definition
}
// Zero-allocation input: pass the proto pointer directly.
// Only built if we pass the tenant guard.
input := map[string]interface{}{
"event": trigger,
}
for _, rule := range def.Rules {
if !rule.Enabled {
continue
}
// Time window check (e.g., "09:00-17:00", "weekdays")
if rule.TimeWindow != nil && !inTimeWindow(*rule.TimeWindow) {
continue
}
// Min confidence threshold
if rule.MinConfidence != nil && trigger.Confidence < float32(*rule.MinConfidence)/100 {
continue
}
// STAGE 2: User CEL expression (LLM-generated, validated)
// Only reached if the Go guard passed.
out, _, err := e.programs[string(rule.Id)].Eval(input)
if err != nil {
e.recordInstance(rule, trigger, pb.CEL_ERROR, err.Error())
continue
}
if out.Value() == true {
// Create AlertInstance (alerting.proto:114) with:
// rule_id, trigger_event_id, matched_topic, trigger_data,
// cel_context, cel_result
// Then execute AlertActions: SMS, Email, Webhook, Slack, etc.
e.createAlertInstance(rule, trigger)
e.executeActions(rule.Actions, trigger)
if rule.StopOnMatch {
break
}
}
}
}
}sequenceDiagram
participant OD as Object Detector
participant NATS as NATS Bus
participant MEngine as Mission Engine
participant CEL as CEL Engine<br/>(cached Programs)
participant Dispatch as Action Dispatcher
OD->>NATS: Publish MissionRuntimeEvent (type=DETECTION)
NATS->>MEngine: Deliver MissionDetectionPayload for mission-xyz
MEngine->>CEL: Eval(detection=MissionDetectionPayload) for each MissionRule
alt Rule matches
CEL-->>MEngine: match=true
MEngine->>Dispatch: Execute MissionRuleActions
Dispatch->>Dispatch: Render message_template → send SMS / Discord
else No match
CEL-->>MEngine: match=false
end
func (e *MissionEngine) EvaluateDetection(detection *pb.MissionDetectionPayload, missionID []byte) {
input := map[string]interface{}{
"detection": detection,
}
for _, rule := range e.rulesForMission(missionID) {
if !rule.Enabled || rule.IsDeleted {
continue
}
out, _, err := e.programs[rule.Id].Eval(input)
if err != nil {
log.Printf("[mission-cel] rule %s eval error: %v", rule.Name, err)
continue
}
if out.Value() == true {
// Execute MissionRuleActions (SMS, Discord)
// Render message_template with CEL placeholders
e.executeActions(rule.Actions, detection)
}
}
}At startup (or when rules change), compile user CEL expressions into cel.Program ASTs and build the Go tenant guard maps:
func (e *AlertEngine) CompileRules(definitions []*pb.AlertDefinition) error {
env, _ := cel.NewEnv(
cel.Types(&pb.TriggerEvent{}),
cel.Declarations(
decls.NewVar("event", decls.NewObjectType("proto.api.TriggerEvent")),
),
)
for _, def := range definitions {
defID := string(def.Id)
// Build Go tenant guard from UUID bytes (converted to string once at load time)
e.guards[defID] = newRuleGuard(def.TenantId, def.FacilityIds)
// Compile user CEL expressions only
for _, rule := range def.Rules {
ast, issues := env.Compile(rule.CelExpression)
if issues != nil && issues.Err() != nil {
return fmt.Errorf("rule %s CEL compilation failed: %w", rule.Name, issues.Err())
}
prg, err := env.Program(ast)
if err != nil {
return fmt.Errorf("rule %s program creation failed: %w", rule.Name, err)
}
e.programs[string(rule.Id)] = prg
}
}
return nil
}| CEL Expression | Type | Description |
|---|---|---|
event.event_id |
string |
Unique event identifier |
event.region |
string |
Region identifier |
event.customer_id |
string |
Customer/tenant identifier |
event.facility_id |
string |
Facility identifier |
event.hardware_id |
string |
Camera/sensor/device ID |
event.data_type |
string |
"detection", "sensor", "image" |
event.stage |
string |
Processing stage |
event.label |
string |
"person", "vehicle", "motion" |
event.occurred_at |
int64 |
Epoch milliseconds |
event.confidence |
float |
0.0–1.0 |
event.severity |
string |
"LOW", "MEDIUM", "HIGH", "CRITICAL" |
event.tags |
map<string,string> |
Key-value tags |
event.nats_subject |
string |
Full NATS subject |
event.facts |
google.protobuf.Struct |
Dynamic JSON payload (escape hatch) |
event.environment |
google.protobuf.Struct |
Environmental context |
| CEL Expression | Type | Description |
|---|---|---|
detection.class_name |
string |
"person", "vehicle", "bicycle" |
detection.confidence |
float |
0.0–1.0 |
detection.location.lat_e6 |
int32 |
GPS latitude (microdegrees) |
detection.location.lng_e6 |
int32 |
GPS longitude (microdegrees) |
detection.bbox_x_min |
float |
Bounding box left |
detection.bbox_y_min |
float |
Bounding box top |
detection.bbox_x_max |
float |
Bounding box right |
detection.bbox_y_max |
float |
Bounding box bottom |
detection.detector_id |
bytes |
UUID of detecting device |
detection.detector_label |
string |
Human-readable device name |
| Natural Language | Context | CEL Expression |
|---|---|---|
| Alert on persons with high confidence | Alert | event.label == "person" && event.confidence > 0.8 |
| Critical severity events only | Alert | event.severity == "CRITICAL" |
| Detections from camera hw-001 | Alert | event.data_type == "detection" && event.hardware_id == "hw-001" |
| Events tagged as outdoor | Alert | "outdoor" in event.tags |
| Person detected with >90% confidence | Mission | detection.class_name == "person" && detection.confidence > 0.9 |
| Vehicle near HQ (within lat/lng range) | Mission | detection.class_name == "vehicle" && detection.location.lat_e6 > 337000000 && detection.location.lat_e6 < 338000000 |
| Any detection from porch camera | Mission | detection.detector_label == "front porch camera" |
| Feature | Benefit |
|---|---|
cel.Declarations + Root Object |
Prevents invalid rules from reaching the database. Acts as a strict compiler firewall against LLM hallucinations. |
| Zero Maintenance | Adding fields to TriggerEvent or MissionDetectionPayload in the proto automatically makes them available in CEL — no Go code changes needed. |
| Proto-First Evaluation | Uses native Go structs and direct memory access instead of slow JSON parsing or map lookups. |
| Sandboxing | CEL is non-Turing complete. Users cannot write infinite loops or access the file system. |
| Existing Proto Schema | No new .proto files required. TriggerEvent, CelRule, MissionRule, validation RPCs — all already defined. |
| Two-Tier Rule System | Global alert rules (AlertDefinition.rules) for cross-cutting concerns + per-mission rules (MissionRule) for scoped detection logic. |
google.protobuf.Struct Escape Hatch |
event.facts allows dynamic JSON payloads for data not yet modeled in the proto schema. |
| NATS Subject Matching | AlertDefinition.topic_pattern pre-filters which definitions are even evaluated, reducing unnecessary CEL evaluations. |
| Built-in Actions | SMS, Email, Webhook, Slack, Teams, Discord, PagerDuty already modeled in AlertAction and MissionRuleAction. |
| Go Tenant Guard (No CEL Overhead) | A single broad NATS subscription serves all tenants. Tenant isolation is enforced by a plain Go if check on tenant_id + facility_id — nanosecond cost, runs before the CEL engine is ever invoked. Three layers of defense: (1) LLM prompt scoping, (2) Go guard at eval time, (3) DB-level RLS. The LLM never controls tenant_id or facility_id. |
The following Go packages are required:
github.com/google/cel-go # Already in cloud/go.mod
google.golang.org/protobuf # Already in cloud/go.mod
github.com/SkyDaddyAI/sensemesh/cloud/proto # Generated proto package
cel-go is already present in the monorepo's cloud/go.mod.