Skip to content

Instantly share code, notes, and snippets.

@davidlu1001
Created May 21, 2025 04:49
Show Gist options
  • Save davidlu1001/4b20612753312b0f7110dade9b272810 to your computer and use it in GitHub Desktop.
Save davidlu1001/4b20612753312b0f7110dade9b272810 to your computer and use it in GitHub Desktop.
SOP - Immuta Operations

Immuta Operations SOP - BAU / Change Management / Incident Learnings (with Snowflake Integration)


Quick Reference Table

Task API Endpoint Script Example Section
User Staging /api/v2/user/status stage_users.py 3.1
Tag Migration /api/v2/tag migrate_tags.py 2.4
Policy Testing /api/v2/policy/test test_policy.py 2.3
User Sync /api/v2/user/sync sync_users.py 2.2
SDD Scan /api/v2/sdd/scan run_sdd_scan.py 2.4
Data Product Creation /api/v2/dataSource create_data_product.py 2.5

1. Introduction & Core Principles

This Standard Operating Procedure (SOP) outlines best practices for Business As Usual (BAU) operations, change management, and incident handling for the Immuta Data Security Platform integrated with Snowflake. It addresses challenges in your environment: restricted MMD access, AD-based authentication, incomplete API documentation, and complex tasks like tag migration across environments.

Core Principles:

  • Reliability First: Ensure uptime and rapid recovery with proactive monitoring.
  • Automation-Driven: Prioritize API-based automation to reduce toil, despite documentation gaps.
  • Security by Default: Enforce least-privilege and auditability in all operations.
  • Cost Efficiency: Optimize Snowflake compute and storage usage.
  • Continuous Learning: Integrate incident learnings to prevent recurrence.
  • User-Centric Design: Streamline Data Marketplace access for seamless user experience.
  • Disciplined Change Management: Test, validate, and log all changes to minimize risks.

2. Key BAU Tasks and Operational Areas

2.1. Environment Health & Configuration Management

Snowflake Warehouse for Immuta

  • Tasks: Configure and monitor a dedicated Snowflake warehouse for Immuta.
  • Incident Learning: Undersized warehouses cause query queuing; unoptimized syncs spike costs (e.g., during SDD scans).
  • Checklist:
    • Use X-Small warehouse with 1-minute auto-suspend and transient tables.
    • Monitor query history (filter by IMMUNTA_USER) for performance (>5s query duration triggers alerts).
    • Set Snowflake budget alerts (>10 credits/day).
    • Define SLI: 99% of Immuta queries complete in <5s.
  • Best Practice: Use /api/v2/dataSource/sync to batch metadata updates, scheduling during off-peak hours (2 AM NZST).
  • Reminder: Review warehouse usage biweekly via Snowflake’s cost management dashboard.

Immuta-Snowflake Connections

  • Tasks: Ensure secure, reliable connectivity for metadata and policy enforcement.
  • Incident Learning: Expired AD credentials or network latency (>100ms) halt operations.
  • Checklist:
    • Validate AD-based credentials monthly via /api/v2/connection/test.
    • Apply least-privilege to Immuta’s Snowflake user (e.g., SELECT for metadata, CREATE VIEW for policies).
    • Monitor latency between Immuta and Snowflake (<50ms target).
    • Document zero-downtime credential rotation process.
  • Best Practice: Enable Connections feature for scalable onboarding (contact Immuta support for pre-Feb 2025 tenants).
  • Reminder: Test connections in a sandbox before production changes.

Immuta Platform Health

  • Tasks: Monitor Immuta services, API, and resource utilization.
  • Incident Learning: API failures (e.g., 500 errors) or job queue backlogs (>100 tasks) block operations.
  • Checklist:
    • Monitor logs (/var/log/immuta) and API p99 latency (>500ms triggers alerts).
    • Track CPU/memory/disk for on-prem Immuta instances (alert on >80% utilization).
    • Set health checks via /api/v2/health (99.9% uptime SLI).
    • Alert on job queue length (>100 tasks).
  • Best Practice: Deploy Immuta in high-availability mode across two availability zones.
  • Reminder: Subscribe to Immuta’s Status page for service updates.

2.2. User and Access Management

User Synchronization with AD

  • Tasks: Sync users/groups from AD to Immuta for accurate access control.
  • Incident Learning: Sync failures cause stale attributes, leading to policy misapplication or access denials.
  • Checklist:
    • Monitor sync logs daily via /api/v2/audit (alert on <95% success rate).
    • Audit orphaned accounts quarterly using /api/v2/user.
    • Alert on user/group count changes (>5% deviation).
    • Document manual attribute override process.
  • Best Practice: Automate syncs with /api/v2/user/sync:
    import requests
    import logging
    
    def sync_users(headers):
        try:
            response = requests.post(f"{IMMUNTA_URL}/api/v2/user/sync", headers=headers)
            response.raise_for_status()
            logging.info(f"Sync completed: {response.json()['status']}")
        except requests.exceptions.HTTPError as e:
            logging.error(f"Sync failed: {e}, Response: {response.text}")
            raise
  • Reminder: Validate AD group mappings weekly.

Birthright & Privileged Access

  • Tasks: Configure and audit default/admin access.
  • Incident Learning: Stale birthright policies overexpose data; excessive admin privileges increase risks.
  • Checklist:
    • Review birthright policies quarterly via /api/v2/permissions.
    • Log privileged actions in UAM (audit via /api/v2/audit).
    • Implement just-in-time (JIT) access for admins using AD groups.
  • Best Practice: Automate birthright assignments with /api/v2/user/permissions.
  • Reminder: Document privileged roles in ADO.

2.3. Policy Management & Governance

Policy Creation and Optimization

  • Tasks: Build scalable ABAC policies and monitor performance.
  • Incident Learning: Complex policies (>5 conditions) slow Snowflake queries, causing timeouts.
  • Checklist:
    • Write plain-English policies (e.g., “Mask SSN for non-Compliance users”).
    • Test policies in sandboxes with /api/v2/policy/test.
    • Monitor Snowflake Query Profile for Immuta view/UDF bottlenecks (>10% query time).
    • Limit policy conditions to <5 for performance.
  • Best Practice: Use Policy-as-Code with Git/CI-CD pipelines (e.g., GitHub Actions, Azure DevOps).
  • Reminder: Stage users (/api/v2/user/status) before policy changes.

Policy Auditing and Validation

  • Tasks: Ensure policy accuracy and compliance.
  • Incident Learning: Misconfigured policies cause data leaks or over-restrictions.
  • Checklist:
    • Automate policy testing with scripts validating access for sample users.
    • Conduct impact analysis (affected users/tables) before deployment.
    • Assign policy owners for accountability.
  • Best Practice: Use /api/v2/policy/version for rollback; integrate with SIEM for audit logging.
  • Reminder: Log policy changes in UAM.

2.4. Sensitive Data Discovery (SDD) and Tagging

SDD Scans and Tag Accuracy

  • Tasks: Automate and validate sensitive data tagging.
  • Incident Learning: False positives/negatives cause compliance risks; heavy scans spike Snowflake costs.
  • Checklist:
    • Run incremental SDD scans weekly via /api/v2/sdd/scan for new data.
    • Validate tags with data owners using /api/v2/tag.
    • Exclude non-sensitive schemas to reduce load.
    • Document tag dispute resolution process.
  • Best Practice: Create custom classifiers for domain-specific data and test in sandboxes.
  • Reminder: Audit tags monthly for compliance.

Tag Migration Across Environments

  • Tasks: Automate tag migration (e.g., Env A to Env B).
  • Incident Learning: Missing id fields or duplicate tags cause migration failures.
  • Checklist:
    • Export tags with /api/v2/tag, strip id fields, preserve hierarchy (e.g., tagA.tagB).
    • Validate tag uniqueness in target environment before import.
    • Test migrations in staging.
  • Best Practice: Use Python script with error handling:
    import requests
    import logging
    
    def migrate_tags(source_url, target_url, headers):
        try:
            tags = requests.get(f"{source_url}/api/v2/tag", headers=headers).json()
            cleaned_tags = [{k: v for k, v in tag.items() if k != "id"} for tag in tags]
            for tag in cleaned_tags:
                response = requests.post(f"{target_url}/api/v2/tag", headers=headers, json=tag)
                response.raise_for_status()
            logging.info("Tag migration completed")
        except requests.exceptions.HTTPError as e:
            logging.error(f"Migration failed: {e}, Response: {response.text}")
            raise
  • Reminder: Document API quirks (e.g., inconsistent id fields) in ADO doc.

2.5. Data Marketplace Operations

  • Tasks: Maintain high-quality data products for user access.
  • Incident Learning: Poor metadata increases support tickets; manual approvals create bottlenecks.
  • Checklist:
    • Audit data product metadata quarterly via /api/v2/dataSource.
    • Enable auto-approvals for pre-governed datasets.
    • Monitor usage via /api/v2/audit (alert on >10 access requests/hour).
    • Define SLIs: <5 access requests/hour, >90% user satisfaction (via monthly surveys).
  • Best Practice: Automate data product creation with /api/v2/dataSource and clear metadata.
  • Reminder: Collect user feedback monthly via surveys or Slack.

2.6. Monitoring, Alerting, and Observability

  • Tasks: Set up proactive monitoring for Immuta and Snowflake.
  • Incident Learning: Vague alerts delay detection (high MTTD); missing metrics prolong diagnosis (high MTTI).
  • Checklist:
    • Alert on:
      • Policy change failures (/api/v2/audit, >1 failure/hour).
      • High denied access attempts (>10/user/hour).
      • Anomalous data access (>100x typical query volume).
      • Snowflake query failures from Immuta views (>5/hour).
      • Immuta resource limits (disk >90%, license usage >95%).
    • Define SLIs: API uptime (99.9%), query latency (<5s), sync success rate (>95%).
    • Create dashboards correlating Immuta API, Snowflake queries, and AD syncs.
  • Best Practice: Integrate UAM with SIEM (e.g., Datadog) for real-time alerts.
  • Reminder: Tune alerts weekly to minimize false positives.

3. Managing Changes

3.1. Pre-Change Procedures

Sandbox Testing

  • Tasks: Test changes in a production-like staging environment.
  • Incident Learning: Un-tested changes cause outages or access issues.
  • Checklist:
    • Use anonymized data and user personas for testing.
    • Test negative scenarios (e.g., unauthorized access).
    • Validate API changes (e.g., /api/v2/policy) in sandboxes.
  • Best Practice: Automate testing in CI/CD pipelines (e.g., GitHub Actions, Azure DevOps).
  • Reminder: Log test results in ADO doc.

User Staging Protocol

  • Tasks: Stage users before policy/attribute changes.
  • Incident Learning: Active users during changes trigger query failures or lockouts.
  • Checklist:
    • Stage users via /api/v2/user/status in bulk.
    • Verify status post-change to avoid lockouts.
  • Best Practice: Automate staging:
    import requests
    import logging
    
    def stage_users(headers, user_ids):
        try:
            for user_id in user_ids:
                response = requests.put(f"{IMMUNTA_URL}/api/v2/user/{user_id}/status", headers=headers, json={"status": "staged"})
                response.raise_for_status()
                logging.info(f"Staged user {user_id}")
        except requests.exceptions.HTTPError as e:
            logging.error(f"Staging failed: {e}, Response: {response.text}")
            raise
  • Reminder: Schedule changes for off-peak hours (2 AM NZST).

Risk Assessment and Review

  • Tasks: Evaluate impact and require peer reviews.
  • Incident Learning: Un-reviewed changes increase error risk.
  • Checklist:
    • Assess blast radius (e.g., users/tables affected).
    • Mandate peer reviews for API-driven changes.
    • Categorize changes: minor (1-hour approval), major (2-day review), emergency (escalation path).
  • Best Practice: Use Git for Policy-as-Code with automated linting.
  • Reminder: Log reviews in a change management system.

3.2. Change Implementation

Controlled Rollout

  • Tasks: Deploy incrementally with versioning.
  • Incident Learning: Big-bang deployments amplify failure impact.
  • Checklist:
    • Use /api/v2/policy/version for all changes.
    • Test high-risk changes on small user groups (canary testing).
    • Log deployment steps in UAM.
  • Best Practice: Automate rollouts via CI/CD with GitHub Actions or Azure DevOps.
  • Reminder: Define rollback triggers (e.g., >5% query failures).

Rollback Plan

  • Tasks: Prepare tested rollback procedures.
  • Incident Learning: Un-tested rollbacks delay recovery (high MTTR).
  • Checklist:
    • Document rollback steps (e.g., /api/v2/policy/rollback).
    • Test rollbacks in staging.
  • Best Practice: Automate rollback scripts for critical changes.
  • Reminder: Validate rollback success post-deployment.

3.3. Post-Change Procedures

Intensive Monitoring

  • Tasks: Monitor system behavior to confirm stability.
  • Incident Learning: Most incidents occur post-change due to undetected errors.
  • Checklist:
    • Monitor API latency, Snowflake query performance, and denied access rates.
    • Validate changes with test queries (e.g., /api/v2/dataSource/test).
    • Set 24-hour bake time for stability.
  • Best Practice: Use automated validation scripts for change outcomes.
  • Reminder: Document observations in ADO doc.

4. Incident Management & Postmortem Learnings

4.1. Incident Response Playbooks

  • Tasks: Maintain playbooks for common incidents.
  • Examples:
    • Policy Lockdown: Roll back via /api/v2/policy/rollback, stage users, notify data owners.
      • Detection: >10 denied accesses/min (/api/v2/audit).
      • Triage: Check policy logs, revert changes.
      • Escalation: SRE lead within 10 minutes.
    • SDD Over-Tagging: Pause scans (/api/v2/sdd/pause), validate tags, update classifiers.
      • Detection: >20% tags flagged as incorrect.
      • Triage: Review scan logs, consult data owners.
      • Escalation: Data governance lead within 30 minutes.
    • Immuta Service Failure: Check /api/v2/health, restart services, escalate to support.
      • Detection: API errors >5/min.
      • Triage: Check logs (/var/log/immuta), restart services.
      • Escalation: Immuta support within 15 minutes.
    • Data Leak: Isolate data source (/api/v2/dataSource/disable), audit logs, notify compliance.
      • Detection: Anomalous access (>100x typical volume).
      • Triage: Review UAM logs, isolate source.
      • Escalation: Compliance team within 5 minutes.
  • Checklist:
    • Define detection, triage, escalation, and communication (Slack template).
    • Conduct quarterly tabletop exercises.
    • Run proactive checks (e.g., pre-deployment policy validation scripts).
  • Best Practice: Automate triage with scripts checking /api/v2/health and /api/v2/audit.
  • Reminder: Update playbooks post-incident.

4.2. Blameless Postmortems

  • Tasks: Analyze incidents/near-misses for root causes.
  • Checklist:
    • Document timeline, impact, root causes (5 Whys), and actions.
    • Assign owners/deadlines for corrective measures.
    • Share learnings in ADO doc.
  • Best Practice: Use structured postmortem templates for consistency.
  • Reminder: Review postmortems quarterly.

4.3. Integrating Learnings

  • Tasks: Update SOPs with incident findings.
  • Incident Learning: Unincorporated learnings lead to repeat incidents.
  • Checklist:
    • Maintain “Known Issues” section for API quirks (e.g., tag id issues).
    • Review SOP quarterly for new learnings.
  • Best Practice: Version-control SOP in Git for traceability.
  • Reminder: Share learnings with governance teams.

4.4. Emergency Break-Glass Procedures

  • Tasks: Define restricted emergency access.
  • Incident Learning: Missing procedures delay recovery in outages.
  • Checklist:
    • Restrict break-glass accounts with multi-factor approval.
    • Log actions in UAM.
    • Test procedures in staging.
  • Best Practice: Store credentials in Azure Key Vault.
  • Reminder: Review access annually.

5. Critical Reminders & Best Practices

API Automation

  • Use /api/v2 for all tasks, with robust error handling:
    import requests
    import logging
    import time
    
    def call_api(endpoint, headers, payload=None, method="GET", retries=3):
        for attempt in range(retries):
            try:
                response = requests.request(method, f"{IMMUNTA_URL}{endpoint}", headers=headers, json=payload)
                response.raise_for_status()
                return response.json()
            except requests.exceptions.HTTPError as e:
                logging.error(f"API error: {e}, Status: {response.status_code}, Response: {response.text}")
                if response.status_code == 429 and attempt < retries - 1:
                    time.sleep(2 ** attempt)  # Exponential backoff
                    continue
                raise
  • Deploy scripts on a build agent / tooling server (Python/FastAPI) to bypass MMD restrictions.
  • Document quirks (e.g., missing id fields) in ADO doc.

Snowflake Optimization

  • Schedule syncs (/api/v2/dataSource/sync) for off-peak hours (2 AM NZST).
  • Monitor credit usage biweekly; use transient tables to reduce costs.

Security Hardening

  • Enforce least-privilege for Immuta’s Snowflake user and AD groups.
  • Encrypt data with Azure Key Vault integration.

User Training

  • Train users on Data Marketplace navigation and policy interpretation via quarterly sessions.
  • Provide Slack/ticketing support channels.

Documentation

  • Maintain ADO doc for configurations, API quirks, and incident learnings.
  • Update SOP quarterly in Git.

Common Pitfalls

  • Tag Migration Failures: Duplicate tags or missing id fields cause errors. Validate uniqueness and strip id fields.
  • SDD Performance: Over-scanning spikes Snowflake costs. Use incremental scans and exclude non-sensitive schemas.
  • API Rate Limits: Undocumented 429 errors disrupt automation. Implement exponential backoff.

6. API Automation Toolkit

6.1. Overview

This toolkit provides reusable scripts for automating Immuta operations in an MMD environment with AD authentication and Snowflake integration, addressing API documentation gaps.

6.2. Common Automation Scripts

User Staging

  • Purpose: Stage users before policy updates.
  • Script: See Section 3.1 (User Staging Protocol).
  • Best Practice: Run during off-peak hours; validate status post-staging.

Tag Migration

  • Purpose: Migrate tags between environments.
  • Script: See Section 2.4 (Tag Migration Across Environments).
  • Best Practice: Test in staging; validate tag uniqueness.

Policy Testing

  • Purpose: Validate policy logic before deployment.
  • Script:
    import requests
    import logging
    
    def test_policy(headers, policy_id, test_data):
        try:
            response = requests.post(f"{IMMUNTA_URL}/api/v2/policy/{policy_id}/test", headers=headers, json=test_data)
            response.raise_for_status()
            logging.info(f"Policy test passed: {response.json()['result']}")
        except requests.exceptions.HTTPError as e:
            logging.error(f"Policy test failed: {e}, Response: {response.text}")
            raise
  • Best Practice: Use representative user personas and datasets.

User Sync

  • Purpose: Sync AD users with Immuta.
  • Script: See Section 2.2 (User Synchronization with AD).
  • Best Practice: Implement retry logic for transient errors.

SDD Scan

  • Purpose: Run incremental SDD scans.
  • Script:
    import requests
    import logging
    
    def run_sdd_scan(headers, data_source_id):
        try:
            response = requests.post(f"{IMMUNTA_URL}/api/v2/sdd/scan", headers=headers, json={"dataSourceId": data_source_id, "incremental": True})
            response.raise_for_status()
            logging.info(f"SDD scan started for {data_source_id}")
        except requests.exceptions.HTTPError as e:
            logging.error(f"SDD scan failed: {e}, Response: {response.text}")
            raise
  • Best Practice: Schedule weekly scans for new data.

Data Product Creation

  • Purpose: Create Data Marketplace products.
  • Script:
    import requests
    import logging
    
    def create_data_product(headers, data_source_config):
        try:
            response = requests.post(f"{IMMUNTA_URL}/api/v2/dataSource", headers=headers, json=data_source_config)
            response.raise_for_status()
            logging.info(f"Data product created: {data_source_config['name']}")
        except requests.exceptions.HTTPError as e:
            logging.error(f"Data product creation failed: {e}, Response: {response.text}")
            raise
  • Best Practice: Include clear metadata (e.g., description, owner).

6.3. Handling API Documentation Gaps

  • Strategy:
    • Test API calls in a sandbox to identify undocumented behaviors.
    • Use JSON schema validation:
      from jsonschema import validate
      schema = {"type": "object", "required": ["name"], "properties": {"name": {"type": "string"}}}
      def validate_response(response_data):
          validate(instance=response_data, schema=schema)
    • Document quirks in ADO doc (e.g., tag id inconsistencies).
  • Best Practice: Cache API responses to handle rate limits (429 errors).

6.4. Tooling Server Setup for MMD

  • Setup:
    • Deploy a Python/FastAPI server on-premises or Azure.
    • Configure AD authentication with OAuth tokens in Azure Key Vault.
    • Example FastAPI endpoint:
      from fastapi import FastAPI
      app = FastAPI()
      
      @app.post("/stage-users")
      async def stage_users_endpoint(user_ids: list):
          headers = {"Authorization": f"Bearer {get_token_from_key_vault()}"}
          stage_users(headers, user_ids)
          return {"status": "success"}
  • Best Practice: Restrict server access to authorized SREs via AD groups.
  • Reminder: Test server connectivity with Immuta and Snowflake.

6.5. CI/CD Integration

  • Purpose: Automate Immuta policy deployment, user staging, and tag migration using CI/CD pipelines to ensure consistent, error-free changes.
  • Pipeline Examples:
    • GitHub Actions:
      name: Immuta Policy Deployment
      on:
        push:
          branches: [main]
      jobs:
        deploy:
          runs-on: ubuntu-latest
          steps:
            - uses: actions/checkout@v3
            - name: Run Policy Tests
              run: python test_policy.py
            - name: Deploy Policy
              run: python deploy_policy.py
              env:
                IMMUTA_TOKEN: ${{ secrets.IMMUTA_TOKEN }}
    • Azure DevOps Pipeline:
      trigger:
        branches:
          include:
            - main
      pool:
        vmImage: 'ubuntu-latest'
      steps:
        - checkout: self
        - task: UsePythonVersion@0
          inputs:
            versionSpec: '3.x'
        - script: |
            python test_policy.py
          displayName: 'Run Policy Tests'
        - script: |
            python deploy_policy.py
          displayName: 'Deploy Policy'
          env:
            IMMUTA_TOKEN: $(IMMUNTA_TOKEN)
      • Configuration Notes:
        • Store IMMUNTA_TOKEN in Azure DevOps secure variables or link to Azure Key Vault for AD-authenticated access.
        • Use a service connection with AD credentials to access the Immuta API in your MMD-restricted environment.
        • Add a requirements.txt step if dependencies are needed:
          - script: |
              pip install -r requirements.txt
            displayName: 'Install Dependencies'
  • Best Practice:
    • Include linting (e.g., flake8 for Python scripts) and peer review gates in both pipelines.
    • Use separate staging and production pipelines with approval gates for major changes.
    • Log pipeline outputs to Azure Monitor or ADO doc for traceability.
  • Reminder:
    • Secure secrets in Azure Key Vault (for Azure DevOps) or GitHub Secrets (for GitHub Actions).
    • Test pipelines in a sandbox environment before production deployment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment