Bryan Paget bryanpaget

API-First Architecture for OneLake Integration

TL;DR

Our Recommendation: Mounting OneLake is not worth the effort. While the Medium article proves it's technically possible on a single Linux VM, that approach is mismatched for our Kubernetes environment and would force us to rebuild the same type of fragile, unsupported abstraction layer that caused our past goofys headaches. We achieve better security, stability, and sovereignty natively through the supported API approach.

Executive Summary

We strongly recommend against mounting Microsoft OneLake as a filesystem in our Kubeflow environment. A documented method exists for mounting to a Linux VM using BlobFuse, but its design and constraints are mismatched for our dynamic Kubernetes platform. Adapting it for The Zone would force us to build and maintain a complex, unsupported abstraction layer—**precisely the type of wor

Presentation Outline:

30 Minutes total

Spend 2 Minutes

Introduce yourself and thank them for having you.
Talk about how Jose is the current team lead but I am happy to give this presentation
Talk about how you did a presentation in the summer and those slides are still available
Fall

flowchart TD
    A["intelligent-data-monitor
    (Cron Job)
    Generates synthetic data
    Stores in its own Git repo"] --> B["Synthetic Data Git Repo"]

    B --> C["anomaly-monitoring-dashboard
   (Cron Job)
 Pulls data from repo

Step-by-Step Guide to Install Spark Operator

1. Add Kubeflow Helm Chart Repository

# Add the Kubeflow Helm chart repo
helm repo add kubeflow https://charts.kubeflow.org
helm repo update

StatCan Data Sovereignty Strategy

Core Recommendation

StatCan must implement a Canadian-controlled data platform as our primary infrastructure for sensitive data, with Microsoft Fabric used only for specific, non-sensitive applications.

Why This Matters

Epic: Implement Lean Data Virtualization with Spark & Colectica

Section 1: Deploy Spark on Kubernetes

Context:
No Spark backend exists. Adding Spark enables scalable federated queries and integrates with existing Kubeflow/JupyterLab.

Todo:

Install Spark Operator in Kubernetes cluster via Helm.

📢 Finding Common Ground

Let's Collaborate on Our Data Science Environment

Dear Zone Friends,

I want to thank everyone for the passionate discussion about our Kubeflow environment. The diverse perspectives shared have highlighted important considerations and helped us refine our approach.

Acknowledging Different Perspectives

We've heard valuable feedback about:

📢 Aidez-nous à façonner notre environnement de science des données!

Chers amis de La` Zone,

Nous optimisons notre environnement Kubeflow pour mieux répondre à vos besoins. Pour créer une configuration de base véritablement utile, nous avons besoin de votre avis sur les packages qui comptent le plus pour votre travail quotidien.

État actuel et changements à venir

Notre environnement inclut déjà des packages statistiques essentiels (tidyverse, pandas, scikit-learn), des outils d'entreprise (ODBC, Kubernetes) et des environnements de développement (VSCode, JupyterLab, RStudio).

marp	theme	size	paginate	header	footer
true	default	58140	true		Statistics Canada \| Statistique Canada

	Python version: 3.13.5 \| packaged by conda-forge \| (main, Jun 16 2025, 08:27:50) [GCC 13.3.0]
	Testing 63 packages for compatibility with Python 3.13.5 (Offline mode)

	Testing numpy...
	✅ NumPy basic functionality test passed
	Testing pandas...
	✅ Pandas basic functionality test passed
	Testing scipy...
	✅ SciPy basic functionality test passed
	Testing matplotlib...