Training MNIST using Kubeflow, Minio, and Argo on Azure

This example guides you through the process of taking an example model, modifying it to run better within Kubeflow, and serving the resulting trained model. We will be using Argo to manage the workflow, Minio for saving model training info, Tensorboard to visualize the training, and Kubeflow to deploy the Tensorflow operator and serve the model.

Prerequisites

Before we get started there a few requirements.

Deploy an AKS Cluster

Azure Container Service (AKS) manages your hosted Kubernetes environment, making it quick and easy to deploy and manage containerized applications without container orchestration expertise. It also eliminates the burden of ongoing operations and maintenance by provisioning, upgrading, and scaling resources on demand, without taking your applications offline.

We will use the official CLI to interact with Azure. If you want more informations on how to install it you can go trought the official documentation

Make sure you are authenticated into Azure from the CLI

# Generate the authentication process
az login

This command will prompt you for an authentication. You also select on which subscription you want to deploy you cluster using the az account command. You can find mode informations on the official documention.

Define some variables to setup you environement

# Location where to deploy resources
export AZ_LOCATION=eastus
export AZ_RG=KubeflowExample
export AZ_AKS_NAME=$AZ_RG
export AZ_AKS_LOCATION=$AZ_LOCATION
export AZ_AKS_VERSION=1.9.1
export AZ_AKS_SKU=Standard_NC6
export AZ_AKS_NODE_NUMBER=3
export AZ_ACR_NAME=${AZ_AKS_NAME}acr

Feel free to modify and enter you own values here. We will use and reference it during all this tutorial.

Create a resource group and deploy an AKS cluster into it

# Create a resource group
az group create -n $AZ_RG -l $AZ_LOCATION
# Deploy the cluster
az aks create -n $AZ_AKS_NAME -g $AZ_RG -l $AZ_AKS_LOCATION -k $AZ_AKS_VERSION -c $AZ_AKS_NODE_NUMBER -s $AZ_AKS_SKU

Those two command will create a resource group az group and deploy an AKS cluster into it az aks create. You can find more options about the az aks command line into the official documentation.

Following this setup, your cluster will run 3 GPU Nodes with the kubernetes version 1.9.1.

Note : AKS doesn't support RBAC yet, so make sure when you are deploying third party, like Kubeflow or Argo for example, to follow the instructions correctly.

After few minutes, the CLI will prompt a JSON payload with the informations about your cluster. This means your cluster is deployed and ready to be used.

Local Setup

kubectl
ksonnet
argo
minio

kubectl and authentication to the cluster

You can either install the kubectl client from the official documentation or from the Azure CLI using the az aks install-cli command.

To be authenticated against your cluster you have to run the followin command :

$ az aks get-credentials -g $AZ_RG -n $AZ_AKS_NAME
Merged "KubeflowExample" as current context in /Users/justroh/.kube/config

Ksonnet

To install Ksonnet, you can follow the official documentation.

argo

To install argo, you can follow the official documentation.

When argo is installed locally you just have to run the argo install command to deploy the controller into your cluster.

$ argo install
Installing Argo v2.0.0 into namespace 'kube-system'
Proceeding with Kubernetes version 1.9.1
Deployment 'workflow-controller' created
Deployment 'argo-ui' created
Service 'argo-ui' created

minio

add values in yaml file

kubectl create -f minio.yaml

check service

Prepare the model - MNSINT

model.py from repo ? Dockerfile.model

Build and Push to ACR

We will build our container and push it to private registry on Azure. To do we will use Azure Container Registry (ACR).

We will create a new ACR resource using the CLI

az acr create -n $AZ_ACR_NAME -g $AZ_RG --sku Basic --admin-enabled -l $AZ_LOCATION

You can find more informations about the ACR cli command in the official documentation.

We will build locally the container first and push it to our private registry on Azure.

# Build the image locally
DOCKER_BASE_URL=$AZ_ACR_NAME.azurecr.io/kubeflowdemo
docker build . --no-cache  -f Dockerfile.model -t ${DOCKER_BASE_URL}/mytfmodel:1.0
# Fetch the generated password of the private registry
AZ_ACR_PASSWORD=$(az acr credential show -n $AZ_ACR_NAME -g $AZ_RG -o tsv --query 'passwords[0].value')
AZ_ACR_FQDN=$AZ_ACR_NAME.azurecr.io
docker login $AZ_ACR_FQDN -u $AZ_ACR_NAME -p $AZ_ACR_PASSWORD
# Push the local image to the private registry
docker push ${DOCKER_BASE_URL}/mytfmodel:1.0

We have to store the ACR credentials in our Kubernetes cluster to be able to pull from it

TBD Test without doing it

Preparing your Kubernetes Cluster

With our data and workloads ready, now the cluster must be prepared. We will be deploying the TF Operator, and Argo to help manage our training job.

In the following instructions we will install our required components to a single namespace. For these instructions we will assume the chosen namespace is tfworkflow :

Deploying Tensorflow Operator and Argo.

We are using the Tensorflow operator to automate the deployment of our distributed model training, and Argo to create the overall training pipeline. The easiest way to install these components on your Kubernetes cluster is by using Kubeflow's ksonnet prototypes.

NAMESPACE=tfworkflow
APP_NAME=my-kubeflow
ks init ${APP_NAME}
cd ${APP_NAME}

ks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflow

ks pkg install kubeflow/core@1a6fc9d0e19e456b784ba1c23c03ec47648819d0
ks pkg install kubeflow/argo@8d617d68b707d52a5906d38b235e04e540f2fcf7

# Deploy TF Operator and Argo
kubectl create namespace ${NAMESPACE}
ks generate core kubeflow-core --name=kubeflow-core --namespace=${NAMESPACE}
ks generate argo kubeflow-argo --name=kubeflow-argo --namespace=${NAMESPACE}

ks apply default -c kubeflow-core
ks apply default -c kubeflow-argo

# Switch context for the rest of the example
kubectl config set-context $(kubectl config current-context) --namespace=${NAMESPACE}
cd -

You can check to make sure the components have deployed:

$ kubectl get pods -l name=tf-job-operator
NAME                              READY     STATUS    RESTARTS   AGE
tf-job-operator-78757955b-2glvj   1/1       Running   0          1m

$ kubectl get pods -l app=workflow-controller
NAME                                   READY     STATUS    RESTARTS   AGE
workflow-controller-7d8f4bc5df-4zltg   1/1       Running   0          1m

$ kubectl get crd
NAME                    AGE
tfjobs.kubeflow.org     1m
workflows.argoproj.io   1m

$ argo list
NAME   STATUS   AGE   DURATION

Creating secrets for our workflow

For fetching and uploading data, our workflow requires S3 credentials. These credentials will be provided as kubernetes secrets:

#export S3_ENDPOINT=<IP of the minio service>
export S3_ENDPOINT=52.226.74.27:9000
export MINIO_ENDPOINT_URL=http://${S3_ENDPOINT}
export MINIO_ACCESS_KEY_ID=kubeflowexample
export MINIO_SECRET_ACCESS_KEY=kubeflowexample
export BUCKET_NAME=kubeflowexample

    kubectl create secret generic minio-creds --from-literal=minioAccessKeyID=${MINIO_ACCESS_KEY_ID} \
    --from-literal=minioSecretAccessKey=${MINIO_SECRET_ACCESS_KEY}

Defining your training workflow

This is the bulk of the work, let's walk through what is needed:

Train the model
Export the model
Serve the model Now let's look at how this is represented in our example workflow

The argo workflow can be daunting, but basically our steps above extrapolate as follows:

get-workflow-info: Generate and set variables for consumption in the rest of the pipeline.
tensorboard: Tensorboard is spawned, configured to watch the S3 URL for the training output.
train-model: A TFJob is spawned taking in variables such as number of workers, what path the datasets are at, which model container image, etc. The model is exported at the end.
serve-model: Optionally, the model is served.

With our workflow defined, we can now execute it.

Submitting your training workflow using Argo

First we need to set a few variables in our workflow. Make sure to set your docker registry or remove the IMAGE parameters in order to use our defaults:

DOCKER_BASE_URL=$DOCKER_BASE_URL
export S3_DATA_URL=s3://${BUCKET_NAME}/data/mnist/
export S3_TRAIN_BASE_URL=s3://${BUCKET_NAME}/models
export AWS_REGION=us-west-2
export JOB_NAME=myjob-$(uuidgen  | cut -c -5 | tr '[:upper:]' '[:lower:]')
export TF_MODEL_IMAGE=${DOCKER_BASE_URL}/mytfmodel:1.0
export TF_WORKER=3
export MODEL_TRAIN_STEPS=200

Next, submit your workflow.

argo submit model-train.yaml -n ${NAMESPACE} \
    -p aws-endpoint-url=${MINIO_ENDPOINT_URL} \
    -p s3-endpoint=${S3_ENDPOINT} \
    -p aws-region=${AWS_REGION} \
    -p tf-model-image=${TF_MODEL_IMAGE} \
    -p s3-data-url=${S3_DATA_URL} \
    -p s3-train-base-url=${S3_TRAIN_BASE_URL} \
    -p job-name=${JOB_NAME} \
    -p tf-worker=${TF_WORKER} \
    -p model-train-steps=${MODEL_TRAIN_STEPS} \
    -p namespace=${NAMESPACE} \
    -p s3-use-https=false \
    -p s3-verify-ssl=false

Your training workflow should now be executing.

You can verify and keep track of your workflow using the argo commands:

$ argo list
NAME                STATUS    AGE   DURATION
tf-workflow-h7hwh   Running   1h    1h

$ argo get tf-workflow-h7hwh

suensummit/TOC Kubeflow on Azure.md

Training MNIST using Kubeflow, Minio, and Argo on Azure

Prerequisites

Deploy an AKS Cluster

Local Setup

kubectl and authentication to the cluster

Ksonnet

argo

minio

Prepare the model - MNSINT

Build and Push to ACR

Preparing your Kubernetes Cluster

Deploying Tensorflow Operator and Argo.

Creating secrets for our workflow

Defining your training workflow

Submitting your training workflow using Argo

Monitoring

Argo UI

Tensorboard

Using Tensorflow serving

Disabling Serving

Submitting new argo jobs

Conclusion