This example guides you through the process of taking an example model, modifying it to run better within Kubeflow, and serving the resulting trained model. We will be using Argo to manage the workflow, Minio for saving model training info, Tensorboard to visualize the training, and Kubeflow to deploy the Tensorflow operator and serve the model.
Before we get started there a few requirements.
Azure Container Service (AKS) manages your hosted Kubernetes environment, making it quick and easy to deploy and manage containerized applications without container orchestration expertise. It also eliminates the burden of ongoing operations and maintenance by provisioning, upgrading, and scaling resources on demand, without taking your applications offline.
We will use the official CLI to interact with Azure. If you want more informations on how to install it you can go trought the official documentation
- Make sure you are authenticated into Azure from the CLI
# Generate the authentication process
az login
This command will prompt you for an authentication. You also select on which subscription you want to deploy you cluster using the az account
command. You can find mode informations on the official documention.
- Define some variables to setup you environement
# Location where to deploy resources
export AZ_LOCATION=eastus
export AZ_RG=KubeflowExample
export AZ_AKS_NAME=$AZ_RG
export AZ_AKS_LOCATION=$AZ_LOCATION
export AZ_AKS_VERSION=1.9.1
export AZ_AKS_SKU=Standard_NC6
export AZ_AKS_NODE_NUMBER=3
export AZ_ACR_NAME=${AZ_AKS_NAME}acr
Feel free to modify and enter you own values here. We will use and reference it during all this tutorial.
- Create a resource group and deploy an AKS cluster into it
# Create a resource group
az group create -n $AZ_RG -l $AZ_LOCATION
# Deploy the cluster
az aks create -n $AZ_AKS_NAME -g $AZ_RG -l $AZ_AKS_LOCATION -k $AZ_AKS_VERSION -c $AZ_AKS_NODE_NUMBER -s $AZ_AKS_SKU
Those two command will create a resource group az group
and deploy an AKS cluster into it az aks create
. You can find more options about the az aks
command line into the official documentation.
Following this setup, your cluster will run 3 GPU Nodes with the kubernetes version 1.9.1.
Note : AKS doesn't support RBAC yet, so make sure when you are deploying third party, like Kubeflow or Argo for example, to follow the instructions correctly.
After few minutes, the CLI will prompt a JSON payload with the informations about your cluster. This means your cluster is deployed and ready to be used.
- kubectl
- ksonnet
- argo
- minio
You can either install the kubectl
client from the official documentation or from the Azure CLI using the az aks install-cli
command.
To be authenticated against your cluster you have to run the followin command :
$ az aks get-credentials -g $AZ_RG -n $AZ_AKS_NAME
Merged "KubeflowExample" as current context in /Users/justroh/.kube/config
To install Ksonnet, you can follow the official documentation.
To install argo, you can follow the official documentation.
When argo is installed locally you just have to run the argo install
command to deploy the controller into your cluster.
$ argo install
Installing Argo v2.0.0 into namespace 'kube-system'
Proceeding with Kubernetes version 1.9.1
Deployment 'workflow-controller' created
Deployment 'argo-ui' created
Service 'argo-ui' created
add values in yaml file
kubectl create -f minio.yaml
check service
model.py from repo ? Dockerfile.model
We will build our container and push it to private registry on Azure. To do we will use Azure Container Registry (ACR).
- We will create a new ACR resource using the CLI
az acr create -n $AZ_ACR_NAME -g $AZ_RG --sku Basic --admin-enabled -l $AZ_LOCATION
You can find more informations about the ACR cli command in the official documentation.
- We will build locally the container first and push it to our private registry on Azure.
# Build the image locally
DOCKER_BASE_URL=$AZ_ACR_NAME.azurecr.io/kubeflowdemo
docker build . --no-cache -f Dockerfile.model -t ${DOCKER_BASE_URL}/mytfmodel:1.0
# Fetch the generated password of the private registry
AZ_ACR_PASSWORD=$(az acr credential show -n $AZ_ACR_NAME -g $AZ_RG -o tsv --query 'passwords[0].value')
AZ_ACR_FQDN=$AZ_ACR_NAME.azurecr.io
docker login $AZ_ACR_FQDN -u $AZ_ACR_NAME -p $AZ_ACR_PASSWORD
# Push the local image to the private registry
docker push ${DOCKER_BASE_URL}/mytfmodel:1.0
- We have to store the ACR credentials in our Kubernetes cluster to be able to pull from it
TBD Test without doing it
With our data and workloads ready, now the cluster must be prepared. We will be deploying the TF Operator, and Argo to help manage our training job.
In the following instructions we will install our required components to a single namespace. For these instructions we will assume the chosen namespace is tfworkflow
:
We are using the Tensorflow operator to automate the deployment of our distributed model training, and Argo to create the overall training pipeline. The easiest way to install these components on your Kubernetes cluster is by using Kubeflow's ksonnet prototypes.
NAMESPACE=tfworkflow
APP_NAME=my-kubeflow
ks init ${APP_NAME}
cd ${APP_NAME}
ks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflow
ks pkg install kubeflow/core@1a6fc9d0e19e456b784ba1c23c03ec47648819d0
ks pkg install kubeflow/argo@8d617d68b707d52a5906d38b235e04e540f2fcf7
# Deploy TF Operator and Argo
kubectl create namespace ${NAMESPACE}
ks generate core kubeflow-core --name=kubeflow-core --namespace=${NAMESPACE}
ks generate argo kubeflow-argo --name=kubeflow-argo --namespace=${NAMESPACE}
ks apply default -c kubeflow-core
ks apply default -c kubeflow-argo
# Switch context for the rest of the example
kubectl config set-context $(kubectl config current-context) --namespace=${NAMESPACE}
cd -
You can check to make sure the components have deployed:
$ kubectl get pods -l name=tf-job-operator
NAME READY STATUS RESTARTS AGE
tf-job-operator-78757955b-2glvj 1/1 Running 0 1m
$ kubectl get pods -l app=workflow-controller
NAME READY STATUS RESTARTS AGE
workflow-controller-7d8f4bc5df-4zltg 1/1 Running 0 1m
$ kubectl get crd
NAME AGE
tfjobs.kubeflow.org 1m
workflows.argoproj.io 1m
$ argo list
NAME STATUS AGE DURATION
For fetching and uploading data, our workflow requires S3 credentials. These credentials will be provided as kubernetes secrets:
#export S3_ENDPOINT=<IP of the minio service>
export S3_ENDPOINT=52.226.74.27:9000
export MINIO_ENDPOINT_URL=http://${S3_ENDPOINT}
export MINIO_ACCESS_KEY_ID=kubeflowexample
export MINIO_SECRET_ACCESS_KEY=kubeflowexample
export BUCKET_NAME=kubeflowexample
kubectl create secret generic minio-creds --from-literal=minioAccessKeyID=${MINIO_ACCESS_KEY_ID} \
--from-literal=minioSecretAccessKey=${MINIO_SECRET_ACCESS_KEY}
This is the bulk of the work, let's walk through what is needed:
- Train the model
- Export the model
- Serve the model Now let's look at how this is represented in our example workflow
The argo workflow can be daunting, but basically our steps above extrapolate as follows:
get-workflow-info
: Generate and set variables for consumption in the rest of the pipeline.tensorboard
: Tensorboard is spawned, configured to watch the S3 URL for the training output.train-model
: A TFJob is spawned taking in variables such as number of workers, what path the datasets are at, which model container image, etc. The model is exported at the end.serve-model
: Optionally, the model is served.
With our workflow defined, we can now execute it.
First we need to set a few variables in our workflow. Make sure to set your docker registry or remove the IMAGE parameters in order to use our defaults:
DOCKER_BASE_URL=$DOCKER_BASE_URL
export S3_DATA_URL=s3://${BUCKET_NAME}/data/mnist/
export S3_TRAIN_BASE_URL=s3://${BUCKET_NAME}/models
export AWS_REGION=us-west-2
export JOB_NAME=myjob-$(uuidgen | cut -c -5 | tr '[:upper:]' '[:lower:]')
export TF_MODEL_IMAGE=${DOCKER_BASE_URL}/mytfmodel:1.0
export TF_WORKER=3
export MODEL_TRAIN_STEPS=200
Next, submit your workflow.
argo submit model-train.yaml -n ${NAMESPACE} \
-p aws-endpoint-url=${MINIO_ENDPOINT_URL} \
-p s3-endpoint=${S3_ENDPOINT} \
-p aws-region=${AWS_REGION} \
-p tf-model-image=${TF_MODEL_IMAGE} \
-p s3-data-url=${S3_DATA_URL} \
-p s3-train-base-url=${S3_TRAIN_BASE_URL} \
-p job-name=${JOB_NAME} \
-p tf-worker=${TF_WORKER} \
-p model-train-steps=${MODEL_TRAIN_STEPS} \
-p namespace=${NAMESPACE} \
-p s3-use-https=false \
-p s3-verify-ssl=false
Your training workflow should now be executing.
You can verify and keep track of your workflow using the argo commands:
$ argo list
NAME STATUS AGE DURATION
tf-workflow-h7hwh Running 1h 1h
$ argo get tf-workflow-h7hwh