CI/CD for Machine Learning

Streamline your machine learning development processes, enhancing efficiency, reliability, and reproducibility in your projects. Develop a comprehensive understanding of CI/CD workflows and YAML syntax, utilizing GitHub Actions (GA) for automation, training models in a pipeline, versioning datasets with DVC, and performing hyperparameter tuning.

Introduction

Essential principles of Continuous Integration/Continuous Delivery (CI/CD) and YAML. Software development life cycle and key terms like build, test, and deploy. Continuous Integration, Continuous Delivery, and Continuous Deployment. Significance of CI/CD in machine learning and experimentation.

SDLC Overview
- Systematic approach covering software development from start to finish
- Workflow - build, test, deploy
Continuous Integration (CI): The practice of frequently building, testing, and merging code changes into a shared repository
Continuous Delivery (CD): Ensures that code changes can be deployed to production at any time but requires manual approval
Continuous Deployment (CD):Automatically deploys code changes to production without manual intervention
CI/CD in Machine Learning enables
- Data versioning
- Building models and model versioning
- Automating experiments
- Testing
- Deployment
Introduction to YAML
- A data formatting language similar to JSON and XML
- Indentation is meaningful, Tabs are not allowed
- Mappings, sequences, and scalars are building blocks of YAML
Introduction to GitHub Actions
- CI/CD platform to automate pipelines
- A sequence of steps that represent the flow of work and data
GHA Components
- Event: is a specific activity in a repository that triggers a workflow run
- Workflow: automated process that will run one or more jobs
  - Triggered automatically by event
  - Housed in .github/workflows
- Job: set of steps
  - Each job is independent
  - Parallel execution is possible
  - Executed on the compute machine called runners
- Steps: individual units of work
  - Executed in order, depends on previous step
  - Run on the same machine, so data can be shared
- Action: GHA platform specific application
  - checkout repo, comment on PR

GitHub Actions

Components of GHA - events, actions, jobs, steps, runners, and context. Workflows that activate upon events like push and pull requests, and tailor runner machines. CI pipelines and intricacies of the GHA log.

Intermediate YAML
- Multiline strings: Block scalar format
  - Literal style ( | ) preserves line break and indentation
  - Fold style ( > ) removes line breaks
- Chomping indicators control the behavior of newlines at the end of the string
  - clip is the default mode, single newline at the end
  - strip ( - ) removes all newlines at the end
  - keep ( + ) retains all newlines at the end
- Dynamic value injection - expressions allow parsers to dynamically substitute values
  - Usage - environment variables and references to other parts of YAML
- Multi-document YAML
Setting a basic CI pipeline
- Anatomy of GitHub Actions workflow

name: CI

on:
  push:
    branches: [ "main" ]

jobs:
  build:
    runs-on: ubuntu-latest
  steps:
    - name: Run a multi-line script
      run: |
        echo Hello, world!
        echo Add other actions to build,
        echo test, and deploy your project.

Running repository code
- Create a feature branch
- Add repository code
- Configure workflow event
- Create PR and trigger workflow

name: CI

on:
  pull_request:
    branches: [ "main" ]

jobs:
  build:
    runs-on: ubuntu-latest
  steps:
    - name: Checkout
      uses: actions/checkout@v3
    - name: Setup Python
      uses: actions/setup-python@v4
      with:
        python-version: 3.9
    - name: Run Python script
      run: |
        echo hello_world.py
        python hello_world.py

Environment Variables and Secrets
- Contexts - access information about predefined variables and data
- Contexts used in this course
  - github - information about the workflow run
  - env - variables set in the workflow
  - secrets - names and values that are available to workflow
  - job - info about the current job
  - runner - info about the machine
- Variables store non-sensitive information in plain text compiler flags, usernames, file paths
  - Global/local scope is controlled by the level where defined
  - Accessed from the env context as ${{ env.ENV_VAR }}
- Secrets store sensitive information in encrypted manner, i.e.passwords, API keys
  - Setting secrets
  - GITHUB_TOKEN secret - built-in secret provided by GitHub Actions
    - Used to perform workflow actions
    - Automatically available in every GitHub Actions workflow
    - Permissions can be tuned to the right degree

name: Greeting on variable day
# Global env
env:
  Greeting: Hello

# Grant permissions to write comments in PR
permissions:
  pull-requests: write

# Use GITHUB_TOKEN to authorize
steps:
  - name: Comment PR
    uses: thollander/actions-comment-pull-request@v2
    with:
      GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
      message: |
        Hello world ! :wave:

Continuous Integration in Machine Learning

Integration of machine learning model training into a GitHub Action pipeline using Continuous Machine Learning GitHub Action. Generate a comprehensive markdown report including model metrics and plots. Fata versioning in Machine Learning by adopting Data Version Control (DVC) to track data changes. Setting up DVC remotes and dataset transfers. DVC pipelines, DVC YAML file to orchestrate reproducible model training.

Dataset: Weather Prediction in Australia
- Data preprocessing
  - Convert categorical features to numerical
  - Replace missing values of features
  - Scale features
- Random Forest Classifier
  - max_depth = 2 , n_estimators = 50
- Standard metrics on test data
  - Performance plots
  - Confusion matrix plot
GitHub Actions Workflow
- Continuous Machine Learning (CML)
  - CI/CD tool for Machine Learning
  - GitHub Actions Integration

# Enable setup-cml action to be used later
- uses: iterative/setup-cml@v1

- name: Train model
  run: |
    # Your ML workflow goes here
    pip install -r requirements.txt
    python3 train.py
- name: Write CML report
  run: |
    # Add results and plots to markdown
    cat results.txt >> report.md
    echo "![training graph](./graph.png)" >> report.md
    # Create comment from markdown report
    cml comment create report.md
  env:
    REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Versioning datasets with Data Version Control
- Ensures a historical record of data changes
- DVC: Data Version Control tool
  - Manages data and experiments
  - Similar to Git
- Data stored separately
  - SSH, HTTP/HTTPS, Local File System
  - AWS, GCP, and Azure object storage

> dvc init
Initialized DVC repository.
You can now commit the changes to git.
> dvc add data.csv
> cat data.csv
> cat data.csv.dvc

Interacting with DVC remotes
- Location for Data Storage
- Similar to Git remotes, but for cached data

> dvc remote add myAWSremote s3://mybucket
> dvc remote modify myAWSremote connect_timeout 300
> dvc remote add -d mylocalremote /tmp/dvc
> dvc push -r myAWSremote data.csv
>
> dvc add /path/to/data/datafile
> git commit /path/to/datafile.dvc -m "Dataset updates"
> git push origin main
> dvc push

DVC Pipelines - sequence of stages defining ML workflow and dependencies
- Versioning data alone is not very useful
- Run only what's needed
- Steps in Directed Acyclic Graph (DAG)
- Defined in dvc.yaml file
- Similar to the GitHub Actions workflow
  - Focused on ML tasks instead of CI/CD
  - Can be abstracted as a step in GHA

> dvc stage add -n preprocess -d raw_data.csv -d preprocess.py -o processed_data.csv python preprocess.py
> dvc stage add -n train -d train.py -d processed_data.csv -o plots.png -o metrics.txt python train.py
> dvc dag
> dvc repro
> git add dvc.lock && git commit -m "first pipeline repro"`

stages:
  preprocess:
    cmd: python preprocess.py
    deps:
    - preprocess.py
    - raw_data.csv
    outs:
    - processed_data.csv
  train:
    cmd: python train.py
    deps:
    - processed_data.csv
    - train.py
    outs:
    - plots.png

Comparing training runs and Hyperparameter (HP) tuning

Analysis of model performance and the fine-tuning of hyperparameters. Compare metrics and visualizations across different branches to assess changes in model performance. Hyperparameter tuning using scikit-learn's GridSearchCV. Automation of pull requests using the optimal model configuration.

Configure DVC YAML file to track metrics across experiments
Querying and comparing DVC metrics
- Change a hyperparameter and rerun dvc repro
- Setting up DVC Github Action

stages:
  preprocess:
  train:
    outs:
    - confusion_matrix.png
    metrics:
      - metrics.json:
        cache: false

> dvc metrics show
> dvc metrics diff
> dvc plots show predictions.csv

Hyperparameter tuning route
- Branch name hp_tune/<some-string>
- Make changes to search configuration
- Manually open a PR
  - Force runs DVC pipeline dvc repro -f hp_tune
  - Uses cml pr create to create a new training PR with best parameters
  - Force push a commit to training PR to kick off model training job
Separate feature branches for training and hyperparameter tuning
- Hyperparameter tuning job kickoff
- Creating a training PR from hyperparameter run

misho-kr/CI-CD for Machine Learning.md

CI/CD for Machine Learning

Introduction

GitHub Actions

Continuous Integration in Machine Learning

Comparing training runs and Hyperparameter (HP) tuning