Streamline your machine learning development processes, enhancing efficiency, reliability, and reproducibility in your projects. Develop a comprehensive understanding of CI/CD workflows and YAML syntax, utilizing GitHub Actions (GA) for automation, training models in a pipeline, versioning datasets with DVC, and performing hyperparameter tuning.
Essential principles of Continuous Integration/Continuous Delivery (CI/CD) and YAML. Software development life cycle and key terms like build, test, and deploy. Continuous Integration, Continuous Delivery, and Continuous Deployment. Significance of CI/CD in machine learning and experimentation.
SDLC
Overview- Systematic approach covering software development from start to finish
- Workflow - build, test, deploy
Continuous Integration (CI)
: The practice of frequently building, testing, and merging code changes into a shared repositoryContinuous Delivery (CD)
: Ensures that code changes can be deployed to production at any time but requires manual approvalContinuous Deployment (CD):
Automatically deploys code changes to production without manual intervention- CI/CD in Machine Learning enables
- Data versioning
- Building models and model versioning
- Automating experiments
- Testing
- Deployment
- Introduction to YAML
- A data formatting language similar to JSON and XML
- Indentation is meaningful,
Tabs are not allowed
Mappings
,sequences
, andscalars
are building blocks of YAML
- Introduction to GitHub Actions
- CI/CD platform to automate pipelines
- A sequence of steps that represent the flow of work and data
- GHA Components
Event
: is a specific activity in a repository that triggers a workflow runWorkflow
: automated process that will run one or more jobs- Triggered automatically by event
- Housed in
.github/workflows
Job
: set of steps- Each job is independent
- Parallel execution is possible
- Executed on the compute machine called runners
Steps
: individual units of work- Executed in order, depends on previous step
- Run on the same machine, so data can be shared
Action
: GHA platform specific application- checkout repo, comment on PR
Components of GHA - events, actions, jobs, steps, runners, and context. Workflows that activate upon events like push and pull requests, and tailor runner machines. CI pipelines and intricacies of the GHA log.
- Intermediate YAML
- Multiline strings: Block scalar format
- Literal style
( | )
preserves line break and indentation - Fold style
( > )
removes line breaks
- Literal style
- Chomping indicators control the behavior of newlines at the end of the string
clip
is the default mode, single newline at the endstrip
( - )
removes all newlines at the endkeep
( + )
retains all newlines at the end
- Dynamic value injection - expressions allow parsers to dynamically substitute values
- Usage - environment variables and references to other parts of YAML
- Multi-document YAML
- Multiline strings: Block scalar format
- Setting a basic CI pipeline
- Anatomy of GitHub Actions workflow
name: CI
on:
push:
branches: [ "main" ]
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Run a multi-line script
run: |
echo Hello, world!
echo Add other actions to build,
echo test, and deploy your project.
- Running repository code
- Create a feature branch
- Add repository code
- Configure workflow event
- Create PR and trigger workflow
name: CI
on:
pull_request:
branches: [ "main" ]
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: 3.9
- name: Run Python script
run: |
echo hello_world.py
python hello_world.py
- Environment Variables and Secrets
- Contexts - access information about predefined variables and data
- Contexts used in this course
github
- information about the workflow runenv
- variables set in the workflowsecrets
- names and values that are available to workflowjob
- info about the current jobrunner
- info about the machine
- Variables store non-sensitive information in plain text compiler flags, usernames, file paths
- Global/local scope is controlled by the level where defined
- Accessed from the env context as
${{ env.ENV_VAR }}
- Secrets store sensitive information in encrypted manner, i.e.passwords, API keys
- Setting secrets
GITHUB_TOKEN
secret - built-in secret provided by GitHub Actions- Used to perform workflow actions
- Automatically available in every GitHub Actions workflow
- Permissions can be tuned to the right degree
name: Greeting on variable day
# Global env
env:
Greeting: Hello
# Grant permissions to write comments in PR
permissions:
pull-requests: write
# Use GITHUB_TOKEN to authorize
steps:
- name: Comment PR
uses: thollander/actions-comment-pull-request@v2
with:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
message: |
Hello world ! :wave:
Integration of machine learning model training into a GitHub Action pipeline using Continuous Machine Learning GitHub Action. Generate a comprehensive markdown report including model metrics and plots. Fata versioning in Machine Learning by adopting Data Version Control (DVC) to track data changes. Setting up DVC remotes and dataset transfers. DVC pipelines, DVC YAML file to orchestrate reproducible model training.
- Dataset: Weather Prediction in Australia
- Data preprocessing
- Convert categorical features to numerical
- Replace missing values of features
- Scale features
- Random Forest Classifier
max_depth = 2 , n_estimators = 50
- Standard metrics on test data
- Performance plots
- Confusion matrix plot
- Data preprocessing
- GitHub Actions Workflow
- Continuous Machine Learning (CML)
- CI/CD tool for Machine Learning
- GitHub Actions Integration
- Continuous Machine Learning (CML)
# Enable setup-cml action to be used later
- uses: iterative/setup-cml@v1
- name: Train model
run: |
# Your ML workflow goes here
pip install -r requirements.txt
python3 train.py
- name: Write CML report
run: |
# Add results and plots to markdown
cat results.txt >> report.md
echo "![training graph](./graph.png)" >> report.md
# Create comment from markdown report
cml comment create report.md
env:
REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
- Versioning datasets with Data Version Control
- Ensures a historical record of data changes
- DVC: Data Version Control tool
- Manages data and experiments
- Similar to Git
- Data stored separately
- SSH, HTTP/HTTPS, Local File System
- AWS, GCP, and Azure object storage
> dvc init
Initialized DVC repository.
You can now commit the changes to git.
> dvc add data.csv
> cat data.csv
> cat data.csv.dvc
- Interacting with DVC remotes
- Location for Data Storage
- Similar to Git remotes, but for cached data
> dvc remote add myAWSremote s3://mybucket
> dvc remote modify myAWSremote connect_timeout 300
> dvc remote add -d mylocalremote /tmp/dvc
> dvc push -r myAWSremote data.csv
>
> dvc add /path/to/data/datafile
> git commit /path/to/datafile.dvc -m "Dataset updates"
> git push origin main
> dvc push
- DVC Pipelines - sequence of stages defining ML workflow and dependencies
- Versioning data alone is not very useful
- Run only what's needed
- Steps in Directed Acyclic Graph (DAG)
- Defined in
dvc.yaml
file - Similar to the GitHub Actions workflow
- Focused on ML tasks instead of CI/CD
- Can be abstracted as a step in GHA
> dvc stage add -n preprocess -d raw_data.csv -d preprocess.py -o processed_data.csv python preprocess.py
> dvc stage add -n train -d train.py -d processed_data.csv -o plots.png -o metrics.txt python train.py
> dvc dag
> dvc repro
> git add dvc.lock && git commit -m "first pipeline repro"`
stages:
preprocess:
cmd: python preprocess.py
deps:
- preprocess.py
- raw_data.csv
outs:
- processed_data.csv
train:
cmd: python train.py
deps:
- processed_data.csv
- train.py
outs:
- plots.png
Analysis of model performance and the fine-tuning of hyperparameters. Compare metrics and visualizations across different branches to assess changes in model performance. Hyperparameter tuning using scikit-learn's GridSearchCV. Automation of pull requests using the optimal model configuration.
- Configure DVC YAML file to track metrics across experiments
- Querying and comparing DVC metrics
- Change a hyperparameter and rerun dvc repro
- Setting up DVC Github Action
stages:
preprocess:
train:
outs:
- confusion_matrix.png
metrics:
- metrics.json:
cache: false
> dvc metrics show
> dvc metrics diff
> dvc plots show predictions.csv
- Hyperparameter tuning route
- Branch name
hp_tune/<some-string>
- Make changes to search configuration
- Manually open a PR
- Force runs DVC pipeline dvc repro -f hp_tune
- Uses
cml pr create
to create a new training PR with best parameters - Force push a commit to training PR to kick off model training job
- Branch name
- Separate feature branches for training and hyperparameter tuning
- Hyperparameter tuning job kickoff
- Creating a training PR from hyperparameter run