Skip to content

Instantly share code, notes, and snippets.

@johnidm
Created October 15, 2025 16:48
Show Gist options
  • Select an option

  • Save johnidm/5c022f79caf6915525271c34ff4f100a to your computer and use it in GitHub Desktop.

Select an option

Save johnidm/5c022f79caf6915525271c34ff4f100a to your computer and use it in GitHub Desktop.
Initialize DVC in a repo for datasets and binary model versions

Step-by-step: Initialize DVC in a repo for datasets and binary model versions

Below is a clean, practical workflow to add DVC to an existing Git repo and start versioning your data and model artifacts (e.g., large/binary files). Adjust remote choices (S3, GCS, Azure, SSH, local) to your environment.

Prerequisites:

Verify:

  • git --version
  • dvc --version

Initialize DVC in Your Repository

cd your-project-directory
dvc init

What this does:

  • Creates .dvc/ directory and.dvcignore file (similar to .gitignore).

Commit DVC Configuration

git add .dvc .dvcignore
git commit -m "Initialize DVC"

Configure Remote Storage

# Example with S3
dvc remote add -d myremote s3://mybucket/path

# Example with local/network storage
dvc remote add -d myremote /path/to/storage

# Example with Google Drive
dvc remote add -d myremote gdrive://folder-id

Track Your Dataset

dvc add data/dataset.csv
# or for a directory
dvc add data/raw/

This creates a .dvc file (e.g., dataset.csv.dvc) and adds the actual data to .gitignore

Track Your Model Files

dvc add models/model.pkl

Commit DVC Tracking Files

git add data/dataset.csv.dvc models/model.pkl.dvc .gitignore
git commit -m "Track dataset and model with DVC"

Push Data to Remote

dvc push

Commit Remote Configuration

git add .dvc/config
git commit -m "Configure DVC remote storage"
git push

Optional: configure DVC pipeline and reproducibility

Create dvc.yaml and dvc.lock by running commands like:

dvc run -n preprocess -d src/preprocess.py -d data/raw -o data/processed python src/preprocess.py

Add a remote configuration

Initialize DVC (if not done yet)

dvc init
git add .dvc .dvcignore
git commit -m "Initialize DVC"

Add your storage (S3) remote

dvc remote add model_ia_spacy_ner s3://mycompany-ia-spacy-ner--models

This will create a section like this in .dvc/confi:

['remote "model_ia_spacy_ner"']
    url = s3://mycompany-ia-spacy-ner--models

Set it as the default remote

dvc remote default model_ia_spacy_ner

This adds:

[core]
    remote = default model_ia_spacy_ner

Push data to the remote

dvc add models/my_model.pkl
git add models/my_model.pkl.dvc .gitignore
git commit -m "Track model file with DVC"

dvc push

Resulting .dvc/config will look like:

[core]
    remote = model_ia_spacy_ner
['remote "model_ia_spacy_ner"']
    url = s3://mycompany-ia-spacy-ner--models

Create a DVC stage (automatically pulls a model from remote storage)

Make sure your repo and remote are configured - dvc remote list

Create the DVC stage

You can create that YAML block automatically with dvc stage add (recommended)

dvc stage add \
  -n prepare_model_ia_spacy_ner \
  -d .dvc/config \
  -o models/model_ia_spacy_ner/ \
  --always-changed \
dvc pull -r model_ia_spacy_ner/model

Explanation of each flag:

  • -n → stage name
  • -d → dependency (optional; you can use .dvc/config so changes in remote config retrigger the stage)
  • -o → output directory
  • --always-changed → ensures the stage runs every time (even if DVC thinks nothing changed)

Associate current files with the stage

dvc commit prepare_model_ia_spacy_ner

Commit your changes:

git add dvc.yaml
git commit -m "Add DVC stage to pull intimacoes classifier model"

Inspecting remotes

  • List remotes: dvc remote list
  • Show config: dvc config --list

Common Workflow Commands

Pull data from remote:

dvc pull

Update tracked files:

dvc add data/dataset.csv
git add data/dataset.csv.dvc
git commit -m "Update dataset"
dvc push

Checkout specific version:

it checkout <commit-hash>
dvc checkout
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment