Below is a clean, practical workflow to add DVC to an existing Git repo and start versioning your data and model artifacts (e.g., large/binary files). Adjust remote choices (S3, GCS, Azure, SSH, local) to your environment.
Prerequisites:
- Git installed and repository initialized
- DVC installed with S3/GCS/Azure support as needed (https://dvc.org/doc/install)
Verify:
- git --version
- dvc --version
cd your-project-directory
dvc init
What this does:
- Creates
.dvc/directory and.dvcignorefile (similar to .gitignore).
git add .dvc .dvcignore
git commit -m "Initialize DVC"
# Example with S3
dvc remote add -d myremote s3://mybucket/path
# Example with local/network storage
dvc remote add -d myremote /path/to/storage
# Example with Google Drive
dvc remote add -d myremote gdrive://folder-id
dvc add data/dataset.csv
# or for a directory
dvc add data/raw/
This creates a .dvc file (e.g., dataset.csv.dvc) and adds the actual data to .gitignore
dvc add models/model.pkl
git add data/dataset.csv.dvc models/model.pkl.dvc .gitignore
git commit -m "Track dataset and model with DVC"
dvc push
Commit Remote Configuration
git add .dvc/config
git commit -m "Configure DVC remote storage"
git push
Create dvc.yaml and dvc.lock by running commands like:
dvc run -n preprocess -d src/preprocess.py -d data/raw -o data/processed python src/preprocess.py
dvc init
git add .dvc .dvcignore
git commit -m "Initialize DVC"
dvc remote add model_ia_spacy_ner s3://mycompany-ia-spacy-ner--models
This will create a section like this in .dvc/confi:
['remote "model_ia_spacy_ner"']
url = s3://mycompany-ia-spacy-ner--models
dvc remote default model_ia_spacy_ner
This adds:
[core]
remote = default model_ia_spacy_ner
dvc add models/my_model.pkl
git add models/my_model.pkl.dvc .gitignore
git commit -m "Track model file with DVC"
dvc push
Resulting .dvc/config will look like:
[core]
remote = model_ia_spacy_ner
['remote "model_ia_spacy_ner"']
url = s3://mycompany-ia-spacy-ner--models
Make sure your repo and remote are configured - dvc remote list
You can create that YAML block automatically with dvc stage add (recommended)
dvc stage add \
-n prepare_model_ia_spacy_ner \
-d .dvc/config \
-o models/model_ia_spacy_ner/ \
--always-changed \
dvc pull -r model_ia_spacy_ner/model
Explanation of each flag:
-n→ stage name-d→ dependency (optional; you can use .dvc/config so changes in remote config retrigger the stage)-o→ output directory--always-changed→ ensures the stage runs every time (even if DVC thinks nothing changed)
dvc commit prepare_model_ia_spacy_ner
git add dvc.yaml
git commit -m "Add DVC stage to pull intimacoes classifier model"
- List remotes:
dvc remote list - Show config:
dvc config --list
Pull data from remote:
dvc pull
Update tracked files:
dvc add data/dataset.csv
git add data/dataset.csv.dvc
git commit -m "Update dataset"
dvc push
Checkout specific version:
it checkout <commit-hash>
dvc checkout