Putting Jupyter notebooks in source control is always a good idea. However, the size of notebooks can grow to be quite large due to the output of the cells being large or images. On top of that, re-running a notebook to generate different results would create a large diff for very little to no change in the code, further increasing the repo size. Eventually the repo will be large enough to be unwieldy or even hit maximum repo limits like 2GB most git servers enforce.
The solution is to strip the output of the notebooks automatically, to prevent accidentally commmiting the output. Git provides two different mechanisms to do this:
- Pre-commit hooks: Pre-commit hooks must be setup after a clone, but can wipe the output of your notebooks right before commit. However this would essentially wipe the output in your working copy, which is less than desired.
- Clean/smudge filters: Custom filters must be setup after a clone, but the better solution is to create a "clean filter" that wipes the output in the staged c