Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save perXautomatik/0352689484bcb49a9e96903c92bd4c5f to your computer and use it in GitHub Desktop.
Save perXautomatik/0352689484bcb49a9e96903c92bd4c5f to your computer and use it in GitHub Desktop.

Using shared data space

While working with git on major projects, one problem developers usually feel is the huge amount of data that is hosted in each working directory. For example, my busybox folder currently hosts 18MB of source code and 20MB in .git. While code can be compressed to 2.5MB, the git data is already compressed and remains 20MB.

As soon as you work on several branches at the same time, because you are following different use cases of the same software package, it's useful to have different folders, with different checkouts of the same project, to avoid switching branches too often, as each time you need to recompile everything, which takes time. In this situation, the amount of common history becomes an heavy load, both on the work disk and on the backup device. Clearly, most git objects are repeated in the various folders, as past history of the package is the same, and local differences across branches are relatively little.

To avoid such data duplication, git allows to specify other archives for objects, called alternates. These additional archives are read by git but they are never written to: commits always happen locally. Such alternate locations can be specified in the environment, as GIT_ALTERNATE_OBJECT_DIRECTORIES or in the file .git/objects/info/alternates, within the working directory of the project.

My personal choice, for my kernel work, is keeping a git archive that only hosts branches I download from the network, where I periodically run git fetch. Work in progress then lives in git trees that refer to that on as alternate. I still need to run git fetch in each of the projects, as the status of branches is kept locally, but the fetch operation run in the secondary git repositories finds objects that are already available (in the alternate directory), and won't make another local copy of objects that are part of the upstream package. In fact, in the project-specific directories I always fetch from the other folder in the same computer, to avoid network traffic and to avoid storing new objects in the wrong place, if the upstream branch has grown in the meantime.

Working with alternates you can save duplication of quite a lot of data, and the .git folders for each project will weight only a few megabytes. Moreover, you can choose to only backup .git, ignoring the checked-out files, since the checkout can always be extracted by the git repository. Finally, in some cases you can even avoid backing up the master local archive, as a copy of upstream can be recovered from the net at any time.

To avoid inadvertently loosing some of your work, I also created a new git repository, that still uses alternates for upstream data. In such repository I move all branches I'm not using any more (with a fetch within the local system), before I remove them from my working repositories -- actually, I now use git remote, but that's an advanced topic. In this way I keep a local copy of my complete history, without keeping the working place crowded with old branches and without wasting more than a few megs of storage for such complete history.

Url: https://www.linux.it/~rubini/docs/git-en/git-en.html

Sure, I can suggest you some hashtags based on the text at the URL. Here are some possible hashtags:

#git,#versioncontrol,#gitcommands,#gitrepository,#gitconfig,#gitlog

I hope this helps. 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment