Before getting into Git, let's establish a few ideas common to all version control systems:
Given two versions of a file, we can find the diff(erence) between them, line-by-line. For example, given two versions of this file:
Roses are red
Violets are green
And so are you
Roses are red
Violets are blue
Sugar is sweet
And so are you
...we can express the diff from A to B as:
-Violets are green
+Violets are blue
+Sugar is sweet
...where + starts each line added and - starts each line removed. (Note that when a line is modified, we express that as removing the line and adding a replacement line.)
Given two versions of a file, we may wish to reconcile their differences into one new version. This can often be done reasonably well by an algorithm if we also have a common ancestor of these two versions (a version of the file from which they both derive). By comparing each of the two versions with the common ancestor, the algorithm can see what changes each of the two versions represent (relative to the common ancestor) and then combine those changes into one file. For example:
Violets are blue
And so are you
Roses are red
Violets are blue
Sugar is sweet
And so are you
Roses are red
Violets are blue
And so are you
Compared to version C, version A has removed the line 'Roses are red'.
Compared to version C, version B has added the line 'Sugar is sweet'.
These changes relative to C do not conflict, and so we can get this automatic merger:
Violets are blue
Sugar is sweet
And so are you
However, in some cases, the changes in the two versions relative to the common ancestor may conflict, in which case a human will have to resolve the differences. In this example, versions A and B edit the same line in different ways, and so the merge tool will inject nasty lines of <<<<<<<<<<<<<<, ================, and >>>>>>>>>>>>> to denote the conflict:
Roses are green
And so are you
Roses are brown
And so are you
Roses are red
And so are you
<<<<<<<<<< Version A <<<<<<<<<<
Roses are green
================================
Roses are brown
>>>>>>>>>> Version B >>>>>>>>>>
And so are you
As the human user, it is now our responsibility to edit the file into the state we want. This generally means picking which of the two conflicting versions to keep and deleting the unwanted lines, e.g.:
Roses are green
And so are you
If we want to merge two versions of a directory and all their contents, again, it's best if we can use a common ancestor to combine two sets of differences. For example, say we have three versions of one directory:
cat.txt
dog.txt
cat.txt
dog.txt
bird.txt
gorilla.bat
cat.txt
dog.txt
bird.txt
Compared to version C, version A deleted bird.txt
.
Compared to version C, version B added gorilla.bat
.
So we get this automatic merger:
cat.txt
dog.txt
gorilla.bat
Once we have the merged directory listing, we must also merge together the files common to both versions A and B using their respective common ancestors in C. In this case:
- merge
cat.txt
of A withcat.txt
of B (using common ancestorcat.txt
of C) - merge
dog.txt
of A withdog.txt
of B (using common ancestordog.txt
of C)
Each file merge might produce conflicts which we will have to resolve by hand.
A Git repo (repository) stores a few kinds of things:
- A commit stores a snapshot state of a directory and its content (including subdirectories). Once created, a commit is never modified and (usually) never deleted. A commit can point to one or more other commits which are its parents, versions of the directory state from which this commit derives. For example, if commit A has parent B, then A is a derivative version created from B. If commit A has parents B and C, then A represents the merger of B and C. The first commit created in a repo is usually the only commit with no parent.
- A ref (reference) is simply a named pointer to a commit. Refs come in two kinds: tags and branches. A tag is meant to uniquely identify a particular commit in a fixed way, such as to designate a commit with a particular version number, e.g. 'v3.2alpha', 'v0.13', etc. A branch is meant to denote a commit and all of that commit's ancestors. When you create a new commit A derived from parent B, the branches pointing to B are automatically updated to point to A instead of B. Branches should generally point only to 'tips' a.k.a. 'heads' (commits with no children).
- The working directory is not part of the repo but rather where we view and edit the files of our project. The repo itself is usually stored in the working directory under the subdirectory called
.git
. When a Git repo is hosted on a server, we generally don't need or want it to have a working directory; a Git repo with no working directory is said to be 'bare'. - We create a commit by first staging changes in the index a.k.a. staging area. For example, if I want the next commit to include modifications I've made to a file in my working directory, I tell the index to add the file's changes. If I want the next commit to remove a file that exists in the previous commit, I tell the index to remove the file. In other words, the index records the changes that the next commit will represent relative to its parent(s).
- Inside the
.git
repo directory, a text file calledconfig
stores various options which control how Git operates upon the repo.
In centralized version control systems, such as Subversion and CVS, a single repo lives on a central server, and each user has their own working directory on their local machines.
In distributed version control systems, like Git, each user has their own repo, and commits are copied between repos. Typically, users on a project coordinate through a central repo on a server: a user makes new commits on their local machine and then copies them to the repo on the server; to get the work of others, a user copies commits from the repo on the server to their own local repo.
Once you understand how Git stores data, the only hard part in learning to use Git is remembering exactly how the commands affect your repo and working directory. Some commands affect only the working directory, some only the commits, some only the index, and some only the refs. However, several commands affect a mix of all of these things, and some commands affect multiple repos. Here's a quick rundown of the most essential commands:
- To create a new, empty repo, we use the
git init
command. - To copy a repo, we can simply copy the
.git
directory, but more commonly we use thegit clone
command, which copies the repo but then also conveniently sets certain configuration options in the new copy. - We can copy commits from another repo into our own using the
git fetch
command. If the other repo is on another machine running a Git server, we can fetch over the network. - To copy commits in the other direction—from our repo to another—we use the
git push
command. If the other repo is on another machine running a Git server, we can push over the network. (Thegit push
command not only copies commits, it modifies certain tags and branches.) - To stage changes in the index, we use the
git add
andgit rm
commands. - To create a commit from the changes we've staged, we use the
git commit
command. - To merge a commit and the working directory, we use
git merge
. This does not create a new commit: it only puts the result of the merge in the working directory. Once we've fixed the merge conflicts, we then usually stage the changes (usinggit add
) and make a commit (usinggit commit
). - To set our working directory's state to match the state represented by an existing commit, we use the
git checkout
command. - To create, modify, and delete branches, we use
git branch
. - The
git pull
command is similar togit fetch
, but it updates certain refs and in some cases also triggers a merge.