Lesson 1: The Basics

The goal for this lesson is to bolster your confidence using git. We won't attempt a comprehensive review, but we will try to establish a more holistic understanding of what it is, what it isn't, and how you can begin to make it work for you.

This is not:

...a reference for commands (use the --help option or look online instead)
...a prescription for how to use git (that's up to you to ultimately decide)
...a high production-value way to learn git
...a replacement for practice and uncovering your own answers

Git vs. Github

Let's be very clear about one thing:

Git is not the same thing as Github.

Git is a program used to version control your software. It is open-sourced and can be called from the command line or integrated into your editor. There are other version control systems (VCS) out there (and they are all designed to do the same thing, though the way they achieve this is different), but git is by and large the most popular. Git borrows many principals from Linux (and, perhaps unsurprisingly, shares a common founder: Linus Torvalds).

Github is a service to host and manage git repositories online. There are many others: Bitbucket, GitLab, Sourceforge, etc. These are largely similar, but they compete on the margin on features like integrations, security, administration, etc. It can even be used in CI/CD (think: Buildkite, Jenkins, etc.) or more complex ways beyond the scope of this discussion. Github was acquired by Microsoft in June 2018.

Git: Objects & Pointers

Git, much like dbt or Airflow, essentially manages a directed acyclical graph (DAG). Whereas for dbt and Airflow, models and tasks are the unit of interest, respectively, for git the unit of interest is the state of a set of files and folders.

Let's look at this last sentence once more:

...unit of interest is the state of a set of files and folders.

These italicized words are concepts the humans can think about in the abstract, but for a computer, it needs this information to be encoded in a consumable way. To make these ideas into something tangible, git models these ideas as object-key pairs (think: a Python dictionary).

`git` object	Description
commit	state - reference to a tree and one or more parent commits
blob	files - content of a file
tree	folders - structured collection of blobs and trees

Git elegantly relates all of these things together by assigning keys to all of these objects. This allows git to interleave objects together without needing to copy the same data over again each time a new commit is made (unless every single file had indeed changed commit-over-commit).

The picture below is a representative example of how commits bring these pieces together under the hood:

Branching

There's probably another git entity you're familiar with: branches (and perhaps, tags, but we won't worry about those here). These may seem like substantive things, but in actuality branches (and tags) are simply pointers to commits! What this looks like on your computer is simply a file named your branch name whose contents are the SHA-1 hash of the commit being pointed to by the branch.

Types of Branches

There are 4 types of branches and their definition depends on largely on where these branches live and how they are associated with another branch(es):

Remote branch: located on remote; e.g., what receives changes in PRs
Remote-tracking branch: located locally; i.e., what receives data from git fetch
Local-tracking branch: located locally; i.e., branch corresponding to its remote-tracking branch that receives changes from git pull
Local branch: located locally; i.e., simply a local branch not associated with a branch in the remote

Additional Resources

Git: A Common Local Workflow

On the data team at Flexport, we tend to use the same pattern of development or workflow. This workflow involves a "monorepo" (e.g., kimono 👘) wherein "prod" is HEAD (pointing to master, which is itself a pointer pointing to the latest commit). As you already know, this means that whenever you want to contribute to a project, you create a branch from [the latest commit] on master and name it something descriptive (e.g. test in the graphic above).

We then go about our jobs, committing as we go. But we're not working in a vacuum! Our remote parent branch (usually master) is also having work added to it (as people approve PRs and then incorporate their work onto remote master). How we reconcile this divergence is what we tackle in the next section, but for now, let's keep out attention on what's really going on as we make changes to our copy of the codebase locally.

The Three "Rails" of Git

Reminder: We're still in the realm of git and the file system on our local machine.

Consider Google Sheets. As soon as you enter data into a cell, that data is cached and immediately sent over the wire to be stored by whatever mechanism Google Drive uses to store data. Boom. Done. This is perfectly fine for most use cases, but imagine trying to recover a past version of a document. Google provides timestamped versions, but good luck trying to find, say days later on an active document, which wise-ass deleted your brilliant idea.

Git takes a different approach (for lots of reasons we won't get into here). What's important to know are the three "rails" involved to get the change you have mind into the repo:

The working tree (also called the "working directory")
The staging area (also called the "staging index", or simply, the "index")
The [local] repo (also called the "commit tree", "HEAD", "history")

More often than not, you really care about going from #1 to #3 -- you've changed some files and now you want to commit those to the repo. This of course is easily achieve in an economy of commands (assuming you're using the command line interface):

Change some files and save these changes to disk
git add --all to stage all changes to all files
git commit -m "Create some files; modify others" to commit your changes

Note: git commit -a -m "[Your commit message here]" condenses the last two steps into one command.

That's pretty easy! But let's slow down for a second and see why the staging area is valuable beyond just getting in the way of us committing directly to our repo.

I want to emphasize that what you see in your terminal as a consequence of moving lines of code between pairs of "rails" are the diffs (or change sets) of the state of your files in the repo. Using the illustration above, the red text you'll see (read: unstaged work) is the area between the working tree and staging area rails, whereas the green text is the diff between the staging area and HEAD. This will become clearer in the demos below.

Demo: Staging is Amazing

The staging area is a good place to put candidate changes as you're working. If the mantra is commit early and often, staging changes allows you to have that internal debate you're having about the changes you have to implement externally on your machine.

More importantly, I will argue that staging is super important for at least two reasons:

Staging lets you operate on "hunks" (or "patches") which should save you time from having to open your editor -- productivity benefit
Staging can prevent the loss of information because once something is staged, it gets tracked by git -- safety benefit

Disclaimer: I should note that this is but one way to use git effectively. Some people have completely different workflows, which speaks to how much flexibility git affords.

Benefit #1: Productivity

Create many_hunks.txt; fill it with contents and save to disk
Stage this file
Make a many, many more edits; you're dissatisfied
Use git add -p many_hunks.txt to choose what you want to keep vs. discard

Benefit #2: Safety

Create doomed.txt; give it some contents and save to disk
Stage this file as you watch what happens to the contents of .git/
Force delete this file (git rm -f doomed.txt). 😱
Recover this file to your working tree

Warning: This is merely a safety net! Git will "garbage collect" these orphaned blobs after a while, so commit or use git stash (out of scope for this discussion) to make more lasting changes!

The Different `diff`s

The git diff command gives you insight into the differences between states of your file(s). In it's simplest form, you can pass in any two "commit-ish" objects as parameters to see the difference in state of your files. The following graphic captures many of the useful applications of this command in the contexts you'll likely care most about:

Note: git diff --cached is the same thing as git diff --staged.

Got Lost? Git Log!

git log is one of the most powerful commands for introspecting your git database (read: repo). This is but a sampling of things this command can help answer while you're knee deep in your work:

Where is branch X pointing?
Am I in line with remote master?
Where did I add this line of code to our code base?
When did person Y merge their PR?
...and much, much more!

Demo: Querying with `git log`

Show vanilla git log vs. variations
Show all commits by a given person
Show all potty-mouth commit messages and associated commits
Show, roughly, how many times I made adjustments at Graham's request
Show the "pickaxe" operation (-S, -G) flag

Note: Aliases are your friends! They will let you abstract away the specific options and parameters so you can say focused.

Another useful introspection command is git blame, but we'll save that for another talk or Lesson 2.

Additional Resources

Merging

No man is an island entire of itself; every man is a piece of the continent, a part of the main... - John Donne

For better (or for worse, depending on who ask around here), your work does not exist in a silo. As such, you will need to sync what's happening around you with your body of work (vice versa, and when the time comes). For now, we will focus on only one way to achieve this synchronicity: merging (namely, pairwise fast-forward and recursive merging).

The "happy path" is straightforward, say in the context of kimono 👘:

You make some changes to the codebase as others are merging their PRs into master
You merge those changes into your local feature_branch (and deal with conflicts as they arise)
When ready, you push to remote feature_branch and open PR.

This really boils down to:

💡 Merge others' changes frequently as you work to mitigate the extent of messy merge conflicts and address any conflicts as they arise.

This will also lend itself to communicating with other people who are touching the same files you are working on! Given the surface area of our project and the structure of our team, if you follow this advice, you should be running into a merge conflicts pretty infrequently (even rarely). More on how to deal with these conflicts (should they arise), in the next section.

What's Happening in a Merge?

In short, you are coalescing your changes with the state of the repo at the time of merging. How this is accomplished really depends on the specific situation you're in.

A fast-forward merge occurs when the target branch (e.g., master) can be reached directly from your child branch. The pointer for the target branch will "fast-forward" to the pointer of your current branch, and HEAD will point to both of these branches (upon checkout).

A recursive merge is needed when there is not a direct pathway between the two (or more) branches being merged. Two branches can be coerced to a recursive merge by passing no-ff. A merge commit is created and this commit will have two parents.

There are a few more, but I will leave that, as book publishers are fond of saying, as an exercise for the reader. 🐙

Merge Conflicts

Merge conflicts occur when competing changes are made to the same line of a file, or when one person edits a file and another person deletes the same file.

I'm not going to expound too much here (I've linked a useful guide below), but simulating and working through your own merge conflicts is, your correspondent's opinion, the best way to learn how to cope with these [infrequent] occurrences.

Additional Resources

Github: Your Code, Hosted

So you've been working and committing things locally on your branch periodically until you're done and ready to merge your work into remote master. As you've probably surmised by now, you can't simply git push origin master directly -- and for good reason!

Instead, we employ a pretty common collaborate workflow pattern to incorporate distributed changes into one, mainline branch (i.e., master): pull requests.

Pull Requests (PRs)

A pull request (PR) is simply a collection of bells and whistles to make collaborating much more facile, transparent, and measurable. Take a look around the next time you open a PR. What are the main features that you see?

A PR is a "pull" request because we are requesting that our changes get "pulled" into the upstream (whether that be another branch in the existing repo or in another repo altogether). There's "forum"-like conversation thread, different ways to view the proposed changes, and mechanisms to enable or disable certain actions. Hence, a PR is the real-life manifestation of the conversation and decisions made in your head if you were working alone and trying to integrate changes from a feature branch into your [own] project's mainline branch. (What's that? You don't want Graham's voice in your head?)

Merging PRs?

Often, you'll actually be presented the option of how you'd like your recently approved PR to be integrated into prod/remote master:

First, some digging with our handy git log command shows some interesting results. If you turn the graph option on, you can also visualize how various changes were assimilated.

Additional Resources

About Pull Request Merges

Resources

The actual reference materials used for today's talk is listed below:

git - the simple guide
A Visual Git Reference
LinkedIn Learning (via Okta)
O'Reilly (via Okta)

Each subsection has a more a targeted list of resources should you feel the desire to dive deeper.

There is a ton of material out there that is of the very highest quality. A meta-skill to develop is knowing how to ask the right question so you can get the right answer with the right level of detail at the right time -- and working with git is a great way to hone this skill!

Per48edjes/git_gym__Lesson_1.md

Lesson 1: The Basics

Git vs. Github

Git: Objects & Pointers

Branching

Types of Branches

Additional Resources

Git: A Common Local Workflow

The Three "Rails" of Git

Demo: Staging is Amazing

Benefit #1: Productivity

Benefit #2: Safety

The Different diffs

Got Lost? Git Log!

Demo: Querying with git log

Additional Resources

Merging

What's Happening in a Merge?

Merge Conflicts

Additional Resources

Github: Your Code, Hosted

Pull Requests (PRs)

Merging PRs?

Additional Resources

Resources

The Different `diff`s

Demo: Querying with `git log`