Commit Hygiene; a treatise on the Whys and Hows

Introduction

Commits (or changesets/patches) are the effective results of a developer's coding efforts. They describe a delta, or difference in contents that all other developers using a given codebase should also apply to their copies of the codebase, keeping it moving forward towards a common goal.

It would seem apparent that there exist "better" and "worse" ways of creating, describing, and otherwise talking about these individual changes, but actually nailing down a consensus definition of "good" or "best" is often tricky. What follows here is one experienced developer's opinion of why it matters and what "good" means, along with some links to back that up.

0. Terminology

repo: short for "repository", this is a folder that contains a .git/ directory
- that directory is used by git to track the past and current state of the repository/folder
- the past state, or history, of the folder describes a series of repeatable modificiations
commit: each one of the repeatable modifications used to track a repo's history are uniquely identified by a hash value
- each commit has a single parent commit (usually, merge commits are different but ignore that for now), so the history of the repo can be accurately represented as a singly linked-list where each commit is a node.
branch: when 2 or more commits claim the same parent commit, you have yourself a branching situation
- there is traditionally a canonical 'main' branch which track the "real" history
local vs remote: when you clone a repository, you create (locally) a copy of the repo/root folder and its history.
- after cloning, there now "exist" THREE functional repositories:
  1. The first, your local, is implictly updated as you work on it.
  2. The second, your local-copy-of-remote, is NOT updated until you "push" your changes (or 'fetch' them from the server).
  3. The third, your remote-copy-of-local, is what lives on the server to track your branch.
HEAD: in a git context, HEAD means $most_recent_commit (this is not a real variable name, it's just a helpful mental model). git checkout main could also be phrased "update $most_recent_commit to be equal to local_repo.branches.main.last()"

1. How to use branches

the github flow is how most modern development goes, as it handles CI well:
1. main is always deployable
2. branch from main to work on a new feature
3. commit early and often to your branch during development
4. prepare your code for merge to main / open a pull request. (this is the important bit, I'll expand below)
5. merge to main only after review sign-off
6. deploy from main immediately after merge (delete named branch)

To expand: You may have noticed that I recommend "commit early and often"; but the overall theme of this treatise is "clean commits" - what gives?

The answer is that both approaches have their place, and can be used together quite harmoniously! The key is to recognize and utilize the difference between step 3 (active development) and step 4 (prep for merge). During your development you commit 'early and often' and with quick 'disposable' messages... When adding to the mainline branch you want to the commits to be concise and correctly formatted

2. Small, patchy changes

There are two main ideas I'd like you to take away from this treatise, and they both relate to the platonic ideal (at least, in my opinion) of "what to aim for when describing code changes you've made":

smaller commits are better
make patchy changes

Smaller commits are better than bigger ones

This is surprisingly not obvious to a lot of people, so it's worth calling out. As a general rule, a smaller, more focused commit will tend to be "better" than a larger, more complex one.

What do I mean by better? Okay, that's fair - I mean:

it'll be easier to isolate any bad behavior that results from applying this change
the commit message/change metadata can be far more focused and effective
the code reviewer will never ask themselves "why is this file included in the diff?"

For an even more reasons why smaller, more focused commits are better, see How focused commits make you a better coder.

At the extreme end of this - if your description of a given change includes the word "and", think about whether you could split that diff into multiple, more focused changes. Is that possibly too extreme? It could be! But the other extreme of "here's everything the ticket asked for including frontend changes, backend changes, and tests in a single commit" is probably not extreme enough. Where you land on this "how extreme" spectrum is something you should definitely talk about/agree upon with your fellow coders.

Let's pause for a moment so I can clarify something - when I say the other extreme of "here's everything the ticket asked for", I am not saying that it's inappropriate to MERGE all of those changes simultaneously, I'm saying it's less-than-ideal if you have the opportunity to DESCRIBE the changes individually and choose not to do so.

The analogy I like to use is that of integration and unit testing. Integration tests, like a "pull request" or "patch set", act upon a functional grouping of code. Unit tests, on the other hand, allow you to test each individual component of a functional grouping of code in isolation, much as a small/focused/idealized commits allow you to refer to (-slash- revert, cherry-pick, bisect, changelog publish, etc.) each individual component of a changeset in isolation.

Patchy changes

Consider large, long-lived software projects in the wild. When someone wants to modify the linux kernel, do they open a PR by saying "I did a cool thing locally, then clicked 'Merge' and got this cool diff to give you!"? I mean, they could, but I doubt it would go well.

Here's the thing - no one cares about the 'state' of your local instance of the codebase, it's too much effort to sync all the changes that everyone makes all the time. What people do care about is how you propose to modify the codebase - what are the repeatable changes to make?

Instead of the above, here's a better example thought process: "I did a cool thing locally, then stopped to consider the effective delta between my branch and main. Here's a patch that describes how everyone can make these changes to their local repo too!"

Doing a thing once (making a change on your machine) is way less cool than being able to do it EVERYWHERE.

Semver

Both of these ideas come together in the semver spec. For many open source projects, a proposed patch will, once accepted, contain a keyword in the new commit message indicating "how big" the change is:

Small bugfixes increment the final digit (PATCH version): 1.0.1 -> 1.0.2
Medium-sized new feature increment the middle digit (MINOR version): 1.0.1 -> 1.1.0
Large, backwards-incompatible changes increment the first digit (MAJOR version): 1.0.1 -> 2.0.0

Once a patch is accepted, automated tooling can kick off a new build and publish a new version of the library - consumers are now aware how much testing they need to do for an upgrade, precisely because of how targeted the metadata around the patch is able to be.

3. Implementation

Okay, small commits are great - how do I make them? Assume I'm not a platonically-ideal developer whose early-and-often commits lined up 1:1 with the patches I want to propose...

This is where a lot of "git literature" starts to get hairy, and it's because it's hard to prescribe a "right" methodology when there are 230934 perfectly viable ways to do, well just about anything. And some of those "ways to do it" are scary! But I ask you this - what would you think of a professional carpenter who had never used a band saw because there was always a less dangerous way? Similarly you, a professional software developer, should have knowledge and ability with tools that have the potential to be dangerous.

It's okay to make mistakes!

The biggest thing I try to tell everyone who's nervous about this comes from the wonderful think-like-a-git guide: creating a branch before you try a merge or a rebase is like saving your game before you battle the boss. Local branches are free, so git checkout -b efischer/tmp_20220312 and BAM!, now I can refer to "the history of my codebase as it exists right now" (the previous branch I was working on) and "the ideal history of main after my changes are merged" (this new branch, as and after you do "crazy git things").

At that point if something goes wrong - who cares?! The changes we made can still be referenced, so just checkout a different temp branch and try again! You're learning!

merge --squash

Here's an example of my "usual" development flow, which uses a tmp branch and merge --squash:

git checkout -b efischer/tmp; git fetch; git reset --hard origin/main
- switch to temp branch, set it to equal the most recent origin/main branch
git merge --squash efischer/awesome_feature
- take ALL the changes from my feature branch and commit them to the current branch as a single change
git reset HEAD~
- undo the last commit I made, but leave the file changes staged
git add X; git commit -m "..."
- git add --patch if you split more granularly than by file name
- repeat until all changes in a "focused" commit
- tmp branch now contains proposed patch set
git checkout efischer/awesome_feature; git reset --hard efischer/tmp
- go back to feature branch; set it equal to the newly fixed-up tmp branch

Proper rebasing

Another one to mention is git rebase -i. The full details of this powerful command are far beyond the scope of this document (seriously, check out think like a git above if you're interested though), but it's worth noting that IMO interactive rebases are THE way to modify changes you've already split.

Say you broke your work up into commits A, B, C, and D, but then noticed a problem with B's logic and made a fixup commit E to address it. rebase -i is exactly what you'll be looking for before pushing that all up to create a Pull Request.

Be careful editing published history

It's also worth calling out as we delve into history-editing that many of these operations will cause you to be unable to simply git push, you'll need to force-push. This is because it's impossible to fast-forward to a new endpoint in the linked list from a previous node that no longer exists.

If you're going to be sharing changes with coworkers BEFORE they get merged, then a) be super careful and b) ask yourselves if this is really the ideal branching strategy -slash- division of work. Ideally, changes are reviewed, then merged before sharing.

rebase --onto

If you do get into this scenario though, rebase --onto is magic. Here's a link to a top-tier StackOverflow answer, what I usually remember it as is:

git rebase --onto <updated_base_branch.last()> <commits_to_abandon.last()>

4. Good Commit Messages

This is one I'm just gonna point you to other references on, I have nothing to add to this definitional post. Click through for more details!

It's even possible (and strongly recommended/required for semver-style projects) to set up linters that will hook into git-provided hooks and tell you if your commit messages are legit at the time of writing them!

Thanks for reading, go and code up some commits!

efischer19/On Clean Commits.md