Skip to content

Instantly share code, notes, and snippets.

@Wirone
Last active August 21, 2025 07:56
Show Gist options
  • Save Wirone/d870086d9fd41295fb3272c9ad94821c to your computer and use it in GitHub Desktop.
Save Wirone/d870086d9fd41295fb3272c9ad94821c to your computer and use it in GitHub Desktop.
Linearize Git history while preserving authorship (more or less) from merged branches

This gist shows how you can linearize complex Git tree that has many merge commits, making it like a "squash on merge" policy was used from the beginning of repo's history, while modifying commits' authorship in a standardised way.

It checks if a commit is a merge commit and, if so, modifies its author and parents:

  • merge commit becomes "regular" commit with only 1 parent
  • authorship for ex-merge commit is taken from the last commit of merged branch (2nd parent that is discarded). It assumes that single person was working on a branch, so it may produce false blame information if multiple people authored commits in a branch (all changes from the branch have authorship from last commit's author). In our case it was needed because merge commits are automated and authorship comes from bot user, so without this little hack blame would always show the bot as author of any line.

The main idea of this script is reducing Git repository's tree size, which leads to faster clones (and other Git operations).

Usage: place both file in the root of your repository and run ./linearize_git_history.sh (don't forget chmod +x).

Note

Requires git-filter-repo

Warning

DESTRUCTIVE OPERATION! It rewrites Git history, use at your own risk if you know what you're doing.

Let's consider example repository with such a tree:

*   a7eed66b - Merge D - Greg (HEAD -> main)
|\
| * 98331fc7 - D - Jack
|/
*   d1202c56 - Merge C - Greg
|\
| * 59acc515 - C plus - Jill
| * 5bf839a3 - C - Jack
|/
*   2ea8ca3f - Merge B - John
|\
| * ab229bb9 - B plus - Greg
| * 8c1275ea - B - Greg
|/
* 36c97e8b - Initial commit - Greg

The goal of this script is:

  • commit 2ea8ca3f “Merge B” takes authorship from commit ab229bb9 (overrides John with Greg)
  • commit d1202c56 “Merge C” takes authorship from commit 59acc515 (overrides Greg with Jill)
  • commit a7eed66b “Merge D” takes authorship from commit 98331fc7 (overrides Greg with Jack)

Ultimately, resulting repository should look like:

* ???????? - Merge D - Jack (HEAD -> main)
* ???????? - Merge C - Jill
* ???????? - Merge B - Greg Korba
* ???????? - Initial commit - Greg Korba
import git_filter_repo as fr
import subprocess
import sys
# We need this because `git-filter-repo` does not provide parents' SHAs, only int IDs.
# Map is built from `git log`, order of the commits seems to be fitting with what GFR does.
commit_id_to_hash_map = {}
def load_map():
global commit_id_to_hash_map
try:
sha_list = subprocess.check_output(
['git', 'log', '--all', '--reverse', '--pretty=format:%H'],
universal_newlines=True
).strip()
commit_id_to_hash_map = sha_list.split()
except subprocess.CalledProcessError:
print("Loading ID-SHA map failed", flush=True)
sys.exit(1)
def process_commit(commit):
global commit_id_to_hash_map
# Check if the commit is a merge commit (has more than 1 parent)
if len(commit.parents) > 1:
# Get the original hash of the second parent (the merged branch's last commit) using pre-generated map
merged_parent_original_hash = commit_id_to_hash_map[commit.parents[1]-1]
# Get the author of the second parent using a subprocess call (use `[###]` as a separator between name and email so we're 100% sure we get proper authorship data later)
try:
author_info = subprocess.check_output(
['git', 'show', '--no-patch', '--format=%an[###]%ae', merged_parent_original_hash],
universal_newlines=True
).strip().split('[###]')
except subprocess.CalledProcessError:
print(f"Warning: Could not get author for commit {merged_parent_original_hash}. Using original merge author.", flush=True)
sys.exit(1)
print(f"Rewriting {commit.original_id} with author '{author_info[0]} <{author_info[1]}>' from {merged_parent_original_hash}")
# Set the author of the new, single-parent commit to the author of the last commit from the merged branch
commit.author_name = author_info[0].encode()
commit.author_email = author_info[1].encode()
# Linearize the history by keeping only the first parent
commit.parents = [commit.parents[0]]
load_map()
git filter-repo --commit-callback '
import sys
import os
sys.path.append(os.getcwd())
import linearize_callback
return linearize_callback.process_commit(commit)
' --force
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment