This gist details the following:
- Converting a Subversion (SVN) repository into a Git repository
- Purging the resultant Git repository of large files
- Retrieve a list of SVN commit usernames
- Match SVN usernames to email addresses
- Migrate to Git using git-svn clone command
A SVN commit only lists a user's username. Git on the other hand lists much more details, but at the very least, a git commit author needs both a username and an email address associated to that username. Since the email address is not available in SVN, it needs to be manually matched.
A list of usernames as recorded by SVN therefore needs to be created for the match. The following command will result in a file called authors.txt which will have the SVN usernames as its contents:
svn log -q | awk -F '|' '/^r/ {sub("^ ", "", $2); sub(" $", "", $2); print $2" = "$2" <"$2">"}' | sort -u > authors.txt
The contents of authors.txt is in the following format:
jwilkins = jwilkins <jwilkins>
It needs to be converted into this:
jwilkins = John Albin Wilkins <[email protected]>
Create a folder where the git clone is to be stored, and then do the following:
git svn clone --stdlayout --authors-file=path/to/authors.txt <svn_repo>
This last step may take some time, but it should result in a Git repo.
##Find And Purge Large Files From Git History
Git (at least GitHub) seems to be stricter than SVN regarding large files. In order to migrate a SVN repository to Git, one may need to purge these files from the Git history.
Go to newly created Git repo and do the following:
git rev-list --objects --all | sort -k 2 > allfileshas.txt;git gc && git verify-pack -v .git/objects/pack/pack-*.idx | egrep "^\w+ blob\W+[0-9]+ [0-9]+ [0-9]+$" | sort -k 3 -n -r > bigobjects.txt
This will result in two files:
- allfileshas.txt - a list of all sha's in the git repo
- bigobjects.txt - a list of sha's representing objects that are large
To transform these two files into a list of file names and sorted by size in descending order:
for SHA in `cut -f 1 -d\ < bigobjects.txt`; do echo $(grep $SHA bigobjects.txt) $(grep $SHA allfileshas.txt) | awk '{print$1,$3,$7}' >> bigtosmall.txt; done
NOTE: The above script may take a long time (and may never stop), so after 2 minutes (max), just ctr-c stop it.
The resulting file, bigtosmall.txt
will contain a list of file names, sorted from largest to smallest.
Select files (or even a directory) from bigtosmall.txt
that you want purged. Then run the following for each file, substituing MY-BIG-DIRECTORY-OR-FILE
with the directory or file that is to be purged:
git filter-branch -f --prune-empty --index-filter 'git rm -rf --cached --ignore-unmatch MY-BIG-DIRECTORY-OR-FILE' --tag-name-filter cat -- --all
worked by doing the first 3 steps. nice one!