Cleaning big git repositories

I like to make semi-automatic git repositories for every analysis I am working on. Because they are semi-automatic (=I don’t add file by file but rather exclude files I don’t want to track) it sometimes happens I accidentally add a huge file. And because I then put my semi-automatic git commits to crontab the huge file get buried deep into the git commit-tree structure. If you happen to be lazy like me it’s worth it to check the sizes of your git repos from time to time (or just look to the /var/mail/username for git commit repo size errors ). This is the easiest way to scan&clean repos I found so far.

Don’t waste your time with git filter-branch or git filter-index commands and just go straight for BFG Repo-Cleaner.

Install BFG Repo-Cleaner:

# Get Scala (for bfg)
$ cd ~/tools/
$ wget https://github.com/sbt/sbt/releases/download/v1.5.5/sbt-1.5.5.zip -O sbt-1.5.5.zip
$ unzip sbt-1.5.5.zip
$ mv sbt sbt-1.5.5
$ cd sbt-1.5.5/
$ ln -s $(pwd)/sbt ~/bin/sbt # Make link to your bin

# Install bfg; follow the instructions https://github.com/rtyley/bfg-repo-cleaner/blob/master/BUILD.md
$ cd ~/tools/
$ wget https://github.com/rtyley/bfg-repo-cleaner/archive/refs/tags/v1.14.0.tar.gz -O bfg-repo-cleaner-1.14.0.tar.gz
$ tar xvzf bfg-repo-cleaner-1.14.0.tar.gz
$ cd bfg-repo-cleaner-1.14.0/
$ sbt # <- start the sbt console
sbt: bfg/assembly # <- download dependencies, run the tests, build the jar

# Make link to you bin directory:
# Paste this (without the ">" symbol) into ~/bin/bfg
$ nano ~/bin/bfg
#   > #!/bin/bash
#   > java -jar /home/yourusername/tools/bfg-repo-cleaner-1.14.0/bfg/target/bfg-1.14.0-unknown.jar $@ 

# Test all is working as it should
$ bfg --help

You will also need git object size scanner. The easiest is to use either the main idea from here or use the method from the comment by kenorb (that’s what I am using).

Once you have the BFG and the size scanner and you know it’s working you can start with cleaning. Be extremely careful because the following steps will remove the objects/directories from the git repo completely. Once they are gone they cannot be brought back to life.

# Find big .gits
$ cd ~/
$ find . -type d -name ".git" -prune -exec du -sh {} \;

# Locate the big repo and check what are the big files
$ cd ~/projects/directory-with-huge-git-repo/
$ git big-files

# Check the big files (or directories), add them to .gitignore so they are not added again in the future, and commit the changes
$ nano .gitignore
$ git add .
$ git commit -m "Updated .gitignore before cleaning with BFG"

# Repack blobs for faster cleaning
$ git gc --prune=now --aggressive

And now you are ready for the BFG cleaning. Again, be extremely careful!

# Remove files bigger than x megabytes
$ bfg --strip-blobs-bigger-than 10M .git
# or remove files by name
$ bfg --delete-files test.*.tab .git
# or folders
$ bfg --delete-folders *bckp .git

# Clean and finalize the deletion
$ git reflog expire --expire=now --all && git gc --prune=now --aggressive

# Add changes, commit and push to master
$ git add .
$ git commit -m "Commit after cleaning with BFG"
$ git push -f origin master

# Confirm the repo is nice and small
$ du -sh .git

Leave a comment