Extracting a sub project into a new repository (and how mso-dumper got its new home).


Just a short while ago I worked on extracting our mso-dumper project from LibreOffice’s build repository, into a brand new repository created just for this. The new repository was to be located in libreoffice/contrib/mso-dumper.

Originally, this project started out just as a simple sub directory of a much larger parent repository. But because it grew so much, and because its scope is not entirely in line with that of the parent repository, I decided it was best to move this project into a repository of its own. Now, it’s easy to transfer a subset of files from one repository to another if you don’t mind losing its history, but I wanted to preserve the history of those files even after the transition.

It turns out that there is a way to do this with git. Kendy suggested that I look into git filter-branch, so I did. After a few hours of researching and trials & errors (and some bash script writing which was later thrown away), I’ve come to realize that all of this can be achieved in the following simple steps.


First, clone the whole build repository which contains the sub project to be extracted

git clone path/to/libo/build mso-dumper-temp

Once done, cd into that cloned repository, and run

git filter-branch --subdirectory-filter scratch/mso-dumper/ -- --all

which will remove all files from the git history except for those under the scratch/mso-dumper directory, and re-locate those files under that directory into the top-level directory. You may also want to run

git remote rm origin

to prevent accidental pushing of this to the remote origin during these steps. Anyway, once the filtering is done, remove all tags by

git tag | xargs git tag -d

And that’s all. Now, you have only the files you want to keep, they are sitting happily at the top level like they should, all of their commit records are preserved, and you don’t have any old tags you don’t need for the new repository.

This is not over yet. At this point, this git repo still stores the objects of the removed files. In fact, the size of the .git directory of this new repo was more than twice the size of the .git directory of the original build repo! To completely prune this unnecessary info in order to shrink the size of the repository, run

git clone file:///path/to/mso-dumper-temp mso-dumper

to further clone this into another repo locally to strip all the unnecessary blob. Note that I used the file:///… style file path, as opposed to the usual /path/to/foo style file path. When using the file:///… style path to clone a local repo, git will not clone the objects of the removed files, thereby reducing the size of the objects significantly (and clone is faster too). Using the regular /path/to/foo style path, git will hard-link all the object files, so the size will stay the same.

After the second cloning, the size of my .git directory shrank from 280MB to 384k! So it does make a big difference. Now all that’s left to do is to push this repository to the new remote location. Easy huh? :-)

But there was a gotcha….

There was one caveat, however. This method apparently does not preserve the whole history of the relocated files if the parent sub-directory had been renamed. The mso-dumper directory was renamed from its original name sc-xlsutil in order to accommodate the ppt dumper that Thorsten wrote. Unfortunately git filter-branch --subdirectory-filter did not preserve the history before the directory rename occurred, but that was just a minor issue, and something I was not too concerned about for this particular transition.