Extracting a sub project into a new repository (and how mso-dumper got its new home).

Background

Just a short while ago I worked on extracting our mso-dumper project from LibreOffice’s build repository, into a brand new repository created just for this. The new repository was to be located in libreoffice/contrib/mso-dumper.

Originally, this project started out just as a simple sub directory of a much larger parent repository. But because it grew so much, and because its scope is not entirely in line with that of the parent repository, I decided it was best to move this project into a repository of its own. Now, it’s easy to transfer a subset of files from one repository to another if you don’t mind losing its history, but I wanted to preserve the history of those files even after the transition.

It turns out that there is a way to do this with git. Kendy suggested that I look into git filter-branch, so I did. After a few hours of researching and trials & errors (and some bash script writing which was later thrown away), I’ve come to realize that all of this can be achieved in the following simple steps.

Steps

First, clone the whole build repository which contains the sub project to be extracted

git clone path/to/libo/build mso-dumper-temp

Once done, cd into that cloned repository, and run

git filter-branch --subdirectory-filter scratch/mso-dumper/ -- --all

which will remove all files from the git history except for those under the scratch/mso-dumper directory, and re-locate those files under that directory into the top-level directory. You may also want to run

git remote rm origin

to prevent accidental pushing of this to the remote origin during these steps. Anyway, once the filtering is done, remove all tags by

git tag | xargs git tag -d

And that’s all. Now, you have only the files you want to keep, they are sitting happily at the top level like they should, all of their commit records are preserved, and you don’t have any old tags you don’t need for the new repository.

This is not over yet. At this point, this git repo still stores the objects of the removed files. In fact, the size of the .git directory of this new repo was more than twice the size of the .git directory of the original build repo! To completely prune this unnecessary info in order to shrink the size of the repository, run

git clone file:///path/to/mso-dumper-temp mso-dumper

to further clone this into another repo locally to strip all the unnecessary blob. Note that I used the file:///… style file path, as opposed to the usual /path/to/foo style file path. When using the file:///… style path to clone a local repo, git will not clone the objects of the removed files, thereby reducing the size of the objects significantly (and clone is faster too). Using the regular /path/to/foo style path, git will hard-link all the object files, so the size will stay the same.

After the second cloning, the size of my .git directory shrank from 280MB to 384k! So it does make a big difference. Now all that’s left to do is to push this repository to the new remote location. Easy huh? :-)

But there was a gotcha….

There was one caveat, however. This method apparently does not preserve the whole history of the relocated files if the parent sub-directory had been renamed. The mso-dumper directory was renamed from its original name sc-xlsutil in order to accommodate the ppt dumper that Thorsten wrote. Unfortunately git filter-branch --subdirectory-filter did not preserve the history before the directory rename occurred, but that was just a minor issue, and something I was not too concerned about for this particular transition.

Working with a branch using git-new-workdir

Introduction

Git package contains a script named git-new-workdir, which allows you to work in a branch in a separate directory on the file system. This differs from cloning a repository in that git-new-workdir doesn’t duplicate the git history from the original repository and shares it instead, and that when you commit something to the branch that commit goes directly into the history of the original repository without explicitly pushing to the original repository. On top of that, creating a new branch work directory happens very much instantly. It’s fast, and it’s efficient. It’s an absolute time saver for those of us who work on many branches at any given moment without bloating the disk space.

As wonderful as this script can be, not all distros package this script with their git package. If your distro doesn’t package it, you can always download the source packages of git and find the script there, under the contrib directory. Also, if you have the build repository of libreoffice cloned, you can find it in bin/git-new-workdir too.

Now, I’m going to talk about how I make use of this script to work on the 3.3 branch of LibreOffice.

Creating a branch work directory

If you’ve followed this page to build the master branch of libreoffice, then you should have in your clone of the build repository a directory named clone. Under this directory are your local clones of the 19 repositories comprising the whole libreoffice source tree. If you are like me, you have followed the above page and built your libreoffice build in the rawbuild directory.

The next step is to create a separate directory just for the 3.3 branch which named libreoffice-3-3 and set things up so that you can build it normally as you did in the rawbuild. I’ve written the following bash script (named create-branch-build.sh) to do this in one single step.

#!/usr/bin/env bash
 
GIT_NEW_WORKDIR=~/bin/git-new-workdir
REPOS=clone
 
print_help() {
    echo Usage: $1 [bootstrap dir] [dest dir] [branch name]
}
 
die() {
    echo $1
    exit 1
}
 
BOOTSTRAP_DIR="$1"
DEST_DIR="$2"
BRANCH="$3"
 
if [ "$BOOTSTRAP_DIR" = "" ]; then
    echo bootstrap repo is missing.
    print_help $0
    exit 1
fi
 
if [ "$DEST_DIR" = "" ]; then
    echo destination directory is missing.
    print_help $0
    exit 1
fi
 
if [ "$BRANCH" = "" ]; then
    echo branch name is missing.
    print_help $0
    exit 1
fi
 
if [ -e "$DEST_DIR/$BRANCH" ]; then
    die "$DEST_DIR/$BRANCH already exists."
fi
 
# Clone bootstrap first.
$GIT_NEW_WORKDIR "$BOOTSTRAP_DIR" "$DEST_DIR/$BRANCH" "$BRANCH" || die "failed to clone bootstrap repo."
 
# First, check out the branches.
echo "creating directory $DEST_DIR/$BRANCH/$REPOS"
mkdir -p "$DEST_DIR/$BRANCH/$REPOS" || die "failed to create $DEST_DIR/$BRANCH/$REPOS"
for repo in `ls "$BOOTSTRAP_DIR/clone"`; do
    repo_path="$BOOTSTRAP_DIR/clone/$repo"
    if [ ! -d $repo_path ]; then
        # we only care about directories.
        continue
    fi
    echo ===== $repo =====
    $GIT_NEW_WORKDIR $repo_path "$DEST_DIR/$BRANCH/$REPOS/$repo" $BRANCH
done
 
# Set symbolic links to the root directory.
cd "$DEST_DIR/$BRANCH"
for repo in `ls $REPOS`; do
    repo_path=$REPOS/$repo
    if [ ! -d $repo_path ]; then
        # skip if not directory.
        continue
    fi
    ln -s -t . $repo_path/*
done

The only thing you need to do before running this script is to set the GIT_NEW_WORKDIR variable to point to the location of the git-new-workdir script on your file system.

With this script in place, you can simply

cd ..  # move out of the build directory
create-branch-build.sh ./build/clone . libreoffice-3-3

and you now have a new directory named libreoffice-3-3 (same as the branch name), where all modules and top-level files are properly symlinked to their original locations, while the actual repo branches are under the _repos directory. All you have left to do is to start building. :-)

Note that there is no need to manually create a local branch named libreoffice-3-3 that tracks the remote libreoffice-3-3 branch in the original repository before running this script; git-new-workdir takes care of that for you provided that the remote branch of the same name exists.

Updating the branch work directory

In general, when you are in a branch work directory (I call it this because it sounds about right), updating the branch from the branch in the remote repo consists of two steps. First, fetch the latest history in the original repository by git fetch, move back to the branch work directory and run git pull -r.

But doing this manually in all the 19 repositories can be very tedious. So I wrote another script (named g.sh) to ease this pain a little.

#!/usr/bin/env bash
 
REPOS=clone
 
die() {
    echo $1
    exit 1
}
 
if [ ! -d $REPOS ]; then
    die "$REPOS directory not found in cwd."
fi
 
echo ===== main repository =====
git $@
 
for repo in `ls $REPOS`; do
    echo ===== $repo =====
    repo_path=$REPOS/$repo
    if [ ! -d $repo_path ]; then
        # Not a directory.  Skip it.
        continue
    fi
    pushd . > /dev/null
    cd $repo_path
    git $@
    popd > /dev/null
done

With this, updating the branch build directory is done:

g.sh pull -r

That’s all there is to it.

A few more words…

As with any methods in life, this method has limitations. If you build libreoffice with the old-fashioned way of applying patches on top of the raw source tree, this method doesn’t help you; you would still need to clone the repo, and manually switch to the branch in the cloned repo.

But if you build, hack and debug in rawbuild almost exclusively (like me), then this method will help you save time and disk space. You can also adopt this method for any feature branches, as long as all the 19 repos (20 if you count l10n repo) have the same branch name. So, it’s worth a look! :-)

Thank you, ladies and gentlemen.

P.S. I’ve updated the scripts to adopt to the new bootstrap based build scheme.

Git on Windows

I guess I don’t really have to tell the world about this, since if you type the title of this blog post in Google it will come back as the top hit. But it’s still worth mentioning msysgit, a pretty darn good git client on Windows. It’s small, it’s efficient, and it’s git. :-) You could of course use git in cygwin, but git in cygwin feels a little “heavy” and by no means small, since you have to get the whole cygwin environment to even use git. So, if you don’t already have cygwin, and want to use git on Windows, msysgit is a pretty good choice. It comes with a minimal bash shell, and while I’m happy to see ssh included with its shell, I was a little disappointed that they left out rsync. But that’s just one minor downside.

For me, msysgit is my git client of choice on Windows, especially in a virtual machine setting where the disk space is tight. On a build machine, though, I still use git in cygwin since I already have to use cygwin to build OOo.

Cognitive Dissonance

There are certain words I can never type right for the first time. One is ‘formula’. My finger always wants to type it as ‘formular’, thinking that there is a missing ‘r’ at the end. Another one is ‘cvs’. I always end up typing ‘svn’, then backspacing it three times to retype ‘cvs’. However, I have no problem typing ‘git’ (I’m not kidding!).