Extracting a sub project into a new repository (and how mso-dumper got its new home).

Background

Just a short while ago I worked on extracting our mso-dumper project from LibreOffice’s build repository, into a brand new repository created just for this. The new repository was to be located in libreoffice/contrib/mso-dumper.

Originally, this project started out just as a simple sub directory of a much larger parent repository. But because it grew so much, and because its scope is not entirely in line with that of the parent repository, I decided it was best to move this project into a repository of its own. Now, it’s easy to transfer a subset of files from one repository to another if you don’t mind losing its history, but I wanted to preserve the history of those files even after the transition.

It turns out that there is a way to do this with git. Kendy suggested that I look into git filter-branch, so I did. After a few hours of researching and trials & errors (and some bash script writing which was later thrown away), I’ve come to realize that all of this can be achieved in the following simple steps.

Steps

First, clone the whole build repository which contains the sub project to be extracted

git clone path/to/libo/build mso-dumper-temp

Once done, cd into that cloned repository, and run

git filter-branch --subdirectory-filter scratch/mso-dumper/ -- --all

which will remove all files from the git history except for those under the scratch/mso-dumper directory, and re-locate those files under that directory into the top-level directory. You may also want to run

git remote rm origin

to prevent accidental pushing of this to the remote origin during these steps. Anyway, once the filtering is done, remove all tags by

git tag | xargs git tag -d

And that’s all. Now, you have only the files you want to keep, they are sitting happily at the top level like they should, all of their commit records are preserved, and you don’t have any old tags you don’t need for the new repository.

This is not over yet. At this point, this git repo still stores the objects of the removed files. In fact, the size of the .git directory of this new repo was more than twice the size of the .git directory of the original build repo! To completely prune this unnecessary info in order to shrink the size of the repository, run

git clone file:///path/to/mso-dumper-temp mso-dumper

to further clone this into another repo locally to strip all the unnecessary blob. Note that I used the file:///… style file path, as opposed to the usual /path/to/foo style file path. When using the file:///… style path to clone a local repo, git will not clone the objects of the removed files, thereby reducing the size of the objects significantly (and clone is faster too). Using the regular /path/to/foo style path, git will hard-link all the object files, so the size will stay the same.

After the second cloning, the size of my .git directory shrank from 280MB to 384k! So it does make a big difference. Now all that’s left to do is to push this repository to the new remote location. Easy huh? :-)

But there was a gotcha….

There was one caveat, however. This method apparently does not preserve the whole history of the relocated files if the parent sub-directory had been renamed. The mso-dumper directory was renamed from its original name sc-xlsutil in order to accommodate the ppt dumper that Thorsten wrote. Unfortunately git filter-branch --subdirectory-filter did not preserve the history before the directory rename occurred, but that was just a minor issue, and something I was not too concerned about for this particular transition.

mso-dumper now packaged in OBS

I’m happy to announce that the mso-dumper tool is now packaged in the openSUSE build service under my home repository. This tool is written in Python, and allows you to dump the contents of MS Office documents stored in the BIFF-structured binary file format in a more human readable fashion. It is an indispensable tool when dealing with importing from and/or exporting to the Office documents. Right now, only Excel and Power Point formats are supported.

This package provides two new commands xls-dump and ppt-dump. If you wish to dump the content of an Excel document, all you have to do is

xls-dump ./path/to/mydoc.xls

and it dumps its content to standard output. What the output looks like depends on what’s stored with the document, but it will look something like this:

...
0085h: =============================================================
0085h: BOUNDSHEET - Sheet Information (0085h)
0085h:   size = 14
0085h: -------------------------------------------------------------
0085h: B4 09 00 00 00 00 06 00 53 68 65 65 74 31 
0085h: -------------------------------------------------------------
0085h: BOF position in this stream: 2484
0085h: sheet name: Sheet1
0085h: hidden state: visible
0085h: sheet type: worksheet or dialog sheet

008Ch: =============================================================
008Ch: COUNTRY - Default Country and WIN.INI Country (008Ch)
008Ch:   size = 4
008Ch: -------------------------------------------------------------
008Ch: 01 00 01 00 

00EBh: =============================================================
00EBh: MSODRAWINGGROUP - Microsoft Office Drawing Group (00EBh)
00EBh:   size = 90
00EBh: -------------------------------------------------------------
00EBh: 0F 00 00 F0 52 00 00 00 00 00 06 F0 18 00 00 00 
00EBh: 02 04 00 00 02 00 00 00 02 00 00 00 01 00 00 00 
00EBh: 01 00 00 00 02 00 00 00 33 00 0B F0 12 00 00 00 
00EBh: BF 00 08 00 08 00 81 01 09 00 00 08 C0 01 40 00 
00EBh: 00 08 40 00 1E F1 10 00 00 00 0D 00 00 08 0C 00 
00EBh: 00 08 17 00 00 08 F7 00 00 10 

00FCh: =============================================================
00FCh: SST - Shared String Table (00FCh)
00FCh:   size = 8
00FCh: -------------------------------------------------------------
00FCh: 00 00 00 00 00 00 00 00 
00FCh: -------------------------------------------------------------
00FCh: total number of references: 0
00FCh: total number of unique strings: 0
...

I have originally written this tool to deal with the Excel import and export part of Calc’s development, and continue to develop it further. Thorsten Behrens has later joined forces and added support for the Power Point format. Right now, I’m working on adding an XML output format option to make it easier to compare outputs, which is important for regression testing.