Working with a branch using git-new-workdir

Introduction

Git package contains a script named git-new-workdir, which allows you to work in a branch in a separate directory on the file system. This differs from cloning a repository in that git-new-workdir doesn’t duplicate the git history from the original repository and shares it instead, and that when you commit something to the branch that commit goes directly into the history of the original repository without explicitly pushing to the original repository. On top of that, creating a new branch work directory happens very much instantly. It’s fast, and it’s efficient. It’s an absolute time saver for those of us who work on many branches at any given moment without bloating the disk space.

As wonderful as this script can be, not all distros package this script with their git package. If your distro doesn’t package it, you can always download the source packages of git and find the script there, under the contrib directory. Also, if you have the build repository of libreoffice cloned, you can find it in bin/git-new-workdir too.

Now, I’m going to talk about how I make use of this script to work on the 3.3 branch of LibreOffice.

Creating a branch work directory

If you’ve followed this page to build the master branch of libreoffice, then you should have in your clone of the build repository a directory named clone. Under this directory are your local clones of the 19 repositories comprising the whole libreoffice source tree. If you are like me, you have followed the above page and built your libreoffice build in the rawbuild directory.

The next step is to create a separate directory just for the 3.3 branch which named libreoffice-3-3 and set things up so that you can build it normally as you did in the rawbuild. I’ve written the following bash script (named create-branch-build.sh) to do this in one single step.

#!/usr/bin/env bash
 
GIT_NEW_WORKDIR=~/bin/git-new-workdir
REPOS=clone
 
print_help() {
    echo Usage: $1 [bootstrap dir] [dest dir] [branch name]
}
 
die() {
    echo $1
    exit 1
}
 
BOOTSTRAP_DIR="$1"
DEST_DIR="$2"
BRANCH="$3"
 
if [ "$BOOTSTRAP_DIR" = "" ]; then
    echo bootstrap repo is missing.
    print_help $0
    exit 1
fi
 
if [ "$DEST_DIR" = "" ]; then
    echo destination directory is missing.
    print_help $0
    exit 1
fi
 
if [ "$BRANCH" = "" ]; then
    echo branch name is missing.
    print_help $0
    exit 1
fi
 
if [ -e "$DEST_DIR/$BRANCH" ]; then
    die "$DEST_DIR/$BRANCH already exists."
fi
 
# Clone bootstrap first.
$GIT_NEW_WORKDIR "$BOOTSTRAP_DIR" "$DEST_DIR/$BRANCH" "$BRANCH" || die "failed to clone bootstrap repo."
 
# First, check out the branches.
echo "creating directory $DEST_DIR/$BRANCH/$REPOS"
mkdir -p "$DEST_DIR/$BRANCH/$REPOS" || die "failed to create $DEST_DIR/$BRANCH/$REPOS"
for repo in `ls "$BOOTSTRAP_DIR/clone"`; do
    repo_path="$BOOTSTRAP_DIR/clone/$repo"
    if [ ! -d $repo_path ]; then
        # we only care about directories.
        continue
    fi
    echo ===== $repo =====
    $GIT_NEW_WORKDIR $repo_path "$DEST_DIR/$BRANCH/$REPOS/$repo" $BRANCH
done
 
# Set symbolic links to the root directory.
cd "$DEST_DIR/$BRANCH"
for repo in `ls $REPOS`; do
    repo_path=$REPOS/$repo
    if [ ! -d $repo_path ]; then
        # skip if not directory.
        continue
    fi
    ln -s -t . $repo_path/*
done

The only thing you need to do before running this script is to set the GIT_NEW_WORKDIR variable to point to the location of the git-new-workdir script on your file system.

With this script in place, you can simply

cd ..  # move out of the build directory
create-branch-build.sh ./build/clone . libreoffice-3-3

and you now have a new directory named libreoffice-3-3 (same as the branch name), where all modules and top-level files are properly symlinked to their original locations, while the actual repo branches are under the _repos directory. All you have left to do is to start building. :-)

Note that there is no need to manually create a local branch named libreoffice-3-3 that tracks the remote libreoffice-3-3 branch in the original repository before running this script; git-new-workdir takes care of that for you provided that the remote branch of the same name exists.

Updating the branch work directory

In general, when you are in a branch work directory (I call it this because it sounds about right), updating the branch from the branch in the remote repo consists of two steps. First, fetch the latest history in the original repository by git fetch, move back to the branch work directory and run git pull -r.

But doing this manually in all the 19 repositories can be very tedious. So I wrote another script (named g.sh) to ease this pain a little.

#!/usr/bin/env bash
 
REPOS=clone
 
die() {
    echo $1
    exit 1
}
 
if [ ! -d $REPOS ]; then
    die "$REPOS directory not found in cwd."
fi
 
echo ===== main repository =====
git $@
 
for repo in `ls $REPOS`; do
    echo ===== $repo =====
    repo_path=$REPOS/$repo
    if [ ! -d $repo_path ]; then
        # Not a directory.  Skip it.
        continue
    fi
    pushd . > /dev/null
    cd $repo_path
    git $@
    popd > /dev/null
done

With this, updating the branch build directory is done:

g.sh pull -r

That’s all there is to it.

A few more words…

As with any methods in life, this method has limitations. If you build libreoffice with the old-fashioned way of applying patches on top of the raw source tree, this method doesn’t help you; you would still need to clone the repo, and manually switch to the branch in the cloned repo.

But if you build, hack and debug in rawbuild almost exclusively (like me), then this method will help you save time and disk space. You can also adopt this method for any feature branches, as long as all the 19 repos (20 if you count l10n repo) have the same branch name. So, it’s worth a look! :-)

Thank you, ladies and gentlemen.

P.S. I’ve updated the scripts to adopt to the new bootstrap based build scheme.

Japanese language mailing lists now available

Florian (whose blog I can’t find at the moment so I’ll link his twitter account) was kind enough to set up three mailing lists dedicated for the Japanese language speakers in the LibreOffice project. So, those of you who have been patiently waiting for this moment, feel free to subscribe them. I’ll see you guys there. :-)

Meanwhile, Yosuke Kato has made similar announcement about the new mailing lists here (in Japanese).

Key binding compatibility options (take 2)

This post is a sequel to this previous post, so refer to that post for the detail of what I’ve been working on.

Anyway, I have settled with the following Compatibility option page:

calc-compat-option

which should be just adequate for what it needs to do without being too annoying.

Also, just for the matter of documenting its behavior, the following chart shows what actions are associated with what key bindings for the two key binding types (Default and OpenOffice.org legacy):

Key Binding Default OpenOffice.org legacy
Backspace delete contents delete
Delete delete delete contents
Ctrl-D fill down data select
Shift-Ctrl-D data select

where the actions are

  • delete contents – launch the delete contents dialog.
  • delete – immediately delete the cell content, without the dialog.
  • fill down – fill cell content downward within selection.
  • data select – launch the selection list dialog.

Note that all the other key bindings are left untouched. Also, the list of key bindings that can get reset by this functionality may grow in future releases.

mdds 0.3.1

I’m happy to announce the release of version 0.3.1 of the Multi-Dimensional Data Structure (mdds). This is a bug fix release, and contains no major changes from the previous version (0.3.0). The highlights of this release are:

  • use of autoconf tool, to allow you to run ./configure && sudo make install to install the library, and
  • drop the requirement for C++0x support, by using equivalent features from the boost library which mdds already depends on.

When using this library without the C++0x support, however, you need to define the MDDS_HASH_CONTAINER_BOOST compiler macro in order for the mdds library to use boost’s hash_containers, instead of the ones from C++0x. Similarly, you can define the MDDS_HASH_CONTAINER_STLPORT to force mdds to use stlport’s hash containers instead.

I will briefly explain the incompatible support of hash containers various libraries. Originally, the stlport library supported two hash containers, hash_map and hash_set, in the std namespace which can be used as hashed replacements of std::map and std::set, respectively. In C++0x, however, these two containers have been renamed to unordered_map and unordered_set which are still in the std namespace. Boost also provides unordered_map and unordered_set, but they are in the boost namespace. The change that this release contains should hopefully be useful when dealing with this incompatible hash container situation in various libraries.

This release contains patches from David Tardon and Phillip Thomas, who fixed various bits of the Makefile script. Phillip also helped me fix the rpm spec file. Thanks a lot!

Key binding compatibility options

As I posted on the libreoffice development list, I’m currently working on adding a new option page in the Options dialog, to provide a quick way to switch key bindings between LibreOffice’s default and OpenOffice.org’s for Calc. For the most part, the default key bindings are identical between LibreOffice and OpenOffice.org as far as Calc’s concerned, but there are some differences, which are enough to annoy those users who are accustomed to the old key bindings from OOo Calc.

The option page will look something like this:
calc-key-binding-compat

but it will not look exactly like this since this is just my first viewable version with no design effort put into it at all. So, hopefully we can morph this into something a little more usable and a little more bearable.

Right now, we only deal with key binding compatibility; however, we may expand this page to incorporate other compatibility options in future versions of LibreOffice.

Stay tuned!

Strace equivalent for Windows

While searching for a debug tool equivalent of strace that runs on Windows, I’ve come across Process Monitor. This appears to work well for me. For anyone in search of strace equivalent on Windows, give this one a try.

I’ve also tried StraceNT, but this one was not very reliable as it tends to crash the traced process almost every time.

An equally useful tool for debugging on Windows is DebugView, which captures output from the OutputDebugString calls.

Ixion – threaded formula calculation library

I spent my entire last week on my personal project, by taking advantage of Novell’s HackWeek. Officially, HackWeek took place two weeks ago, but because I had to travel that week I postponed mine till the following week.

Ixion is the project I worked on as part of my HackWeek. This project is an experimental effort to develop a stand-alone library that supports parallel computation of formula expressions using threads. I’d been working on this on and off in my spare time, but when the opportunity came along to spend one week of my paid time on any project of my choice (personal or otherwise), I didn’t hesitate to pick Ixion.

Overview

So, what’s Ixion? Ixion aims to provide a library for calculating the results of formula expressions stored in multiple named targets, or “cells”. The cells can be referenced from each other, and the library takes care of resolving their dependencies automatically upon calculation. The caller can run the calculation routine either in a single-threaded mode, or a multi-threaded mode. The library also supports re-calculation where the contents of one or more cells have been modified since the last calculation, and a partial calculation of only the affected cells gets performed. It is written entirely in C++, and makes extensive use of the boost library to achieve portability across different platforms. It has currently been tested to build on Linux and Windows.

The goal is to eventually bring this library up to the level where it can serve as a full-featured calculation engine for spreadsheet applications. But right now, this project remains as an experimental, proof-of-concept project to help me understand what is required to build a threaded calculation engine capable of performing all sorts of tasks required in a typical spreadsheet app.

I consider this project a library project; however, building this project only creates a single stand-alone console application at the moment. I plan to separate it into a shared library and a front-end executable in the future, to allow external apps to dynamically link to it.

How it works

Building this project creates an executable called ixion-parser. Running it with a -h option displays the following help content:

Usage: ixion-parser [options] FILE1 FILE2 ...
 
The FILE must contain the definitions of cells according to the cell definition rule.
 
Allowed options:
  -h [ --help ]         print this help.
  -t [ --thread ] arg   specify the number of threads to use for calculation.  
                        Note that the number specified by this option 
                        corresponds with the number of calculation threads i.e.
                        those child threads that perform cell interpretations. 
                        The main thread does not perform any calculations; 
                        instead, it creates a new child thread to manage the 
                        calculation threads, the number of which is specified 
                        by the arg.  Therefore, the total number of threads 
                        used by this program will be arg + 2.

The parser expects one or more cell definition files as arguments. A cell definition file may look like this:

%mode init
A1=1
A2=A1+10
A3=A2+A1*30
A4=(10+20)*A2
A5=A1-A2+A3*A4
A6=A1+A3
A7=A7
A8=10/0
A9=A8
%calc
%mode result
A1=1
A2=11
A3=41
A4=330
A5=13520
A6=42
A7=#REF!
A8=#DIV/0!
A9=#DIV/0!
%check
%mode edit
A6=A1+A2
%recalc
%mode result
A1=1
A2=11
A3=41
A4=330
A5=13520
A6=12
A7=#REF!
A8=#DIV/0!
A9=#DIV/0!
%check
%mode edit
A1=10
%recalc

I hope the format of the cell definition rule is straightforward. The definitions are read from top down. I used the so-called A1 notation to name target cells, but it doesn’t have to be that way. You can use any naming scheme to name target cells as long as the lexer recognizes them as names. Also, the format supports a command construct; a line beginning with a ‘%’ is considered a command. Several commands are currently available. For instance the mode command lets you switch input modes. The parser currently supports three input modes:

  • init – initialize cells with specified contents.
  • result – pick up expected results for cells, for verification.
  • edit – modify cell contents.

In addition to the mode command, the following commands are also supported:

  • calc – perform full calculation, by resetting the cached results of all involved cells.
  • recalc – perform partial re-calculation of modified cells and cells that reference modified cells, either directly or indirectly.
  • check – verify the calculation results.

Given all this, let’s see what happens when you run the parser with the above cell definition file.

./ixion-parser -t 4 test/01-simple-arithmetic.txt 
Using 4 threads
Number of CPUS: 4
---------------------------------------------------------
parsing test/01-simple-arithmetic.txt
---------------------------------------------------------
A1: 1
A1: result = 1
---------------------------------------------------------
A2: A1+10
A2: result = 11
---------------------------------------------------------
A3: A2+A1*30
A3: result = 41
---------------------------------------------------------
A4: (10+20)*A2
A4: result = 330
---------------------------------------------------------
A5: A1-A2+A3*A4
A5: result = 13520
---------------------------------------------------------
A8: 10/0
result = #DIV/0!
---------------------------------------------------------
A6: A1+A3
A6: result = 42
---------------------------------------------------------
A9: 
result = #DIV/0!
---------------------------------------------------------
A7: result = #REF!
---------------------------------------------------------
checking results
---------------------------------------------------------
A2 : 11
A8 : #DIV/0!
A3 : 41
A9 : #DIV/0!
A4 : 330
A5 : 13520
A6 : 42
A7 : #REF!
A1 : 1
---------------------------------------------------------
recalculating
---------------------------------------------------------
A6: A1+A2
A6: result = 12
---------------------------------------------------------
checking results
---------------------------------------------------------
A2 : 11
A8 : #DIV/0!
A3 : 41
A9 : #DIV/0!
A4 : 330
A5 : 13520
A6 : 12
A7 : #REF!
A1 : 1
---------------------------------------------------------
recalculating
---------------------------------------------------------
A1: 10
A1: result = 10
---------------------------------------------------------
A2: A1+10
A2: result = 20
---------------------------------------------------------
A3: A2+A1*30
A3: result = 320
---------------------------------------------------------
A4: (10+20)*A2
A4: result = 600
---------------------------------------------------------
A5: A1-A2+A3*A4
A5: result = 191990
---------------------------------------------------------
A6: A1+A2
A6: result = 30
---------------------------------------------------------
(duration: 0.00113601 sec)
---------------------------------------------------------

Notice that at the beginning of the output, it displays the number of threads being used, and the number of “CPU”s it detected. Here, the “CPU” may refer to the number of physical CPUs, the number of cores, or the number of hyper-threading units. I’m well aware that I need to use a different term for this other than “CPU”, but anyways… The number of child threads to use to perform calculation can be specified at run-time via -t option. When running without the -t option, the parser will run in a single-threaded mode. Now, let me go over what the above output means.

The first calculation performed is a full calculation. Since no cells have been calculated yet, we need to calculate results for all defined cells. This is followed by a verification of the initial calculation. After this, we modify cell A6, and perform partial re-calculation. Since no other cells depend on the result of cell A6, the re-calc only calculates A6.

Now, the third calculation is also a partial re-calculation following the modification of cell A1. This time, because several other cells do depend on the result of A1, those cells also need to be re-calculated. The end result is that cells A1, A2, A3, A4, A5 and A6 all get re-calculated.

Under the hood

Cell dependency resolution

cell-dependency-graph
There are several technical aspects of the implementation of this library I’d like to cover. The first is cell dependency resolution. I use a well-known algorithm called topological sort to sort cells in order of dependency so that cells can be calculated one by one without being blocked by the calculation of precedent cells. Topological sort is typically used to manage scheduling of tasks that are inter-dependent with each other, and it was a perfect one to use to resolve cell dependencies. This algorithm is a by-product of depth first search of directed acyclic graph (DAG), and is well-documented. A quick google search should give you tons of pseudo code examples of this algorithm. This algorithm work well both for full calculation and partial re-calculation routines.

Managing threaded calculation

The heart of this project is to implement parallel evaluation of formula expressions, which has been my number 1 goal from the get-go. This is also the reason why I put my focus on designing the threaded calculation engine as my initial goal before I start putting my focus into other areas. Programming with threads was also very new to me, so I took extra care to ensure that I understand what I’m doing, and I’m designing it correctly. Also, designing a framework that uses multiple threads can easily get out-of-hand and out-of-control when it’s overdone. So, I made an extra effort to limit the area where multiple threads are used while keeping the rest of the code single-threaded, in order to keep the code simple and maintainable.

As I soon realized, even knowing the basics of programming with threads, you are not immune to falling into many pitfalls that may arise during the actual designing and debugging of concurrent code. You have to go extra miles ensuring that access to thread-global data are synchronized, and that one thread waits for another thread in case threads must be executed in certain order. These things may sound like common sense and probably are in every thread programming text book, but in reality they are very easy to overlook, especially to those who have not had substantial exposure to concurrency before. Parallelism seemed that un-orthodox to conventional minds like myself. Having said all that, once you go through enough pain dealing with concurrency, it does become less painful after a while. Your mind can simply adjust to “thinking in parallel”.

Back to the topic. I’ve picked the following pattern to manage threaded calculation.

thread-design

First, the main thread creates a new thread whose job is to manage cell queues, that is, receiving queues from the main thread and assigning them to idle threads to perform calculation. It is also responsible for keeping track of which threads are idle and ready to take on a cell assignment. Let’s call this thread a queue manager thread. When the queue manager thread is created, it spawns a specified number of child threads, and waits until they are all ready. These child threads are the ones that perform cell calculation, and we call them calc threads.

Each calc thread registers itself as an idle thread upon creation, then sleeps until the queue manager thread assigns it a cell to calculate and signals it to wake up. Once awake, it calculates the cell, registers itself as an idle thread once again and goes back to sleep. This cycle continues until the queue manager thread sends a termination request to it, after which it breaks out of the cycle and reaches the end of its execution path to terminate.

The role of the queue manager thread is to receive cell calculation requests from the main thread and pass them on to idle calc threads. It keeps doing it until it receives a termination request from the main thread. Once receiving the termination request from the main thread, it sends all the remaining cells in queue to the calc threads to finish up, then sends termination requests to the calc threads and wait until all of them terminate.

Thanks to the cells being sorted in topological order, the process of putting a cell in queue and having a calc thread perform calculation is entirely asynchronous. The only exception is that when referencing another cell during calculation, the result of that referenced cell may not be available at the time of the value query due to concurrency. In such cases, the calculating thread needs to block its execution until the result of the referenced cell becomes available. When running in a single-threaded mode, on the other hand, the result of a referenced cell is guaranteed to be available as long as cells are calculated in topological order and contain no circular references.

What I accomplished during HackWeek

During HackWeek, I was able to accomplish quite a few things. Before the HackWeek, the threaded calculation framework was not even there; the parser was only able to reliably perform calculation in a single-threaded mode. I had some test code to design the threaded queue management framework, but that code had yet to be integrated into the main formula interpreter code. A lot of work was still needed, but thanks to having an entire week devoted for this, I was able to

  • port the test threaded queue manager framework code into the formula interpreter code,
  • adopt the circular dependency detection code for the new threaded calculation framework,
  • test the new framework to squeeze lots of kinks out,
  • put some performance optimization in the cell definition parser and the formula lexer code,
  • implement result verification framework, and
  • implement partial re-calculation.

Had I had to do all this in my spare time alone, it would have easily taken months. So, I’m very thankful for the event, and I look forward to having another opportunity like this in hopefully not-so-distant future.

What lies ahead

So, what lies ahead for Ixion? You may ask. There are quite a few things to get done. Let me start first by saying that, this library is far from providing all the features that a typical spreadsheet application needs. So, there are still lots of work needed to make it even usable. Moreover, I’m not even sure whether this library will become usable enough for real-world spreadsheet use, or it will simply end up being just another interesting proof-of-concept. My hope is of course to see this library evolve into maturity, but at the same time I’m also aware that it would be hard to advance this project with only my scarce spare time to spend in.

With that said, here are some outstanding issues that I plan on addressing as time permits.

  • Add support for literal strings, and support textural formula results in addition to numerical results.
  • Add support for empty cells. Empty cells are those cells that are not defined in the model definition file but can still be referenced. Currently, referencing a cell that is not defined causes a reference error.
  • Add support for cell ranges. This implies that I need to make cell instances addressable by 3-dimensional coordinates rather than by pointer values.
  • Split the code into two parts: a shared library and an executable.
  • Use autoconf to make the build process configurable.
  • Make the expression parser feature-complete.
  • Implement more functions. Currently only MAX and MIN are implemented.
  • Support for localized numbers.
  • Lots and lots more.

Conclusion

This concludes my HackWeek report. Thank you very much, ladies and gentlemen.

mdds 0.3.0 released!

I’m happy to announce the release of version 0.3.0 of Multi-Dimensional Data Structure, or mdds for short. This is a C++ library, and is a collection of various data structures designed to efficiently store and query multi-dimensional data for various filtering criteria. Different structures are optimized for different query needs.

This library is a source-code only library. It’s designed to be header-only meaning that the user program does not need to link to any additional shared library in order to use these data structures. The data structures are all available as C++ templates.

I have briefly touched on the motivation behind starting this project when I previously talked about my effort to increase Calc’s row limit, the effort of which eventually led to the creation of Flat segment tree, one of the structures included in this library.

What this release contains

This version contains the following four data structures:

  • Segment tree
  • Flat segment tree
  • Rectangle set
  • Point quad tree

I will not go into the details of each of these data structures in this post, but I will probably do a follow-up post explaining the strengths of each structure, and details of how they are structured internally, how the query works etc. The project home page provides a brief (yes, very brief!) overview of these data structures for the curious-minded.

Installation

I’ve created a package for the SUSE family of Linux distributions, via openSUSE Build Service (OBS). There is also a link to 1-Click Install from the project home page.

For non-SUSE users, please install it directly from the generic source package available from here, but first I must apologize for not providing the familiar configure, make, make install installation option yet. That’s one of the things I’d like to work on in future releases. For now, please install the library manually by

tar xvf mdds_0.3.0.tar.bz2
cd mdds_0.3.0
sudo cp -R inc/mdds /usr/local/include/

How to use these data structures

I’ve provided some example source code under the example directory which should serve as a quick tutorial on how to use these structures. For more details of the API, please refer to their respective header file. As time permits, I will work on providing documentation for this library.

Lastly…

Comments and feedback are very much appreciated! Thank you very much, ladies and gentlemen. :-)

Fetching tarball before the build

I just rebased one of my upstream child work spaces to milestone 77 (DEV300_m77), and noticed my build breaking all over the place due to missing patches for external libraries. After some digging, here is the step I was missing.

./fetch_tarballs.sh ooo.lst

which apparently downloads a whole bunch of external source packages from http://hg.services.openoffice.org/binaries that the build depends on, and places them into their respective location within the source tree. After this was done, my build proceeded normally.