LibreOffice Conference 2011

So, it was a real pleasure to be a part of the very first LibreOffice conference held in Paris, France. Some of the faces and names were familiar from the old OOo conferences, but the atmosphere of the conference was very different from the OOo ones in the past. I have been to the 2007 Barcelona conference and the 2009 Orvieto one, and I have to say, while there were some rough-edges, this is by-far my favorite OOo/LibO conference to date.

The only regret I have is that, because I had another international trip (to South Korea) only a week prior to the conference, I felt pretty much exhausted most of the time I was there. But I think I managed to chat with most of the people I needed to chat with during this once-a-year event. I intentionally tried not to hack too much during this conference, mainly because of my travel fatigue, but also because I felt it was more important to see people and talk to them to have a good feel for each other. Working from home, I sometimes miss the human interaction that people who work in the office probably take for granted, so this conference was a perfect place to fulfill that need, to make me feel human again. ;-) (Actually I tried to code a bit during the conference, but apparently my brain wasn’t cooperating at all I decided it probably wasn’t a good idea).

Anyway, it was good to see and chat with Markus Mohrhard (moggi), a very active Calc hacker who’s been instrumental in Calc’s filter test development in recent days. We discussed on various topics on Calc development since we work together in that code.

Also, Laurent Godard, whom I’ve known many years from the OOo days, but never met face-to-face.

And Valek Filippov, who happens to be in the same timezone as I. There aren’t many of us left in this LibreOffice circle, unfortunately. I tried to persuade him into this wonderful world of hacking, but so far he’s successfully fended off my attack.

It was also nice to chat with Michael Meeks at length, to clarify the new Calc cell storage structure that he and I discussed previously. Now the concept is very much clear, waiting to be coded.

Of course, many other countless hackers I’ve had beer with during the conference week, it was a real pleasure.

Now, I got some homework to do based on my interaction with various people during the conference. I will list them up item by item to use as a reminder.

  • Two Calc bugs from Valek. Both are related to this 1C program that pretty much everyone in Russia uses. I’ve already added them to my 3.5 TODO list, so it’s just a matter of finding time to tackle them unless something tricky comes out.
  • Some documentation on how to use the ixion library. Since there were some interests on using ixion to support formula calculations in other applications, I should probably start working on producing documentation on ixion, both on how to build it, and how to use it. I should also create a package for it while I’m at it.
  • Support for temporary cell buffer in the orcus library, to allow converting cell values before passing them to the client code. In some cases we can’t simply push the cell value as-is but convert it first before passing it to the client code. Typical examples are double quotes as a literal quote in CSV, as well as encoded characters (e.g. &) in XML/HTML. This will unfortunately cost us a bit for the allocation of the buffer and copying of the char array, but fortunately we don’t need to do this for all cells.
  • And lots and lots more.

All in all, I was glad to be a part of this successful conference. The atmosphere was very much all inclusive and personal, exactly how an open source conference should be.

Slides for my talk

In case someone wants to get a hold of the slides for my talk during the LibreOffice conference, they are available here (also in PDF).

I will write something up about the conference in more detail at later time. For now, I’ll take some time off to recover from the several travels I did in the past few weeks, across 3 different timezones that are 17 hours apart in total.

So, see you guys later.

Import performance boost with form controls

This is another performance win.

I’ve just pushed changes to the master branch to improve the import performance of binary Excel documents containing tons of form controls. The test document I used had an upward of 500 form controls which, prior to the change, Calc would spend at least several minutes to load. I don’t know the exact amount of time it took to open the document because each time I tried to open it, I had to kill the app after I became too impatient to wait.

Long story short, the same document now opens under 6 seconds on my machine.

The poor performance in this particular case consisted of several different bottlenecks. They are

  • inefficient algorithm in registering event listeners for VBA events,
  • inefficient algorithm in querying the code name from the parent application,
  • unnecessary VBA event registration for form controls, and
  • sending unnecessary notifications to property value change listeners during import for each and every property value insertion.

Registering event listeners for VBA events

When each control is inserted, we register several VBA events for it in order to handle events from the VBA code. For each event, we would register by passing the target and listener pair to the handler that handles event notification. As it turned out, however, each time that happens, the handler has to introspect the type of the target because it is passed as UNO’s Any object. While each instance of that may take only a fraction of a second to complete, when calling it literally millions of times it adds up not to mention the fact that the target remains the same for 12 or so listeners that are being registered for each control.

To solve this, I added a new method to register multiple event listeners for an identical target in a single call, to avoid repeated and unnecessary introspection of the target type. This alone has resulted in reducing the load time significantly (66% load-time reduction with my test document). However, this was still not enough with a larger number of controls since, as the number of controls grew, the load time would increase almost quadratically.

Querying the code name from the parent application

Another issue was the algorithm responsible for looking up the “code name” of the VBA module that the control belongs to. The code name is the name associated with each VBA module that Excel creates for each sheet. The name of the module does not necessarily equal the name of the sheet, and is unique to each sheet. The old algorithm would go through all existing form control instances in order to find a match, then backtrack the sheet it is on in order to determine the correct code name. But because it had to iterate through all existing controls, as the number of the controls grew, so would the time it takes to find a match.

Since the code name is identical for each sheet, there was no reason to check every single control. So I added a new method to get the code name directly from the parent container of the controls. Since we only create one container per sheet at most, this has resulted in making the code name lookup independent of the number of controls, and has resulted in quasi-constant time lookup since the number of sheets doesn’t grow during the import.

Unnecessary VBA event registration for form controls

There are two types of controls that Excel supports. One is the older form controls that you can insert via Forms toolbar, while the other is the newer, OLE controls that you can insert via Control Toolbox toolbar. As luck would have it, Excel doesn’t support bindings to VBA with the form controls, so it was not necessary to register events for these guys when we import them (as Noel told me). Turning off event registration for form control import has surely cut down the load time significantly. Many thanks to Noel for giving me a patch to turn this off for form controls.

Property value change listeners

Even after all these performance bottlenecks squashed, the load time still didn’t feel as fast as it should be. So, I did another round of profiling. It indicated that, every time we set a new property value to a control via XPropertySet, we would notify all property value change listeners to allow them to react to the change or veto the change, and this happened unconditionally for every single property value insertion for every single control.

Since the likelihood of having to veto or change other property values based on a new property value insertion during file import is close to nil if not zero, I added a new API to temporarily turn off this notification. This has cut down the last few seconds off the overall load time, down to 6 seconds in total. This notification is turned back on after the loading is complete.

Future consideration

There are several opportunities for future work. For one thing, the code name lookup also applies to the VBA event support in Writer. But because I wasn’t aware of how Writer organizes form controls, I didn’t touch its lookup algorithm. So, if the same inefficiency applies to Writer (which I’m not sure it does), then there may be a way to improve performance in that area.

Another area to consider is reducing the number of VBA events to register. As Noel told me, we currently register 12 or so events unconditionally for all controls imported from Excel documents. But technically we only have to register events that are actually needed. So, if we can find a way to determine what events we need to register by either parsing the VBA code or any other ways, we can reduce the number of VBA event registrations during the import.

This is all I can think of at the moment. Thank you ladies and gentlemen.

New option to specify initial number of sheets

This just landed on the master branch.

Starting with LibreOffice 3.5, you can now specify the initial number of sheets that new documents will have. Previously, this was hard-coded to be 3 sheets in all cases no matter what. While this didn’t seem to bother a whole lot of people based on how little bug reports we’d received on this, it did bother some users enough so that one of them have decided to code up a patch to make it happen. Now, without further ado, let’s take a look at the new option page:

new-doc-options

where you can change the number of worksheets in new document, which becomes effective the next time you create a new document.

Last but not least, the name of the person who made this all happen is Albert Thuswaldner. Please give kudos to him for his excellent work. :-)

New document status image in the status bar

I’ve just checked in the new icon set for the document status indicator from Paulo José. Here is a side-by-side screenshot of what the new icons look like.

statusbar-new-status-icon

The above is what it looks like when the document is unmodified. It’s a bit faded with translucency effect which is intentional. The one below is when the document is modified. The new images look very refined and are more in line with the application icon that we use for LibreOffice. Good work Paulo! :-)

Now, he has created another icon to show immediately after the document is saved, before it becomes the faded icon again after a few seconds. But that effect has yet to be implemented. If you are interested in taking on this task, drop us a note. It’s listed on the Easy Hacks page.

And let’s not forget to say that 3.4 will have these two brand-new icons.

Named range as data source in DataPilot table

I have hinted in my previous post that you can now use a named range as the data source of a DataPilot table, but you couldn’t create a new DataPilot table with a named range as the source.

Well, now you can.

I tried to come up with a clever way to add this functionality, but ended up with just another radio button in the existing source selection dialog (the dialog that pops up when you select Data – DataPilot – Start without an existing DataPilot table).

Here is a screenshot of the new dialog as evidence:
calc-dp-named-range-source

This functionality is currently available on the master branch of LibreOffice. For those of you who can build LibreOffice directly from the repository, go check it out!

For those of you who would rather wait for a released version, this will be available in 3.4 – the next minor release. Refer to this page for more detailed release plan of the upcoming versions of LibreOffice.

FOSDEM 2011 slide & latest updates

I’ve just uploaded the slide for my talk during FOSDEM 2011 here. It was very nice to be able to talk about our somewhat ambitious plan to bring LibreOffice Calc to the next level. Also, I regret that I haven’t been able to blog about what’s been going on lately; lots of time spent on writing, reviewing code, fixing bugs and integrating patches, and sadly little time is left on writing blogs.

Having said all that, let me talk about a few things that are new on the master branch (since I’m already in the writing mode).

The first one is the new move/copy sheet dialog

new-copy-move-sheet-dialog

which is based on the design suggestion from Christoph Noak and coded by Joost Eekhoorn. The idea is to provide a quick way to rename a copied sheet, and also to make the layout more ergonomic and more appropriate to modern HIG. There are still some minor issues that we have yet to work out, but this is a step in the right direction.

The second one is related to DataPilot. In fact there are two new enhancements landed on master with regard to DataPilot.

The first enhancement is the support for unlimited number of fields. Previously, DataPilot could only support up to 8 fields in each dimension (page, column, row and data). But now you can define as many fields in each dimension as you desire, provided that you have enough memory and CPU cycles to handle extra load.

calc-dp-unlimited-fields

The second DataPilot enhancement is the support for named range as the data source. Now, you can use a named range as the data source of a DataPilot table, instead of raw range reference. This has the advantage that, when your source range grows, you can simply update the named range and refresh the DataPilot table.

calc-dp-named-range-source

However, I have not yet added a way to create a new DataPilot table with a named range as data source. I will work on that sometime soon, hopefully in time for our 3.4 release.

Other than that, I’ve fixed quite a number of bugs and added performance enhancements particularly with regard to external reference handling. Still, there are lots of other tasks I need to do on master before we hit the 3.4 release. Stay tuned for more updates.

New LibreOffice build eye-candy

This is cool.

When you build LibreOffice straight from the master repository, and you build it in the GNOME environment, you’ll get a nice little systray thingie with up-to-date build status information.

libo-build-zenity

And this is what you get when your build happens to fail.

libo-build-zenity-failed

When you are lucky enough to have a successful build, here is what you see.

libo-build-zenity-success

I don’t know who added this , but it sure is a nice one. :-)

Update: this is the result of the fine work done by Luke Dixon.

Working with a branch using git-new-workdir

Introduction

Git package contains a script named git-new-workdir, which allows you to work in a branch in a separate directory on the file system. This differs from cloning a repository in that git-new-workdir doesn’t duplicate the git history from the original repository and shares it instead, and that when you commit something to the branch that commit goes directly into the history of the original repository without explicitly pushing to the original repository. On top of that, creating a new branch work directory happens very much instantly. It’s fast, and it’s efficient. It’s an absolute time saver for those of us who work on many branches at any given moment without bloating the disk space.

As wonderful as this script can be, not all distros package this script with their git package. If your distro doesn’t package it, you can always download the source packages of git and find the script there, under the contrib directory. Also, if you have the build repository of libreoffice cloned, you can find it in bin/git-new-workdir too.

Now, I’m going to talk about how I make use of this script to work on the 3.3 branch of LibreOffice.

Creating a branch work directory

If you’ve followed this page to build the master branch of libreoffice, then you should have in your clone of the build repository a directory named clone. Under this directory are your local clones of the 19 repositories comprising the whole libreoffice source tree. If you are like me, you have followed the above page and built your libreoffice build in the rawbuild directory.

The next step is to create a separate directory just for the 3.3 branch which named libreoffice-3-3 and set things up so that you can build it normally as you did in the rawbuild. I’ve written the following bash script (named create-branch-build.sh) to do this in one single step.

#!/usr/bin/env bash
 
GIT_NEW_WORKDIR=~/bin/git-new-workdir
REPOS=clone
 
print_help() {
    echo Usage: $1 [bootstrap dir] [dest dir] [branch name]
}
 
die() {
    echo $1
    exit 1
}
 
BOOTSTRAP_DIR="$1"
DEST_DIR="$2"
BRANCH="$3"
 
if [ "$BOOTSTRAP_DIR" = "" ]; then
    echo bootstrap repo is missing.
    print_help $0
    exit 1
fi
 
if [ "$DEST_DIR" = "" ]; then
    echo destination directory is missing.
    print_help $0
    exit 1
fi
 
if [ "$BRANCH" = "" ]; then
    echo branch name is missing.
    print_help $0
    exit 1
fi
 
if [ -e "$DEST_DIR/$BRANCH" ]; then
    die "$DEST_DIR/$BRANCH already exists."
fi
 
# Clone bootstrap first.
$GIT_NEW_WORKDIR "$BOOTSTRAP_DIR" "$DEST_DIR/$BRANCH" "$BRANCH" || die "failed to clone bootstrap repo."
 
# First, check out the branches.
echo "creating directory $DEST_DIR/$BRANCH/$REPOS"
mkdir -p "$DEST_DIR/$BRANCH/$REPOS" || die "failed to create $DEST_DIR/$BRANCH/$REPOS"
for repo in `ls "$BOOTSTRAP_DIR/clone"`; do
    repo_path="$BOOTSTRAP_DIR/clone/$repo"
    if [ ! -d $repo_path ]; then
        # we only care about directories.
        continue
    fi
    echo ===== $repo =====
    $GIT_NEW_WORKDIR $repo_path "$DEST_DIR/$BRANCH/$REPOS/$repo" $BRANCH
done
 
# Set symbolic links to the root directory.
cd "$DEST_DIR/$BRANCH"
for repo in `ls $REPOS`; do
    repo_path=$REPOS/$repo
    if [ ! -d $repo_path ]; then
        # skip if not directory.
        continue
    fi
    ln -s -t . $repo_path/*
done

The only thing you need to do before running this script is to set the GIT_NEW_WORKDIR variable to point to the location of the git-new-workdir script on your file system.

With this script in place, you can simply

cd ..  # move out of the build directory
create-branch-build.sh ./build/clone . libreoffice-3-3

and you now have a new directory named libreoffice-3-3 (same as the branch name), where all modules and top-level files are properly symlinked to their original locations, while the actual repo branches are under the _repos directory. All you have left to do is to start building. :-)

Note that there is no need to manually create a local branch named libreoffice-3-3 that tracks the remote libreoffice-3-3 branch in the original repository before running this script; git-new-workdir takes care of that for you provided that the remote branch of the same name exists.

Updating the branch work directory

In general, when you are in a branch work directory (I call it this because it sounds about right), updating the branch from the branch in the remote repo consists of two steps. First, fetch the latest history in the original repository by git fetch, move back to the branch work directory and run git pull -r.

But doing this manually in all the 19 repositories can be very tedious. So I wrote another script (named g.sh) to ease this pain a little.

#!/usr/bin/env bash
 
REPOS=clone
 
die() {
    echo $1
    exit 1
}
 
if [ ! -d $REPOS ]; then
    die "$REPOS directory not found in cwd."
fi
 
echo ===== main repository =====
git $@
 
for repo in `ls $REPOS`; do
    echo ===== $repo =====
    repo_path=$REPOS/$repo
    if [ ! -d $repo_path ]; then
        # Not a directory.  Skip it.
        continue
    fi
    pushd . > /dev/null
    cd $repo_path
    git $@
    popd > /dev/null
done

With this, updating the branch build directory is done:

g.sh pull -r

That’s all there is to it.

A few more words…

As with any methods in life, this method has limitations. If you build libreoffice with the old-fashioned way of applying patches on top of the raw source tree, this method doesn’t help you; you would still need to clone the repo, and manually switch to the branch in the cloned repo.

But if you build, hack and debug in rawbuild almost exclusively (like me), then this method will help you save time and disk space. You can also adopt this method for any feature branches, as long as all the 19 repos (20 if you count l10n repo) have the same branch name. So, it’s worth a look! :-)

Thank you, ladies and gentlemen.

P.S. I’ve updated the scripts to adopt to the new bootstrap based build scheme.

Japanese language mailing lists now available

Florian (whose blog I can’t find at the moment so I’ll link his twitter account) was kind enough to set up three mailing lists dedicated for the Japanese language speakers in the LibreOffice project. So, those of you who have been patiently waiting for this moment, feel free to subscribe them. I’ll see you guys there. :-)

Meanwhile, Yosuke Kato has made similar announcement about the new mailing lists here (in Japanese).