Performance benchmark on mdds R-tree

I’d like to share the results of the quick benchmark tests I’ve done to measure the performance of the R-tree implementation included in the mdds library since 1.4.0.

Brief overview on R-tree

R-tree is a data structure designed for optimal query performance on spatial data. It is especially well suited when you need to store a large number of spatial objects in a single store and need to perform point- or range-based queries. The version of R-tree implemented in mdds is a variant known as R*-tree, which differs from the original R-tree in that it occasionally forces re-insertion of stored objects when inserting a new object would cause the target node to exceed its capacity. The original R-tree would simply split the node unconditionally in such cases. The reason behind R*-tree’s choice of re-insertion is that re-insertion would result in the tree being more balanced than simply splitting the node without re-insertion. The downside of such re-insertion is that it would severely affect the worst case performance of object insertion; however, it is claimed that in most real world use cases, the worst case performance would rarely be hit.

That being said, the insertion performance of R-tree is still not very optimal especially when you need to insert a large number of objects up-front, and unfortunately this is a very common scenario in many applications. To mitigate this, the mdds implementation includes a bulk loader that is suitable for mass-insertion of objects at tree initialization time.

What is measured in this benchmark

What I measured in this benchmark are the following:

  • bulk-loading of objects at tree initialization,
  • the size() method call, and
  • the average query performance.

I have written a specially-crafted benchmark program to measure these three categories, and you can find its source code here. The size() method is included here because in a way it represents the worst case query scenario since what it does is visit every single leaf node in the entire tree and count the number of stored objects.

The mdds implementation of R-tree supports arbitrary dimension sizes, but in this test, the dimension size was set to 2, for storing 2-dimensional objects.

Benchmark test design

Here is how I designed my benchmark tests.

First, I decided to use map data which I obtained from OpenStreetMap (OSM) for regions large enough to contain the number of objects in the millions. Since OSM does not allow you to specify a very large export region from its web interface, I went to the Geofabrik download server to download the region data. For this benchmark test, I used the region data for North Carolina, California, and Japan’s Chubu region. The latitude and longitude were used as the dimensions for the objects.

All data were in the OSM XML format, and I used the XML parser from the orcus project to parse the input data and build the input objects.

Since the map objects are not necessarily of rectangular shape, and not necessarily perfectly aligned with the latitude and longitude axes, the test program would compute the bounding box for each map object that is aligned with both axes before inserting it into R-tree.

To prevent the XML parsing portion of the test to affect the measurement of the bulk loading performance, the map object data gathered from the input XML file were first stored in a temporary store, and then bulk-loaded into R-tree afterward.

To measure the query performance, the region was evenly split into 40 x 40 sub-regions, and a point query was performed at each point of intersection that neighbors 4 sub-regions. Put it another way, a total of 1521 queries were performed at equally-spaced intervals throughout the region, and the average query time was calculated.

Note that what I refer to as a point query here is a type of query that retrieves all stored objects that intersects with a specified point. R-tree also allows you to perform area queries where you specify a 2D area and retrieve all objects that overlap with the area. But in this benchmark testing, only point queries were performed.

For each region data, I ran the tests five times and calculated the average value for each test category.

It is worth mentioning that the machine I used to run the benchmark tests is a 7-year old desktop machine with Intel Xeon E5630, with 4 cores and 8 native threads running Ubuntu LTS 1804. It is definitely not the fastest machine by today’s standard. You may want to keep this in mind when reviewing the benchmark results.

Benchmark results

Without further ado, these are the actual numbers from my benchmark tests.

The Shapes column shows the numbers of map objects included in the source region data. When comparing the number of shapes against the bulk-loading times, you can see that the bulk-loading time scales almost linearly with the number of shapes:

You can also see a similar trend in the size query time against the number of shapes:

The point query search performance, on the other hand, does not appear to show any correlation with the number of shapes in the tree:

This makes sense since the structure of R-tree allows you to only search in the area of interest regardless of how many shapes are stored in the entire tree. I’m also pleasantly surprised with the speed of the query; each query only takes 5-6 microseconds on this outdated machine!

Conclusion

I must say that I am overall very pleased with the performance of R-tree. I can already envision various use cases where R-tree will be immensely useful. One area I’m particularly interested in is spreadsheet application’s formula dependency tracking mechanism which involves tracing through chained dependency targets to broadcast cell value changes. Since the spreadsheet organizes its data in terms of row and column positions which is 2-dimensional, and many queries it performs can be considered spatial in nature, R-tree can potentially be useful for speeding things up in many areas of the application.

Orcus 0.11.0

I’m very pleased to announce that version 0.11.0 of the orcus library is officially out in the wild! You can download the latest source package from the project’s home page.

Lots of changes went into this release, but the two that I would highlight most are the inclusions of JSON and YAML parsers and their associated tools and interfaces. This release adds two new command-line tools: orcus-json and orcus-yaml. The orcus-json tool optionally handles JSON references to external files when the --resolves-refs option is given, though currently it only supports resolving external files that are on the local file system and only when the paths are relative to the referencing file.

I’ve also written an API documentation on the JSON interface in case someone wants to give it a try. Though the documentation on orcus is always work-in-progress, I’d like to spend more time to make the documentation in a more complete state.

On the import filter front, Markus Mohrhard has been making improvements to the ODS import filter especially in the area of styles import. Oh BTW, he is also proposing to mentor a GSOC project on this front under the LibreOffice project. So if you are interested, go and take a look!

That’s all I have at the moment. Thank you, ladies and gentlemen.

Orcus 0.7.1 is out

After more than a year since the release of 0.7.0, I’m once again very happy to announce that the version 0.7.1 of the Orcus library is now available for download. You can download the package from the project’s download page.

This is a maintenance release. It primarily includes bug fixes and build fixes since the 0.7.0 release with no new features. That said, the most notable aspect of this release is that it is buildable with the version 0.9.0 of the Ixion library which was just released a week ago. So, if you are trying to package and distribute the newly-released Ixion library but are unable to do so because of Orcus not being buildable with it, you might be interested in this release.

The next major upgrade will be 0.9.0 whose development is on-going on the current master branch. Hopefully that release won’t be too far away.

SUSE Hack Week

Last week was SUSE’s Hack Week – an event my employer does periodically to allow us – hard working engineers – to go wild with our wildest ideas and execute them in one week. Just like what I did at my last Hack Week event, I decided to work on integration of Orcus library into LibreOffice once again, to pick up on what I’d left off from my previous integration work.

Integration bits

Prior to Hack Week, orcus was already partially integrated; it was used to provide the backend functionality for Calc’s XML Source feature, and experimental support for Gnumeric file import. The XML Source side was pretty well integrated, but the normal file import side was only partially integrated. Lots of essential pieces were still missing, the largest of which were

  • support for multiple filters from a single external filter provider source (such as orcus),
  • progress indicator in the status bar, and
  • proper type detection by analyzing file content rather than its extension (which we call “deep detection”).

In short, I was able to complete the first two pieces during Hack Week, while the last item still has yet to be worked on. Aside from this, there are still more minor pieces missing, but perhaps I can work on the remaining bits during the next Hack Week.

Enabling orcus in your build

If you have a recent enough build from the master branch of the LibreOffice repository, you can enable imports via orcus library by

  1. checking the Enable experimental features box in the Options dialog, and
  2. setting the environment variable LIBO_USE_ORCUS to YES before launching Calc.

This will overwrite the stock import filters for ODS, XLSX and CSV. At present, orcus only performs file extension based detection rather than content based one, so be mindful of this when you try this on your machine. To go back to the current import filters, simply disable experimental features, or unset the environment variable.

Note that I’ve added this bits to showcase a preview of what orcus can potentially do as a future import filter framework. As such, never use this in production if you want stable file loading experience, or don’t file bugs against this. We are not ready for that yet. Orcus filters are still missing lots and lots of features.

Also note that, while in theory you could enable orcus with the Windows build, the performance of orcus on Windows may not be that impressive; in fact, in some cases slower than the current filters. That is because orcus relies on strtod and strtol system calls to convert string numbers into numeric values, and their implementation depend on the platform. And the performance of MSVC’s strtod implementation is known to be suboptimal compared to those of gcc and clang on Linux. I’m very much aware of this, and will work on addressing this at a later time.

Performance comparison

This is perhaps the most interesting part. I wanted to do a quick performance comparison and see how this orcus filter stands up against the current filter. Given the orcus filter is still only capable of importing raw cell values and not any other features or properties (not even cell formats), I’ve used this test file which only consists of raw text and numeric values in a 8-by-300000 range, to measure the load times that are as fair and representative as I could make them. Here is the result on my machine running openSUSE 11.4:

xlsx-load-times

The current filter, which has undergone its set of performance optimizations on raw cell values, still spends upwards of 50 seconds. Given that it used to take minutes to load this file, it’s still an improvement.

The orcus filter, on the other hand, combined with the heavily optimized load handler in Calc core that I put in place during Hack Week, can load the same file in 4.5 seconds. I would say that is pretty impressive.

I also measured the load time on the same file using Excel 2007, on the same machine running on top of wine, and the result was 7.5 seconds. While running an Windows app via wine emulation layer may incur some performance cost, this page suggests that it should not be noticeable, if any. And my own experience of running various versions of Excel via wine backs up that argument. So this number should be fairly representative of Excel’s native performance on the same hardware.

Considering that my ultimate goal with orcus is to beat Excel on performance on loading its own files (or at least not be slower than Excel), I would say we are making good progress toward that goal.

That’s all for today. Thank you, ladies and gentlemen.

Orcus integration into LibreOffice


Warning: WP_Syntax::substituteToken(): Argument #1 ($match) must be passed by reference, value given in /homepages/7/d105059906/htdocs/wp.kohei.us/wp-content/plugins/wp-syntax/wp-syntax.php on line 383

Last week was SUSE Hack Week, where we SUSE engineers were encouraged to be creative and work on whatever project that we had been dying to work on.

Given this opportunity, I decided to try integrating my orcus library project into LibreOffice proper to see how much improvement we could make in the performance of loading spreadsheet documents.

I’ll leave the detailed description and goal of orcus project for another blog post, but in short, orcus is an independent library designed to process spreadsheet documents, and is also designed to be useable from an application that would like to use it to load documents. It’s currently still work in progress, and is not even in alpha quality. So, I intentionally don’t release orcus library packages on an official basis.

Integration work

The main difficulty with integrating orcus into LibreOffice proper was dealing with the very intricate loading process that LibreOffice uses for all existing filters. It first goes through an elaborate type detection process, which loads the content of the file into memory in order for the type detection code to parse it. Once the correct type is determined, LibreOffice then instantiates correct frame loader and start the actual loading process. I’ve explained all of this in detail in this blog post of mine.

Orcus, on the other hand, only needs a file path, and it does the rest. And it pushes data to the call back functions provided by the client code as it parses the file. It was this difference in overall loading process that made the integration of orcus into LibreOffice all the more challenging. And even though the hack week itself lasted only one week, I had spent months prior to it just to study the type detection code and other auxiliary code that makes up the whole file loading process in order to come up with an elegant way to add hook for orcus.

Long story short, I was able to come up with a way to hook orcus such that LibreOffice relinquishes all its file loading to the orcus library, and only handles callbacks. To make this work, I first packaged orcus into an installable rpm package using the openSUSE build service, locally installed that package, then added –with-system-orcus configure option to allow LibreOffice to find the library. The entire change needed to add hook is condensed into this commit.

Using CSV filter as an experiment

As an initial experiment, I replaced the current csv import filter with one from orcus, just to see how this overall process works. The results are very encouraging.

With a very large csv file I created via this python script:

#!/usr/bin/env python
 
import sys
 
for i in xrange(0, 65536):
    for j in xrange(1, 101):
        val = i * 1.0 / j
        sys.stdout.write("%g,"%val)
    sys.stdout.write("end\n")

the current filter spends roughly 27 seconds to load this file, which is not too bad given the sheer size of the file (~50Mb). The orcus filter, on the other hand, spends only 11 seconds to load the same file.

However, the orcus filter code path still skips a number of steps that need to be performed if it were to be used in the production build, such as

  • drawing progress bar in the status bar area,
  • calculating row heights for rows that include multi-line cell contents, and
  • probably something else I forget to mention here.

Given some of these can be quite expensive, the above numbers may not be fully comparable. Despite that, these initial numbers show a great promise on the performance improvement that may result from using the orcus library.

Future work

First of all, we will not switch to the orcus csv filter anytime soon. Although I’d like to see that happen at some point in the future, there are still lots of missing pieces in the orcus csv filter that prevent us from using it in the production build. My plan with orcus is therefore limited to addition of new filters, and my immediate plan is to develop new XML import and export filters using orcus, and integrate it into LibreOffice. This should also provide a stepping stone for any additional filters that may come up later, as well as replacing some of the existing filters as the need arises.

That’s all for now. Thanks for reading!

LibreOffice Conference 2011

So, it was a real pleasure to be a part of the very first LibreOffice conference held in Paris, France. Some of the faces and names were familiar from the old OOo conferences, but the atmosphere of the conference was very different from the OOo ones in the past. I have been to the 2007 Barcelona conference and the 2009 Orvieto one, and I have to say, while there were some rough-edges, this is by-far my favorite OOo/LibO conference to date.

The only regret I have is that, because I had another international trip (to South Korea) only a week prior to the conference, I felt pretty much exhausted most of the time I was there. But I think I managed to chat with most of the people I needed to chat with during this once-a-year event. I intentionally tried not to hack too much during this conference, mainly because of my travel fatigue, but also because I felt it was more important to see people and talk to them to have a good feel for each other. Working from home, I sometimes miss the human interaction that people who work in the office probably take for granted, so this conference was a perfect place to fulfill that need, to make me feel human again. ;-) (Actually I tried to code a bit during the conference, but apparently my brain wasn’t cooperating at all I decided it probably wasn’t a good idea).

Anyway, it was good to see and chat with Markus Mohrhard (moggi), a very active Calc hacker who’s been instrumental in Calc’s filter test development in recent days. We discussed on various topics on Calc development since we work together in that code.

Also, Laurent Godard, whom I’ve known many years from the OOo days, but never met face-to-face.

And Valek Filippov, who happens to be in the same timezone as I. There aren’t many of us left in this LibreOffice circle, unfortunately. I tried to persuade him into this wonderful world of hacking, but so far he’s successfully fended off my attack.

It was also nice to chat with Michael Meeks at length, to clarify the new Calc cell storage structure that he and I discussed previously. Now the concept is very much clear, waiting to be coded.

Of course, many other countless hackers I’ve had beer with during the conference week, it was a real pleasure.

Now, I got some homework to do based on my interaction with various people during the conference. I will list them up item by item to use as a reminder.

  • Two Calc bugs from Valek. Both are related to this 1C program that pretty much everyone in Russia uses. I’ve already added them to my 3.5 TODO list, so it’s just a matter of finding time to tackle them unless something tricky comes out.
  • Some documentation on how to use the ixion library. Since there were some interests on using ixion to support formula calculations in other applications, I should probably start working on producing documentation on ixion, both on how to build it, and how to use it. I should also create a package for it while I’m at it.
  • Support for temporary cell buffer in the orcus library, to allow converting cell values before passing them to the client code. In some cases we can’t simply push the cell value as-is but convert it first before passing it to the client code. Typical examples are double quotes as a literal quote in CSV, as well as encoded characters (e.g. &) in XML/HTML. This will unfortunately cost us a bit for the allocation of the buffer and copying of the char array, but fortunately we don’t need to do this for all cells.
  • And lots and lots more.

All in all, I was glad to be a part of this successful conference. The atmosphere was very much all inclusive and personal, exactly how an open source conference should be.