Performance benchmark on mdds R-tree

I’d like to share the results of the quick benchmark tests I’ve done to measure the performance of the R-tree implementation included in the mdds library since 1.4.0.

Brief overview on R-tree

R-tree is a data structure designed for optimal query performance on spatial data. It is especially well suited when you need to store a large number of spatial objects in a single store and need to perform point- or range-based queries. The version of R-tree implemented in mdds is a variant known as R*-tree, which differs from the original R-tree in that it occasionally forces re-insertion of stored objects when inserting a new object would cause the target node to exceed its capacity. The original R-tree would simply split the node unconditionally in such cases. The reason behind R*-tree’s choice of re-insertion is that re-insertion would result in the tree being more balanced than simply splitting the node without re-insertion. The downside of such re-insertion is that it would severely affect the worst case performance of object insertion; however, it is claimed that in most real world use cases, the worst case performance would rarely be hit.

That being said, the insertion performance of R-tree is still not very optimal especially when you need to insert a large number of objects up-front, and unfortunately this is a very common scenario in many applications. To mitigate this, the mdds implementation includes a bulk loader that is suitable for mass-insertion of objects at tree initialization time.

What is measured in this benchmark

What I measured in this benchmark are the following:

  • bulk-loading of objects at tree initialization,
  • the size() method call, and
  • the average query performance.

I have written a specially-crafted benchmark program to measure these three categories, and you can find its source code here. The size() method is included here because in a way it represents the worst case query scenario since what it does is visit every single leaf node in the entire tree and count the number of stored objects.

The mdds implementation of R-tree supports arbitrary dimension sizes, but in this test, the dimension size was set to 2, for storing 2-dimensional objects.

Benchmark test design

Here is how I designed my benchmark tests.

First, I decided to use map data which I obtained from OpenStreetMap (OSM) for regions large enough to contain the number of objects in the millions. Since OSM does not allow you to specify a very large export region from its web interface, I went to the Geofabrik download server to download the region data. For this benchmark test, I used the region data for North Carolina, California, and Japan’s Chubu region. The latitude and longitude were used as the dimensions for the objects.

All data were in the OSM XML format, and I used the XML parser from the orcus project to parse the input data and build the input objects.

Since the map objects are not necessarily of rectangular shape, and not necessarily perfectly aligned with the latitude and longitude axes, the test program would compute the bounding box for each map object that is aligned with both axes before inserting it into R-tree.

To prevent the XML parsing portion of the test to affect the measurement of the bulk loading performance, the map object data gathered from the input XML file were first stored in a temporary store, and then bulk-loaded into R-tree afterward.

To measure the query performance, the region was evenly split into 40 x 40 sub-regions, and a point query was performed at each point of intersection that neighbors 4 sub-regions. Put it another way, a total of 1521 queries were performed at equally-spaced intervals throughout the region, and the average query time was calculated.

Note that what I refer to as a point query here is a type of query that retrieves all stored objects that intersects with a specified point. R-tree also allows you to perform area queries where you specify a 2D area and retrieve all objects that overlap with the area. But in this benchmark testing, only point queries were performed.

For each region data, I ran the tests five times and calculated the average value for each test category.

It is worth mentioning that the machine I used to run the benchmark tests is a 7-year old desktop machine with Intel Xeon E5630, with 4 cores and 8 native threads running Ubuntu LTS 1804. It is definitely not the fastest machine by today’s standard. You may want to keep this in mind when reviewing the benchmark results.

Benchmark results

Without further ado, these are the actual numbers from my benchmark tests.

The Shapes column shows the numbers of map objects included in the source region data. When comparing the number of shapes against the bulk-loading times, you can see that the bulk-loading time scales almost linearly with the number of shapes:

You can also see a similar trend in the size query time against the number of shapes:

The point query search performance, on the other hand, does not appear to show any correlation with the number of shapes in the tree:

This makes sense since the structure of R-tree allows you to only search in the area of interest regardless of how many shapes are stored in the entire tree. I’m also pleasantly surprised with the speed of the query; each query only takes 5-6 microseconds on this outdated machine!

Conclusion

I must say that I am overall very pleased with the performance of R-tree. I can already envision various use cases where R-tree will be immensely useful. One area I’m particularly interested in is spreadsheet application’s formula dependency tracking mechanism which involves tracing through chained dependency targets to broadcast cell value changes. Since the spreadsheet organizes its data in terms of row and column positions which is 2-dimensional, and many queries it performs can be considered spatial in nature, R-tree can potentially be useful for speeding things up in many areas of the application.

LibreOffice Development Talk at Triangle C++ Developer’s Group

It was a pleasure to have been given an opportunity to talk about LibreOffice development the other day at the Triangle C++ Developer’s Group. Looking back, what we went through was a mixture of hardship, accomplishments, and learning experience intertwined in such a unique fashion. It was great to be able to talk about it and hopefully it was entertaining enough to those of you who decided to show up to my talk.

Here is a link to the slides I used during my talk.

Thanks again, everyone!

Edit: Here is a PDF version of my slides for those of you who don’t have a program that can open odp files.

Orcus 0.11.0

I’m very pleased to announce that version 0.11.0 of the orcus library is officially out in the wild! You can download the latest source package from the project’s home page.

Lots of changes went into this release, but the two that I would highlight most are the inclusions of JSON and YAML parsers and their associated tools and interfaces. This release adds two new command-line tools: orcus-json and orcus-yaml. The orcus-json tool optionally handles JSON references to external files when the --resolves-refs option is given, though currently it only supports resolving external files that are on the local file system and only when the paths are relative to the referencing file.

I’ve also written an API documentation on the JSON interface in case someone wants to give it a try. Though the documentation on orcus is always work-in-progress, I’d like to spend more time to make the documentation in a more complete state.

On the import filter front, Markus Mohrhard has been making improvements to the ODS import filter especially in the area of styles import. Oh BTW, he is also proposing to mentor a GSOC project on this front under the LibreOffice project. So if you are interested, go and take a look!

That’s all I have at the moment. Thank you, ladies and gentlemen.

Ixion 0.11.0

Version 0.11.0 of the Ixion library has been just released. You can download it from the project’s home page.

Here is the full list of changes since 0.9.1.

  • C++11 is a hard requirement.
  • implement R1C1 formula name resolver.
  • remove boost dependency from the public headers (except for boost::thread).
  • fix incorrect life-cycle management of pooled string instances.
  • make it buildable on OSX.
  • other general code cleanups.
  • python
    • correctly catch and translate general_error into python’s, for Document.append_sheet() method.
    • make python module build configurable.
    • add ixion.column_label() to convert numeric column indices into column labels. A1 and R1C1 are supported.

mdds 1.1.0

I’m pleased to announce the availability of mdds 1.1.0. As always, the source package can be downloaded from the project’s home page.

This release includes the addition of 2 new data structures – trie_map and packed_trie_map, significant performance improvement on sorted_string_map, general bug fixes on some of the existing data structures, enhancement on multi_type_matrix, and support for user-defined event handlers for multi_type_vector.

Huge thanks to Markus Mohrhard for sorted_string_map’s performance improvement as well as the bug fixes and the enhancement on multi_type_matrix’s walk() method.

In addition, thanks to David Tardon, we now use automake as our build system which will simplify the process of package generation and integrity check among other things.

Here is the full list of changes since version 1.0.0:

  • all
    • switched our build system to using automake.
  • packed_trie_map (new)
    • new data structure that implements a trie also known as a prefix tree. This implementation requires all key values be known at construction time, after which its content is considered immutable. Internally it packs all its nodes in a single contiguous array for space and lookup efficiencies.
  • trie_map (new)
    • new data structure that implements a trie. It works similar to packed_trie_map except that this version is mutable.
  • multi_type_matrix
    • added a variant of walk() that takes the upper-left and lower-right corners to allow walking through a subset of the original matrix.
  • multi_type_vector
    • fixed incorrect return values of the increment and decrement operators of in-block iterators. They would previously return a value_type pointer which did not conform to the behaviors of STL iterators.
    • added support for custom event handlers for element block acquisitions and releases.
  • flat_segment_tree
    • fixed incorrect return values of the increment and decrement operators of its leaf-node iterators as in multi_type_vector’s fix.
  • sorted_string_map
    • significantly improved the performance of its find() method by switching from using linear search to using binary search. The improvement is especially visible with a large number of elements.

Documentation

I’ve also added Doxygen documentation for this library for those who are more used to the Doxygen style comprehensive code documentation. The official API documentation has also received some love in the code examples for multi_type_vector. I plan on adding more code examples to the documentation as time permits.

LibreOffice mini-Conference 2016 in Osaka

Night view in Osaka, overlooking the Metropolitan Expressway.
Night view in Osaka, overlooking the Metropolitan Expressway.

Keynote

First off, let me just say that it was such an honor and pleasure to have had the opportunity to present a keynote at the LibreOffice mini-Conference in Osaka. It was a bit surreal to be given such an opportunity almost one year after my involvement with LibreOffice as a paid full-time engineer ended, but I’m grateful that I can still give some tales that some people find interesting. I must admit that I haven’t been that active since I left Collabora in terms of the number of git commits to the LibreOffice core repository, but that doesn’t mean that my passion for that project has faded. In reality it is far from it.

There were a lot of topics I could potentially have covered for my keynote, but I chose to talk about the 5-year history of the project, simply because I felt that we all deserved to give ourselves a lot of praises for numerous great things we’ve achieved in this five years time, which not many of us do simply because we are all very humble beings and always too eager to keep moving forward. I felt that, sometimes, we do need to stop for a moment, look back and reflect on what we’ve done, and enjoy the fruits of our labors.

Osaka

Though I had visited Kyoto once before, this was actually my first time in Osaka. Access from the Kansai International Airport (KIX) into the city was pretty straightforward. The venue was located on the 23th floor of Grand Front Osaka North Building Tower B (right outside the north entrance of JR Osaka Station), on the premises of GMO DigiRock who kindly sponsored the space for the event.

Osaka Station north entrance.
Osaka Station north entrance.

Conference

The conference took place on Saturday January 9th of 2016. The conference program consisted of my keynote, followed by four regular-length talks (30 minutes each), five lightning talks (5 minutes each), and round-table discussions at the end. Topics of the talks included: potential use of LibreOffice in high school IT textbooks, real-world experiences of large-scale migration from MS Office to LibreOffice, LibreOffice API how-tos, and to LibreOffice with NVDA the open source screen reader.

After the round-table discussions, we had some social event with beer and pizza before we concluded the event. Overall, 48 participants showed up for the conference.

Conference venue.
Conference venue.

Videos of the conference talks are made available on YouTube thanks to the effort of the LibreOffice Japanese Language Team.

Slides for my keynote are available here.

Hackfest

We also organized a hackfest on the following day at JUSO Coworking. A total of 20 plus people showed up for the hackfest, to work on things like translating the UI strings to Japanese, authoring event-related articles, and of course hacking on LibreOffice. I myself worked on implementing simple event callbacks in the mdds library, which, by the way, was just completed and merged to the master branch today.

Many folks hard at work during hackfest.
Many folks hard at work during hackfest.

Conclusion

It was great to see so many faces, new and old, many of whom traveled long distance to attend the conference. I was fortunate enough to be able to travel all the way from North Carolina across the Pacific, and it was well worth the hassle of a jet lag.

Last but not least, be sure to check out the article (in Japanese) Naruhiko Ogasawara has written up on the conference. The article goes in-depth with my keynote, and is very well written.

Other Pictures

I’ve taken quite a bit of pictures of the conference as well as of the city of Osaka in general. Jump over to this Facebook album I made of this event if you are interested.

mdds 1.0.0

A new version of mdds is out, and this time, we’ve decided to bump up the version to 1.0.0. As always, you can download it from the project’s main page.

Here is the highlight of this release.

First off, C++11 is now a hard requirement starting with this release. It’s been four years since the C++11 standard was finalized. It’s about time we made this a new baseline.

Secondly, we now have an official API documentation. It’s programatically generated from the source code documentation via Doxygen, Sphinx and Breathe. Huge thanks to the contributors of the aforementioned projects. You guys make publishing API documentation such a breathe (no pun intended).

This release has finally dropped mixed_type_matrix which has been deprecated for quite some time now in favor of multi_type_matrix.

The multi_type_vector data structure has received some performance optimization thanks to patches from William Bonnet.

Aside from that, there is one important bug fix in sorted_string_map, to fix false positives due to incorrect key matching.

API versioning

One thing I need to note with this release is the introduction of API versioning. Starting with this release, we’ll use API versions to flag any API-incompatible releases. Going forward, anytime we introduce an API-incompatible change, we’ll use the version of that release as the new API version. The API version will only contain major and minor components i.e. API versions can be 1.0, 1.2, 2.1 etc. but never 1.0.6, for instance. That also implies that we will never introduce API-incompatible changes in the micro releases.

The API version will be a part of the package name. For example, this release will have a package name of mdds-1.0 so that, when using tools like pkg-config to query for compiler/linker flags, you’ll need to query for mdds-1.0 instead of simply mdds. The package name will stay that way until we have another release with an API-incompatible change.

mdds 0.12.1

I’m happy to announce that mdds 0.12.1 is now out. You can download it from the project’s README page.

There are primarily two major changes from the previous release of 0.12.0 as explained below.

multi_type_vector

One is that multi_type_vector now has a new static method advance_position to increment or decrement the logical position of a position_type object by an arbitrary distance.

static position_type advance_position(const position_type& pos, int steps);

The implementation of this method has been contributed by Markus Mohrhard.

flat_segment_tree

Another major change in this release is with flat_segment_tree. Previously, flat_segment_tree had an unintentional constraint that the value_type must be of numeric type. In this release, that constraint has been officially lifted so that the user of this data structure can now store values of arbitrary types with this data structure. The credit goes to David Tardon for adding this nice improvement.

Other than that, there are no other changes from 0.12.0.

mdds on GitLab

Incidentally, the mdds project now has a new home at gitlab.com. The new URL for the project page is now

https://gitlab.com/mdds/mdds

If you need to include a project URL, be sure to use the new one.

Thank you, ladies and gentlemen!

Ixion 0.9.1 and its move to GitLab

Today I have two announcements to make.

First, the version 0.9.1 of the Ixion library is now available. You can download the 0.9.1 source package from the project’s main page.

This is purely a maintenance release to address portability problems in the python bindings as well as other minor build and packaging issues. Many thanks to David Tardon for single-handedly addressing these issues.

Now, here is the second announcement. We are officially moving the project’s home from the previous Gitorious one to the GitLab’s, following the announcement of the acquisition of Gitorious by GitLab and the imminent shutdown of the Gitorious hosting site resulted from the acquisition. The new official URL for the Ixion project will be https://gitlab.com/ixion/ixion. If you need to include an URL to the project, please use the new one from this point forward.

Thank you, ladies and gentlemen.

Orcus 0.7.1 is out

After more than a year since the release of 0.7.0, I’m once again very happy to announce that the version 0.7.1 of the Orcus library is now available for download. You can download the package from the project’s download page.

This is a maintenance release. It primarily includes bug fixes and build fixes since the 0.7.0 release with no new features. That said, the most notable aspect of this release is that it is buildable with the version 0.9.0 of the Ixion library which was just released a week ago. So, if you are trying to package and distribute the newly-released Ixion library but are unable to do so because of Orcus not being buildable with it, you might be interested in this release.

The next major upgrade will be 0.9.0 whose development is on-going on the current master branch. Hopefully that release won’t be too far away.