Benchmark results on mdds multi_type_vector

In this post, I’m going to share the results of some benchmark testing I have done on multi_type_vector, which is included in the mdds library. The benchmark was done to measure the impact of the change I made recently to improve the performance on block searches, which will affect a major part of its functionality.

Background

One of the data structures included in mdds, called multi_type_vector, stores values of different types in a single logical vector. LibreOffice Calc is one primary user of this. Calc uses this structure as its cell value store, and each instance of this value store represents a single column instance.

Internally, multi_type_vector creates multiple element blocks which are in turn stored in its parent array (primary array). This primary array maps a logical position of a value to the actual block instance that stores it. Up to version 1.5.0, this mapping process involved a linear search that always starts from the first block of the primary array. This was because each element block, though it stores the size of the block, does not store its logical position. So the only way to find the right element block that intersects the logical position of a value is to scan from the first block then keep accumulating the sizes of the encountered blocks.

The reason for not storing the logical positions of the blocks was to avoid having to update them after shifting the blocks after value insertion, which is quite common when editing spreadsheet documents.

Of course, sometimes one has to perform repeated searches to access a number of element values across a number of element blocks, in which case, always starting the search from the first block, or block 0, in every single search can be prohibitively expensive, especially when the vector is heavily fragmented.

To alleviate this, multi_type_vector provides the concept of position hints, which allows the caller to start the search from block N where N > 0. Most of multi_type_vector’s methods return a position hint which can be used for the next search operation. This allows the caller to chain all necessary search operations in such a way to only scan the primary array once for the entire sequence of search operations. The only prerequisite is that access to the elements occur in perfect ascending order. For the most part, this approach worked quite well.

The downside of this is that there are times you need to access multiple element positions and you cannot always arrange your access pattern to take advantage of the position hints. This is the case especially during multi-threaded formula cell execution routine, which Calc introduced some versions ago. This has motivated us to switch to an alternative lookup algorithm, and binary search was the obvious replacement.

Switch from linear search to binary search

The challenge for switching from linear search to binary search was to refactor multi_type_vector’s implementation to store the logical positions of the element blocks and update them real-time, as the vector gets modified. The good news is that, as of this writing, all necessary changes have been done, and the current master branch fully implements binary-search-based block position lookup in all of its operations.

Benchmarks

To get a better idea on how this change will affect the performance profile of multi_type_vector, I ran some benchmarks, using both mdds version 1.5.0 – the latest stable release that still uses linear search, and mdds version 1.5.99 – the current development branch which will eventually become the stable 1.6.0 release. The benchmark tested the following three scenarios:

  1. set() that modifies the block layout of the primary array. This test sets a new value to an empty vector at positions that monotonically increase by 2, until it reaches the end of the vector.
  2. set() that updates the value of the last logical element of the vector. The update happens without modifying the block layout of the primary array. Like the first test, this one also measures the performance of the block position lookup, but since the block count does not change, it is expected that the block position lookup comprises the bulk of its operation.
  3. insert() that inserts a new element block at the logical mid-point of the vector and shifts all the elements that occur below the point of insertion. The primary array of the vector is made to be already heavily fragmented prior to the insertion. This test involves both block position lookup as well as shifting of the element blocks. Since the new multi_type_vector implementation will update the positions of element blocks whose logical positions have changed, this test is designed to measure the cost of this extra operation that was previously not performed as in 1.5.0.

In each of these scenarios, the code executed the target method N number of times where N was specified to be 10,000, 50,000, or 100,000. Each test was run twice, once with position hints and once without them. Each individual run was then repeated five times and the average duration was computed. In this post, I will only include the results for N = 100,000 in the interest of space.

All binaries used in this benchmark were built with a release configuration i.e. on Linux, gcc with -O3 -DNDEBUG flags was used to build the binaries, and on Windows, MSVC (Visual Studio 2017) with /MD /O2 /Ob2 /DNDEBUG flags was used.

All of the source code used in this benchmark is available in the mdds perf-test repository hosted on GitLab.

The benchmarks were performed on machines running either Linux (Ubuntu LTS 1804) or Windows with a variety of CPU’s with varying number of native threads. The following table summarizes all test environments used in this benchmark:

It is very important to note that, because of the disparity in OS environments, compilers and compiler flags, one should NOT compare the absolute values of the timing data to draw any conclusions about CPU’s relative performance with each other.

Results

Scenario 1: set value at monotonically increasing positions

This scenario tests a set of operations that consists of first seeking the position of a block that intersects with the logical position, then setting a new value to that block which causes that block to split and a new value block inserted at the point of split. The test repeats this process 100,000 times, and in each iteration the block search distance progressively increases as the total number of blocks increases. In Calc’s context, scenarios like this are very common especially during file load.

Without further ado, here are the results:

You can easily see that the binary search (1.5.99) achieves nearly the same performance as the linear search with position hints in 1.5.0. Although not very visible in these figures due to the scale of the y-axes, position hints are still beneficial and do provide small but consistent timing reduction in 1.5.99.

Scenario 2: set at last position

The nature of what this scenario tests is very similar to that of the previous scenario, but the cost of the block position lookup is much more emphasized while the cost of the block creation is eliminated. Although the average durations in 1.5.0 without position hints are consistently higher than their equivalent values from the previous scenario across all environments, the overall trends do remain similar.

Scenario 3: insert and shift

This last scenario was included primarily to test the cost of updating the stored block positions after the blocks get shifted, as well as to quantify how much increase this overhead would cause relative to 1.5.0. In terms of Calc use case, this operation roughly corresponds with inserting new rows and shifting of existing non-empty rows downward after the insertion.

Without further ado, here are the results:

These results do indicate that, when compared to the average performance of 1.5.0 with position hints, the same operation can be 4 to 6 times more expensive in 1.5.99. Without position hints, the new implementation is more expensive to a much lesser degree. Since the scenario tested herein is largely bottlenecked by the block position updates, use of position hints seems to only provide marginal benefit.

Adding parallelism

Faced with this dilemma of increased overhead, I did some research to see if there is a way to reduce the overhead. The suspect code in question is in fact a very simple loop, and all its does is to add a constant value to a known number of blocks:

template<typename _CellBlockFunc, typename _EventFunc>
void multi_type_vector<_CellBlockFunc, _EventFunc>::adjust_block_positions(size_type block_index, size_type delta)
{
    size_type n = m_blocks.size();
 
    if (block_index >= n)
        return;
 
    for (; block_index < n; ++block_index)
        m_blocks[block_index].m_position += delta;
}

Since the individual block positions can be updated entirely independent of each other, I decided it would be worthwhile to experiment with the following two types of parallelization techniques. One is loop unrolling, the other is OpenMP. I found these two techniques attractive for this particular case, for they both require very minimal code change.

Adding support for OpenMP was rather easy, since all one has to do is to add a #pragma line immediately above the loop you intend to parallelize, and add an appropriate OpenMP flag to the compiler when building the code.

Adding support for loop unrolling took a little fiddling around, but eventually I was able to make the necessary change without breaking any existing unit test cases. After some quick experimentation, I settled with updating 8 elements per iteration.

After these changes were done, the above original code turned into this:

template<typename _CellBlockFunc, typename _EventFunc>
void multi_type_vector<_CellBlockFunc, _EventFunc>::adjust_block_positions(int64_t start_block_index, size_type delta)
{
    int64_t n = m_blocks.size();
 
    if (start_block_index >= n)
        return;
 
#ifdef MDDS_LOOP_UNROLLING
    // Ensure that the section length is divisible by 8.
    int64_t len = n - start_block_index;
    int64_t rem = len % 8;
    len -= rem;
    len += start_block_index;
    #pragma omp parallel for
    for (int64_t i = start_block_index; i < len; i += 8)
    {
        m_blocks[i].m_position += delta;
        m_blocks[i+1].m_position += delta;
        m_blocks[i+2].m_position += delta;
        m_blocks[i+3].m_position += delta;
        m_blocks[i+4].m_position += delta;
        m_blocks[i+5].m_position += delta;
        m_blocks[i+6].m_position += delta;
        m_blocks[i+7].m_position += delta;
    }
 
    rem += len;
    for (int64_t i = len; i < rem; ++i)
        m_blocks[i].m_position += delta;
#else
    #pragma omp parallel for
    for (int64_t i = start_block_index; i < n; ++i)
        m_blocks[i].m_position += delta;
#endif
}

I have made the loop-unrolling variant of this method a compile-time option and kept the original method intact to allow on-going comparison. The OpenMP part didn’t need any special pre-processing since it can be turned on and off via compiler flag with no impact to the code itself. I needed to switch the loop counter from the original size_type (which is a typedef to size_t) to int64_t so that the code can be built with OpenMP enabled on Windows, using MSVC. Apparently the Microsoft Visual C++ compiler requires the loop counter to be a signed integer for the code to even build with OpenMP enabled.

With these changes in, I wrote a separate test code just to benchmark the insert-and-shift scenario with all permutations of loop-unrolling and OpenMP. The number of threads to use for OpenMP was not specified during the test, which would cause OpenMP to automatically use all available native threads.

With all of this out of the way, let’s look at the results:

Here, LU and OMP stand for loop unrolling and OpenMP, respectively. The results from each machine consist of four groups each having two timing values, one with 1.5.0 and one with 1.5.99. Since 1.5.0 does not use neither loop unrolling nor OpenMP, its results show no variance between the groups, which is expected. The numbers for 1.5.99 are generally much higher than those of 1.5.0, but the use of OpenMP brings the numbers down considerably. Although how much OpenMP reduced the average duration varies from machine to machine, the number of available native threads likely plays some role. The reduction by OpenMP on Core i5 6300U (which comes with 4 native threads) is approximately 30%, the number on Ryzen 7 1700X (with 16 native threads) is about 70%, and the number on Core i7 4790 (with 8 native threads) is about 50%. The relationship between the native thread count and the rate of reduction somewhat follows a linear trend, though the numbers on Xeon E5-2697 v4, which comes with 32 native threads, deviate from this trend.

The effect of loop unrolling, on the other hand, is visible only to a much lesser degree; in all but two cases it has resulted in a reduction of 1 to 7 percent. The only exceptions are the Ryzen 7 without OpenMP which denoted an increase of nearly 16%, and the Xeon E5630 with OpenMP which denoted a slight increase of 0.1%.

The 16% increase with the Ryzen 7 environment may well be an outlier, since the other test in the same environment (with OpenMP enabled) did result in a reduction of 7% – the highest of all tested groups.

Interpreting the results

Hopefully the results presented in this post are interesting and provide insight into the nature of the change in multi_type_vector in the upcoming 1.6.0 release. But what does this all mean, especially in the context of LibreOffice Calc? These are my personal thoughts.

  • From my own observation of having seen numerous bug reports and/or performance issues from various users of Calc, I can confidently say that the vast majority of cases involve reading and updating cell values without shifting of cells, either during file load, or during executions of features that involve massive amounts of cell I/O’s. Since those cases are primarily bottlenecked by block position search, the new implementation will bring a massive win especially in places where use of position hints was not practical. That being said, the performance of block search will likely see no noticeable improvements even after switching to the new implementation when the code already uses position hints with the old implementation.

  • While the increased overhead in block shifting, which is associated with insertion or deletion of rows in Calc, is a certainly a concern, it may not be a huge issue in day-to-day usage of Calc. It is worth pointing out that that what the benchmark measures is repeated insertions and shifting of highly fragmented blocks, which translates to repeated insertions or deletions of rows in Calc document where the column values consist of uniformly altering types. In normal Calc usage, it is more likely that the user would insert or delete rows as one discrete operation, rather than a series of thousands of repeated row insertions or deletions. I am highly optimistic that Calc can absorb this extra overhead without its users noticing.

  • Even if Calc encounters a very unlikely situation where this increased overhead becomes visible at the UI level, enabling OpenMP, assuming that’s practical, would help lessen the impact of this overhead. The benefit of OpenMP becomes more elevated as the number of native CPU threads becomes higher.

What’s next?

I may invest some time looking into potential use of GPU offloading to see if that would further speed up the block position update operations. The benefit of loop unrolling was not as great as I had hoped, but this may be highly CPU and compiler dependent. I will likely continue to dig deeper into this and keep on experimenting.

Performance benchmark on mdds R-tree

I’d like to share the results of the quick benchmark tests I’ve done to measure the performance of the R-tree implementation included in the mdds library since 1.4.0.

Brief overview on R-tree

R-tree is a data structure designed for optimal query performance on spatial data. It is especially well suited when you need to store a large number of spatial objects in a single store and need to perform point- or range-based queries. The version of R-tree implemented in mdds is a variant known as R*-tree, which differs from the original R-tree in that it occasionally forces re-insertion of stored objects when inserting a new object would cause the target node to exceed its capacity. The original R-tree would simply split the node unconditionally in such cases. The reason behind R*-tree’s choice of re-insertion is that re-insertion would result in the tree being more balanced than simply splitting the node without re-insertion. The downside of such re-insertion is that it would severely affect the worst case performance of object insertion; however, it is claimed that in most real world use cases, the worst case performance would rarely be hit.

That being said, the insertion performance of R-tree is still not very optimal especially when you need to insert a large number of objects up-front, and unfortunately this is a very common scenario in many applications. To mitigate this, the mdds implementation includes a bulk loader that is suitable for mass-insertion of objects at tree initialization time.

What is measured in this benchmark

What I measured in this benchmark are the following:

  • bulk-loading of objects at tree initialization,
  • the size() method call, and
  • the average query performance.

I have written a specially-crafted benchmark program to measure these three categories, and you can find its source code here. The size() method is included here because in a way it represents the worst case query scenario since what it does is visit every single leaf node in the entire tree and count the number of stored objects.

The mdds implementation of R-tree supports arbitrary dimension sizes, but in this test, the dimension size was set to 2, for storing 2-dimensional objects.

Benchmark test design

Here is how I designed my benchmark tests.

First, I decided to use map data which I obtained from OpenStreetMap (OSM) for regions large enough to contain the number of objects in the millions. Since OSM does not allow you to specify a very large export region from its web interface, I went to the Geofabrik download server to download the region data. For this benchmark test, I used the region data for North Carolina, California, and Japan’s Chubu region. The latitude and longitude were used as the dimensions for the objects.

All data were in the OSM XML format, and I used the XML parser from the orcus project to parse the input data and build the input objects.

Since the map objects are not necessarily of rectangular shape, and not necessarily perfectly aligned with the latitude and longitude axes, the test program would compute the bounding box for each map object that is aligned with both axes before inserting it into R-tree.

To prevent the XML parsing portion of the test to affect the measurement of the bulk loading performance, the map object data gathered from the input XML file were first stored in a temporary store, and then bulk-loaded into R-tree afterward.

To measure the query performance, the region was evenly split into 40 x 40 sub-regions, and a point query was performed at each point of intersection that neighbors 4 sub-regions. Put it another way, a total of 1521 queries were performed at equally-spaced intervals throughout the region, and the average query time was calculated.

Note that what I refer to as a point query here is a type of query that retrieves all stored objects that intersects with a specified point. R-tree also allows you to perform area queries where you specify a 2D area and retrieve all objects that overlap with the area. But in this benchmark testing, only point queries were performed.

For each region data, I ran the tests five times and calculated the average value for each test category.

It is worth mentioning that the machine I used to run the benchmark tests is a 7-year old desktop machine with Intel Xeon E5630, with 4 cores and 8 native threads running Ubuntu LTS 1804. It is definitely not the fastest machine by today’s standard. You may want to keep this in mind when reviewing the benchmark results.

Benchmark results

Without further ado, these are the actual numbers from my benchmark tests.

The Shapes column shows the numbers of map objects included in the source region data. When comparing the number of shapes against the bulk-loading times, you can see that the bulk-loading time scales almost linearly with the number of shapes:

You can also see a similar trend in the size query time against the number of shapes:

The point query search performance, on the other hand, does not appear to show any correlation with the number of shapes in the tree:

This makes sense since the structure of R-tree allows you to only search in the area of interest regardless of how many shapes are stored in the entire tree. I’m also pleasantly surprised with the speed of the query; each query only takes 5-6 microseconds on this outdated machine!

Conclusion

I must say that I am overall very pleased with the performance of R-tree. I can already envision various use cases where R-tree will be immensely useful. One area I’m particularly interested in is spreadsheet application’s formula dependency tracking mechanism which involves tracing through chained dependency targets to broadcast cell value changes. Since the spreadsheet organizes its data in terms of row and column positions which is 2-dimensional, and many queries it performs can be considered spatial in nature, R-tree can potentially be useful for speeding things up in many areas of the application.

mdds 1.1.0

I’m pleased to announce the availability of mdds 1.1.0. As always, the source package can be downloaded from the project’s home page.

This release includes the addition of 2 new data structures – trie_map and packed_trie_map, significant performance improvement on sorted_string_map, general bug fixes on some of the existing data structures, enhancement on multi_type_matrix, and support for user-defined event handlers for multi_type_vector.

Huge thanks to Markus Mohrhard for sorted_string_map’s performance improvement as well as the bug fixes and the enhancement on multi_type_matrix’s walk() method.

In addition, thanks to David Tardon, we now use automake as our build system which will simplify the process of package generation and integrity check among other things.

Here is the full list of changes since version 1.0.0:

  • all
    • switched our build system to using automake.
  • packed_trie_map (new)
    • new data structure that implements a trie also known as a prefix tree. This implementation requires all key values be known at construction time, after which its content is considered immutable. Internally it packs all its nodes in a single contiguous array for space and lookup efficiencies.
  • trie_map (new)
    • new data structure that implements a trie. It works similar to packed_trie_map except that this version is mutable.
  • multi_type_matrix
    • added a variant of walk() that takes the upper-left and lower-right corners to allow walking through a subset of the original matrix.
  • multi_type_vector
    • fixed incorrect return values of the increment and decrement operators of in-block iterators. They would previously return a value_type pointer which did not conform to the behaviors of STL iterators.
    • added support for custom event handlers for element block acquisitions and releases.
  • flat_segment_tree
    • fixed incorrect return values of the increment and decrement operators of its leaf-node iterators as in multi_type_vector’s fix.
  • sorted_string_map
    • significantly improved the performance of its find() method by switching from using linear search to using binary search. The improvement is especially visible with a large number of elements.

Documentation

I’ve also added Doxygen documentation for this library for those who are more used to the Doxygen style comprehensive code documentation. The official API documentation has also received some love in the code examples for multi_type_vector. I plan on adding more code examples to the documentation as time permits.

mdds 1.0.0

A new version of mdds is out, and this time, we’ve decided to bump up the version to 1.0.0. As always, you can download it from the project’s main page.

Here is the highlight of this release.

First off, C++11 is now a hard requirement starting with this release. It’s been four years since the C++11 standard was finalized. It’s about time we made this a new baseline.

Secondly, we now have an official API documentation. It’s programatically generated from the source code documentation via Doxygen, Sphinx and Breathe. Huge thanks to the contributors of the aforementioned projects. You guys make publishing API documentation such a breathe (no pun intended).

This release has finally dropped mixed_type_matrix which has been deprecated for quite some time now in favor of multi_type_matrix.

The multi_type_vector data structure has received some performance optimization thanks to patches from William Bonnet.

Aside from that, there is one important bug fix in sorted_string_map, to fix false positives due to incorrect key matching.

API versioning

One thing I need to note with this release is the introduction of API versioning. Starting with this release, we’ll use API versions to flag any API-incompatible releases. Going forward, anytime we introduce an API-incompatible change, we’ll use the version of that release as the new API version. The API version will only contain major and minor components i.e. API versions can be 1.0, 1.2, 2.1 etc. but never 1.0.6, for instance. That also implies that we will never introduce API-incompatible changes in the micro releases.

The API version will be a part of the package name. For example, this release will have a package name of mdds-1.0 so that, when using tools like pkg-config to query for compiler/linker flags, you’ll need to query for mdds-1.0 instead of simply mdds. The package name will stay that way until we have another release with an API-incompatible change.

mdds 0.12.1

I’m happy to announce that mdds 0.12.1 is now out. You can download it from the project’s README page.

There are primarily two major changes from the previous release of 0.12.0 as explained below.

multi_type_vector

One is that multi_type_vector now has a new static method advance_position to increment or decrement the logical position of a position_type object by an arbitrary distance.

static position_type advance_position(const position_type& pos, int steps);

The implementation of this method has been contributed by Markus Mohrhard.

flat_segment_tree

Another major change in this release is with flat_segment_tree. Previously, flat_segment_tree had an unintentional constraint that the value_type must be of numeric type. In this release, that constraint has been officially lifted so that the user of this data structure can now store values of arbitrary types with this data structure. The credit goes to David Tardon for adding this nice improvement.

Other than that, there are no other changes from 0.12.0.

mdds on GitLab

Incidentally, the mdds project now has a new home at gitlab.com. The new URL for the project page is now

https://gitlab.com/mdds/mdds

If you need to include a project URL, be sure to use the new one.

Thank you, ladies and gentlemen!

mdds 0.12.0 is now out

I’m happy to announce that mdds 0.12.0 is now out. You can download it from the project’s download page

https://code.google.com/p/multidimalgorithm/wiki/Downloads

The highlight of this release is mostly with the segment_tree data structure, where its value type previously only supported pointer types. Markus Mohrhard worked on removing that constraint from segment_tree so that you can now store values of arbitrary types just like you would expect from a template container.

Aside from that, there are some minor bug and build fixes. Users of the previous versions are encouraged to update to this version.

mdds 0.7.1 released

Ok. I’m actually a bit late on this announcement since 10 days have already passed since the actual release of 0.7.1. Anyhow, I will hereby announce that version 0.7.1 of Multi-Dimensional Data Structure (mdds) is out, which contains several critical bug fixes over the previous 0.7.0 release. You can download the source package from here:

http://multidimalgorithm.googlecode.com/files/mdds_0.7.1.tar.bz2

0.7.1 fixes several bugs in the set_empty() method of multi_type_vector. In the previous versions, the set_empty() method would fail to merge two adjacent empty blocks into a single block, which violated the basic requirement of multi_type_vector that it never allows two adjacent blocks of identical type. This caused other parts of multi_type_vector to fail as a result.

There are no API-incompatible changes since version 0.7.0. I highly recommend you update to 0.7.1 if you make heavy use of multi_type_vector and still use any versions older than 0.7.1.

mdds 0.7.0 released

I’m once again very happy to announce that version 0.7.0 of Multi-Dimensional Data Structure (mdds) is released and is available at the link below:

http://multidimalgorithm.googlecode.com/files/mdds_0.7.0.tar.bz2

All changes that went into this version since 0.6.1 are related to multi_type_vector. The highlights of the changes are:

  1. setter methods (set, set_empty, insert, and insert_empty) now return an iterator that references the block where the values are set or inserted,
  2. each of the above-referenced methods now have a variant that takes a position hint iterator for faster insertion, and
  3. several critical bug fixes.

There are no API-incompatible changes since 0.6.1. If you currently use version 0.6.1 and use multi_type_vector, you should upgrade to 0.7.0 as it contains several important bug fixes.

mdds 0.6.1 released

I’m once again very happy to announce that version 0.6.1 of Multi-Dimensional Data Structure (mdds) is released and is available at the link below:

http://multidimalgorithm.googlecode.com/files/mdds_0.6.1.tar.bz2

This is purely a bug fix release, and contain no new functionality since 0.6.0.

This release fixes a bug in the iterator implementation of flat_segment_tree. Prior to this release, the iterator would treat the position immediately before the end position to be the end position, which would result in incorrectly skipping the last data position during iteration. This release contains a fix for that bug.

It also contains fixes for various build errors and compiler warnings.

Many thanks to David Tardon, Stephan Bergmann, Tomáš Chvátal, and Markus Mohrhard for having submitted patches since the release of 0.6.0.

mdds::multi_type_matrix performance consideration

In my previous post, I explained the basic concept of multi_type_vector – one of the two new data structures added to mdds in the 0.6.0 release. In this post, I’d like to explain a bit more about multi_type_matrix – the other new structure added in the aforementioned release. It is also important to note that the addition of multi_type_matrix deprecates mixed_type_matrix, and is subject to deletion in future releases.

Basics

In short, multi_type_matrix is a matrix data structure designed to allow storage of four different element types: numeric value (double), boolean value (bool), empty value, and string value. The string value type can be either std::string, or one provided by the user. Internally, multi_type_matrix is just a wrapper to multi_type_vector, which does most of the hard work. All multi_type_matrix does is to translate logical element positions in 2-dimensional space into one-dimensional positions, and pass them onto the vector. Using multi_type_vector has many advantages over the previous matrix class mixed_type_matrix both in terms of ease of use and performance.

One benefit of using multi_type_vector as its backend storage is that, we will no longer have to differentiate densely-populated and sparsely-populated matrix density types. In mixed_type_matrix, the user would have to manually specify which backend type to use when creating an instance, and once created, it wasn’t possible to switch from one to the other unless you copy it wholesale. In multi_type_matrix, on the other hand, the user no longer has to specify the density type since the new storage is optimized for either density type.

Another benefit is the reduced storage cost and improved latency in memory access especially when accessing a sequence of element values at once. This is inherent in the use of multi_type_vector which I explained in detail in my previous post. I will expand on the storage cost of multi_type_matrix in the next section.

Storage cost

The new multi_type_matrix structure generally provides better storage efficiency in most average cases. I’ll illustrate this by using the two opposite extreme density cases.

First, let’s assume we have a 5-by-5 matrix that’s fully populated with numeric values. The following picture illustrates how the element values of such numeric matrix are stored.

In mixed_type_matrix with its filled-storage backend, the element values are either 1) stored in heap-allocated element objects and their pointers are stored in a separate array (middle right), or 2) stored directly in one-dimensional array (lower right). Those initialized with empty elements employ the first variant, whereas those initialized with zero elements employ the second variant. The rationale behind using these two different storage schemes was the assertion that, in a matrix initialized with empty elements, most elements likely remain empty throughout its life time whereas a matrix initialized with zero elements likely get numeric values assigned to most of the elements for subsequent computations.

Also, each element in mixed_type_matrix stores its type as an enum value. Let’s assume that the size of a pointer is 8 bytes (the world is moving toward 64-bit systems these days), that of a double is 8 bytes, and that of an enum is 4 bytes. The total storage cost of a 5-by-5 matrix will be 8 x 25 + (8 + 4) x 25 = 500 bytes for empty-initialized matrix, and (8 + 4) x 25 = 300 bytes for zero-initialized matrix.

In contrast, multi_type_matrix (upper right) stores the same data using a single array of double’s, whose memory address is stored in a separate block array. This block array also stores the type of each block (int) and its size (size_t). Since we only have one numeric block, it only stores one int value, one size_t value, and one pointer value for the whole block. With that, the total storage cost of a 5-by-5 matrix will be 8 x 25 + 4 + 8 + 8 = 220 bytes. Suffice it to say that it’s less than half the storage cost of empty-initialized mixed_type_matrix, and roughly 26% less than that of zero-initialized mixed_type_matrix.

Now let’s a look at the other end of the density spectrum. Say, we have a very sparsely-populated 5-by-5 matrix, and only the top-left and bottom-right elements are non-empty like the following illustration shows:

In mixed_type_matrix with its sparse-storage backend (lower right), the element values are stored in heap-allocated element objects which are in turn stored in nested balanced-binary trees. The space requirement of the sparse-storage backend varies depending on how the elements are spread out, but in this particular example, it takes one 5-node tree, one 2-node tree, four single-node tree, and five element instances. Let’s assume that each node in each of these trees stores 3 pointers (pointer to left node, pointer right node and pointer to the value), which makes up 24 bytes of storage per node. Multiplying that by 11 makes 24 x 11 = 264 bytes of storage. With each element instance requiring 12 bytes of storage, the total storage cost comes to 24 x 11 + 12 x 6 = 336 bytes.

In multi_type_matrix (upper right), the primary array stores three element blocks each of which makes up 20 bytes of storage (one pointer, one size_t and one int). Combine that with one 2-element array (16 bytes) and one 4-element array (24 bytes), and the total storage comes to 20 x 3 + 8 * (2 + 4) = 108 bytes. This clearly shows that, even in this extremely sparse density case, multi_type_matrix provides better storage efficiency than mixed_type_matrix.

I hope these two examples are evidence enough that multi_type_matrix provides reasonable efficiency in either densely populated or sparsely populated matrices. The fact that one storage can handle either extreme also gives us more flexibility in that, even when a matrix object starts out sparsely populated then later becomes completely filled, there is no need to manually switch the storage structure as was necessary with mixed_type_matrix.

Run-time performance

Better storage efficiency with multi_type_matrix over mixed_type_matrix is one thing, but what’s equally important is how well it performs run-time. Unfortunately, the actual run-time performance largely depends on how it is used, and while it should provide good overall performance if used in ways that take advantage of its structure, it may perform poorly if used incorrectly.

In this section, I will provide performance comparisons between multi_type_matrix and mixed_type_matrix in several difference scenarios, with the actual source code used to measure their performance. All performance comparisons are done in terms of total elapsed time in seconds required to perform each task. All elapsed times were measured in CPU time, and all benchmark codes were compiled on openSUSE 12.1 64-bit using gcc 4.6.2 with -Os compiler flag.

For the sake of brevity and consistency, the following typedef’s are used throughout the performance test code.

typedef mdds::mixed_type_matrix<std::string, bool>            mixed_mx_type;
typedef mdds::multi_type_matrix<mdds::mtm::std_string_trait>  multi_mx_type;

Instantiation

The first scenario is the instantiation of matrix objects. In this test, six matrix object instantiation scenarios are measured. In each scenario, a matrix object of 20000 rows by 8000 columns is instantiated, and the time it takes for the object to get fully instantiated is measured.

The first three scenarios instantiate matrix object with zero element values. The first scenario instantiates mixed_type_matrix with filled storage backend, with all elements initialized to zero.

mixed_mx_type mx(20000, 8000, mdds::matrix_density_filled_zero);

Internally, this allocates a one-dimensional array and fill it with zero element instances.

The second case is just like the first one, the only difference being that it uses sparse storage backend.

mixed_mx_type mx(20000, 8000, mdds::matrix_density_sparse_zero);

With the sparse storage backend, all this does is to allocate just one element instance to use it as zero, and set the internal size value to specified size. No allocation for the storage of any other elements occur at this point. Thus, instantiating a mixed_type_matrix with sparse storage is a fairly cheap, constant-time process.

The third scenario instantiates multi_type_matrix with all elements initialized to zero.

multi_mx_type mx(20000, 8000, 0.0);

This internally allocates one numerical block containing one dimensional array of length 20000 x 8000 = 160 million, and fill it with 0.0 values. This process is very similar to that of the first scenario except that, unlike the first one, the array stores the element values only, without the extra individual element types.

The next three scenarios instantiate matrix object with all empty elements. Other than that, they are identical to the first three.

The first scenario is mixed_type_matrix with filled storage.

mixed_mx_type mx(20000, 8000, mdds::matrix_density_filled_empty);

Unlike the zero element counterpart, this version allocates one empty element instance and one dimensional array that stores all identical pointer values pointing to the empty element instance.

The second one is mixed_type_matrix with sparse storage.

mixed_mx_type mx(20000, 8000, mdds::matrix_density_sparse_empty);

And the third one is multi_type_matrix initialized with all empty elements.

multi_mx_type mx(20000, 8000);

This is also very similar to the initialization with all zero elements, except that it creates one empty element block which doesn’t have memory allocated for data array. As such, this process is cheaper than the zero element counterpart because of the absence of the overhead associated with creating an extra data array.

Here are the results:

The most expensive one turns out to be the zero-initialized mixed_type_matrix, which allocates array with 160 million zero element objects upon construction. What follows is a tie between the empty-initialized mixed_type_matrix and the zero-initialized multi_type_matrix. Both structures allocate array with 160 million primitive values (one with pointer values and one with double values). The sparse mixed_type_matrix ones are very cheap to instantiate since all they need is to set their internal size without additional storage allocation. The empty multi_type_matrix is also cheap for the same reason. The last three types can be instantiated at constant time regardless of the logical size of the matrix.

Assigning values to elements

The next test is assigning numeric values to elements inside matrix. For the remainder of the tests, I will only measure the zero-initialized mixed_type_matrix since the empty-initialized one is not optimized to be filled with a large number of non-empty elements.

We measure six different scenarios in this test. One is for mixed_type_matrix, and the rest are all for multi_type_matrix, as multi_type_matrix supports several different ways to assign values. In contrast, mixed_type_matrix only supports one way to assign values.

The first scenario involves assigning values to elements in mixed_type_matrix. Values are assigned individually inside nested for loops.

size_t row_size = 10000, col_size = 1000;
 
mixed_mx_type mx(row_size, col_size, mdds::matrix_density_filled_zero);
 
double val = 0.0;
for (size_t row = 0; row < row_size; ++row)
{
    for (size_t col = 0; col < col_size; ++col)
    {
        mx.set(row, col, val);
        val += 0.00001; // different value for each element
    }
}

The second scenario is almost identical to the first one, except that it’s multi_type_matrix initialized with empty elements.

size_t row_size = 10000, col_size = 1000;
 
multi_mx_type mx(row_size, col_size);
 
double val = 0.0;
for (size_t row = 0; row < row_size; ++row)
{
    for (size_t col = 0; col < col_size; ++col)
    {
        mx.set(row, col, val);
        val += 0.00001; // different value for each element
    }
}

Because the matrix is initialized with just one empty block with no data array allocated, the very first value assignment allocates the data array just for one element, then all the subsequent assignments keep resizing the data array by one element at a time. Therefore, each value assignment runs the risk of the data array getting reallocated as it internally relies on std::vector’s capacity growth policy which in most STL implementations consists of doubling it on every reallocation.

The third scenario is identical to the previous one. The only difference is that the matrix is initialized with zero elements.

size_t row_size = 10000, col_size = 1000;
 
multi_mx_type mx(row_size, col_size, 0.0);
 
double val = 0.0;
for (size_t row = 0; row < row_size; ++row)
{
    for (size_t col = 0; col < col_size; ++col)
    {
        mx.set(row, col, val);
        val += 0.00001; // different value for each element
    }
}

But this seemingly subtle difference makes a huge difference. Because the matrix is already initialized with a data array to the full matrix size, none of the subsequent assignments reallocate the array. This cuts the repetitive reallocation overhead significantly.

The next case involves multi_type_matrix initialized with empty elements. The values are first stored into an extra array first, then the whole array gets assigned to the matrix in one call.

size_t row_size = 10000, col_size = 1000;
 
multi_mx_type mx(row_size, col_size);
 
// Prepare a value array first.
std::vector<double> vals;
vals.reserve(row_size*col_size);
double val = 0.0;
for (size_t row = 0; row < row_size; ++row)
{
    for (size_t col = 0; col < col_size; ++col)
    {
        vals.push_back(val);
        val += 0.00001;
    }
}
 
// Assign the whole element values in one step.
mx.set(0, 0, vals.begin(), vals.end());

Operation like this is something that mixed_type_matrix doesn’t support. What the set() method on the last line does is to assign the values to all elements in the matrix in one single call; it starts from the top-left (0,0) element position and keeps wrapping values into the subsequent columns until it reaches the last element in the last column.

Generally speaking, with multi_type_matrix, assigning a large number of values in this fashion is significantly faster than assigning them individually, and even with the overhead of the initial data array creation, it is normally faster than individual value assignments. In this test, we measure the time it takes to set values with and without the initial data array creation.

The last scenario is identical to the previous one, but the only difference is the initial element values being zero instead of being empty.

size_t row_size = 10000, col_size = 1000;
 
multi_mx_type mx(row_size, col_size, 0.0);
 
// Prepare a value array first.
std::vector<double> vals;
vals.reserve(row_size*col_size);
double val = 0.0;
for (size_t row = 0; row < row_size; ++row)
{
    for (size_t col = 0; col < col_size; ++col)
    {
        vals.push_back(val);
        val += 0.00001;
    }
}
 
// Assign the whole element values in one step.
mx.set(0, 0, vals.begin(), vals.end());

The only significant thing this code does differently from the last one is that it assigns values to an existing numeric data array whereas the code in the previous scenario allocates a new array before assigning values. In practice, this difference should not make any significant difference performance-wise.

Now, let’s a take a look at the results.

The top orange bar is the only result from mixed_type_matrix, and the rest of the blue bars are from multi_type_matrix, using different assignment techniques.

The top three bars are the results from the individual value assignments inside loop (hence the label “loop”). The first thing that jumps out of this chart is that individually assigning values to empty-initialized multi_type_matrix is prohibitively expensive, thus such feat should be done with extra caution (if you really have to do it). When the matrix is initialized with zero elements, however, it does perform reasonably though it’s still slightly slower than the mixed_type_matrix case.

The bottom four bars are the results from the array assignments to multi_type_matrix, one initialized with empty elements and one initialized with zero elements, and one is with the initial data array creation and one without. The difference between the two initialization cases is very minor and well within the margin of being barely noticeable in real life.

Performance of an array assignment is roughly on par with that of mixed_type_matrix’s if you include the cost of the extra array creation. But if you take away that overhead, that is, if the data array is already present and doesn’t need to be created prior to the assignment, the array assignment becomes nearly 3 times faster than mixed_type_matrix’s individual value assignment.

Adding all numeric elements

The next benchmark test consists of fetching all numerical values from a matrix and adding them all together. This requires accessing the stored elements inside matrix after it has been fully populated.

With mixed_type_matrix, the following two ways of accessing element values are tested: 1) access via individual get_numeric() calls, and 2) access via const_iterator. With multi_type_matrix, the tested access methods are: 1) access via individual get_numeric() calls, and 2) access via walk() method which walks all element blocks sequentially and call back a caller-provided function object on each element block pass.

In each of the above testing scenarios, two different element distribution types are tested: one that consists of all numeric elements (homogeneous matrix), and one that consists of a mixture of numeric and empty elements (heterogeneous matrix). In the tests with heterogeneous matrices, one out of every three columns is set empty while the remainder of the columns are filled with numeric elements. The size of a matrix object is fixed to 10000 rows by 1000 columns in each tested scenario.

The first case involves populating a mixed_type_matrix instance with all numeric elements (homogenous matrix), then read all values to calculate their sum.

size_t row_size = 10000, col_size = 1000;
 
mixed_mx_type mx(row_size, col_size, mdds::matrix_density_filled_zero);
 
// Populate the matrix with all numeric values.
double val = 0.0;
for (size_t row = 0; row < row_size; ++row)
{
    for (size_t col = 0; col < col_size; ++col)
    {
        mx.set(row, col, val);
        val += 0.00001;
    }
}
 
// Sum all numeric values.
double sum = 0.0;
for (size_t row = 0; row < row_size; ++row)
    for (size_t col = 0; col < col_size; ++col)
        sum += mx.get_numeric(row, col);

The test only measures the second nested for loops where the values are read and added. The first block where the matrix is populated is excluded from the measurement.

In the heterogeneous matrix variant, only the first block is different:

// Populate the matrix with numeric and empty values.
double val = 0.0;
for (size_t row = 0; row < row_size; ++row)
{
    for (size_t col = 0; col < col_size; ++col)
    {
        if ((col % 3) == 0)
        {
            mx.set_empty(row, col);
        }
        else
        {
            mx.set(row, col, val);
            val += 0.00001;
        }
    }
}

while the second block remains intact. Note that the get_numeric() method returns 0.0 when the element type is empty (this is true with both mixed_type_matrix and multi_type_matrix), so calling this method on empty elements has no effect on the total sum of all numeric values.

When measuring the performance of element access via iterator, the second block is replaced with the following code:

// Sum all numeric values via iterator.
double sum = 0.0;
mixed_mx_type::const_iterator it = mx.begin(), it_end = mx.end();
for (; it != it_end; ++it)
{
    if (it->m_type == mdds::element_numeric)
        sum += it->m_numeric;
}

Four separate tests are performed with multi_type_matrix. The first variant consists of a homogeneous matrix with all numeric values, where the element values are read and added via manual loop.

size_t row_size = 10000, col_size = 1000;
 
multi_mx_type mx(row_size, col_size, 0.0);
 
// Populate the matrix with all numeric values.
double val = 0.0;
for (size_t row = 0; row < row_size; ++row)
{
    for (size_t col = 0; col < col_size; ++col)
    {
        mx.set(row, col, val);
        val += 0.00001;
    }
}
 
// Sum all numeric values.
double sum = 0.0;
for (size_t row = 0; row < row_size; ++row)
    for (size_t col = 0; col < col_size; ++col)
        sum += mx.get_numeric(row, col);

This code is identical to the very first scenario with mixed_type_matrix, the only difference being that it uses multi_type_matrix initialized with zero elements.

In the heterogeneous matrix variant, the first block is replaced with the following:

multi_mx_type mx(row_size, col_size); // initialize with empty elements.
double val = 0.0;
vector<double> vals;
vals.reserve(row_size);
for (size_t col = 0; col < col_size; ++col)
{
    if ((col % 3) == 0)
        // Leave this column empty.
        continue;
 
    vals.clear();
    for (size_t row = 0; row < row_size; ++row)
    {
        vals.push_back(val);
        val += 0.00001;
    }
 
    mx.set(0, col, vals.begin(), vals.end());
}

which essentially fills the matrix with numeric values except for every 3rd column being left empty. It’s important to note that, because heterogeneous multi_type_matrix instance consists of multiple element blocks, making every 3rd column empty creates roughly over 300 element blocks with matrix that consists of 1000 columns. This severely affects the performance of element block lookup especially for elements that are not positioned in the first few blocks.

The walk() method was added to multi_type_matrix precisely to alleviate this sort of poor lookup performance in such heavily partitioned matrices. This allows the caller to walk through all element blocks sequentially, thereby removing the need to restart the search in every element access. The last tested scenario measures the performance of this walk() method by replacing the second block with:

sum_all_values func;
mx.walk(func);

where the sum_all_values function object is defined as:

class sum_all_values : public std::unary_function<multi_mx_type::element_block_node_type, void>
{
    double m_sum;
public:
    sum_all_values() : m_sum(0.0) {}
 
    void operator() (const multi_mx_type::element_block_node_type& blk)
    {
        if (!blk.data)
            // Skip the empty blocks.
            return;
 
        if (mdds::mtv::get_block_type(*blk.data) != mdds::mtv::element_type_numeric)
            // Block is not of numeric type.  Skip it.
            return;
 
        using mdds::mtv::numeric_element_block;
        // Access individual elements in this block, and add them up.
        numeric_element_block::const_iterator it = numeric_element_block::begin(*blk.data);
        numeric_element_block::const_iterator it_end = numeric_element_block::end(*blk.data);
        for (; it != it_end; ++it)
            m_sum += *it;
    }
 
    double get() const { return m_sum; }
};

Without further ado, here are the results:

It is somewhat surprising that mixed_type_matrix shows poorer performance with iterator access as opposed to access via get_numeric(). There is no noticeable difference between the homogeneous and heterogeneous matrix scenarios with mixed_type_matrix, which makes sense given how mixed_type_matrix stores its element values.

On the multi_type_matrix front, element access via individual get_numeric() calls turns out to be very slow, which is expected. This poor performance is highly visible especially with heterogeneous matrix consisting of over 300 element blocks. Access via walk() method, on the other hand, shows much better performance, and is in fact the fastest amongst all tested scenarios. Access via walk() is faster with the heterogeneous matrix which is likely attributed to the fact that the empty element blocks are skipped which reduces the total number of element values to read.

Counting all numeric elements

In this test, we measure the time it takes to count the total number of numeric elements stored in a matrix. As with the previous test, we use both homogeneous and heterogeneous 10000 by 1000 matrix objects initialized in the same exact manner. In this test, however, we don’t measure the individual element access performance of multi_type_matrix since we all know by now that doing so would result in a very poor performance.

With mixed_type_matrix, we measure counting both via individual element access and via iterators. I will not show the code to initialize the element values here since that remains unchanged from the previous test. The code that does the counting is as follows:

// Count all numeric elements.
long count = 0;
for (size_t row = 0; row < row_size; ++row)
{
    for (size_t col = 0; col < col_size; ++col)
    {
        if (mx.get_type(row, col) == mdds::element_numeric)
            ++count;
    }
}

It is pretty straightforward and hopefully needs no explanation. Likewise, the code that does the counting via iterator is as follows:

// Count all numeric elements via iterator.
long count = 0;
mixed_mx_type::const_iterator it = mx.begin(), it_end = mx.end();
for (; it != it_end; ++it)
{
    if (it->m_type == mdds::element_numeric)
        ++count;
}

Again a pretty straightforward code.

Now, testing this scenario with multi_type_matrix is interesting because it can take advantage of multi_type_matrix’s block-based element value storage. Because the elements are partitioned into multiple blocks, and each block stores its size separately from the data array, we can simply tally the sizes of all numeric element blocks to calculate its total number without even counting the actual individual elements stored in the blocks. And this algorithm scales with the number of element blocks, which is far fewer than the number of elements in most average use cases.

With that in mind, the code to count numeric elements becomes:

count_all_values func;
mx.walk(func);

where the count_all_values function object is defined as:

class count_all_values : public std::unary_function<multi_mx_type::element_block_node_type, void>
{
    long m_count;
public:
    count_all_values() : m_count(0) {}
    void operator() (const multi_mx_type::element_block_node_type& blk)
    {
        if (!blk.data)
            // Empty block.
            return;
 
        if (mdds::mtv::get_block_type(*blk.data) != mdds::mtv::element_type_numeric)
            // Block is not numeric.
            return;
 
        m_count += blk.size; // Just use the separate block size.
    }
 
    long get() const { return m_count; }
};

With mixed_type_matrix, you are forced to parse all elements in order to count elements of a certain type regardless of which type of elements to count. This algorithm scales with the number of elements, much worse proposition than scaling with the number of element blocks.

Now that the code has been presented, let move on to the results:

The performance of mixed_type_matrix, both manual loop and via iterator cases, is comparable to that of the previous test. What’s remarkable is the performance of multi_type_matrix via its walk() method; the numbers are so small that they don’t even register in the chart! As I mentions previously, the storage structure of multi_type_matrix replaces the problem of counting elements into a new problem of counting element blocks, thereby significantly reducing the scale factor with respect to the number of elements in most average use cases.

Initializing matrix with identical values

Here is another scenario where you can take advantage of multi_type_matrix over mixed_type_matrix. Say, you want to instantiate a new matrix and assign 12.3 to all of its elements. With mixed_type_matrix, the only way you can achieve that is to assign that value to each element in a loop after it’s been constructed. So you would write code like this:

size_t row_size = 10000, col_size = 2000;
mixed_mx_type mx(row_size, col_size, mdds::matrix_density_filled_zero);
 
for (size_t row = 0; row < row_size; ++row)
    for (size_t col = 0; col < col_size; ++col)
        mx.set(row, col, 12.3);

With multi_type_matrix, you can achieve the same result by simply passing an initial value to the constructor, and that value gets assigned to all its elements upon construction. So, instead of assigning it to every element individually, you can simply write:

multi_mx_type(row_size, col_size, 12.3);

Just for the sake of comparison, I’ll add two more cases for multi_type_matrix. The first one involves instantiation with a numeric block of zero’s, and individually assigning value to the elements afterward, like so:

multi_mx_type mx(row_size, col_size, 0.0);
 
for (size_t row = 0; row < row_size; ++row)
    for (size_t col = 0; col < col_size; ++col)
        mx.set(row, col, 12.3);

which is algorithmically similar to the mixed_type_matrix case.

Now, the second one involves instantiation with a numeric block of zero’s, create an array with the same element count initialized with a desired initial value, then assign that to the matrix in one go.

multi_mx_type mx(row_size, col_size);
 
vector<double> vals(row_size*col_size, 12.3);
mx.set(0, 0, vals.begin(), vals.end());

The results are:

The performance of assigning initial value to individual elements is comparable between mixed_type_matrix and multi_type_matrix, though it is also the slowest of all. Creating an array of initial values and assigning it to the matrix takes less than half the time of individual assignment even with the overhead of creating the extra array upfront. Passing an initial value to the constructor is the fastest of all; it only takes roughly 1/8th of the time required for the individual assignment, and 1/3rd of the array assignment.

Conclusion

I hope I have presented enough evidence to convince you that multi_type_matrix offers overall better performance than mixed_type_matrix in a wide variety of use cases. Its structure is much simpler than that of mixed_type_matrix in that, it only uses one element storage backend as opposed to three in mixed_type_matrix. This greatly improves not only the cost of maintenance but also the predictability of the container behavior from the user’s point of view. That fact that you don’t have to clone matrix just to transfer it into another storage backend should make it a lot simpler to use this new matrix container.

Having said this, you should also be aware of the fact that, in order to take full advantage of multi_type_matrix to achieve good run-time performance, you need to

  • try to limit single value assignments and prefer using value array assignment,
  • construct matrix with proper initial value which also determines the type of initial element block, which in turn affects the performance of subsequent value assignments, and
  • use the walk() method when iterating through all elements in the matrix.

That’s all, ladies and gentlemen.