Here is another performance improvement that just landed on master.
It was brought to our attention that the performance of saving documents to ODF spreadsheet format had been degrading quite noticeably. This was especially true when the document contained lots of what we call rich text cells. Rich text cells are those cells that contain text with mixed format spans, or text that consists of multiple lines. These cells are handled differently from simple strings internally, and have slightly more overhead than the simple string counterparts. Because of this, saving a document full of such texts was always slower than saving one with just numbers and simple strings.
However, even with this unavoidable overhead, the performance of saving rich text cells was clearly going in the wrong direction. Therefore it was time to act.
Long story short, after many days of code reading and writing, I brought it to a state where I can share some numbers.
Measuring export performance
I measured the performance of exporting rich text cells in the following steps.
Create a new spreadsheet document.
Type in cell A1 3 lines of ‘libreoffice’. Here, you can hit Ctrl-Enter to move to the next line within the same cell.
Copy A1, select A1:N1000 and paste, to replicate the content of A1 to all cells in the range.
Save the document as ODF spreadsheet document, and measure its duration.
I performed the above measurement with 3.5, 3.6, 4.0, 4.1, and the latest master (slated to become 4.2) builds, and these are the numbers.
It is clear from this chart that the performance started to suffer first in version 3.6, then gradually worsened over 4.0 and 4.1. The good news is that we have managed to bring the number back down in the master build, even lower than that of 3.5 which I used as the point of reference. Not just slightly lower, but much, much lower.
I don’t know about you, but I’m quite happy with this result.
This week I have finally finished implementing a true shared formula framework in Calc core which allows Calc to share token array instances between adjacent formula cells if they contain identical set of formula tokens. Since one of the major benefits of sharing formula token arrays is reduced memory footprint, I decided to measure the trend in Calc’s memory usages since 4.0 all the way up to the latest master, to see how much impact this shared formula work has made in Calc’s overall memory footprint.
Here is the test document I used to measure Calc’s memory usage
This ODF spreadsheet document contains 100000 rows of cells in 4 columns of which 399999 are formula cells. Column A contains a series of integers that grow linearly down the column. Here, only the first cell (A1) is a numeric cell while the rest are all formula cells that reference their respective immediate upper cell. Cells in Column B all reference their immediate left in Column A, cells in Column C all reference their immediate left in Column B, and so on. References used in this document are all relative references; no absolute references are used.
I’ve tested a total of 4 builds. One is the 4.0.1 build packaged for openSUSE 11.4 (x64) from the openSUSE repository, one is the 4.0.6 build built from the 4.0 branch, one is the 4.1.1 build built from the 4.1 branch, and the last one is the latest from the master branch. With the exception of the packaged 4.0.1 build, all builds are built locally on my machine running openSUSE 11.4 (x64). Also on the master build, I’ve tested memory usage both with and without shared formulas.
In each tested build, the memory usage was measured by directly opening the test document from the command line and recording the virtual memory usage in GNOME system monitor. After the document was loaded, I allowed for the virtual memory reading to stabilize by waiting several seconds before recording the number. The results are presented graphically in the following chart.
The following table shows the actual numbers recorded.
4.0.1 (packaged by openSUSE)
master (no shared formula)
master (shared formula)
Additionally, I’ve also measured the number of token array instances between the two master builds (one with shared formula and one without), and the build without shared formula created 399999 token array instances (exactly 4 x 100000 – 1) upon file load, whereas the build with shared formula created only 4 token array instances. This likely accounts for the difference of 78.3 MiB in virtual memory usage between the two builds.
Effect of cell storage rework
One thing worth noting here is that, even without shared formulas, the numbers clearly show a steady decline of Calc’s memory usage from 4.0 to 4.1, and to the current master. While we can’t clearly infer from these numbers alone what caused the memory usage to shrink, I can say with reasonable confidence that the cell storage rework we did during the same period is a significant factor in such memory footprint shrinkage. I won’t go into the details of the cell storage rework here; I’ll reserve that topic for another blog post.
Oh by the way, I have absolutely no idea why the 4.0.1 build packaged from the openSUSE repository shows such high memory usage. To me this looks more like an anomaly, indicative of earlier memory leaks we had later fixed, different custom allocator that only the distro packaged version uses that favors large up-front memory allocation, or anything else I haven’t thought of. Either way, I’m not counting this as something that resulted from any of our improvements we did in Calc core.