I’m very pleased to announce that version 0.11.0 of the orcus library is officially out in the wild! You can download the latest source package from the project’s home page.
Lots of changes went into this release, but the two that I would highlight most are the inclusions of JSON and YAML parsers and their associated tools and interfaces. This release adds two new command-line tools: orcus-json and orcus-yaml. The orcus-json tool optionally handles JSON references to external files when the --resolves-refs option is given, though currently it only supports resolving external files that are on the local file system and only when the paths are relative to the referencing file.
I’ve also written an API documentation on the JSON interface in case someone wants to give it a try. Though the documentation on orcus is always work-in-progress, I’d like to spend more time to make the documentation in a more complete state.
Here is another performance improvement that just landed on master.
It was brought to our attention that the performance of saving documents to ODF spreadsheet format had been degrading quite noticeably. This was especially true when the document contained lots of what we call rich text cells. Rich text cells are those cells that contain text with mixed format spans, or text that consists of multiple lines. These cells are handled differently from simple strings internally, and have slightly more overhead than the simple string counterparts. Because of this, saving a document full of such texts was always slower than saving one with just numbers and simple strings.
However, even with this unavoidable overhead, the performance of saving rich text cells was clearly going in the wrong direction. Therefore it was time to act.
Long story short, after many days of code reading and writing, I brought it to a state where I can share some numbers.
Measuring export performance
I measured the performance of exporting rich text cells in the following steps.
Create a new spreadsheet document.
Type in cell A1 3 lines of ‘libreoffice’. Here, you can hit Ctrl-Enter to move to the next line within the same cell.
Copy A1, select A1:N1000 and paste, to replicate the content of A1 to all cells in the range.
Save the document as ODF spreadsheet document, and measure its duration.
I performed the above measurement with 3.5, 3.6, 4.0, 4.1, and the latest master (slated to become 4.2) builds, and these are the numbers.
It is clear from this chart that the performance started to suffer first in version 3.6, then gradually worsened over 4.0 and 4.1. The good news is that we have managed to bring the number back down in the master build, even lower than that of 3.5 which I used as the point of reference. Not just slightly lower, but much, much lower.
I don’t know about you, but I’m quite happy with this result.
I have great news to share with you. Calc’s ODS import filter in 3.5 should be substantially faster when you have documents with a large number of named ranges. Read on if you want to know more details.
Laurent Godard, Markus Mohrhard, and myself have been working pretty hard in the past month to bring the performance of ODS import filter to a reasonable level, especially with documents containing a large number of named ranges.
Here is the background. Laurent uses LibreOffice as a platform for his professional extension, which makes heavy use of named ranges. It programmatically generates ODS documents and inserts hundred’s or thousand’s of named ranges as intermediary storage to further process the data. The problem was, however, our import performance with that kind of documents was so suboptimal that this process was taking a prohibitively long time. In order for his extension to perform optimally, our ODS import filter needed to be optimized, and optimized heavily.
During the Paris conference, we got our heads together in order to come up with a strategy to make that happen. Laurent was more than willing to participate this effort, and in the end, he did substantial amount of work profiling, analyzing code, coming up with optimization strategy and putting it altogether. Markus and I provided mentorship, code pointers, as well as occasional coding to accelerate this effort.
Our hope was to make it all happen in time for our first 3.5 release. And I’m very happy to say that we made it.
Since we are talking about performance, it won’t be complete without the actual numbers. So here goes.
Test document 1
Here is the first test document global500.ods. It contains 500 sheets, 12,500 global named ranges, and 12,500 formulas that reference them.
On my development machine, the last stable release 3.4.4 takes 14 seconds to open this document. While 14 seconds may not seem that slow, keep in mind that this machine is somewhat unfairly fast tailored for the abusive developer use, so the real world performance is likely much less impressive (you can probably multiply that number by 3 to get a rough idea of the real world performance). Anyhow, using the latest master branch on the same machine, this document opens roughly in 2 and a half seconds. That’s roughly 86% reduction in import time.
Test document 2
Here is the second, somewhat larger document global1000.ods. This document contains 1000 sheets, 25,000 named ranges and 25,000 formulas that reference them.
According to my benchmark performed in the same condition as the first document, 3.4.4 opens this document in 50 seconds, whereas in 3.5.0 it opens under 5 seconds. That’s about 90% reduction in import time. Pretty impressive!
Real power of open source
This story shows another aspect of this remarkable achievement worth mentioning. If you use an open source product such as LibreOffice in your business, and if it doesn’t perform the way you need it to, you can actually join the project as a developer and coordinate the effort with the upstream developers to make it happen. And depending on the nature of the change you want to see happen, it can happen very quickly as this story demonstrates.
I wanted to emphasize this because, while more and more businesses and institutions are embracing open source software, many of them tend to focus too much on the cost-saving aspect of it, thereby developing the wrong mindset that that’s what open source is all about. It isn’t. The real power of using open source software in your deployment is it gives you the ability to join and contribute to the project to influence the direction of its development. That gives you real flexibility in planning, and in my opinion the best way to harness the power of using open source software. The monetary cost-saving side of the benefit comes as a side effect but should be thought of only as an added bonus, not the primary reason for deploying open source software.
I guess it’s all over the news right now, that the latest service pack (SP2) for MS Office 2007 will enable Office to import and export ODF natively. This blog piece by Doug Mahugh touches on the word processor part of their ODF support. I haven’t yet tried it myself, but judging by Doug’s blog article it looks pretty impressive.
But, being more of a spreadsheet person, I’m personally more interested in how it fares in Excel to Calc interoperability. Since their ODF support is on ODF 1.1, which predates the on-going OpenFormula specification work, I’d be interested to see how compatible the formulas are. Technically speaking, as of ODF 1.1, interpreting formula expressions was pretty much application-specific, so I would not be surprised even if they are not compatible at all. But we’ll just have to see.
Either way, I find this news very encouraging. This is undoubtedly a big step toward proliferation of ODF as a practical document exchange format.