Tag Archives: orcus

Orcus 0.11.0

Share Button

I’m very pleased to announce that version 0.11.0 of the orcus library is officially out in the wild! You can download the latest source package from the project’s home page.

Lots of changes went into this release, but the two that I would highlight most are the inclusions of JSON and YAML parsers and their associated tools and interfaces. This release adds two new command-line tools: orcus-json and orcus-yaml. The orcus-json tool optionally handles JSON references to external files when the --resolves-refs option is given, though currently it only supports resolving external files that are on the local file system and only when the paths are relative to the referencing file.

I’ve also written an API documentation on the JSON interface in case someone wants to give it a try. Though the documentation on orcus is always work-in-progress, I’d like to spend more time to make the documentation in a more complete state.

On the import filter front, Markus Mohrhard has been making improvements to the ODS import filter especially in the area of styles import. Oh BTW, he is also proposing to mentor a GSOC project on this front under the LibreOffice project. So if you are interested, go and take a look!

That’s all I have at the moment. Thank you, ladies and gentlemen.

Orcus 0.7.1 is out

Share Button

After more than a year since the release of 0.7.0, I’m once again very happy to announce that the version 0.7.1 of the Orcus library is now available for download. You can download the package from the project’s download page.

This is a maintenance release. It primarily includes bug fixes and build fixes since the 0.7.0 release with no new features. That said, the most notable aspect of this release is that it is buildable with the version 0.9.0 of the Ixion library which was just released a week ago. So, if you are trying to package and distribute the newly-released Ixion library but are unable to do so because of Orcus not being buildable with it, you might be interested in this release.

The next major upgrade will be 0.9.0 whose development is on-going on the current master branch. Hopefully that release won’t be too far away.

SUSE Hack Week

Share Button

Last week was SUSE’s Hack Week – an event my employer does periodically to allow us – hard working engineers – to go wild with our wildest ideas and execute them in one week. Just like what I did at my last Hack Week event, I decided to work on integration of Orcus library into LibreOffice once again, to pick up on what I’d left off from my previous integration work.

Integration bits

Prior to Hack Week, orcus was already partially integrated; it was used to provide the backend functionality for Calc’s XML Source feature, and experimental support for Gnumeric file import. The XML Source side was pretty well integrated, but the normal file import side was only partially integrated. Lots of essential pieces were still missing, the largest of which were

  • support for multiple filters from a single external filter provider source (such as orcus),
  • progress indicator in the status bar, and
  • proper type detection by analyzing file content rather than its extension (which we call “deep detection”).

In short, I was able to complete the first two pieces during Hack Week, while the last item still has yet to be worked on. Aside from this, there are still more minor pieces missing, but perhaps I can work on the remaining bits during the next Hack Week.

Enabling orcus in your build

If you have a recent enough build from the master branch of the LibreOffice repository, you can enable imports via orcus library by

  1. checking the Enable experimental features box in the Options dialog, and
  2. setting the environment variable LIBO_USE_ORCUS to YES before launching Calc.

This will overwrite the stock import filters for ODS, XLSX and CSV. At present, orcus only performs file extension based detection rather than content based one, so be mindful of this when you try this on your machine. To go back to the current import filters, simply disable experimental features, or unset the environment variable.

Note that I’ve added this bits to showcase a preview of what orcus can potentially do as a future import filter framework. As such, never use this in production if you want stable file loading experience, or don’t file bugs against this. We are not ready for that yet. Orcus filters are still missing lots and lots of features.

Also note that, while in theory you could enable orcus with the Windows build, the performance of orcus on Windows may not be that impressive; in fact, in some cases slower than the current filters. That is because orcus relies on strtod and strtol system calls to convert string numbers into numeric values, and their implementation depend on the platform. And the performance of MSVC’s strtod implementation is known to be suboptimal compared to those of gcc and clang on Linux. I’m very much aware of this, and will work on addressing this at a later time.

Performance comparison

This is perhaps the most interesting part. I wanted to do a quick performance comparison and see how this orcus filter stands up against the current filter. Given the orcus filter is still only capable of importing raw cell values and not any other features or properties (not even cell formats), I’ve used this test file which only consists of raw text and numeric values in a 8-by-300000 range, to measure the load times that are as fair and representative as I could make them. Here is the result on my machine running openSUSE 11.4:

xlsx-load-times

The current filter, which has undergone its set of performance optimizations on raw cell values, still spends upwards of 50 seconds. Given that it used to take minutes to load this file, it’s still an improvement.

The orcus filter, on the other hand, combined with the heavily optimized load handler in Calc core that I put in place during Hack Week, can load the same file in 4.5 seconds. I would say that is pretty impressive.

I also measured the load time on the same file using Excel 2007, on the same machine running on top of wine, and the result was 7.5 seconds. While running an Windows app via wine emulation layer may incur some performance cost, this page suggests that it should not be noticeable, if any. And my own experience of running various versions of Excel via wine backs up that argument. So this number should be fairly representative of Excel’s native performance on the same hardware.

Considering that my ultimate goal with orcus is to beat Excel on performance on loading its own files (or at least not be slower than Excel), I would say we are making good progress toward that goal.

That’s all for today. Thank you, ladies and gentlemen.

Orcus integration into LibreOffice

Share Button

Last week was SUSE Hack Week, where we SUSE engineers were encouraged to be creative and work on whatever project that we had been dying to work on.

Given this opportunity, I decided to try integrating my orcus library project into LibreOffice proper to see how much improvement we could make in the performance of loading spreadsheet documents.

I’ll leave the detailed description and goal of orcus project for another blog post, but in short, orcus is an independent library designed to process spreadsheet documents, and is also designed to be useable from an application that would like to use it to load documents. It’s currently still work in progress, and is not even in alpha quality. So, I intentionally don’t release orcus library packages on an official basis.

Integration work

The main difficulty with integrating orcus into LibreOffice proper was dealing with the very intricate loading process that LibreOffice uses for all existing filters. It first goes through an elaborate type detection process, which loads the content of the file into memory in order for the type detection code to parse it. Once the correct type is determined, LibreOffice then instantiates correct frame loader and start the actual loading process. I’ve explained all of this in detail in this blog post of mine.

Orcus, on the other hand, only needs a file path, and it does the rest. And it pushes data to the call back functions provided by the client code as it parses the file. It was this difference in overall loading process that made the integration of orcus into LibreOffice all the more challenging. And even though the hack week itself lasted only one week, I had spent months prior to it just to study the type detection code and other auxiliary code that makes up the whole file loading process in order to come up with an elegant way to add hook for orcus.

Long story short, I was able to come up with a way to hook orcus such that LibreOffice relinquishes all its file loading to the orcus library, and only handles callbacks. To make this work, I first packaged orcus into an installable rpm package using the openSUSE build service, locally installed that package, then added –with-system-orcus configure option to allow LibreOffice to find the library. The entire change needed to add hook is condensed into this commit.

Using CSV filter as an experiment

As an initial experiment, I replaced the current csv import filter with one from orcus, just to see how this overall process works. The results are very encouraging.

With a very large csv file I created via this python script:

#!/usr/bin/env python
 
import sys
 
for i in xrange(0, 65536):
    for j in xrange(1, 101):
        val = i * 1.0 / j
        sys.stdout.write("%g,"%val)
    sys.stdout.write("end\n")

the current filter spends roughly 27 seconds to load this file, which is not too bad given the sheer size of the file (~50Mb). The orcus filter, on the other hand, spends only 11 seconds to load the same file.

However, the orcus filter code path still skips a number of steps that need to be performed if it were to be used in the production build, such as

  • drawing progress bar in the status bar area,
  • calculating row heights for rows that include multi-line cell contents, and
  • probably something else I forget to mention here.

Given some of these can be quite expensive, the above numbers may not be fully comparable. Despite that, these initial numbers show a great promise on the performance improvement that may result from using the orcus library.

Future work

First of all, we will not switch to the orcus csv filter anytime soon. Although I’d like to see that happen at some point in the future, there are still lots of missing pieces in the orcus csv filter that prevent us from using it in the production build. My plan with orcus is therefore limited to addition of new filters, and my immediate plan is to develop new XML import and export filters using orcus, and integrate it into LibreOffice. This should also provide a stepping stone for any additional filters that may come up later, as well as replacing some of the existing filters as the need arises.

That’s all for now. Thanks for reading!

LibreOffice Conference 2011

Share Button

So, it was a real pleasure to be a part of the very first LibreOffice conference held in Paris, France. Some of the faces and names were familiar from the old OOo conferences, but the atmosphere of the conference was very different from the OOo ones in the past. I have been to the 2007 Barcelona conference and the 2009 Orvieto one, and I have to say, while there were some rough-edges, this is by-far my favorite OOo/LibO conference to date.

The only regret I have is that, because I had another international trip (to South Korea) only a week prior to the conference, I felt pretty much exhausted most of the time I was there. But I think I managed to chat with most of the people I needed to chat with during this once-a-year event. I intentionally tried not to hack too much during this conference, mainly because of my travel fatigue, but also because I felt it was more important to see people and talk to them to have a good feel for each other. Working from home, I sometimes miss the human interaction that people who work in the office probably take for granted, so this conference was a perfect place to fulfill that need, to make me feel human again. ;-) (Actually I tried to code a bit during the conference, but apparently my brain wasn’t cooperating at all I decided it probably wasn’t a good idea).

Anyway, it was good to see and chat with Markus Mohrhard (moggi), a very active Calc hacker who’s been instrumental in Calc’s filter test development in recent days. We discussed on various topics on Calc development since we work together in that code.

Also, Laurent Godard, whom I’ve known many years from the OOo days, but never met face-to-face.

And Valek Filippov, who happens to be in the same timezone as I. There aren’t many of us left in this LibreOffice circle, unfortunately. I tried to persuade him into this wonderful world of hacking, but so far he’s successfully fended off my attack.

It was also nice to chat with Michael Meeks at length, to clarify the new Calc cell storage structure that he and I discussed previously. Now the concept is very much clear, waiting to be coded.

Of course, many other countless hackers I’ve had beer with during the conference week, it was a real pleasure.

Now, I got some homework to do based on my interaction with various people during the conference. I will list them up item by item to use as a reminder.

  • Two Calc bugs from Valek. Both are related to this 1C program that pretty much everyone in Russia uses. I’ve already added them to my 3.5 TODO list, so it’s just a matter of finding time to tackle them unless something tricky comes out.
  • Some documentation on how to use the ixion library. Since there were some interests on using ixion to support formula calculations in other applications, I should probably start working on producing documentation on ixion, both on how to build it, and how to use it. I should also create a package for it while I’m at it.
  • Support for temporary cell buffer in the orcus library, to allow converting cell values before passing them to the client code. In some cases we can’t simply push the cell value as-is but convert it first before passing it to the client code. Typical examples are double quotes as a literal quote in CSV, as well as encoded characters (e.g. &) in XML/HTML. This will unfortunately cost us a bit for the allocation of the buffer and copying of the char array, but fortunately we don’t need to do this for all cells.
  • And lots and lots more.

All in all, I was glad to be a part of this successful conference. The atmosphere was very much all inclusive and personal, exactly how an open source conference should be.