Orcus integration into LibreOffice

Last week was SUSE Hack Week, where we SUSE engineers were encouraged to be creative and work on whatever project that we had been dying to work on.

Given this opportunity, I decided to try integrating my orcus library project into LibreOffice proper to see how much improvement we could make in the performance of loading spreadsheet documents.

I’ll leave the detailed description and goal of orcus project for another blog post, but in short, orcus is an independent library designed to process spreadsheet documents, and is also designed to be useable from an application that would like to use it to load documents. It’s currently still work in progress, and is not even in alpha quality. So, I intentionally don’t release orcus library packages on an official basis.

Integration work

The main difficulty with integrating orcus into LibreOffice proper was dealing with the very intricate loading process that LibreOffice uses for all existing filters. It first goes through an elaborate type detection process, which loads the content of the file into memory in order for the type detection code to parse it. Once the correct type is determined, LibreOffice then instantiates correct frame loader and start the actual loading process. I’ve explained all of this in detail in this blog post of mine.

Orcus, on the other hand, only needs a file path, and it does the rest. And it pushes data to the call back functions provided by the client code as it parses the file. It was this difference in overall loading process that made the integration of orcus into LibreOffice all the more challenging. And even though the hack week itself lasted only one week, I had spent months prior to it just to study the type detection code and other auxiliary code that makes up the whole file loading process in order to come up with an elegant way to add hook for orcus.

Long story short, I was able to come up with a way to hook orcus such that LibreOffice relinquishes all its file loading to the orcus library, and only handles callbacks. To make this work, I first packaged orcus into an installable rpm package using the openSUSE build service, locally installed that package, then added –with-system-orcus configure option to allow LibreOffice to find the library. The entire change needed to add hook is condensed into this commit.

Using CSV filter as an experiment

As an initial experiment, I replaced the current csv import filter with one from orcus, just to see how this overall process works. The results are very encouraging.

With a very large csv file I created via this python script:

#!/usr/bin/env python
 
import sys
 
for i in xrange(0, 65536):
    for j in xrange(1, 101):
        val = i * 1.0 / j
        sys.stdout.write("%g,"%val)
    sys.stdout.write("end\n")

the current filter spends roughly 27 seconds to load this file, which is not too bad given the sheer size of the file (~50Mb). The orcus filter, on the other hand, spends only 11 seconds to load the same file.

However, the orcus filter code path still skips a number of steps that need to be performed if it were to be used in the production build, such as

  • drawing progress bar in the status bar area,
  • calculating row heights for rows that include multi-line cell contents, and
  • probably something else I forget to mention here.

Given some of these can be quite expensive, the above numbers may not be fully comparable. Despite that, these initial numbers show a great promise on the performance improvement that may result from using the orcus library.

Future work

First of all, we will not switch to the orcus csv filter anytime soon. Although I’d like to see that happen at some point in the future, there are still lots of missing pieces in the orcus csv filter that prevent us from using it in the production build. My plan with orcus is therefore limited to addition of new filters, and my immediate plan is to develop new XML import and export filters using orcus, and integrate it into LibreOffice. This should also provide a stepping stone for any additional filters that may come up later, as well as replacing some of the existing filters as the need arises.

That’s all for now. Thanks for reading!

CSV import enhancement

As I mentioned in my previous post, I had introduced two new HTML import options to control how numbers in cells are detected & converted during import. I had also hinted at the end of the post that I might add the same import options to the CSV import dialog. I’m writing this post to inform you guys that that’s exactly what I just did.

Here is the new CSV import options dialog:

which highlights the changes I’ve made. At the top of the dialog is the new Language list box, to select what language to use for the import. If it’s set to Default, it uses the language that OOo uses globally. Similar to the HTML import, this option affects how the numbers are parsed, based on the decimal and thousands separators for the selected language, as well as how the special numbers are detected & converted. I’ll talk more on the special number detection later.

In addition to the language option, located in the middle of the dialog are two check boxes to further control how cell values are to be interpreted.

  • Option Quoted field as text, when set, always imports quoted cell values as text, even if they are numbers. This option existed before this change, but was used only for the separator-based CSV imports. It is now used for the fixed width imports as well.
  • Option Detect special numbers, similar to the one in the HTML import option, controls whether or not to detect specially-formatted numbers, such as dates, scientific notations etc. When this option is set, Calc will try to detect special numbers and convert them into appropriate format. When this option is not set, Calc will only convert the simple decimal numbers. By default, this option is not set.

Anyway, that’s all there is to it. Hopefully this will solve, or at least make it easier to handle importing of CSV documents, which was previously hampered by Calc’s aggressive date detection and lack of support for alternative number separators. I doubt that this will make your CSV import experience a perfect one, but it will hopefully make it a much better one.

It’s text when it’s quoted, stupid!

Ok. This is another one I just checked into ooo-build trunk. It’s actually a very minor enhancement, but some people may find this useful.

What I did is to modify Calc’s csv import filter to provide the users an option to always import quoted fields (or cells) as text. Here is a screenshot of the dialog with this new option:

Hopefully what it does is self-evident. If not, here is an example.

Let’s say you have a csv file of the following content:

"0-0" "1-0" "2-0" "3-0" "4-0" "5-0"
"0-1" "1-1" "2-1" "3-1" "4-1" "5-1"

When you import this file into Calc, here is what you get with the current version:

As you can see, most of the fields got converted to dates. You get the same results even if those individual cells are not quoted, because Calc’s current csv import filter doesn’t make any distinction between quoted and unquoted fields in terms of data types.

When you import this file using the new option, you get this instead:

Now, all quoted fields are imported as text, with no surprise conversion. Plain and simple. :-)