Update on border lines

Just a quick update to my last post on getting Calc’s border line situation sorted out.

As of last post, the border lines were pretty in good shape as far as printing to paper, but it was still less than satisfactory when rendered on screen. Lines looked generally fatter and the dashes line were unevenly positioned. I had some ideas that I wanted to try out in order to make the border lines look prettier on screen. So I went ahead and spent a few extra days to give that a try, and I’m happy to report that that effort paid off.

To recap, this is what the border lines looked like as of last Friday.

screen-calc-after

and this is what they look like now:

screen-calc-followup

The lines are skinnier, which in my opinion make them look slicker, and the dashes lines are now evenly spaced and look much better.

The art of drawing border lines

I spent this past week on investigating a collection of various problems surrounding how Calc draws cell borders. The problem is very hard to define and can become very subjective depending on who you talk to. Having said that, if you ever imported an Excel document that makes elaborate use of cell borders into Calc, you may often have seen that the borders were printed somewhat differently than what you would have expected.

Here is an example. This is a very small test Excel document that I made that contains all cell border types that Excel supports. When you open this document in Excel and print it on paper, here is what you get.

excel-print

When you open this document in Calc and print it, you probably get something like this:

calc-print-before

You’ll immediately notice that some of the lines (hair, dashed and double lines to be precise) are not printed at all! Not only that, thin, medium and thick lines are a little skinner than those of Excel’s, the dotted line is barely visible, the medium dashed line looks a lot different, and the rest of the dashed lines all became solid lines.

Therefore, it was time for action.

Results

I’ll spare you the details, but the good news is that after spending a week in various parts of the code base, I’ve been able to fix most of the major issues. Here is what Calc prints now using the build from the latest master branch:

calc-print-after

There are still some minor discrepancies from Excel’s borders, such as the double line being a bit too thinner, the dotted line being not as dense as Excel’s etc. But I consider this a step in the right direction. The dashed and medium dashed lines look much better to my eye, and the thicknesses of these lines are more comparable to Excel’s.

The dash-dot and dash-dot-dot lines still become solid lines since we don’t yet support those line types, but that can be worked on at a later time.

So, this is all good, right?

Not quite. One of the reasons why the cell borders became such a big issue was that we previously focused too much on getting them to display correctly on screen. Unfortunately, the resolution of a typical PC monitor is not high enough to accurately depict the content of your document, so what you see on screen is a pixelized approximation of the actual content. When printing to a paper, on the other hand, the content gets depicted much more accurately simply because you get much higher resolution when printing.

I’ll give you a side-by-side comparison of how the content of the same document gets displayed in Excel (2010), Calc 4.2 (before my change), and Calc master (with my change) all at 100% zoom level.

First up is Excel:

screen-excel

The lines all look correct, unsurprisingly. One thing to note is that when displaying Excel approximates a hairline with a very thin, densely dotted line to differentiate it from a thin line both of which are one pixel high. But make no mistake; hairline by definition is a solid line. This is just a trick Excel employs in order to make the hairline look thinner than the thin line counterpart.

Then comes Calc as of 4.2 (before my change):

screen-calc-before

The hairline became a finely-dashed line both on display and in internal representation. Aside from that, both dashed and medium dashed lines look a bit too far apart. Also, the double line looks very much single. In terms of the line thicknesses, however, they do look very much comparable to Excel’s. Let me also remind you that Excel’s dash-dot and dash-dot-dot lines currently become solid lines in Calc because we don’t support these line types yet.

Now here is what Calc displays after my change:

screen-calc-after

The hair line is a solid line since we don’t use the same hair line trick that Excel uses. The dotted and dashes lines look much denser and in my opinion look better. The double line is now really double. The line thicknesses, however, are a bit off even though they are internally more comparable to Excel’s (as you saw in the printout above). This is due to the loss of precision during rasterization of the border lines, and for some reason they get fatter. We previosly tried to “fix” this by making the lines thinner internally, but that was a wrong approach since that also made the lines thinner even when printed, which was not a good thing. So, for now, this is a compromise we’ll have to live with.

But is there really nothing we can do about this? Well, we could try to apply some correction to make the lines look thinner on screen, and on screen only. I have some ideas how we may be able to achieve that, and I might give that a try during my next visit.

That, and we should also support those missing dash-dot, and dash-dot-dot line types at some point.

SUSE Hack Week

Last week was SUSE’s Hack Week – an event my employer does periodically to allow us – hard working engineers – to go wild with our wildest ideas and execute them in one week. Just like what I did at my last Hack Week event, I decided to work on integration of Orcus library into LibreOffice once again, to pick up on what I’d left off from my previous integration work.

Integration bits

Prior to Hack Week, orcus was already partially integrated; it was used to provide the backend functionality for Calc’s XML Source feature, and experimental support for Gnumeric file import. The XML Source side was pretty well integrated, but the normal file import side was only partially integrated. Lots of essential pieces were still missing, the largest of which were

  • support for multiple filters from a single external filter provider source (such as orcus),
  • progress indicator in the status bar, and
  • proper type detection by analyzing file content rather than its extension (which we call “deep detection”).

In short, I was able to complete the first two pieces during Hack Week, while the last item still has yet to be worked on. Aside from this, there are still more minor pieces missing, but perhaps I can work on the remaining bits during the next Hack Week.

Enabling orcus in your build

If you have a recent enough build from the master branch of the LibreOffice repository, you can enable imports via orcus library by

  1. checking the Enable experimental features box in the Options dialog, and
  2. setting the environment variable LIBO_USE_ORCUS to YES before launching Calc.

This will overwrite the stock import filters for ODS, XLSX and CSV. At present, orcus only performs file extension based detection rather than content based one, so be mindful of this when you try this on your machine. To go back to the current import filters, simply disable experimental features, or unset the environment variable.

Note that I’ve added this bits to showcase a preview of what orcus can potentially do as a future import filter framework. As such, never use this in production if you want stable file loading experience, or don’t file bugs against this. We are not ready for that yet. Orcus filters are still missing lots and lots of features.

Also note that, while in theory you could enable orcus with the Windows build, the performance of orcus on Windows may not be that impressive; in fact, in some cases slower than the current filters. That is because orcus relies on strtod and strtol system calls to convert string numbers into numeric values, and their implementation depend on the platform. And the performance of MSVC’s strtod implementation is known to be suboptimal compared to those of gcc and clang on Linux. I’m very much aware of this, and will work on addressing this at a later time.

Performance comparison

This is perhaps the most interesting part. I wanted to do a quick performance comparison and see how this orcus filter stands up against the current filter. Given the orcus filter is still only capable of importing raw cell values and not any other features or properties (not even cell formats), I’ve used this test file which only consists of raw text and numeric values in a 8-by-300000 range, to measure the load times that are as fair and representative as I could make them. Here is the result on my machine running openSUSE 11.4:

xlsx-load-times

The current filter, which has undergone its set of performance optimizations on raw cell values, still spends upwards of 50 seconds. Given that it used to take minutes to load this file, it’s still an improvement.

The orcus filter, on the other hand, combined with the heavily optimized load handler in Calc core that I put in place during Hack Week, can load the same file in 4.5 seconds. I would say that is pretty impressive.

I also measured the load time on the same file using Excel 2007, on the same machine running on top of wine, and the result was 7.5 seconds. While running an Windows app via wine emulation layer may incur some performance cost, this page suggests that it should not be noticeable, if any. And my own experience of running various versions of Excel via wine backs up that argument. So this number should be fairly representative of Excel’s native performance on the same hardware.

Considering that my ultimate goal with orcus is to beat Excel on performance on loading its own files (or at least not be slower than Excel), I would say we are making good progress toward that goal.

That’s all for today. Thank you, ladies and gentlemen.

mso-dumper now packaged in OBS

I’m happy to announce that the mso-dumper tool is now packaged in the openSUSE build service under my home repository. This tool is written in Python, and allows you to dump the contents of MS Office documents stored in the BIFF-structured binary file format in a more human readable fashion. It is an indispensable tool when dealing with importing from and/or exporting to the Office documents. Right now, only Excel and Power Point formats are supported.

This package provides two new commands xls-dump and ppt-dump. If you wish to dump the content of an Excel document, all you have to do is

xls-dump ./path/to/mydoc.xls

and it dumps its content to standard output. What the output looks like depends on what’s stored with the document, but it will look something like this:

...
0085h: =============================================================
0085h: BOUNDSHEET - Sheet Information (0085h)
0085h:   size = 14
0085h: -------------------------------------------------------------
0085h: B4 09 00 00 00 00 06 00 53 68 65 65 74 31 
0085h: -------------------------------------------------------------
0085h: BOF position in this stream: 2484
0085h: sheet name: Sheet1
0085h: hidden state: visible
0085h: sheet type: worksheet or dialog sheet

008Ch: =============================================================
008Ch: COUNTRY - Default Country and WIN.INI Country (008Ch)
008Ch:   size = 4
008Ch: -------------------------------------------------------------
008Ch: 01 00 01 00 

00EBh: =============================================================
00EBh: MSODRAWINGGROUP - Microsoft Office Drawing Group (00EBh)
00EBh:   size = 90
00EBh: -------------------------------------------------------------
00EBh: 0F 00 00 F0 52 00 00 00 00 00 06 F0 18 00 00 00 
00EBh: 02 04 00 00 02 00 00 00 02 00 00 00 01 00 00 00 
00EBh: 01 00 00 00 02 00 00 00 33 00 0B F0 12 00 00 00 
00EBh: BF 00 08 00 08 00 81 01 09 00 00 08 C0 01 40 00 
00EBh: 00 08 40 00 1E F1 10 00 00 00 0D 00 00 08 0C 00 
00EBh: 00 08 17 00 00 08 F7 00 00 10 

00FCh: =============================================================
00FCh: SST - Shared String Table (00FCh)
00FCh:   size = 8
00FCh: -------------------------------------------------------------
00FCh: 00 00 00 00 00 00 00 00 
00FCh: -------------------------------------------------------------
00FCh: total number of references: 0
00FCh: total number of unique strings: 0
...

I have originally written this tool to deal with the Excel import and export part of Calc’s development, and continue to develop it further. Thorsten Behrens has later joined forces and added support for the Power Point format. Right now, I’m working on adding an XML output format option to make it easier to compare outputs, which is important for regression testing.

ODF Support in MS Office 2007

I guess it’s all over the news right now, that the latest service pack (SP2) for MS Office 2007 will enable Office to import and export ODF natively. This blog piece by Doug Mahugh touches on the word processor part of their ODF support. I haven’t yet tried it myself, but judging by Doug’s blog article it looks pretty impressive.

But, being more of a spreadsheet person, I’m personally more interested in how it fares in Excel to Calc interoperability. Since their ODF support is on ODF 1.1, which predates the on-going OpenFormula specification work, I’d be interested to see how compatible the formulas are. Technically speaking, as of ODF 1.1, interpreting formula expressions was pretty much application-specific, so I would not be surprised even if they are not compatible at all. But we’ll just have to see.

Either way, I find this news very encouraging. This is undoubtedly a big step toward proliferation of ODF as a practical document exchange format.

Encrypting an Excel document with password from Calc

The feature of exporting an Excel document encrypted with a password has landed! Now you can save a spreadsheet document as an Excel document with password protection and open it in Excel with that password. The proof is in the following screenshot (the colored emphasis is mine):

Saving a Calc document as Excel with password protection (with encryption)

This feature is now available in the go-oo version of OOo. This also means that, when you edit & save an existing Excel document that has already been encrypted, it will get saved encrypted. Previously when you did this, Calc would save it un-encrypted and you would lose your password protection, which was not good and was a deal breaker for a certain segment of users.

The upstream effort of this feature is underway in the scsheetprotection02 CWS. That CWS also contains another enhancement for sheet and document structure protection (sounds similar but totally unrelated to this document encryption feature), and as soon as I take care of upstreaming that enhancement, I can push the whole CWS for upstream integration. The bad news is that, because that requires a change in the ODF file format & UI change, it will probably take some time before it can be integrated upstream. But I’m making slow but steady progress there, so I’ll keep you guys updated.

Meanwhile, please test the go-oo version for this encryption feature to see if there is any document that Calc fails to encrypt properly (that means Excel can’t open it), and report any bugs to us. I’ve done my own testing, but it’s never as good as many other users testing it. So, thanks!

P.S. Actually this feature has been available for at least two months in the 3.0 branch, but I wanted to see this bug fixed before writing a blog about this so that I could take a pretty screenshot with all texts displayed.

New sheet protection dialog

I’ve just finished designing a new dialog for Calc’s sheet protection functionality to allow optional sheet protection options. This was actually my first time designing a dialog from scratch instead of modifying an existing one, so I had to dig around and figure out how to add a dialog. It turns out that it is actually very simple once you know what to do. After several hours of creative designing process, I’ve come up with something I can show to people. So here it is:

sheet protection dialog screenshot

One thing to note: obviously this dialog is inspired by the similar functionality offered by Excel, and Excel provides many more options for sheet protection than just the two I’m showing here. The reason I only have two at the moment is because I’ve only implemented support for those two options in Calc core. When we support more options in the core, we can easily add them to the dialog.

This work is on-going in scsheetprotection02 CWS. Aside from the new dialog and sheet protection options, this CWS contains my other work on the binary Excel export encryption as well as sheet and document password interoperability between Excel and Calc. I’m trying to wrap this up, so hopefully I can come up with something that people can try out soon.

Excel workbook and worksheet protection

I didn’t mean to keep this a secret, but I’ve been working on two tasks since December to further enhance Calc’s interoperability with Excel. The first one is to support correct import of an Excel document that includes password-protected worksheets, and the second one is to support exporting of a Calc document into an encrypted Excel format with password protection.

Workbook and worksheet protection with password

Currently, when Calc imports an Excel document with password-protected sheet, it retains the protected state but discards the password. What I’m trying to do is to retain the password if the sheet is password-protected and also export the sheet password when that document is exported back into Excel.

The difficulty of retaining a sheet password from Excel all lies in finding out the correct password hashing function that Excel uses, then the rest will be easy. This is because when you type in your password to protect a worksheet, Excel doesn’t store the actual password text, but immediately hashes it and stores that hash value in the document instead of the original text (BTW, Calc does the same). So, if you don’t know the correct hashing algorithm that generates an identical hash value given the same password, all bets are off.

As I covered in my previous blog post, however, this has already been taken care of. On top of that, Excel uses the same hashing function for both worksheet and workbook protections, so we can take care of both types of protections with just this one hashing function mentioned in my previous blog.

There is one important thing to note here. This worksheet-level protection is primarily for preventing accidental editing of protected worksheet during run-time, and this still does not prevent the content of such “protected” worksheet from being modified by directly altering the content of the file itself. This is because the content of the protected worksheet is not encrypted inside the file, so this level of password “protection” can be easily circumvented.

Workbook-level protection with password

Now, Excel also offers a workbook-level protection with password that actually involves encryption of document’s content, and that’s the second task I’ve been working on. There are two instances where Excel encrypts the workbook content:

  1. when you protect workbook’s structure (you can either protect the workbook structure, windows or both for workbook-level protection) or
  2. when you set a password from the Save As dialog (click on Tools near the upper-right corner and select General Options).

Currently, Calc can open an encrypted Excel document, but cannot save it back with the content still encrypted. So, when you open such a document and save it, it loses its content encryption (though a warning dialog is displayed before it writes the content back to the file unencrypted). What I’m trying to do is to allow the user to save the content of an already encrypted document back encrypted, or save a new Calc document to an encrypted Excel document with password protection.

This work is still on-going, but I’m pretty confident that the hardest part is already over. What makes this task really hard is the fact that Excel doesn’t encrypt the entire file stream, but encrypts only the record contents and leaves the record headers unencrypted. Also, certain records, and even parts of certain records, are not encrypted. If the entire file stream was encrypted, it would be relatively simpler to put an encryption layer on top of the normal record stream. But because some parts of the stream are not encrypted, it is necessary to check every single record that we write to make sure that we are encrypting only the right parts of the stream, or else Excel will not load the document at all. But like I said, the worst is over. I can now correctly generate an encrypted Excel file with various content types: even those documents with form controls, charts, cell comments, autoshapes, and OLE objects.

One interesting trivia. The encryption algorithm that Excel uses to encrypt a document requires a password, but Excel allows document encryption without a password. So, what does Excel do if no password is given? It uses the default password VelvetSweatshop. :-)

Anyway, both of these tasks are on-going in the scsheetprotection01 CWS. The original IZ issues are i60305 and i84766. There is still some work ahead before I can get this CWS integrated, but I’m making progress. Thank you, ladies and gentlemen.

Excel sheet protection password hash

When you protect either your workbook or one of your worksheets with a password in Excel, Excel internally generates a 16-bit hash of your password and stores it instead of the original password text. The hashing algorithm used for that was previously unknown, but thanks to the infamous Office Open XML specification it is now documented for the world to see (take a look at Part 4, Section 3.3.1.81 – Sheet Protection Options for the details). Thankfully, the algorithm is identical for all recent versions of Excel including XP, 2003 and 2007, so you can simply reuse the documented algorithm for the older versions of Excel too.

But alas! the documented algorithm is incorrect; it does not produce correct hash values. Being determined to find out the correct algorithm, however, I started to analyze the hashes that the documented algorithm produces, and compare them with the real hash values that Excel generates, in order to decipher the correct algorithm.

In the end, the documented algorithm was, although not accurate, pretty close enough that I was able to make a few changes and derive the algorithm that generates correct values. The following code:

#include <stdio.h>
 
using namespace std;
 
typedef unsigned char sal_uInt8;
typedef unsigned short sal_uInt16;
 
sal_uInt16 getPasswordHash(const char* szPassword)
{
    sal_uInt16 cchPassword = strlen(szPassword);
    sal_uInt16 wPasswordHash = 0;
    if (!cchPassword)
        return wPasswordHash;
 
    const char* pch = &szPassword[cchPassword];
    while (pch-- != szPassword)
    {
        wPasswordHash = ((wPasswordHash >> 14) & 0x01) | 
                        ((wPasswordHash << 1) & 0x7fff);
        wPasswordHash ^= *pch;
    }
 
    wPasswordHash = ((wPasswordHash >> 14) & 0x01) | 
                    ((wPasswordHash << 1) & 0x7fff);
 
    wPasswordHash ^= (0x8000 | ('N' << 8) | 'K');
    wPasswordHash ^= cchPassword;
 
    return wPasswordHash;
}
 
int main (int argc, char** argv)
{
    if (argc < 2)
        exit(1);
 
    printf("input password = %s\n", argv[1]);
    sal_uInt16 hash = getPasswordHash(argv[1]);
    printf("hash = %4.4X\n", hash);
 
    return 0;
}

produces the right hash value from an arbitrary password. One caveat: this algorithm takes an 8-bit char array, so if the input value consists of 16-bit unicode characters, it needs to be first converted into 8-bit character array. The conversion algorithm is also documented in the OOXML specification. I have not tested it yet, but I hope that algorithm is correct. ;-)

Importing Excel 2007 files

The Excel 2007 filter for Calc is still on its way, but perhaps now is a good time to show the progress of this new Excel 2007 import filter work by Daniel Rentz and myself.

Here is a screenshot of a file created by Excel 2007 (left), and one for the same file opened in Calc (right).

Excel 2007 screenshot Excel 2007 file opened in Calc

There are still a lot to be done, however. Formula import is still to be completed, which blocks other features that rely on the formula parser. Charts, text boxes, and other graphic objects are still not imported yet. There is also a performance issue of a large xlsx file import, which needs to be addressed at some point.

But all in all, things are coming along very nicely.