Posts Tagged ‘filter’

mso-dumper now packaged in OBS

March 20th, 2010

I’m happy to announce that the mso-dumper tool is now packaged in the openSUSE build service under my home repository. This tool is written in Python, and allows you to dump the contents of MS Office documents stored in the BIFF-structured binary file format in a more human readable fashion. It is an indispensable tool when dealing with importing from and/or exporting to the Office documents. Right now, only Excel and Power Point formats are supported.

This package provides two new commands xls-dump and ppt-dump. If you wish to dump the content of an Excel document, all you have to do is

xls-dump ./path/to/mydoc.xls

and it dumps its content to standard output. What the output looks like depends on what’s stored with the document, but it will look something like this:

...
0085h: =============================================================
0085h: BOUNDSHEET - Sheet Information (0085h)
0085h:   size = 14
0085h: -------------------------------------------------------------
0085h: B4 09 00 00 00 00 06 00 53 68 65 65 74 31
0085h: -------------------------------------------------------------
0085h: BOF position in this stream: 2484
0085h: sheet name: Sheet1
0085h: hidden state: visible
0085h: sheet type: worksheet or dialog sheet

008Ch: =============================================================
008Ch: COUNTRY - Default Country and WIN.INI Country (008Ch)
008Ch:   size = 4
008Ch: -------------------------------------------------------------
008Ch: 01 00 01 00 

00EBh: =============================================================
00EBh: MSODRAWINGGROUP - Microsoft Office Drawing Group (00EBh)
00EBh:   size = 90
00EBh: -------------------------------------------------------------
00EBh: 0F 00 00 F0 52 00 00 00 00 00 06 F0 18 00 00 00
00EBh: 02 04 00 00 02 00 00 00 02 00 00 00 01 00 00 00
00EBh: 01 00 00 00 02 00 00 00 33 00 0B F0 12 00 00 00
00EBh: BF 00 08 00 08 00 81 01 09 00 00 08 C0 01 40 00
00EBh: 00 08 40 00 1E F1 10 00 00 00 0D 00 00 08 0C 00
00EBh: 00 08 17 00 00 08 F7 00 00 10 

00FCh: =============================================================
00FCh: SST - Shared String Table (00FCh)
00FCh:   size = 8
00FCh: -------------------------------------------------------------
00FCh: 00 00 00 00 00 00 00 00
00FCh: -------------------------------------------------------------
00FCh: total number of references: 0
00FCh: total number of unique strings: 0
...

I have originally written this tool to deal with the Excel import and export part of Calc’s development, and continue to develop it further. Thorsten Behrens has later joined forces and added support for the Power Point format. Right now, I’m working on adding an XML output format option to make it easier to compare outputs, which is important for regression testing.

DBF import performance

December 11th, 2009

dbf-import-perfHere is another performance win! Importing dbf files into Calc is now quicker by 80%. You will probably notice the difference especially when importing a large dbf file. The test document I used had roughly 24000 rows, and importing that took 57 seconds on my machine. Having 24000 rows in a database file (or even in a spreadsheet file) is very common by today’s standard, so this wasn’t good at all.

I had done quite a bit of performance work over the years, but this one was somewhat difficult to tackle. The bottlenecks were fragmented all over the place which required different solutions to different areas. Roughly speaking, the following are the areas I tackled to reduce the total import time for dbf files (module name in parentheses):

  • speedup in parsing of dbf file content (connectivity)
  • disabled property change notification during dbf import (dbaccess)
  • more efficient string interning and unicode conversion (sal)
  • reduction in column array re-allocation during import (sc)
  • removal of unnecessary column and row size adjustments post-import (sc)

With all of this, the file that originally took 57 seconds to load now loads in 12 seconds on the same hardware, which roughly translates to 80% reduction of the total import time!

This itself is pretty impressive; however, I was hoping to get it at least under 10 seconds since Excel can load the same file less than 5 seconds on the same hardware, even through wine emulation (!). But that’s probably for a future project. For now, I’m content with what I’ve done.

CSV import enhancement

June 9th, 2009

As I mentioned in my previous post, I had introduced two new HTML import options to control how numbers in cells are detected & converted during import. I had also hinted at the end of the post that I might add the same import options to the CSV import dialog. I’m writing this post to inform you guys that that’s exactly what I just did.

Here is the new CSV import options dialog:

which highlights the changes I’ve made. At the top of the dialog is the new Language list box, to select what language to use for the import. If it’s set to Default, it uses the language that OOo uses globally. Similar to the HTML import, this option affects how the numbers are parsed, based on the decimal and thousands separators for the selected language, as well as how the special numbers are detected & converted. I’ll talk more on the special number detection later.

In addition to the language option, located in the middle of the dialog are two check boxes to further control how cell values are to be interpreted.

  • Option Quoted field as text, when set, always imports quoted cell values as text, even if they are numbers. This option existed before this change, but was used only for the separator-based CSV imports. It is now used for the fixed width imports as well.
  • Option Detect special numbers, similar to the one in the HTML import option, controls whether or not to detect specially-formatted numbers, such as dates, scientific notations etc. When this option is set, Calc will try to detect special numbers and convert them into appropriate format. When this option is not set, Calc will only convert the simple decimal numbers. By default, this option is not set.

Anyway, that’s all there is to it. Hopefully this will solve, or at least make it easier to handle importing of CSV documents, which was previously hampered by Calc’s aggressive date detection and lack of support for alternative number separators. I doubt that this will make your CSV import experience a perfect one, but it will hopefully make it a much better one.

HTML import option

May 21st, 2009

What’s new?

I just checked in to the Go-OO master branch an enhancement to Calc’s HTML table import feature. With this enhancement, you can select a custom language to use when you import an HTML table, which affects how the numbers are interpreted during the import.

What’s this good for?

Each language adheres to a different set of conventions, such as date formats, thousands and decimal separators, and other special number formats. Previously, Calc simply used the system’s language setting when importing an HTML in order to decide whether a cell is a nubmer or a text, and if it’s a number, what numbering format to apply. Although you could change the system language in the global options dialog, that was not always convenient especially when all you need to do is quickly import an HTML table with different language conventions, and you don’t want to bother changing the global language setting (and changing it back after the import is done). That’s where this enhancement will come in.

What can you expect with this change?

Well, you’ll get the following dialog

when you are importing an HTML table, to either select the system’s language (Automatic), or a custom language from the list of available languages. Just remember that, if you don’t know or don’t care about what language to choose, select Automatic and move on. If you do care, then select Custom and pick a language from the list.

Special number detection (especially the dates)

There is an additional option below, to toggle automatic number format detection. With this option checked, Calc will try to automatically detect special number formats, such as dates, scientific notations etc. If unchecked, Calc will only detect the simplest numbers i.e. numbers that only consist of digits, decimal separators and thousands separators (and a sign if there is one), while all the other numbers are imported as text. By default, this option is NOT checked, which means Calc will only detect the simple numbers.

BTW, I’m thinking of adding a similar option to the csv import, where automatic date conversion has been quite problematic in the past and driving lots of users crazy. Adding this option also to the csv import would IMO make sense.

Well, this is it. It’s actually a pretty minor enhancement, but I hope someone will find it useful. :-)

It’s text when it’s quoted, stupid!

December 18th, 2008

Ok. This is another one I just checked into ooo-build trunk. It’s actually a very minor enhancement, but some people may find this useful.

What I did is to modify Calc’s csv import filter to provide the users an option to always import quoted fields (or cells) as text. Here is a screenshot of the dialog with this new option:

Hopefully what it does is self-evident. If not, here is an example.

Let’s say you have a csv file of the following content:

"0-0" "1-0" "2-0" "3-0" "4-0" "5-0"
"0-1" "1-1" "2-1" "3-1" "4-1" "5-1"

When you import this file into Calc, here is what you get with the current version:

As you can see, most of the fields got converted to dates. You get the same results even if those individual cells are not quoted, because Calc’s current csv import filter doesn’t make any distinction between quoted and unquoted fields in terms of data types.

When you import this file using the new option, you get this instead:

Now, all quoted fields are imported as text, with no surprise conversion. Plain and simple. :-)

Importing Excel 2007 files

June 20th, 2007

The Excel 2007 filter for Calc is still on its way, but perhaps now is a good time to show the progress of this new Excel 2007 import filter work by Daniel Rentz and myself.

Here is a screenshot of a file created by Excel 2007 (left), and one for the same file opened in Calc (right).

Excel 2007 screenshot Excel 2007 file opened in Calc

There are still a lot to be done, however. Formula import is still to be completed, which blocks other features that rely on the formula parser. Charts, text boxes, and other graphic objects are still not imported yet. There is also a performance issue of a large xlsx file import, which needs to be addressed at some point.

But all in all, things are coming along very nicely.

How to (pretend to) write an export filter

May 8th, 2007

It turns out that pretending to write an export filter, at least adding a new entry to the Export dialog, is quite easy. In fact, you don’t even have to write a single line of code. Here is what to do.

Suppose you do your own build, and you have installed the OO.o that you have built. Now, go back to your build tree, and change directory into the following location

filter/source/config/fragments

and add the following two new files relative to this location:

./filters/calc_Kohei_SDF_Filter.xcu
./filters/calc_Kohei_SDF_ui.xcu

You can name your files anyway you want, of course. ;-) Anyway, put the following XML fragments into these files:

<!-- calc_Kohei_SDF_Filter.xcu -->
<node oor:name="calc_Kohei_SDF_Filter" oor:op="replace">
  <prop oor:name="Flags"><value>EXPORT ALIEN 3RDPARTYFILTER</value></prop>
  <prop oor:name="UIComponent"/>
  <prop oor:name="FilterService"><value>com.sun.star.comp.packages.KoheiSuperDuperFileExporter</value></prop>
  <prop oor:name="UserData"/>
  <prop oor:name="FileFormatVersion"/>
  <prop oor:name="Type"><value>Kohei_SDF</value></prop>
  <prop oor:name="TemplateName"/>
  <prop oor:name="DocumentService"><value>com.sun.star.sheet.SpreadsheetDocument</value></prop>
</node>
 
<!-- calc_Kohei_SDF_Filter_ui.xcu -->
<node oor:name="calc_Kohei_SDF_Filter">
  <prop oor:name="UIName"><value xml:lang="x-default">Kohei Super Duper File Format</value>
    <value xml:lang="en-US">Kohei Super Duper File Format</value>
    <value xml:lang="de">Kohei Super Duper File Format</value>
  </prop>
</node>

Likewise, create another file:

./types/Kohei_SDF.xcu

with the following content

<!-- Kohei_SDF.xcu -->
<node oor:name="Kohei_SDF" oor:op="replace" >
  <prop oor:name="DetectService"/>
  <prop oor:name="URLPattern"/>
  <prop oor:name="Extensions"><value>koheisdf</value></prop>
  <prop oor:name="MediaType"/>
  <prop oor:name="Preferred"><value>false</value></prop>
  <prop oor:name="PreferredFilter"><value>calc_Kohei_SDF_Filter</value></prop>
  <prop oor:name="UIName"><value xml:lang="x-default">Kohei Super Duper File Format</value></prop>
  <prop oor:name="ClipboardFormat"><value>doctype:Workbook</value></prop>
</node>

Once these new files are in place, add these files to fcfg_calc.mk so that the build process can find them. To add, open fcfg_calc.mk and add Kohei_SDF to the end of T4_CALC, calc_Kohei_SDF_Filter to F4_CALC, and calc_Kohei_SDF_Filter_ui to F4_UI_CALC. Save the file and rebuild the module. This should rebuild the following configuration files (build done on Linux):

./unxlngi6.pro/misc/filters/modulepacks/fcfg_calc_types.xcu
./unxlngi6.pro/misc/filters/modulepacks/fcfg_calc_filters.xcu
./unxlngi6.pro/bin/fcfg_langpack_en-US.zip

One note: the language pack zip package should contain the file named Filter.xcu with the new UI string you just put in. If you don’t see that, remove the whole unxlngi6.pro directory and build the module again.

Now it’s time to update your installation. You need to update the following files:

<install_dir>/share/registry/modules/org/openoffice/TypeDetection/Filter/fcfg_calc_filters.xcu
<install_dir>/share/registry/modules/org/openoffice/TypeDetection/Types/fcfg_calc_types.xcu

with the new ones you just rebuilt. Next, unpack the langpack zip file and extract Filter.xcu. Place this file in

<install_dir>/share/registry/res/en-US/org/openoffice/TypeDetection/Filter.xcu

to replace the old one.

Ok so far? There is one more thing you need to do to complete the process. Since these configuration files are cached, in order for the updated configuration files to take effect, the cached data must be removed. The cached data is in the user configuration directory, so you need to locate and delete the following directory:

rm -rf <user_config_dir>/user/registry/cache

That’s it! Now, fire up Calc and launch the Export dialog. You see the new file format entry you just put in. :-)

Export dialog with new export filter entry

Just try not to export your file using this new filter for real, because that will utterly fail. ;-)