DBF import performance

dbf-import-perfHere is another performance win! Importing dbf files into Calc is now quicker by 80%. You will probably notice the difference especially when importing a large dbf file. The test document I used had roughly 24000 rows, and importing that took 57 seconds on my machine. Having 24000 rows in a database file (or even in a spreadsheet file) is very common by today’s standard, so this wasn’t good at all.

I had done quite a bit of performance work over the years, but this one was somewhat difficult to tackle. The bottlenecks were fragmented all over the place which required different solutions to different areas. Roughly speaking, the following are the areas I tackled to reduce the total import time for dbf files (module name in parentheses):

  • speedup in parsing of dbf file content (connectivity)
  • disabled property change notification during dbf import (dbaccess)
  • more efficient string interning and unicode conversion (sal)
  • reduction in column array re-allocation during import (sc)
  • removal of unnecessary column and row size adjustments post-import (sc)

With all of this, the file that originally took 57 seconds to load now loads in 12 seconds on the same hardware, which roughly translates to 80% reduction of the total import time!

This itself is pretty impressive; however, I was hoping to get it at least under 10 seconds since Excel can load the same file less than 5 seconds on the same hardware, even through wine emulation (!). But that’s probably for a future project. For now, I’m content with what I’ve done.

7 thoughts on “DBF import performance”

  1. Terrific!
    How hard would it be to modify the Calc import module to serve as the Base import module?

    Some Base tutorials and forums suggest importing CSV files by importing into Calc and copying and pasting to base! Not bad for small tables, but what if more than 64,000 rows?

    http://sheepdogguides.com/fdb/fdb1imp1.htm
    http://www.oooforum.org/forum/viewtopic.phtml?t=25434

    or raw SQL
    http://user.services.openoffice.org/en/forum/viewtopic.php?f=83&t=23260

    Some likely issues:
    1. The modified import module would have to be callable from Base (is there a separate import module in Calc?).
    2. The modified import module would have to output to Hyper SQL (HSQLDB) or whatever database Base is using.
    http://www.hsqldb.org/
    a. Database interface
    http://wiki.services.openoffice.org/wiki/HSQLDB:Tips_and_Tricks
    http://wiki.services.openoffice.org/wiki/HSQLDB_Integration
    b. Would have to guess at field types, especially numeric (integer or real?).
    3. The modified import module would have to be able to process more than 64,000 rows.

    I assume this is non-trivial or it would have been done years ago.

    Thanks,
    Jim Callahan
    Orlando, FL

    1. Hi Jim,

      As I also said in my reply to your email, having Base support CSV file as an editable data source will not be trivial, to say the least. But I’m not really that much familiar with Base (yet), so you are best advised to ask this on the dev@dba.openoffice.org mailing list.

      Good luck!

  2. Hi Jody, good to hear from you. :-)

    Unfortunately the file is confidential, but it’s just a regular, ordinary-looking dbf file with 24000 rows from column A to CN. About half of the records are string records. Any file that resembles that will do, I think.

    I just opened up the same file in Gnumeric on my machine, and Gnumeric can open the file in ~11 seconds. So, it does slightly better than Calc even after all my optimizations.

Comments are closed.