hack week – Roundtrip to Shanghai via Tokyo

SUSE Hack Week

April 17, 2013August 7, 2021 Kohei Yoshida4 Comments

Last week was SUSE’s Hack Week – an event my employer does periodically to allow us – hard working engineers – to go wild with our wildest ideas and execute them in one week. Just like what I did at my last Hack Week event, I decided to work on integration of Orcus library into LibreOffice once again, to pick up on what I’d left off from my previous integration work.

Integration bits

Prior to Hack Week, orcus was already partially integrated; it was used to provide the backend functionality for Calc’s XML Source feature, and experimental support for Gnumeric file import. The XML Source side was pretty well integrated, but the normal file import side was only partially integrated. Lots of essential pieces were still missing, the largest of which were

support for multiple filters from a single external filter provider source (such as orcus),
progress indicator in the status bar, and
proper type detection by analyzing file content rather than its extension (which we call “deep detection”).

In short, I was able to complete the first two pieces during Hack Week, while the last item still has yet to be worked on. Aside from this, there are still more minor pieces missing, but perhaps I can work on the remaining bits during the next Hack Week.

Enabling orcus in your build

If you have a recent enough build from the master branch of the LibreOffice repository, you can enable imports via orcus library by

checking the Enable experimental features box in the Options dialog, and
setting the environment variable LIBO_USE_ORCUS to YES before launching Calc.

This will overwrite the stock import filters for ODS, XLSX and CSV. At present, orcus only performs file extension based detection rather than content based one, so be mindful of this when you try this on your machine. To go back to the current import filters, simply disable experimental features, or unset the environment variable.

Note that I’ve added this bits to showcase a preview of what orcus can potentially do as a future import filter framework. As such, never use this in production if you want stable file loading experience, or don’t file bugs against this. We are not ready for that yet. Orcus filters are still missing lots and lots of features.

Also note that, while in theory you could enable orcus with the Windows build, the performance of orcus on Windows may not be that impressive; in fact, in some cases slower than the current filters. That is because orcus relies on strtod and strtol system calls to convert string numbers into numeric values, and their implementation depend on the platform. And the performance of MSVC’s strtod implementation is known to be suboptimal compared to those of gcc and clang on Linux. I’m very much aware of this, and will work on addressing this at a later time.

Performance comparison

This is perhaps the most interesting part. I wanted to do a quick performance comparison and see how this orcus filter stands up against the current filter. Given the orcus filter is still only capable of importing raw cell values and not any other features or properties (not even cell formats), I’ve used this test file which only consists of raw text and numeric values in a 8-by-300000 range, to measure the load times that are as fair and representative as I could make them. Here is the result on my machine running openSUSE 11.4:

The current filter, which has undergone its set of performance optimizations on raw cell values, still spends upwards of 50 seconds. Given that it used to take minutes to load this file, it’s still an improvement.

The orcus filter, on the other hand, combined with the heavily optimized load handler in Calc core that I put in place during Hack Week, can load the same file in 4.5 seconds. I would say that is pretty impressive.

I also measured the load time on the same file using Excel 2007, on the same machine running on top of wine, and the result was 7.5 seconds. While running an Windows app via wine emulation layer may incur some performance cost, this page suggests that it should not be noticeable, if any. And my own experience of running various versions of Excel via wine backs up that argument. So this number should be fairly representative of Excel’s native performance on the same hardware.

Considering that my ultimate goal with orcus is to beat Excel on performance on loading its own files (or at least not be slower than Excel), I would say we are making good progress toward that goal.

That’s all for today. Thank you, ladies and gentlemen.

Orcus integration into LibreOffice

August 8, 2012August 7, 2021 Kohei Yoshida4 Comments

Last week was SUSE Hack Week, where we SUSE engineers were encouraged to be creative and work on whatever project that we had been dying to work on.

Given this opportunity, I decided to try integrating my orcus library project into LibreOffice proper to see how much improvement we could make in the performance of loading spreadsheet documents.

I’ll leave the detailed description and goal of orcus project for another blog post, but in short, orcus is an independent library designed to process spreadsheet documents, and is also designed to be useable from an application that would like to use it to load documents. It’s currently still work in progress, and is not even in alpha quality. So, I intentionally don’t release orcus library packages on an official basis.

Integration work

The main difficulty with integrating orcus into LibreOffice proper was dealing with the very intricate loading process that LibreOffice uses for all existing filters. It first goes through an elaborate type detection process, which loads the content of the file into memory in order for the type detection code to parse it. Once the correct type is determined, LibreOffice then instantiates correct frame loader and start the actual loading process. I’ve explained all of this in detail in this blog post of mine.

Orcus, on the other hand, only needs a file path, and it does the rest. And it pushes data to the call back functions provided by the client code as it parses the file. It was this difference in overall loading process that made the integration of orcus into LibreOffice all the more challenging. And even though the hack week itself lasted only one week, I had spent months prior to it just to study the type detection code and other auxiliary code that makes up the whole file loading process in order to come up with an elegant way to add hook for orcus.

Long story short, I was able to come up with a way to hook orcus such that LibreOffice relinquishes all its file loading to the orcus library, and only handles callbacks. To make this work, I first packaged orcus into an installable rpm package using the openSUSE build service, locally installed that package, then added –with-system-orcus configure option to allow LibreOffice to find the library. The entire change needed to add hook is condensed into this commit.

Using CSV filter as an experiment

As an initial experiment, I replaced the current csv import filter with one from orcus, just to see how this overall process works. The results are very encouraging.

With a very large csv file I created via this python script:

#!/usr/bin/env python
 
import sys
 
for i in xrange(0, 65536):
    for j in xrange(1, 101):
        val = i * 1.0 / j
        sys.stdout.write("%g,"%val)
    sys.stdout.write("end\n")

the current filter spends roughly 27 seconds to load this file, which is not too bad given the sheer size of the file (~50Mb). The orcus filter, on the other hand, spends only 11 seconds to load the same file.

However, the orcus filter code path still skips a number of steps that need to be performed if it were to be used in the production build, such as

drawing progress bar in the status bar area,
calculating row heights for rows that include multi-line cell contents, and
probably something else I forget to mention here.

Given some of these can be quite expensive, the above numbers may not be fully comparable. Despite that, these initial numbers show a great promise on the performance improvement that may result from using the orcus library.

Future work

First of all, we will not switch to the orcus csv filter anytime soon. Although I’d like to see that happen at some point in the future, there are still lots of missing pieces in the orcus csv filter that prevent us from using it in the production build. My plan with orcus is therefore limited to addition of new filters, and my immediate plan is to develop new XML import and export filters using orcus, and integrate it into LibreOffice. This should also provide a stepping stone for any additional filters that may come up later, as well as replacing some of the existing filters as the need arises.

That’s all for now. Thanks for reading!

Ixion – threaded formula calculation library

June 21, 2010August 7, 2021 Kohei Yoshida2 Comments

I spent my entire last week on my personal project, by taking advantage of Novell’s HackWeek. Officially, HackWeek took place two weeks ago, but because I had to travel that week I postponed mine till the following week.

Ixion is the project I worked on as part of my HackWeek. This project is an experimental effort to develop a stand-alone library that supports parallel computation of formula expressions using threads. I’d been working on this on and off in my spare time, but when the opportunity came along to spend one week of my paid time on any project of my choice (personal or otherwise), I didn’t hesitate to pick Ixion.

Overview

So, what’s Ixion? Ixion aims to provide a library for calculating the results of formula expressions stored in multiple named targets, or “cells”. The cells can be referenced from each other, and the library takes care of resolving their dependencies automatically upon calculation. The caller can run the calculation routine either in a single-threaded mode, or a multi-threaded mode. The library also supports re-calculation where the contents of one or more cells have been modified since the last calculation, and a partial calculation of only the affected cells gets performed. It is written entirely in C++, and makes extensive use of the boost library to achieve portability across different platforms. It has currently been tested to build on Linux and Windows.

The goal is to eventually bring this library up to the level where it can serve as a full-featured calculation engine for spreadsheet applications. But right now, this project remains as an experimental, proof-of-concept project to help me understand what is required to build a threaded calculation engine capable of performing all sorts of tasks required in a typical spreadsheet app.

I consider this project a library project; however, building this project only creates a single stand-alone console application at the moment. I plan to separate it into a shared library and a front-end executable in the future, to allow external apps to dynamically link to it.

How it works

Building this project creates an executable called ixion-parser. Running it with a -h option displays the following help content:

Usage: ixion-parser [options] FILE1 FILE2 ...
 
The FILE must contain the definitions of cells according to the cell definition rule.
 
Allowed options:
  -h [ --help ]         print this help.
  -t [ --thread ] arg   specify the number of threads to use for calculation.  
                        Note that the number specified by this option 
                        corresponds with the number of calculation threads i.e.
                        those child threads that perform cell interpretations. 
                        The main thread does not perform any calculations; 
                        instead, it creates a new child thread to manage the 
                        calculation threads, the number of which is specified 
                        by the arg.  Therefore, the total number of threads 
                        used by this program will be arg + 2.

The parser expects one or more cell definition files as arguments. A cell definition file may look like this:

%mode init
A1=1
A2=A1+10
A3=A2+A1*30
A4=(10+20)*A2
A5=A1-A2+A3*A4
A6=A1+A3
A7=A7
A8=10/0
A9=A8
%calc
%mode result
A1=1
A2=11
A3=41
A4=330
A5=13520
A6=42
A7=#REF!
A8=#DIV/0!
A9=#DIV/0!
%check
%mode edit
A6=A1+A2
%recalc
%mode result
A1=1
A2=11
A3=41
A4=330
A5=13520
A6=12
A7=#REF!
A8=#DIV/0!
A9=#DIV/0!
%check
%mode edit
A1=10
%recalc

I hope the format of the cell definition rule is straightforward. The definitions are read from top down. I used the so-called A1 notation to name target cells, but it doesn’t have to be that way. You can use any naming scheme to name target cells as long as the lexer recognizes them as names. Also, the format supports a command construct; a line beginning with a ‘%’ is considered a command. Several commands are currently available. For instance the mode command lets you switch input modes. The parser currently supports three input modes:

init – initialize cells with specified contents.
result – pick up expected results for cells, for verification.
edit – modify cell contents.

In addition to the mode command, the following commands are also supported:

calc – perform full calculation, by resetting the cached results of all involved cells.
recalc – perform partial re-calculation of modified cells and cells that reference modified cells, either directly or indirectly.
check – verify the calculation results.

Given all this, let’s see what happens when you run the parser with the above cell definition file.

./ixion-parser -t 4 test/01-simple-arithmetic.txt 
Using 4 threads
Number of CPUS: 4
---------------------------------------------------------
parsing test/01-simple-arithmetic.txt
---------------------------------------------------------
A1: 1
A1: result = 1
---------------------------------------------------------
A2: A1+10
A2: result = 11
---------------------------------------------------------
A3: A2+A1*30
A3: result = 41
---------------------------------------------------------
A4: (10+20)*A2
A4: result = 330
---------------------------------------------------------
A5: A1-A2+A3*A4
A5: result = 13520
---------------------------------------------------------
A8: 10/0
result = #DIV/0!
---------------------------------------------------------
A6: A1+A3
A6: result = 42
---------------------------------------------------------
A9: 
result = #DIV/0!
---------------------------------------------------------
A7: result = #REF!
---------------------------------------------------------
checking results
---------------------------------------------------------
A2 : 11
A8 : #DIV/0!
A3 : 41
A9 : #DIV/0!
A4 : 330
A5 : 13520
A6 : 42
A7 : #REF!
A1 : 1
---------------------------------------------------------
recalculating
---------------------------------------------------------
A6: A1+A2
A6: result = 12
---------------------------------------------------------
checking results
---------------------------------------------------------
A2 : 11
A8 : #DIV/0!
A3 : 41
A9 : #DIV/0!
A4 : 330
A5 : 13520
A6 : 12
A7 : #REF!
A1 : 1
---------------------------------------------------------
recalculating
---------------------------------------------------------
A1: 10
A1: result = 10
---------------------------------------------------------
A2: A1+10
A2: result = 20
---------------------------------------------------------
A3: A2+A1*30
A3: result = 320
---------------------------------------------------------
A4: (10+20)*A2
A4: result = 600
---------------------------------------------------------
A5: A1-A2+A3*A4
A5: result = 191990
---------------------------------------------------------
A6: A1+A2
A6: result = 30
---------------------------------------------------------
(duration: 0.00113601 sec)
---------------------------------------------------------

./ixion-parser -t 4 test/01-simple-arithmetic.txt Using 4 threads Number of CPUS: 4 --------------------------------------------------------- parsing test/01-simple-arithmetic.txt --------------------------------------------------------- A1: 1 A1: result = 1 --------------------------------------------------------- A2: A1+10 A2: result = 11 --------------------------------------------------------- A3: A2+A1*30 A3: result = 41 --------------------------------------------------------- A4: (10+20)*A2 A4: result = 330 --------------------------------------------------------- A5: A1-A2+A3*A4 A5: result = 13520 --------------------------------------------------------- A8: 10/0 result = #DIV/0! --------------------------------------------------------- A6: A1+A3 A6: result = 42 --------------------------------------------------------- A9: result = #DIV/0! --------------------------------------------------------- A7: result = #REF! --------------------------------------------------------- checking results --------------------------------------------------------- A2 : 11 A8 : #DIV/0! A3 : 41 A9 : #DIV/0! A4 : 330 A5 : 13520 A6 : 42 A7 : #REF! A1 : 1 --------------------------------------------------------- recalculating --------------------------------------------------------- A6: A1+A2 A6: result = 12 --------------------------------------------------------- checking results --------------------------------------------------------- A2 : 11 A8 : #DIV/0! A3 : 41 A9 : #DIV/0! A4 : 330 A5 : 13520 A6 : 12 A7 : #REF! A1 : 1 --------------------------------------------------------- recalculating --------------------------------------------------------- A1: 10 A1: result = 10 --------------------------------------------------------- A2: A1+10 A2: result = 20 --------------------------------------------------------- A3: A2+A1*30 A3: result = 320 --------------------------------------------------------- A4: (10+20)*A2 A4: result = 600 --------------------------------------------------------- A5: A1-A2+A3*A4 A5: result = 191990 --------------------------------------------------------- A6: A1+A2 A6: result = 30 --------------------------------------------------------- (duration: 0.00113601 sec) ---------------------------------------------------------

Notice that at the beginning of the output, it displays the number of threads being used, and the number of “CPU”s it detected. Here, the “CPU” may refer to the number of physical CPUs, the number of cores, or the number of hyper-threading units. I’m well aware that I need to use a different term for this other than “CPU”, but anyways… The number of child threads to use to perform calculation can be specified at run-time via -t option. When running without the -t option, the parser will run in a single-threaded mode. Now, let me go over what the above output means.

The first calculation performed is a full calculation. Since no cells have been calculated yet, we need to calculate results for all defined cells. This is followed by a verification of the initial calculation. After this, we modify cell A6, and perform partial re-calculation. Since no other cells depend on the result of cell A6, the re-calc only calculates A6.

Now, the third calculation is also a partial re-calculation following the modification of cell A1. This time, because several other cells do depend on the result of A1, those cells also need to be re-calculated. The end result is that cells A1, A2, A3, A4, A5 and A6 all get re-calculated.

Under the hood

Cell dependency resolution

There are several technical aspects of the implementation of this library I’d like to cover. The first is cell dependency resolution. I use a well-known algorithm called topological sort to sort cells in order of dependency so that cells can be calculated one by one without being blocked by the calculation of precedent cells. Topological sort is typically used to manage scheduling of tasks that are inter-dependent with each other, and it was a perfect one to use to resolve cell dependencies. This algorithm is a by-product of depth first search of directed acyclic graph (DAG), and is well-documented. A quick google search should give you tons of pseudo code examples of this algorithm. This algorithm work well both for full calculation and partial re-calculation routines.

Managing threaded calculation

The heart of this project is to implement parallel evaluation of formula expressions, which has been my number 1 goal from the get-go. This is also the reason why I put my focus on designing the threaded calculation engine as my initial goal before I start putting my focus into other areas. Programming with threads was also very new to me, so I took extra care to ensure that I understand what I’m doing, and I’m designing it correctly. Also, designing a framework that uses multiple threads can easily get out-of-hand and out-of-control when it’s overdone. So, I made an extra effort to limit the area where multiple threads are used while keeping the rest of the code single-threaded, in order to keep the code simple and maintainable.

As I soon realized, even knowing the basics of programming with threads, you are not immune to falling into many pitfalls that may arise during the actual designing and debugging of concurrent code. You have to go extra miles ensuring that access to thread-global data are synchronized, and that one thread waits for another thread in case threads must be executed in certain order. These things may sound like common sense and probably are in every thread programming text book, but in reality they are very easy to overlook, especially to those who have not had substantial exposure to concurrency before. Parallelism seemed that un-orthodox to conventional minds like myself. Having said all that, once you go through enough pain dealing with concurrency, it does become less painful after a while. Your mind can simply adjust to “thinking in parallel”.

Back to the topic. I’ve picked the following pattern to manage threaded calculation.

First, the main thread creates a new thread whose job is to manage cell queues, that is, receiving queues from the main thread and assigning them to idle threads to perform calculation. It is also responsible for keeping track of which threads are idle and ready to take on a cell assignment. Let’s call this thread a queue manager thread. When the queue manager thread is created, it spawns a specified number of child threads, and waits until they are all ready. These child threads are the ones that perform cell calculation, and we call them calc threads.

Each calc thread registers itself as an idle thread upon creation, then sleeps until the queue manager thread assigns it a cell to calculate and signals it to wake up. Once awake, it calculates the cell, registers itself as an idle thread once again and goes back to sleep. This cycle continues until the queue manager thread sends a termination request to it, after which it breaks out of the cycle and reaches the end of its execution path to terminate.

The role of the queue manager thread is to receive cell calculation requests from the main thread and pass them on to idle calc threads. It keeps doing it until it receives a termination request from the main thread. Once receiving the termination request from the main thread, it sends all the remaining cells in queue to the calc threads to finish up, then sends termination requests to the calc threads and wait until all of them terminate.

Thanks to the cells being sorted in topological order, the process of putting a cell in queue and having a calc thread perform calculation is entirely asynchronous. The only exception is that when referencing another cell during calculation, the result of that referenced cell may not be available at the time of the value query due to concurrency. In such cases, the calculating thread needs to block its execution until the result of the referenced cell becomes available. When running in a single-threaded mode, on the other hand, the result of a referenced cell is guaranteed to be available as long as cells are calculated in topological order and contain no circular references.

What I accomplished during HackWeek

During HackWeek, I was able to accomplish quite a few things. Before the HackWeek, the threaded calculation framework was not even there; the parser was only able to reliably perform calculation in a single-threaded mode. I had some test code to design the threaded queue management framework, but that code had yet to be integrated into the main formula interpreter code. A lot of work was still needed, but thanks to having an entire week devoted for this, I was able to

port the test threaded queue manager framework code into the formula interpreter code,
adopt the circular dependency detection code for the new threaded calculation framework,
test the new framework to squeeze lots of kinks out,
put some performance optimization in the cell definition parser and the formula lexer code,
implement result verification framework, and
implement partial re-calculation.

Had I had to do all this in my spare time alone, it would have easily taken months. So, I’m very thankful for the event, and I look forward to having another opportunity like this in hopefully not-so-distant future.

What lies ahead

So, what lies ahead for Ixion? You may ask. There are quite a few things to get done. Let me start first by saying that, this library is far from providing all the features that a typical spreadsheet application needs. So, there are still lots of work needed to make it even usable. Moreover, I’m not even sure whether this library will become usable enough for real-world spreadsheet use, or it will simply end up being just another interesting proof-of-concept. My hope is of course to see this library evolve into maturity, but at the same time I’m also aware that it would be hard to advance this project with only my scarce spare time to spend in.

With that said, here are some outstanding issues that I plan on addressing as time permits.

Add support for literal strings, and support textural formula results in addition to numerical results.
Add support for empty cells. Empty cells are those cells that are not defined in the model definition file but can still be referenced. Currently, referencing a cell that is not defined causes a reference error.
Add support for cell ranges. This implies that I need to make cell instances addressable by 3-dimensional coordinates rather than by pointer values.
Split the code into two parts: a shared library and an executable.
Use autoconf to make the build process configurable.
Make the expression parser feature-complete.
Implement more functions. Currently only MAX and MIN are implemented.
Support for localized numbers.
Lots and lots more.

Conclusion

This concludes my HackWeek report. Thank you very much, ladies and gentlemen.

HackWeek – Minor polish

July 27, 2009August 7, 2021 Kohei Yoshida3 Comments

As some of us already blogged, the last week was a Hack Week inside Novell, where we the Novell engineers are allowed to work on whatever project we are pleased to work on. Given the opportunity, I decided to work on some UI polish work for OOo that I had always wanted to work on but could not due to other priorities. These are the results of my Hack Week effort.

First, I wanted to implement animated border to outline copied ranges. Currently, copied ranges are outlined with static solid borders, but it was not always obvious to the users what those borders were for. Excel and Gnumeric, for instance, use animated dashed borders, which look more intuitive than static borders to depict copied ranges. Long story short, we now have animated dashed borders in Calc as well.

It’s not obvious in the above screenshot since it’s a static image, but trust me, it does animate. ;-) I consider this a natural extension of the previous work that Jon Pryor did for pasting on ENTER key.

The second work I did was to brush up the document modified status window, to display disk image to indicate whether the document is modified or not. Previously OOo displayed ‘*’ when the current document is modified, or none if it is not modified. I wanted to make it a little fancier so that it would catch more attention of the users. Anyway, here is the result.

This is what the status bar looks like when the document is modified. The image I used here is basically a reduced version of the save icon in Tango icon theme. However, I am not an artist, and I don’t consider this image to be a final version. So the final image is still subject to change without notice.

This is what the status bar looks like when the document is not modified. Basically a black & white version of the document-modified image, with some translucency applied.

That’s all the work I did during Hack Week. I couldn’t spend as much time as I would have liked since I still had to take care of other tasks even during Hack Week, but hopefully you guys like what I did.

Hack Week: Day 5 (Friday) – The last day

June 30, 2007August 7, 2021 Kohei Yoshida

Well, today was the last day of Hack Week, but unfortunately I wasn’t able to squish the remaining 20% unconverted resource files like I planned to do yesterday. I squished only 5%. This brings my conversion success rate from yesterday’s 80% to 85%. I’m pretty happy with this result, however, considering that some of those resource files I tried to convert are not even dialog resource files.

Here is what I did today:

Fixed incorrect expansion of preprocessing macros. It just didn’t do the right thing when performing recursive macro expansion. This time I really got it right, but it consumed the majority of today’s hacking time. :-(
Reworked my expression evaluation code to fully support the reverse Polish notation (RPN). The absence of this feature caused a parse failure on some files because the position and the size of some widgets are given as a mathematical expression (e.g. (24 + 10)/2) instead of a single number. I got the RPN parser to work, but then I realized that I could have just used Python’s builtin eval function to evaluate a whole expression in one step. Well, duh! I learned how to code the RPN builder to evaluate an expression, though, which was fun exercise.

So, this concludes this week’s Novell Hack Week event. It was certainly fun, although I couldn’t do everything I wanted to do. I’ll be back to my normal hacking activities on next Monday.

Hack Week: Day 4 (Thursday) – The joy of preprocessing macros (not!)

June 29, 2007August 7, 2021 Kohei Yoshida2 Comments

Well, I didn’t have much huge achievement yesterday – day 4 of our Novell’s Hack Week. But here is a list of things I’ve done to improve the robustness of the converter script.

Added support to (semi-)correctly parse the preprocessing macros, both ones that take no arguments and ones that do take arguments, as well as ones that include other macros recursively.
Added support to parse header files, without which many preprocessing macros would be left undefined, thus causing a parse failure.
Added arithmetic support, again in the preprocessing macros.
Numerous bug fixes that were uncovered while working on the preprocessing macro support, as well as some re-write of the algorithms to make them work better.

My conclusion? Preprocessing macros are evil! Since macros are expanded before the source file is parsed, it has its own syntax rules that are different from the host language. A simple expansion is rather easy, but once they start taking arguments, recursively using other macros (or the combination of the two), things become a bit tricky. Anyway, the worst is over I hope…

With this improvement, I can now correctly convert 80% of all of the src files we have in our OO.o source tree. Hopefully I can squish the remaining 20% today.

To recap (for those who missed my previous Hack Week posts), I am working on writing a .src to .xml converter script to migrate the existing dialog resource files (which are statically designed) to new xml format that has layout information. The new xml files will be used as a starting point for re-designing all our existing dialogs for the new dialog layout engine in development.

Hack Week: .src converter to convert ~700 .src files

June 28, 2007August 7, 2021 Kohei Yoshida

So, after some discussion with Ricardo, I have decided to take on the task of writing a converter script to convert ~700 .src files into xml files, which will be used as a starting point for re-designing each and every dialog for the new dialog layout engine. I was initially thinking about working on the dialog editor, but sounds like Ricardo has it under control. So better not mess with that. :-)

When writing a converter script, it of course involves parsing a source file in order to generate output. Typically there are two ways to go about this.

Parse the source file partially for just the information you need using a flat search, and ignore the rest, or
Parse the source file fully according to the syntax of the language, using a lexer-parser pattern.

The advantage of the first method is simplicity; it’s pretty easy to set up a simple regexp-based parser and start parsing. The disadvantage of it is that, once the parsing need grows, as you need to pick up more and more information, the parser code becomes complex with full of special case handling, and eventually requires a total re-write. Good luck with extending such code as the need grows even further.

The second method, while it takes a little upfront effort, is extensible once the framework is set up, and the code usually becomes better structured with only a minimum special case handling if designed correctly. This method is also well-suited for parsing a token-based language, where whitespace and linebreak characters are only for syntactic sugar and does not affect its semantics. For example, C/C++ and Java are token-based, while Python is not. Since the syntax of the src files is very similar to that of C, I’ve decided to use the second method for this task.

I spent yesterday and today writing this converter script from scratch (in Python), and I’ve come to a point where it parses a large number of src files and correctly generate their xml output files. Here is one example case.

The source file:

/*************************************************************************
 *
 *  OpenOffice.org - a multi-platform office productivity suite
 *
 *  $RCSfile: crnrdlg.src,v $
 *
 *  $Revision: 1.44 $
 *
 *  last change: $Author: ihi $ $Date: 2007/04/19 16:36:48 $
 *
 *  The Contents of this file are made available subject to
 *  the terms of GNU Lesser General Public License Version 2.1.
 *
 *
 *    GNU Lesser General Public License Version 2.1
 *    =============================================
 *    Copyright 2005 by Sun Microsystems, Inc.
 *    901 San Antonio Road, Palo Alto, CA 94303, USA
 *
 *    This library is free software; you can redistribute it and/or
 *    modify it under the terms of the GNU Lesser General Public
 *    License version 2.1, as published by the Free Software Foundation.
 *
 *    This library is distributed in the hope that it will be useful,
 *    but WITHOUT ANY WARRANTY; without even the implied warranty of
 *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 *    Lesser General Public License for more details.
 *
 *    You should have received a copy of the GNU Lesser General Public
 *    License along with this library; if not, write to the Free Software
 *    Foundation, Inc., 59 Temple Place, Suite 330, Boston,
 *    MA  02111-1307  USA
 *
 ************************************************************************/
#include "crnrdlg.hrc"
ModelessDialog RID_SCDLG_COLROWNAMERANGES
{
    OutputSize = TRUE ;
    Hide = TRUE ;
    SVLook = TRUE ;
    Size = MAP_APPFONT ( 256 , 181 ) ;
    HelpId = HID_COLROWNAMERANGES ;
    Moveable = TRUE ;
     // Closeable = TRUE;   // Dieser Dialog hat einen Cancel-Button !
    FixedLine FL_ASSIGN
    {
        Pos = MAP_APPFONT ( 6 , 3 ) ;
        Size = MAP_APPFONT ( 188 , 8 ) ;
        Text [ en-US ] = "Range" ;
    };
    ListBox LB_RANGE
    {
        Pos = MAP_APPFONT ( 12 , 14 ) ;
        Size = MAP_APPFONT ( 179 , 85 ) ;
        TabStop = TRUE ;
        VScroll = TRUE ;
        Border = TRUE ;
    };
    Edit ED_AREA
    {
        Border = TRUE ;
        Pos = MAP_APPFONT ( 12 , 105 ) ;
        Size = MAP_APPFONT ( 165 , 12 ) ;
        TabStop = TRUE ;
    };
    ImageButton RB_AREA
    {
        Pos = MAP_APPFONT ( 179 , 104 ) ;
        Size = MAP_APPFONT ( 13 , 15 ) ;
        TabStop = FALSE ;
        QuickHelpText [ en-US ] = "Shrink" ;
    };
    RadioButton BTN_COLHEAD
    {
        Pos = MAP_APPFONT ( 20 , 121 ) ;
        Size = MAP_APPFONT ( 171 , 10 ) ;
        TabStop = TRUE ;
        Text [ en-US ] = "Contains ~column labels" ;
    };
    RadioButton BTN_ROWHEAD
    {
        Pos = MAP_APPFONT ( 20 , 135 ) ;
        Size = MAP_APPFONT ( 171 , 10 ) ;
        TabStop = TRUE ;
        Text [ en-US ] = "Contains ~row labels" ;
    };
    FixedText FT_DATA_LABEL
    {
        Pos = MAP_APPFONT ( 12 , 151 ) ;
        Size = MAP_APPFONT ( 179 , 8 ) ;
        Text [ en-US ] = "For ~data range" ;
    };
    Edit ED_DATA
    {
        Border = TRUE ;
        Pos = MAP_APPFONT ( 12 , 162 ) ;
        Size = MAP_APPFONT ( 165 , 12 ) ;
        TabStop = TRUE ;
    };
    ImageButton RB_DATA
    {
        Pos = MAP_APPFONT ( 179 , 161 ) ;
        Size = MAP_APPFONT ( 13 , 15 ) ;
        TabStop = FALSE ;
        QuickHelpText [ en-US ] = "Shrink" ;
    };
    OKButton BTN_OK
    {
        Pos = MAP_APPFONT ( 200 , 6 ) ;
        Size = MAP_APPFONT ( 50 , 14 ) ;
        TabStop = TRUE ;
    };
    CancelButton BTN_CANCEL
    {
        Pos = MAP_APPFONT ( 200 , 23 ) ;
        Size = MAP_APPFONT ( 50 , 14 ) ;
        TabStop = TRUE ;
    };
    PushButton BTN_ADD
    {
        Pos = MAP_APPFONT ( 200 , 104 ) ;
        Size = MAP_APPFONT ( 50 , 14 ) ;
        Text [ en-US ] = "~Add" ;
        TabStop = TRUE ;
        DefButton = TRUE ;
    };
    PushButton BTN_REMOVE
    {
        Pos = MAP_APPFONT ( 200 , 122 ) ;
        Size = MAP_APPFONT ( 50 , 14 ) ;
        Text [ en-US ] = "~Delete" ;
        TabStop = TRUE ;
    };
    HelpButton BTN_HELP
    {
        Pos = MAP_APPFONT ( 200 , 43 ) ;
        Size = MAP_APPFONT ( 50 , 14 ) ;
        TabStop = TRUE ;
    };
    Text [ en-US ] = "Define Label Range" ;
};

/************************************************************************* * * OpenOffice.org - a multi-platform office productivity suite * * $RCSfile: crnrdlg.src,v $ * * $Revision: 1.44 $ * * last change: $Author: ihi $ $Date: 2007/04/19 16:36:48 $ * * The Contents of this file are made available subject to * the terms of GNU Lesser General Public License Version 2.1. * * * GNU Lesser General Public License Version 2.1 * ============================================= * Copyright 2005 by Sun Microsystems, Inc. * 901 San Antonio Road, Palo Alto, CA 94303, USA * * This library is free software; you can redistribute it and/or * modify it under the terms of the GNU Lesser General Public * License version 2.1, as published by the Free Software Foundation. * * This library is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU * Lesser General Public License for more details. * * You should have received a copy of the GNU Lesser General Public * License along with this library; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, * MA 02111-1307 USA * ************************************************************************/ #include "crnrdlg.hrc" ModelessDialog RID_SCDLG_COLROWNAMERANGES { OutputSize = TRUE ; Hide = TRUE ; SVLook = TRUE ; Size = MAP_APPFONT ( 256 , 181 ) ; HelpId = HID_COLROWNAMERANGES ; Moveable = TRUE ; // Closeable = TRUE; // Dieser Dialog hat einen Cancel-Button ! FixedLine FL_ASSIGN { Pos = MAP_APPFONT ( 6 , 3 ) ; Size = MAP_APPFONT ( 188 , 8 ) ; Text [ en-US ] = "Range" ; }; ListBox LB_RANGE { Pos = MAP_APPFONT ( 12 , 14 ) ; Size = MAP_APPFONT ( 179 , 85 ) ; TabStop = TRUE ; VScroll = TRUE ; Border = TRUE ; }; Edit ED_AREA { Border = TRUE ; Pos = MAP_APPFONT ( 12 , 105 ) ; Size = MAP_APPFONT ( 165 , 12 ) ; TabStop = TRUE ; }; ImageButton RB_AREA { Pos = MAP_APPFONT ( 179 , 104 ) ; Size = MAP_APPFONT ( 13 , 15 ) ; TabStop = FALSE ; QuickHelpText [ en-US ] = "Shrink" ; }; RadioButton BTN_COLHEAD { Pos = MAP_APPFONT ( 20 , 121 ) ; Size = MAP_APPFONT ( 171 , 10 ) ; TabStop = TRUE ; Text [ en-US ] = "Contains ~column labels" ; }; RadioButton BTN_ROWHEAD { Pos = MAP_APPFONT ( 20 , 135 ) ; Size = MAP_APPFONT ( 171 , 10 ) ; TabStop = TRUE ; Text [ en-US ] = "Contains ~row labels" ; }; FixedText FT_DATA_LABEL { Pos = MAP_APPFONT ( 12 , 151 ) ; Size = MAP_APPFONT ( 179 , 8 ) ; Text [ en-US ] = "For ~data range" ; }; Edit ED_DATA { Border = TRUE ; Pos = MAP_APPFONT ( 12 , 162 ) ; Size = MAP_APPFONT ( 165 , 12 ) ; TabStop = TRUE ; }; ImageButton RB_DATA { Pos = MAP_APPFONT ( 179 , 161 ) ; Size = MAP_APPFONT ( 13 , 15 ) ; TabStop = FALSE ; QuickHelpText [ en-US ] = "Shrink" ; }; OKButton BTN_OK { Pos = MAP_APPFONT ( 200 , 6 ) ; Size = MAP_APPFONT ( 50 , 14 ) ; TabStop = TRUE ; }; CancelButton BTN_CANCEL { Pos = MAP_APPFONT ( 200 , 23 ) ; Size = MAP_APPFONT ( 50 , 14 ) ; TabStop = TRUE ; }; PushButton BTN_ADD { Pos = MAP_APPFONT ( 200 , 104 ) ; Size = MAP_APPFONT ( 50 , 14 ) ; Text [ en-US ] = "~Add" ; TabStop = TRUE ; DefButton = TRUE ; }; PushButton BTN_REMOVE { Pos = MAP_APPFONT ( 200 , 122 ) ; Size = MAP_APPFONT ( 50 , 14 ) ; Text [ en-US ] = "~Delete" ; TabStop = TRUE ; }; HelpButton BTN_HELP { Pos = MAP_APPFONT ( 200 , 43 ) ; Size = MAP_APPFONT ( 50 , 14 ) ; TabStop = TRUE ; }; Text [ en-US ] = "Define Label Range" ; };

and here is the output after the conversion:

<modeless-dialog height="181" help-id="HID_COLROWNAMERANGES" hide="true" moveable="true" output-size="true" sv-look="true" text="Define Label Range" width="256" xmlns="http://openoffice.org/2007/layout" xmlns:cnt="http://openoffice.org/2007/layout/container">
    <vbox>
        <fixed-line id="FL_ASSIGN" height="8" text="Range" width="188" x="6" y="3"/>
        <ok-button id="BTN_OK" height="14" tab-stop="true" width="50" x="200" y="6"/>
        <list-box id="LB_RANGE" border="true" height="85" tab-stop="true" vscroll="true" width="179" x="12" y="14"/>
        <cancel-button id="BTN_CANCEL" height="14" tab-stop="true" width="50" x="200" y="23"/>
        <help-button id="BTN_HELP" height="14" tab-stop="true" width="50" x="200" y="43"/>
        <hbox>
            <image-button id="RB_AREA" height="15" quick-help-text="Shrink" tab-stop="false" width="13" x="179" y="104"/>
            <push-button id="BTN_ADD" def-button="true" height="14" tab-stop="true" text="~Add" width="50" x="200" y="104"/>
        </hbox>
        <edit id="ED_AREA" border="true" height="12" tab-stop="true" width="165" x="12" y="105"/>
        <radio-button id="BTN_COLHEAD" height="10" tab-stop="true" text="Contains ~column labels" width="171" x="20" y="121"/>
        <push-button id="BTN_REMOVE" height="14" tab-stop="true" text="~Delete" width="50" x="200" y="122"/>
        <radio-button id="BTN_ROWHEAD" height="10" tab-stop="true" text="Contains ~row labels" width="171" x="20" y="135"/>
        <fixed-text id="FT_DATA_LABEL" height="8" text="For ~data range" width="179" x="12" y="151"/>
        <image-button id="RB_DATA" height="15" quick-help-text="Shrink" tab-stop="false" width="13" x="179" y="161"/>
        <edit id="ED_DATA" border="true" height="12" tab-stop="true" width="165" x="12" y="162"/>
    </vbox>
</modeless-dialog>

These are the steps I take to convert each file. First, the source file is read character-by-character to get tokenized by the lexer class, and this is where the comments (both multi-line and single line) get stripped out and the preprocessing macros are defined. The tokens are then passed to the parser class to build a syntax tree (preprocessor macros are expanded here), which is then converted into an intermediate XML tree with names translated and some attribute types converted properly, such as the position and the size, which are originally given as MAP_APPFONT( a, b ) format. Also, some unnecessary information is discarded at this stage.

Once that’s done, it further translates the intermediate XML tree into another XML tree that has layout elements. The X and Y positions of each widget are used in order to layout the widgets properly by wrapping them with <vbox> and <hbox> elements as needed. The tree is then dumped into a stream of text, which is what you see above.

Unfortunately this task is not done yet. As it turns out, some src files even require inclusion of header files in order to be parsed correctly, which means I need to honor those #include "foo.hrc" header include directives. Right now, they are ignored. On top of that, there may also be cases where the #ifdef directives might need to be interpreted correctly, but so far ignoring them has not caused any side-effect.

I’m sure there are other problems I’ll encounter as I parse more src files, but I’d say the end is near. :-)

Hack Week: Helping make OO.o’s dialog resizable

June 26, 2007August 7, 2021 Kohei Yoshida8 Comments

So, this is day one for Novell’s Hack Week. This week, we, Novell hackers, are allowed to work on whatever project we like. And I chose to work on making VCL dialog resizable.

Michael Meeks already did the ground work, and all I’m trying to do is to do what I can in one week to expand on his work. This is also one of on-going GSoC tasks, so I’m also co-ordinating with the student who’s been assigned to work on this (his name is Ricardo Cruz) so that we won’t step on each other’s toes.

Here is what I did today. I added a wrapper code for a list box control so that I can actually use it in my resizable dialog and add items to it. Let’s show some screenshots here.

OO.o resizable dialog demo (small)

OO.o resizable dialog demo (large)

I posted two shots of the same, but differently-sized dialog just to show that it’s resizable. Pretty cool, huh? :-)

Oh, BTW, since I’m away from my normal business this week, I won’t be working on the OOXML filter. I’ll be back on my regular schedule on next Monday.