Hack Week: .src converter to convert ~700 .src files

So, after some discussion with Ricardo, I have decided to take on the task of writing a converter script to convert ~700 .src files into xml files, which will be used as a starting point for re-designing each and every dialog for the new dialog layout engine. I was initially thinking about working on the dialog editor, but sounds like Ricardo has it under control. So better not mess with that. :-)

When writing a converter script, it of course involves parsing a source file in order to generate output. Typically there are two ways to go about this.

  1. Parse the source file partially for just the information you need using a flat search, and ignore the rest, or
  2. Parse the source file fully according to the syntax of the language, using a lexer-parser pattern.

The advantage of the first method is simplicity; it’s pretty easy to set up a simple regexp-based parser and start parsing. The disadvantage of it is that, once the parsing need grows, as you need to pick up more and more information, the parser code becomes complex with full of special case handling, and eventually requires a total re-write. Good luck with extending such code as the need grows even further.

The second method, while it takes a little upfront effort, is extensible once the framework is set up, and the code usually becomes better structured with only a minimum special case handling if designed correctly. This method is also well-suited for parsing a token-based language, where whitespace and linebreak characters are only for syntactic sugar and does not affect its semantics. For example, C/C++ and Java are token-based, while Python is not. Since the syntax of the src files is very similar to that of C, I’ve decided to use the second method for this task.

I spent yesterday and today writing this converter script from scratch (in Python), and I’ve come to a point where it parses a large number of src files and correctly generate their xml output files. Here is one example case.

The source file:

/*************************************************************************
 *
 *  OpenOffice.org - a multi-platform office productivity suite
 *
 *  $RCSfile: crnrdlg.src,v $
 *
 *  $Revision: 1.44 $
 *
 *  last change: $Author: ihi $ $Date: 2007/04/19 16:36:48 $
 *
 *  The Contents of this file are made available subject to
 *  the terms of GNU Lesser General Public License Version 2.1.
 *
 *
 *    GNU Lesser General Public License Version 2.1
 *    =============================================
 *    Copyright 2005 by Sun Microsystems, Inc.
 *    901 San Antonio Road, Palo Alto, CA 94303, USA
 *
 *    This library is free software; you can redistribute it and/or
 *    modify it under the terms of the GNU Lesser General Public
 *    License version 2.1, as published by the Free Software Foundation.
 *
 *    This library is distributed in the hope that it will be useful,
 *    but WITHOUT ANY WARRANTY; without even the implied warranty of
 *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 *    Lesser General Public License for more details.
 *
 *    You should have received a copy of the GNU Lesser General Public
 *    License along with this library; if not, write to the Free Software
 *    Foundation, Inc., 59 Temple Place, Suite 330, Boston,
 *    MA  02111-1307  USA
 *
 ************************************************************************/
#include "crnrdlg.hrc"
ModelessDialog RID_SCDLG_COLROWNAMERANGES
{
    OutputSize = TRUE ;
    Hide = TRUE ;
    SVLook = TRUE ;
    Size = MAP_APPFONT ( 256 , 181 ) ;
    HelpId = HID_COLROWNAMERANGES ;
    Moveable = TRUE ;
     // Closeable = TRUE;   // Dieser Dialog hat einen Cancel-Button !
    FixedLine FL_ASSIGN
    {
        Pos = MAP_APPFONT ( 6 , 3 ) ;
        Size = MAP_APPFONT ( 188 , 8 ) ;
        Text [ en-US ] = "Range" ;
    };
    ListBox LB_RANGE
    {
        Pos = MAP_APPFONT ( 12 , 14 ) ;
        Size = MAP_APPFONT ( 179 , 85 ) ;
        TabStop = TRUE ;
        VScroll = TRUE ;
        Border = TRUE ;
    };
    Edit ED_AREA
    {
        Border = TRUE ;
        Pos = MAP_APPFONT ( 12 , 105 ) ;
        Size = MAP_APPFONT ( 165 , 12 ) ;
        TabStop = TRUE ;
    };
    ImageButton RB_AREA
    {
        Pos = MAP_APPFONT ( 179 , 104 ) ;
        Size = MAP_APPFONT ( 13 , 15 ) ;
        TabStop = FALSE ;
        QuickHelpText [ en-US ] = "Shrink" ;
    };
    RadioButton BTN_COLHEAD
    {
        Pos = MAP_APPFONT ( 20 , 121 ) ;
        Size = MAP_APPFONT ( 171 , 10 ) ;
        TabStop = TRUE ;
        Text [ en-US ] = "Contains ~column labels" ;
    };
    RadioButton BTN_ROWHEAD
    {
        Pos = MAP_APPFONT ( 20 , 135 ) ;
        Size = MAP_APPFONT ( 171 , 10 ) ;
        TabStop = TRUE ;
        Text [ en-US ] = "Contains ~row labels" ;
    };
    FixedText FT_DATA_LABEL
    {
        Pos = MAP_APPFONT ( 12 , 151 ) ;
        Size = MAP_APPFONT ( 179 , 8 ) ;
        Text [ en-US ] = "For ~data range" ;
    };
    Edit ED_DATA
    {
        Border = TRUE ;
        Pos = MAP_APPFONT ( 12 , 162 ) ;
        Size = MAP_APPFONT ( 165 , 12 ) ;
        TabStop = TRUE ;
    };
    ImageButton RB_DATA
    {
        Pos = MAP_APPFONT ( 179 , 161 ) ;
        Size = MAP_APPFONT ( 13 , 15 ) ;
        TabStop = FALSE ;
        QuickHelpText [ en-US ] = "Shrink" ;
    };
    OKButton BTN_OK
    {
        Pos = MAP_APPFONT ( 200 , 6 ) ;
        Size = MAP_APPFONT ( 50 , 14 ) ;
        TabStop = TRUE ;
    };
    CancelButton BTN_CANCEL
    {
        Pos = MAP_APPFONT ( 200 , 23 ) ;
        Size = MAP_APPFONT ( 50 , 14 ) ;
        TabStop = TRUE ;
    };
    PushButton BTN_ADD
    {
        Pos = MAP_APPFONT ( 200 , 104 ) ;
        Size = MAP_APPFONT ( 50 , 14 ) ;
        Text [ en-US ] = "~Add" ;
        TabStop = TRUE ;
        DefButton = TRUE ;
    };
    PushButton BTN_REMOVE
    {
        Pos = MAP_APPFONT ( 200 , 122 ) ;
        Size = MAP_APPFONT ( 50 , 14 ) ;
        Text [ en-US ] = "~Delete" ;
        TabStop = TRUE ;
    };
    HelpButton BTN_HELP
    {
        Pos = MAP_APPFONT ( 200 , 43 ) ;
        Size = MAP_APPFONT ( 50 , 14 ) ;
        TabStop = TRUE ;
    };
    Text [ en-US ] = "Define Label Range" ;
};

and here is the output after the conversion:

<modeless-dialog height="181" help-id="HID_COLROWNAMERANGES" hide="true" moveable="true" output-size="true" sv-look="true" text="Define Label Range" width="256" xmlns="http://openoffice.org/2007/layout" xmlns:cnt="http://openoffice.org/2007/layout/container">
    <vbox>
        <fixed-line id="FL_ASSIGN" height="8" text="Range" width="188" x="6" y="3"/>
        <ok-button id="BTN_OK" height="14" tab-stop="true" width="50" x="200" y="6"/>
        <list-box id="LB_RANGE" border="true" height="85" tab-stop="true" vscroll="true" width="179" x="12" y="14"/>
        <cancel-button id="BTN_CANCEL" height="14" tab-stop="true" width="50" x="200" y="23"/>
        <help-button id="BTN_HELP" height="14" tab-stop="true" width="50" x="200" y="43"/>
        <hbox>
            <image-button id="RB_AREA" height="15" quick-help-text="Shrink" tab-stop="false" width="13" x="179" y="104"/>
            <push-button id="BTN_ADD" def-button="true" height="14" tab-stop="true" text="~Add" width="50" x="200" y="104"/>
        </hbox>
        <edit id="ED_AREA" border="true" height="12" tab-stop="true" width="165" x="12" y="105"/>
        <radio-button id="BTN_COLHEAD" height="10" tab-stop="true" text="Contains ~column labels" width="171" x="20" y="121"/>
        <push-button id="BTN_REMOVE" height="14" tab-stop="true" text="~Delete" width="50" x="200" y="122"/>
        <radio-button id="BTN_ROWHEAD" height="10" tab-stop="true" text="Contains ~row labels" width="171" x="20" y="135"/>
        <fixed-text id="FT_DATA_LABEL" height="8" text="For ~data range" width="179" x="12" y="151"/>
        <image-button id="RB_DATA" height="15" quick-help-text="Shrink" tab-stop="false" width="13" x="179" y="161"/>
        <edit id="ED_DATA" border="true" height="12" tab-stop="true" width="165" x="12" y="162"/>
    </vbox>
</modeless-dialog>

These are the steps I take to convert each file. First, the source file is read character-by-character to get tokenized by the lexer class, and this is where the comments (both multi-line and single line) get stripped out and the preprocessing macros are defined. The tokens are then passed to the parser class to build a syntax tree (preprocessor macros are expanded here), which is then converted into an intermediate XML tree with names translated and some attribute types converted properly, such as the position and the size, which are originally given as MAP_APPFONT( a, b ) format. Also, some unnecessary information is discarded at this stage.

Once that’s done, it further translates the intermediate XML tree into another XML tree that has layout elements. The X and Y positions of each widget are used in order to layout the widgets properly by wrapping them with <vbox> and <hbox> elements as needed. The tree is then dumped into a stream of text, which is what you see above.

Unfortunately this task is not done yet. As it turns out, some src files even require inclusion of header files in order to be parsed correctly, which means I need to honor those #include "foo.hrc" header include directives. Right now, they are ignored. On top of that, there may also be cases where the #ifdef directives might need to be interpreted correctly, but so far ignoring them has not caused any side-effect.

I’m sure there are other problems I’ll encounter as I parse more src files, but I’d say the end is near. :-)