Kohei Yoshida – Page 7 – Roundtrip to Shanghai via Tokyo

STL container performance on data insertion

March 31, 2010August 7, 2021 Kohei Yoshida5 Comments

I just ran a quick analysis on the performance of various STL containers on simple data insertion. The result was not exactly what I expected so I’d like to share it with you.

What was performed was sequential insertions of 50,000,000 (50 million) unique pointer values into various STL containers, either by push_back or insert, depending on which method is supported by the container. I ran the test on openSUSE 11.2, with g++ 4.4.1, with the compiler options of -std=c++0x -Os -g. The -std=c++0x flag is necessary in order to use std::unordered_set.

Anyway, here is the result I observed:

stl-perf

I was fully aware of the set containers being slower than list and vector on insertion, due to the internal structure of set being more elaborate than those of list or vector, and this test confirms my knowledge. However, I was not aware of such wide gap between list and vector. Also, the difference between unreserved and reserved vector was not as wide as I would have expected. (For the sake of completeness, a reserved vector is an instance of vector whose internal array size is pre-allocated in advance in order to avoid re-allocation.) My belief has always been that reserving vector in advance improves performance on data insertion, which it does, but I was expecting a wider gap between the two. So, the result I see here is a bit unexpected.

In case you want to re-run this test on your own environment, here is the code I used to measure the containers’ performance:

#include <vector>
#include <unordered_set>
#include <set>
#include <list>
 
#include <stdio.h>
#include <string>
#include <sys/time.h>
 
using namespace std;
 
namespace {
 
class StackPrinter
{
public:
    explicit StackPrinter(const char* msg) :
        msMsg(msg)
    {
        fprintf(stdout, "%s: --begin\n", msMsg.c_str());
        mfStartTime = getTime();
    }
 
    ~StackPrinter()
    {
        double fEndTime = getTime();
        fprintf(stdout, "%s: --end (duration: %g sec)\n", msMsg.c_str(), (fEndTime-mfStartTime));
    }
 
    void printTime(int line) const
    {
        double fEndTime = getTime();
        fprintf(stdout, "%s: --(%d) (duration: %g sec)\n", msMsg.c_str(), line, (fEndTime-mfStartTime));
    }
 
private:
    double getTime() const
    {
        timeval tv;
        gettimeofday(&tv, NULL);
        return tv.tv_sec + tv.tv_usec / 1000000.0;
    }
 
    ::std::string msMsg;
    double mfStartTime;
};
 
}
 
int main()
{
    size_t store_size = 50000000;
    {
        StackPrinter __stack_printer__("vector non-reserved");
        string* ptr = 0x00000000;
        vector<void*> store;
        for (size_t i = 0; i < store_size; ++i)
            store.push_back(ptr++);
    }
 
    {
        StackPrinter __stack_printer__("vector reserved");
        string* ptr = 0x00000000;
        vector<void*> store;
        store.reserve(store_size);
        for (size_t i = 0; i < store_size; ++i)
            store.push_back(ptr++);
    }
 
    {
        StackPrinter __stack_printer__("list");
        string* ptr = 0x00000000;
        list<void*> store;
        for (size_t i = 0; i < store_size; ++i)
            store.push_back(ptr++);
    }
 
    {
        StackPrinter __stack_printer__("set");
        string* ptr = 0x00000000;
        set<void*> store;   
        for (size_t i = 0; i < store_size; ++i)
            store.insert(ptr++);
    }
 
    {
        StackPrinter __stack_printer__("unordered set");
        string* ptr = 0x00000000;
        unordered_set<void*> store;
        for (size_t i = 0; i < store_size; ++i)
            store.insert(ptr++);
    }
}

#include <vector> #include <unordered_set> #include <set> #include <list> #include <stdio.h> #include <string> #include <sys/time.h> using namespace std; namespace { class StackPrinter { public: explicit StackPrinter(const char* msg) : msMsg(msg) { fprintf(stdout, "%s: --begin\n", msMsg.c_str()); mfStartTime = getTime(); } ~StackPrinter() { double fEndTime = getTime(); fprintf(stdout, "%s: --end (duration: %g sec)\n", msMsg.c_str(), (fEndTime-mfStartTime)); } void printTime(int line) const { double fEndTime = getTime(); fprintf(stdout, "%s: --(%d) (duration: %g sec)\n", msMsg.c_str(), line, (fEndTime-mfStartTime)); } private: double getTime() const { timeval tv; gettimeofday(&tv, NULL); return tv.tv_sec + tv.tv_usec / 1000000.0; } ::std::string msMsg; double mfStartTime; }; } int main() { size_t store_size = 50000000; { StackPrinter __stack_printer__("vector non-reserved"); string* ptr = 0x00000000; vector<void*> store; for (size_t i = 0; i < store_size; ++i) store.push_back(ptr++); } { StackPrinter __stack_printer__("vector reserved"); string* ptr = 0x00000000; vector<void*> store; store.reserve(store_size); for (size_t i = 0; i < store_size; ++i) store.push_back(ptr++); } { StackPrinter __stack_printer__("list"); string* ptr = 0x00000000; list<void*> store; for (size_t i = 0; i < store_size; ++i) store.push_back(ptr++); } { StackPrinter __stack_printer__("set"); string* ptr = 0x00000000; set<void*> store; for (size_t i = 0; i < store_size; ++i) store.insert(ptr++); } { StackPrinter __stack_printer__("unordered set"); string* ptr = 0x00000000; unordered_set<void*> store; for (size_t i = 0; i < store_size; ++i) store.insert(ptr++); } }

The book has arrived!

March 22, 2010August 7, 2021 Kohei Yoshida

This is the 2nd edition of the classic Advanced Programming in the UNIX Environment book. I’m a happy owner of the 1st edition, and ever since the 2nd edition was published I had the urge to go ahead and order a copy. I had been fighting off that urge, but last week I had finally given up fighting and decided to place an order.

The 1st edition truly opened up my eyes on the power of UNIX programming. Like the 1st edition, I am looking forward to discovering what this book offers, and how the UNIX system (most notably Linux) has evolved since the 1st edition was published.

mso-dumper now packaged in OBS

March 20, 2010August 7, 2021 Kohei Yoshida2 Comments

I’m happy to announce that the mso-dumper tool is now packaged in the openSUSE build service under my home repository. This tool is written in Python, and allows you to dump the contents of MS Office documents stored in the BIFF-structured binary file format in a more human readable fashion. It is an indispensable tool when dealing with importing from and/or exporting to the Office documents. Right now, only Excel and Power Point formats are supported.

This package provides two new commands xls-dump and ppt-dump. If you wish to dump the content of an Excel document, all you have to do is

xls-dump ./path/to/mydoc.xls

and it dumps its content to standard output. What the output looks like depends on what’s stored with the document, but it will look something like this:

...
0085h: =============================================================
0085h: BOUNDSHEET - Sheet Information (0085h)
0085h:   size = 14
0085h: -------------------------------------------------------------
0085h: B4 09 00 00 00 00 06 00 53 68 65 65 74 31 
0085h: -------------------------------------------------------------
0085h: BOF position in this stream: 2484
0085h: sheet name: Sheet1
0085h: hidden state: visible
0085h: sheet type: worksheet or dialog sheet

008Ch: =============================================================
008Ch: COUNTRY - Default Country and WIN.INI Country (008Ch)
008Ch:   size = 4
008Ch: -------------------------------------------------------------
008Ch: 01 00 01 00 

00EBh: =============================================================
00EBh: MSODRAWINGGROUP - Microsoft Office Drawing Group (00EBh)
00EBh:   size = 90
00EBh: -------------------------------------------------------------
00EBh: 0F 00 00 F0 52 00 00 00 00 00 06 F0 18 00 00 00 
00EBh: 02 04 00 00 02 00 00 00 02 00 00 00 01 00 00 00 
00EBh: 01 00 00 00 02 00 00 00 33 00 0B F0 12 00 00 00 
00EBh: BF 00 08 00 08 00 81 01 09 00 00 08 C0 01 40 00 
00EBh: 00 08 40 00 1E F1 10 00 00 00 0D 00 00 08 0C 00 
00EBh: 00 08 17 00 00 08 F7 00 00 10 

00FCh: =============================================================
00FCh: SST - Shared String Table (00FCh)
00FCh:   size = 8
00FCh: -------------------------------------------------------------
00FCh: 00 00 00 00 00 00 00 00 
00FCh: -------------------------------------------------------------
00FCh: total number of references: 0
00FCh: total number of unique strings: 0
...

I have originally written this tool to deal with the Excel import and export part of Calc’s development, and continue to develop it further. Thorsten Behrens has later joined forces and added support for the Power Point format. Right now, I’m working on adding an XML output format option to make it easier to compare outputs, which is important for regression testing.

Increasing Calc’s row limit to 1 million

February 20, 2010August 7, 2021 Kohei Yoshida5 Comments

Introduction

With the child work space (CWS) koheirowlimitperf being marked ready for QA, I believe this is a good time to talk about the change that the CWS will bring once it gets integrated.

The role of this CWS is to upstream various pieces of performance optimization from Go-OO, that arose from the increase of the row limit from 65536 (64 thousand rows) to 1048576 (1 million rows). However, the upstream build will not see the increase of the row limit itself yet, as the upstream developers still consider that move premature due to two outstanding issues that are show stoppers for them. I’ll talk more about those issues later.

What this CWS does change is the storage of row attributes to 1) improve performance of querying the attributes, and to 2) make extra information available that can be used to make the algorithm of various bits of operations more efficient. The CWS also makes several other changes in order to improve performance in general, though not related to the change in the row attribute storage.

Limitation of the old attribute storage

Before I talk about how the row attributes are stored in the new storage, I’d like to talk about the limitation of the old attribute storage, and why it was not adequate when the row limit was raised to 1 million rows. Also, in this article, I’ll only talk about row attribute storage, but the same argument also applies to the column attribute storage as well.
The old attribute container was designed to store several different attributes altogether, namely,

hidden state,
filtered state,
automatic page break position,
manual page break position, and
whether or not a row has a manual height.

They were stored per range, not per individual row or column, so that if a range of rows had identical set of attribute values over the entire range, that attribute set would be stored as a single record. Searching for the value of an attribute for an arbitrary row was performed linearly from the first record since the core of the container itself was simply just an array.

There were primarily two problems with this storage scheme that made the container non-scalable. First, the attributes stored together in this container had different distribution patterns, which caused over-partitioning of the container and unnecessarily slowed down the queries of all stored attributes.

For instance, the hidden and filtered attributes are distributed in a very similar manner, but the manual height attribute is not necessarily distributed in a manner similar to these attributes. Because of this, storing that together with the hidden and filtered attributes unnecessarily increased the partition count, which in turn would slow down the query speed of all three attributes.

Even more problematic was the automatic page break attributes; because the automatic page breaks always need to be set for the entire sheet, increasing the row limit significantly raised the partition count. On top of that, the page breaks themselves are actually single-row attributes; it made little sense to store them in a container that was range-based.

This over-partitioning problem led to the second problem; when the container was over-partitioned, querying for an attribute value would become very slow due to the linear search algorithm used in the query, and this algorithm scales with the number of partitions. Because row attributes are used extensively in many areas of Calc’s operations, and often times in loops, the degradation of its lookup performance caused all sorts of interesting performance problems when the row limit was raised to 1 million.

New way of storing row attributes

Separation of row attributes

The first step in speeding up storage and lookup of row attributes is to separate them into own containers, to avoid the storage of one attribute affecting the storage of another. It was natural to use a range-based container to store the hidden, filtered and the manual height attributes, since these attributes typically span over many consecutive rows. The page break positions, on the other hand, should be stored as point values as opposed to range values since they rarely occur in ranges and they are always set to individual rows.

I picked STL’s std::set container to store the automatic and manual page break positions (they are stored separately in two set containers). That alone resulted in significantly speeding up the sheet pagination performance, whose performance previously suffered due to the poor storage speed of the old container. Later on I improved the pagination performance even further by modifying the pagination algorithm itself, but more on that later.

For the hidden and filtered attributes, I picked a data structure that I call the flat segment tree, which I designed and implemented specifically for this purpose. Row heights are also stored in the flat segment tree.

Flat segment tree

I named this data structure “flat segment tree” because it is a modified version of a data structure known as the segment tree, and unlike the original segment tree which supports storage of overlapping ranges, my version only stores non-overlapping ranges, hence the name “flat”. The structure of the flat segment tree largely resembles that of the original segment tree; it consists of a balanced binary tree with its leaf nodes storing the values while its non-leaf nodes storing auxiliary data used only for querying purposes. The leaf nodes are doubly-linked, allowing them a quick access to their neighboring nodes. Since ranges never overlap with each other in this data structure, one leaf node represents the end of a preceding range and the start of the range that follows. Last but not least, this data structure is a template, and allows you to specify the types of both key and value.

There are three advantages of using this data structure: 1) compactness of storage since only the range boundaries are stored as nodes, 2) reasonably fast lookup thanks to its tree structure, and 3) a single query of a stored value also returns the lower and upper boundary positions of that range with no additional overhead. The last point is very important which I will explain later in the next section.

As an additional rule, the flat segment tree guarantees that the values of adjacent ranges are always different. There is no exception to this rule, so you can take advantage of this when you use this structure in your code.

Also, please do keep in mind that, while the lookup of a value is reasonably fast, it is not without an overhead. So, you are discouraged to perform, say, lookup for every single row when you are iterating through a series of rows in a loop. Instead, do make use of the range boundary info judiciously to skip ahead in such situation.

This data structure is distributed independently of the OOo code base, licensed under MIT/X11. You can find the source code at http://code.google.com/p/multidimalgorithm/. That project includes other data structures than the flat segment tree; however, only the flat segment tree is currently usable by 3rd party programs; the implementations of other structures are still in an experimental stage, and need to be properly templetized before becoming usable in general. Even the flat segment tree is largely undocumented. This is intentional since the API is still not entirely frozen and is subject to change in future versions. You have been warned.

Loop count reduction

Aside from the aforementioned improvement associated with the row attribute storage, I also worked on improving various algorithms used throughout Calc’s core, by taking advantage of one feature of the new data structure.

As I mentioned earlier, the flat segment tree returns the lower and upper boundary positions of the range as part of a normal value query. You can make use of this extra piece of information to significantly reduce the number of loops in an algorithm that loops through a wide row range. Put it another way, since you already know the attribute value associated with that range, and you also know the start and end positions of that range, you don’t need to query the value for every single row position within that range, thus reducing the number of iterations in the loop. And the reduction of the loop count means the reduction of the time required to complete that operation, resulting in a better performance.

That’s the gist of the performance improvement work I did in various parts of Calc, though there were slight variations depending on which part of Calc’s code I worked on. In summary, the following areas received significant performance improvement:

Sheet pagination, which consisted of loops in which numerous calls are made to query row’s hidden states and row height values.
Print preview, mostly due to the improvement of the pagination performance.
Calculation of drawing object’s vertical position

I’m sure there are other areas where the performance still needs improvement. As this is an on-going effort, we will work on resolving any other outstanding issues as we discover them.

Other related work

In addition to the re-work of the row attribute storage and the performance improvement involving the row attribute queries, I’ve also made other changes to improve performance and ensure that Calc’s basic usability is not sacrificed.

Removal of redundant pagination

Prior to my row limit increase work, Calc would re-calculate page break positions again and again even when no changes were made to the document that would alter page break positions, such as changing row heights, filtering rows, inserting manual page breaks, and so on. Because the pagination operation became much more expensive after the row limit increase, I have decided to remove this redundancy so that the re-pagination is done only when necessary. This change especially made huge impact in print preview performance, since (for whatever reason) Calc was performing full pagination every time you move the mouse cursor within the preview pane, even when the movement was only by one pixel! Removal of such redundant re-pagination has brought sanity back to the print preview experience.

Efficient zoom level calculation

The row limit increase also caused the performance degradation of calculating the correct zoom size to fit the document within specified page size. Calc does this when you specify your document to “fit within n pages wide and y pages tall” or “fit to n pages in total”. The root cause was again in the degraded performance in pagination. This time, however, I could not use the trick of “performing pagination only once”, because we do need to perform full pagination continuously at different zoom levels in order to find a correct zoom level.

The solution I employed was to reduce the number of re-pagination by using the bisection method to arrive at the correct zoom level. The old code worked like this:

Initialize the zoom level to 100%, and perform full pagination.
If that doesn’t fit the required page size, decrement the zoom level by 1%, and perform full pagination once again.
If that doesn’t fit, decrement the zoom level by 1%, and try again.
Continue this until the correct zoom level is reached.

Of course, if the correct zoom level is far below the initial value of 100%, this algorithm is not very efficient. If the desired zoom level is 35%, for example, Calc would need to perform full pagination 66 times. Switching to the bisection method here reduced the full pagination count roughly down to the neighborhood of 5 or 6. At the time I worked on this, each full pagination took about 1 second. So the reduction of pagination count from 66 to 5 roughly translated to the reduction of the zoom level calculation from 1 minute to 5 seconds. Suffice it to say that this made a big difference.

Even better news is that the performance of this operation is much faster today, thanks to the improvement I made in the pagination performance in general.

Calculation of autofill marker position

When making a selection, Calc puts the little square at the lower-right corner of the selection. That’s called an autofill marker, and it’s there to let you drag selection to fill values.

Interestingly, calculating its position (especially its vertical position) turned out to be a very slow operation when the marker was positioned close to the bottom of the sheet. Worse is the fact that Calc calculated its position even when it was outside the visible area. The slowdown caused by this was apparent especially when making column selection. Because selecting columns always places the autofill marker at the last row of the sheet, increasing the row limit made that process sluggish. The solution was to simply detect whether the autofill marker is outside the visible area, and if it is, skip calculating its position (since there is no point calculating its position if we don’t need to display it). That made the process of column selection back to normal again.

However, the sluggishness of making selection can still manifest itself under the right (wrong?) condition even with this change. We still need to speed up the calculation of its vertical position, by improving the calculation algorithm itself.

Show stoppers for the upstream build

I sat down and briefly discussed with Niklas Nebel and Eike Rathke, Sun’s Calc co-leads, when we met during last year’s OOoCon in Orvieto, about the possibility of increasing the row limit in the upstream version of Calc. During our discussion, I was told that, in addition to the general performance issues most of which I’ve already resolved, we will need to resolve at least two more outstanding issues before they can set the row limit to 1 million in the upstream build.

First, we need to improve the performance of the formula calculation and the value change propagation mechanism (that we call “broadcasting”). The existing implementation is still tuned for the grid size of 65536 rows; we need to re-tune that for 1 million rows and to ensure that the performance will not suffer after the row limit increase.

Second, we need to resolve the incorrect positioning of drawing objects at higher row positions. This one is somewhat tricky, since the drawing objects are drawn entirely independent of the sheet grid, and the coerce resolution of the drawing layer causes the vertical position of a drawing object to deviate from its intended position. Generally speaking, the higher the row position the more deviation results. With the maximum of 65536 rows, however, it was not such a big issue since the amount of deviation was barely noticeable even at the highest row position. But because the problem becomes much more noticeable with 1 million rows, this needs to be addressed somehow.

Going forward…

Going forward, I will continue to hunt for the remaining performance issues, and squash them one by one. The major ones should all be resolved by now, so what’s remaining should be some corner case issues, performance-wise. As for the two outstanding issues I mentioned in the previous section, we will have to take a good look at them at some point. Whether or not they are really show stoppers is somewhat subject to personal view point, but they are real issues needing resolution, for sure, no matter what their perceived severity is.

Also, as of this writing, the manual row size attribute is still stored in the old, array-based container. It will probably make sense to migrate that to the flat segment tree, so that we can eliminate the old container once and for all, and have a fresh start with the new container. Having said that, doing so would require another round of refactoring of non-trivial scale, it should be conducted with care and proper testing.

The ODS export filter still needs re-work. Currently, all row attributes which are now stored separately, are temporarily merged back into the old array-based container before exporting the document to ODS. The reason is that the ODS export filter code still expects the partitioning behavior of the old container during the export of row styles. In order to fully embrace the new storage of row attributes, that code needs to be adjusted to work with the new storage scheme. Again, this will require a non-trivial amount of code change, thus should be conducted with care.

Calculation of vertical position of various objects, such as the autofill marker can still use some algorithmic improvement. We can make them more efficient by taking advantage of the flat segment tree in a way similar to how the pagination algorithm was made more efficient.

Conclusion

This concludes my write-up on the current status of Calc’s row limit increase work. I hope I’ve made it clear that work is underway toward making that happen without degrading Calc’s basic usability. As a matter of fact, the row limit has already been increased to 1 million in some variants of OOo, such as Go-OO. I believe we’ll be able to increase the row limit in the upstream version in the not-so-distant future as long as we keep working at the remaining issues.

That’s all I have to say for now. Thank you very much, ladies and gentlemen.

Inserting current date and time in one step

February 16, 2010August 7, 2021 Kohei Yoshida5 Comments

Here is another simple feature that may come in handy.

With the change I just checked into ooo-build master, you can now insert current date and time with just one key stroke. By default, Ctrl+; (semicolon) is bound to current date, while Ctrl+Shift+; is bound to current time. But these key bindings are configurable in case you don’t like these default bindings.

Two more enhancements are in

February 2, 2010August 7, 2021 Kohei Yoshida16 Comments

Today, I’d like to talk about two minor enhancements I just checked in to ooo-build master. They are not really earth-shuttering per se, but still worth mentioning & may be interesting to some users.

Insert new sheet tab

Here is the first enhancement. In Calc, you’ll see a new tab at the right end of the sheet tabs, to allow quick insertion of new sheets. Each time you click this tab, a new sheet gets inserted to the right end. The sheet names are automatically assigned.

Previously, inserting a new sheet has to be done by opening the Insert sheet dialog, selecting the position of the new sheet and how many new sheets are to be inserted etc. But if you always append a single sheet at the right end and don’t care to name the new sheet (or name it after the sheet is inserted), this enhancement will save you a few clicks. Implementing this was actually not that hard since I was able to re-use the existing code for most of its functionality. I personally wanted to give it a little more visual appeal, but that will be a future project.

Anyway, I hope some of you will find this useful.

English function names in non-English locale

The second enhancement is related to cell functions. If you use a localized version of OOo, you probably know that the function names are localized. But there has been quite a few requests to support English function names even if the UI is localized. This is where this enhancement comes in.

First, there is now an additional check box in the Formula options page:

By default, the check box is off, which means the localized function names are used. Checking this check box will swap localized function names with the English ones across the board. You can of course uncheck it to go back to the localized function names.

For example, in French locale, the name of the function that calculates a summation of a cell range is called SOMME, but when the English function name option is enabled, this becomes SUM as you can see in the following screenshot:

This change takes effect in all of the following areas:

formula input and display,
function wizard, and
formula tips.

As always, please test this thoroughly, and report any bugs. Thanks!

Distributed text justification

January 21, 2010August 7, 2021 Kohei Yoshida7 Comments

What’s new?

Here is something I’ve been working on for the past few weeks. Since I just checked in the first version of this feature into ooo-build master, it’s probably a good time for me to talk about this.

This feature introduces a new justification option for cell text known as the “distributed justification”, where the left and right edges of the text are aligned with the left and right edges of the bounding box by adjusting space between characters (inter-character spacing), rather than space between words (inter-word spacing), across the entire width of the bounding box. This type of distributed text justification makes little sense for Latin-based languages such as English, French and German, but makes a big difference for Asian languages such as Japanese. The reason the normal justification doesn’t work for Asian languages is because, in those languages, you don’t put spaces between individual word boundaries, and the normal justification relies on presence of spaces at word boundaries. This is where the distributed justification comes into play.

This distributed justification method is commonly known as ?????? in Japanese, and is said to be one of the blockers when attempting to migrate users away from Excel to Calc.

Horizontal justification

First and foremost, I’d like to cover the horizontal justification. The following screenshot shows the difference between the three horizontal alignment modes:

As you can see, in the normal left-aligned text, the right edges of the lines are not aligned. When the text is justified, the right edges of the lines are now aligned by adjusting the inter-character spacing, except for the last line, which remains left-aligned. When the text is distributed, even the right edge of the last line becomes aligned with the right edge of the bounding box by equally distributing the characters on that line.

To allow this new justification type, I added a new justification type Distributed to the existing Cell Formatting dialog.

For the vertical alignment setting, I’ve added two new options Justified and Distributed, to support justification in the vertical direction.

Justifying Asian text mixed with Latin text

While working on this feature, I have decided to also tweak the normal justification algorithm to make it work slightly better for Asian text mixed with Latin text such as English. As I mentioned earlier, distributed justification is not really ideal for Latin text. But with the society becoming more and more global, we are seeing more and more Asian text intermixed with Latin text, and vise versa. And correctly justifying a text having mixed script types requires using different justification methods for their respective script types. After a bit of trial and error, I think I got it right. You can see the result in the following screenshot:

The English portion of the text is justified by inter-word justification, whereas the Japanese portion is justified by inter-character justification. The spaces between the English and Japanese text portions are also slightly adjusted in this scheme.

Vertical justification

Now, let’s move on to the vertical justification. When you justify a text in the vertical direction, that is, in the direction perpendicular to the direction of text flow, the spacing between the lines gets adjusted so that the top and bottom lines get aligned with their respective edges of the bounding box, like so:

The top cell shows text with default justification, while the bottom cell shows text with vertical justification.

The Cell Format dialog itself provides both Justified and Distributed options for the vertical justification setting, but they do exactly the same thing for horizontally-flowing text. For vertically-flowing text, on the other hand, they do different things, but more on that in the next section.

Justifying vertically flowing text

Now, you can also justify text even when the text is flowing vertically. There are three ways you can make the text flow vertically. You can either

rotate 90 degrees to the right (bottom-to-top),
rotate 90 degrees to the left (top-to-bottom), or
switch to Asian layout mode, which flows text in the top-to-bottom, right-to-left direction.

In these modes, the Justified and Distributed vertical justification options do have different effects. The following screenshot demonstrates different vertical alignment settings in three different vertical flow modes.

As an added bonus…

The code responsible for the text layout, the code where I made my modification to support this feature, is actually shared between Calc, Draw and Impress. Calc uses it to render complex cell text, while Draw and Impress use it for their text box objects. This means that, any improvement I make in this area will automatically be made available for all three applications. All that needs to be done is to simply adjust the UI in each app and add hooks in their respective import/export filters. Whether or not I’ll work on that during this cycle is another question. Having said that, I’d like to eventually get that done, and I’d like to do it sooner rather than later. But we’ll see how that goes.

But even without making the extra code change in the Draw/Impress code, my change so far was enough to fix this bug which I didn’t even know existed. :-)

Lastly…

As of this writing, I’m not entirely done with this feature yet. I still have to cover some corner cases, and I still need to fix some bugs which I unfortunately discovered while taking screenshots for this post. So, stay tuned for further fine-tuning!

Git on Windows

January 20, 2010August 7, 2021 Kohei Yoshida

I guess I don’t really have to tell the world about this, since if you type the title of this blog post in Google it will come back as the top hit. But it’s still worth mentioning msysgit, a pretty darn good git client on Windows. It’s small, it’s efficient, and it’s git. :-) You could of course use git in cygwin, but git in cygwin feels a little “heavy” and by no means small, since you have to get the whole cygwin environment to even use git. So, if you don’t already have cygwin, and want to use git on Windows, msysgit is a pretty good choice. It comes with a minimal bash shell, and while I’m happy to see ssh included with its shell, I was a little disappointed that they left out rsync. But that’s just one minor downside.

For me, msysgit is my git client of choice on Windows, especially in a virtual machine setting where the disk space is tight. On a build machine, though, I still use git in cygwin since I already have to use cygwin to build OOo.

Allergic reaction to Bananas?

January 14, 2010August 7, 2021 Kohei Yoshida

Today I went to see my dentist to do my routine teeth cleaning done. In their office, I was asked to fill out a medical history form since my current form was 4 years old. On this form, you are asked to answer questions such as “have you ever had heart attack?”, “are you taking any medications?”, that sort of stuff. Nothing unusual right? However, one question caught my eye, and I can’t believe what I was asked to answer.

Do you have any allergic reaction to Bananas?

Yes, the word Bananas was capitalized for some reason. I asked my dentist right away for clarification (while trying to hold my laugh), but she was not exactly sure what the question was supposed to mean. She even said she couldn’t believe that question was even on that form! ;-)

But the story didn’t end there. Later, she asked another dentist and asked her opinion. While they had a pretty lengthy discussion going back and forth, she too was not able to come up with a reasonable explanation for the significance of the question.

Does anyone out there with enough medical knowledge know why they need to ask that question, and how is that relevant to dentistry?

P.S. A quick google search has come up with this explanation.

Setting break point where an exception is thrown

December 22, 2009August 7, 2021 Kohei Yoshida3 Comments

Caolan told me today that when debugging with gdb, you can actually set a break point right before an exception is thrown.

You can do

gdb ./soffice.bin
(gdb) catch throw
(gdb) run

and gdb breaks at every location where an exception is raised. Or, you can set a normal break point, run catch throw and cont, and gdb will break at the next exception throw event. This technique helps when an exception gets caught somewhere at higher level in the call stack and you are trying to find out where exactly it is thrown. Such task, without this technique, would be very time-consuming, tedious, boring, and at times frustrating especially when you’ve spent hours and still don’t have the location of the thrown exception.

Similarly, you can also break where an exception is caught, with catch catch command, or you can catch a whole set of other events with this construct.

The only drawback with this catch event construct is that, it breaks at every single exception raised or caught, which, inside OOo’s codebase can be quite substantial in some places. Nonetheless, this is a very useful technique to add to your debugging arsenal.