This week I have finally finished implementing a true shared formula framework in Calc core which allows Calc to share token array instances between adjacent formula cells if they contain identical set of formula tokens. Since one of the major benefits of sharing formula token arrays is reduced memory footprint, I decided to measure the trend in Calc’s memory usages since 4.0 all the way up to the latest master, to see how much impact this shared formula work has made in Calc’s overall memory footprint.
Here is the test document I used to measure Calc’s memory usage
This ODF spreadsheet document contains 100000 rows of cells in 4 columns of which 399999 are formula cells. Column A contains a series of integers that grow linearly down the column. Here, only the first cell (A1) is a numeric cell while the rest are all formula cells that reference their respective immediate upper cell. Cells in Column B all reference their immediate left in Column A, cells in Column C all reference their immediate left in Column B, and so on. References used in this document are all relative references; no absolute references are used.
I’ve tested a total of 4 builds. One is the 4.0.1 build packaged for openSUSE 11.4 (x64) from the openSUSE repository, one is the 4.0.6 build built from the 4.0 branch, one is the 4.1.1 build built from the 4.1 branch, and the last one is the latest from the master branch. With the exception of the packaged 4.0.1 build, all builds are built locally on my machine running openSUSE 11.4 (x64). Also on the master build, I’ve tested memory usage both with and without shared formulas.
In each tested build, the memory usage was measured by directly opening the test document from the command line and recording the virtual memory usage in GNOME system monitor. After the document was loaded, I allowed for the virtual memory reading to stabilize by waiting several seconds before recording the number. The results are presented graphically in the following chart.
The following table shows the actual numbers recorded.
4.0.1 (packaged by openSUSE)
master (no shared formula)
master (shared formula)
Additionally, I’ve also measured the number of token array instances between the two master builds (one with shared formula and one without), and the build without shared formula created 399999 token array instances (exactly 4 x 100000 – 1) upon file load, whereas the build with shared formula created only 4 token array instances. This likely accounts for the difference of 78.3 MiB in virtual memory usage between the two builds.
Effect of cell storage rework
One thing worth noting here is that, even without shared formulas, the numbers clearly show a steady decline of Calc’s memory usage from 4.0 to 4.1, and to the current master. While we can’t clearly infer from these numbers alone what caused the memory usage to shrink, I can say with reasonable confidence that the cell storage rework we did during the same period is a significant factor in such memory footprint shrinkage. I won’t go into the details of the cell storage rework here; I’ll reserve that topic for another blog post.
Oh by the way, I have absolutely no idea why the 4.0.1 build packaged from the openSUSE repository shows such high memory usage. To me this looks more like an anomaly, indicative of earlier memory leaks we had later fixed, different custom allocator that only the distro packaged version uses that favors large up-front memory allocation, or anything else I haven’t thought of. Either way, I’m not counting this as something that resulted from any of our improvements we did in Calc core.
Normally I don’t travel to Japan just to visit OSC mainly because of the distance; being located in the East Coast of the United States, it’s a big hassle to fly to Japan, not to mention the cost. Despite this, I wanted to visit this particular OSC primarily for two reasons.
The LibreOffice Japanese team had organized a separate track just for LibreOffice related talks, and I wanted to come and see face-to-face the people who are involved in our project in Japan in various capacities, and learn the latest on what’s going in the Japanese community.
There was one difficulty, however. Because I only had one week to arrange the travel (I got the email only a week before the scheduled ceremony date) I could not guarantee my arrival until the very last minute. Luckily everything went smoothly and I was able to book my flight and reserve my hotel despite the short notice.
This is actually my second time coming to this event. My first visit was in 2010. I was planning my trip to Tokyo to attend a different, work-related meeting. Then I learned about OSC Tokyo 2010 which was scheduled only one day after the meeting was scheduled to end, so I decided to extend my stay in Tokyo for just one more day to visit OSC. OSC 2010 was also held at Meisei University, so at least I didn’t have to research on how to get the conference venue this time.
Once on campus, there were signs all around the place that would take you to the building where the conference was held. Outside the venue, the campus was pretty quiet, and I didn’t see very many students.
No conferences are complete without booths. Various projects set up booths to greet the visitors, to distribute fliers and CD/DVD’s, and to inform them of what’s new in the projects. Volunteers from the LibreOffice Japanese team manned our booth throughout the conference. We distributed version 4.0 feature fliers, installer CD’s, T-shirts, stickers and flags.
Also present was the openSUSE project booth. Fuminobu Takeyama was single-handedly manning the booth when I dropped by on Friday. He is a volunteer in the openSUSE project who also manages several packages for Japanese locales. We briefly talked about some issues with Japanese input method in LibreOffice, and how some folks work around it by forcing the GTK VCL backend even if LibreOffice is launched in the KDE environment (because the input method code in the GTK VCL backend is more reliable than that in the KDE VCL). He said he is very much hoping to someday find time to look into LibreOffice code, to solve various Japanese-related issues that are still outstanding in the latest release.
OSS Contributor’s Award
The ceremony for the OSS Contributor’s Awards was held on Friday evening. The OSS Contributor’s Awards are given to
“those who have created or managed an influential development project and to developers who have played an important role in a global project or those who have contributed to the promotion of related activities.” (quoted from this slide)
The candidates are nominated publicly, and the winners are selected by the Awards Committee. They select four winners and nine incentive award winners each year, and I was fortunate enough to have been selected as one of the four award winners this year.
The ceremony was held in a separate, moderately-sized lecture room right next to the booth areas, and was very well attended. Out of four winners, two of us were present to receive the awards: Tetsuo Handa and myself. We each gave a brief 10-minutes talk afterward, outlining our current activities and our future plans.
Handa-san is a well known Linux kernel hacker and he is leading the development of a kernel security module known as TOMOYO Linux. We briefly chatted after the ceremony, and he hinted that he may get a chance to hack on LibreOffice in the distant future (and I encouraged him!) So, let’s keep his name in the back of our mind, and hope we can see him in our project someday. ;-)
You can find two press articles on this here and here. The official announcement from the OSS Forum is here.
I spent the second day of the conference mostly in the LibreOffice mini-Conference track. According to Naruhiko-san, this is our first ever track dedicated to LibreOffice (and hopefully won’t be the last) held in Japan. We were able to rent a pretty large lecture room for the whole day to host this mini-Conference. Despite the large size, the room was moderately attended.
The first talk was by Miyoshi Ohmori, and his talk was about the company-wide migration from OpenOffice.org to LibreOffice at NTT Comware. In his talk, he shared the challenges he faced during the migration and ways to solve them.
Next up was a talk by Shinji Enoki covering new features in LibreOffice 4.0. He covered all aspects of new features in 4.0, from Firefox Personas support, to Calc’s import filter performance improvement, and everything in-between. His talk was followed by Naruhiko Ogasawara who shared his experience with his trip to the 2nd LibreOffice Conference in Berlin, how he decided to join the LibreOffice community, and how he decided to submit paper for the conference and eventually travel there. During his talk, Ogasawara-san played the video message from Italo that was created specifically for the Japanese audience.
If you thought Enoki-san and Ogasawara-san looked familiar, it was because they came to the Berlin conference to co-present a talk on the topic of the non-English locale communities. The slide for their talk during the Berlin conference is found here. Enoki-san later traveled to Prague with me and the rest of SUSE’ers, to meet with Petr Mladek to learn more about the current QA activities. (Petr couldn’t make it to Berlin due to illness). Anyway, back to the mini-Conf…
The last talk before the lunch break was by Masaki Tamakoshi. In his talk, he presented a good extension to use to add AutoCAD-like functionality to Draw to make Draw easier and more familiar to use for former (or current) AutoCAD users. He also talked about how to convert AutoCAD’s proprietary dwg files to make them loadable into Draw, and how to create playable animation files from Impress slides, using external tools.
After the lunch break, Jun Meguro kickstarted the afternoon session with his talk on how to make effective use of Draw to create professional posters. His organization – City of Aizuwakamatsu – is in fact one of the first organizations in Japan that made a large scale adoption of OpenOffice.org when such a move was still not very common, and instantly became the poster child of OpenOffice.org adoption. They had later moved on to LibreOffice, and Meguro-san continues to contribute to the LibreOffice project as a member of the Japanese language team.
In his talk, he emphasized the usefulness of Draw – the application that may not have received the attention and praise it deserves, and how Draw can be used to create professional posters and fliers without purchasing expensive and proprietary alternatives. He also hinted during his talk that, these days, they can send ODF documents to other local government offices without first converting them to MS Office or PDF formats. This was first revealed when he accidentally sent off a native Draw document (odg) without converting it to PDF, and later received a phone call from the recipient of the document to discuss about the details of the drawing! Although this is an isolated incident, an anecdote like this may suggest that the actual rate of ODF adoption may well be higher than we may have expected.
In the next talk, Masahisa Kamataki talked about how to make use of FLOSS office suites such as LibreOffice, combined with non-FLOSS but free as in beer cloud services such as SkyDrive and Google Drive to reduce operation costs. He mentioned that all of this was made possible thanks to the international standard ODF which many major cloud services also support these days. He also demonstrated the level of ODF compatibilities between these cloud services.
Next up was Ikuya Awashiro. He talked about the specifics of LibreOffice Japanese localization effort. As someone who coordinates the Japanese translation of LibreOffice UI strings, he knows the in’s and out’s of LibreOffice translation which he covered extensively in his talk. He also talked about the detailed history of the translation in this code base, dating back to the old OpenOffice.org days, and how he learned what not to do in order to successfully coordinate the current community-based translation effort in our project.
I should also mention that, of all the presenters we had during this track (including myself), he was the only presenter who used the Impress Remote feature!
Makoto Takizawa concluded the afternoon session with his ODF PlugFest talk which also happened to be the very last talk in the whole LibreOffice track.
He started off his talk with the basics of ODF, including its standardization history, and went on to talk about various ODF-supporting applications and how each of these apps fares on interoperability test. During his talk he noted that, although in theory the use of ODF ensures seamless interoperability between different supporting applications, in reality there are still some nasty corner cases where different ODF producers interpret ODF differently.
Toward the end of his talk, he performed a live ODF spreadsheet scenario test using Calligra, Gnumeric, SkyDrive and LibreOffice, to test in real life the level of ODF conformance in these spreadsheet applications. In this particular scenario, Calligra, Gnumeric and SkyDrive actually scored higher than LibreOffice. He concluded his talk by pointing out the importance of the ODF user community assessing the conformance level of each ODF-supporting application, and actively giving feedback to the developer community to improve ODF interoperability between the supporting applications.
Lastly, while I was not officially on the list of speakers in this track, I managed to squeeze my talk during the lunch break, to briefly talk about various random development topics. Please refer to my earlier post to get a hold of the slide for my talk. Unfortunately I had to cut it short to give people enough time to eat lunch, but it sort of worked out since I didn’t have much time to prepare my talk to begin with! ;-)
All in all, I believe this was a quite successful LibreOffice track. We were able to see each other face-to-face which is not very easy to do given how widespread we are geographically. That is true even for those inside Japan, and more so for me. It was unfortunate that Takeshi Abe couldn’t make it for this event. Perhaps we should plan another conference during OSC Okinawa so that we get to see him again.
This was actually my very first time to participate in OSC Japan as a speaker, and mingle with so many people from various sectors of the Japanese market. I spoke to quite a lot of people in various capacities during the conference, and I was pleasantly surprised with the level of interest that they have toward LibreOffice. Various local governments are aggressively considering a switch to LibreOffice, with Aizuwakamatsu City and JA Fukuoka leading the way. Though the uptake of LibreOffice among Japanese corporations are still slow, Sumitomo Electric has recently announced their adoption of LibreOffice, so others who are still hesitating to switch may eventually follow suit. I also chatted with someone from a local school district working very hard to realize a district-wide adoption of LibreOffice, which suggests that people in the education sector also see value in adopting LibreOffice.
On the other side of the fence, however, we have yet to attract a healthy dose of developers toward LibreOffice from the Japanese developer community. It is my impression that Japan has a sizable Linux kernel developer community, and in fact, many of the participants at OSC Tokyo were kernel hackers. So, whatever reason they may have for not participating in the LibreOffice development, it’s not because of lack of talents and expertize; they are there, contributing to other projects. At the same time, I also saw lots of interest in hacking on LibreOffice from various people. So, the interest is there; what they just need is a means and justification to work on LibreOffice.
While chatting with Ogawa-san from Ashisuto, who provides paid support for LibreOffice, it is apparent that we are not very far from seeing companies emerging who are very eager to find developers to work on LibreOffice. It is therefore my hope that, by increasing the level of LibreOffice adoption amongst users, the level of interest in participating the development of LibreOffice among support vendors will increase proportionally as a result. And my own impression from participating in OSC Tokyo fills me with optimism in this regard.
This is saved as a hybrid PDF; you can view it in your regular PDF viewer (such as Evince and Adobe PDF viewer), or you can open it in Impress to edit it as a normal Impress document. Use this one in case you need it as a pure Impress document.
And the second one is for the talk I did during the LibreOffice mini-Conference on Saturday.
Like the 1st one, this one is also a hybrid PDF. The regular odp version is available here.
I will write more about OSC Tokyo and especially about the LibreOffice mini-Conference in a separate blog. Stay tuned.
Last week was SUSE Hack Week, where we SUSE engineers were encouraged to be creative and work on whatever project that we had been dying to work on.
Given this opportunity, I decided to try integrating my orcus library project into LibreOffice proper to see how much improvement we could make in the performance of loading spreadsheet documents.
I’ll leave the detailed description and goal of orcus project for another blog post, but in short, orcus is an independent library designed to process spreadsheet documents, and is also designed to be useable from an application that would like to use it to load documents. It’s currently still work in progress, and is not even in alpha quality. So, I intentionally don’t release orcus library packages on an official basis.
The main difficulty with integrating orcus into LibreOffice proper was dealing with the very intricate loading process that LibreOffice uses for all existing filters. It first goes through an elaborate type detection process, which loads the content of the file into memory in order for the type detection code to parse it. Once the correct type is determined, LibreOffice then instantiates correct frame loader and start the actual loading process. I’ve explained all of this in detail in this blog post of mine.
Orcus, on the other hand, only needs a file path, and it does the rest. And it pushes data to the call back functions provided by the client code as it parses the file. It was this difference in overall loading process that made the integration of orcus into LibreOffice all the more challenging. And even though the hack week itself lasted only one week, I had spent months prior to it just to study the type detection code and other auxiliary code that makes up the whole file loading process in order to come up with an elegant way to add hook for orcus.
Long story short, I was able to come up with a way to hook orcus such that LibreOffice relinquishes all its file loading to the orcus library, and only handles callbacks. To make this work, I first packaged orcus into an installable rpm package using the openSUSE build service, locally installed that package, then added –with-system-orcus configure option to allow LibreOffice to find the library. The entire change needed to add hook is condensed into this commit.
Using CSV filter as an experiment
As an initial experiment, I replaced the current csv import filter with one from orcus, just to see how this overall process works. The results are very encouraging.
With a very large csv file I created via this python script:
#!/usr/bin/env pythonimportsysfor i inxrange(0,65536):
for j inxrange(1,101):
val = i * 1.0 / j
for i in xrange(0, 65536):
for j in xrange(1, 101):
val = i * 1.0 / j
the current filter spends roughly 27 seconds to load this file, which is not too bad given the sheer size of the file (~50Mb). The orcus filter, on the other hand, spends only 11 seconds to load the same file.
However, the orcus filter code path still skips a number of steps that need to be performed if it were to be used in the production build, such as
drawing progress bar in the status bar area,
calculating row heights for rows that include multi-line cell contents, and
probably something else I forget to mention here.
Given some of these can be quite expensive, the above numbers may not be fully comparable. Despite that, these initial numbers show a great promise on the performance improvement that may result from using the orcus library.
First of all, we will not switch to the orcus csv filter anytime soon. Although I’d like to see that happen at some point in the future, there are still lots of missing pieces in the orcus csv filter that prevent us from using it in the production build. My plan with orcus is therefore limited to addition of new filters, and my immediate plan is to develop new XML import and export filters using orcus, and integrate it into LibreOffice. This should also provide a stepping stone for any additional filters that may come up later, as well as replacing some of the existing filters as the need arises.
As my previous post just mentioned, mdds 0.6.0 is finally released which contains two new data structures: multi_type_vector and multi_type_matrix. I’d like to explain a little more about multi_type_vector in this post because, of all the data structures I’ve added to mdds over the course of its project life, I firmly believe this structure deserves some explanation.
What motivated multi_type_vector
The initial idea for this structure came from a discussion I had with Michael Meeks over two years ago in Nuremberg, Germany. Back then, he was dumping his idea on me about how to optimize cell storage in LibreOffice Calc, and his idea was that, instead of storing cell values wrapped around cell objects allocated on the heap and storing them in a column array, we store raw cell values directly in an array without the cell object wrappers. This way, if you have a column filled with numbers from top down, those values are guaranteed to be placed in a contiguous region in memory space which is more likely to be in the same memory page unless their size exceeds the memory page size. By contrast, if you store cell values wrapped inside cell objects that are allocated on the heap, those values are most likely scattered all around the memory space and probably located in many different memory pages.
Now, one of the most common operations that typical spreadsheet users do is to operate on numbers in cells. It could be summing up their totals, calculating their average, determining their minimum and maximum values and so on and so forth. To make these operations happen, the program first needs to fetch all the cell values before it can work on them.
Assume that these values are stored inside cell objects which are located in hundreds of memory pages. The mere action of fetching the cell values alone requires loading all of these memory pages, which causes the CPU to fetch them from the main memory in order to access them. Worse, if some of those pages are located in instead of the physical memory space but in the virtual memory space, it causes page fault, which further degrades performance since that particular memory page must be swapped in from disk. In contrast, if they are all located in a single memory page (or just several of them instead of hundreds), it just needs to fetch just once or several times, depending on the size of the data being fetched.
Moreover, most CPUs these days come equipped with CPU caches to cache recently-fetched memory pages in order to speed up subsequent access to them. Because of this, keeping all your data in the same page reduces the chance of the CPU fetching it from the main memory (or the worse case from the virtual memory), which is slower than fetching it from the caches.
Let’s visualize this idea for a moment. The current cell storage looks like this:
As you can see, cells are scattered in different pages. To access them all, you need to load all of these pages that contain the requested cell objects.
Compare that with the following illustration:
where all requested cell values are stored in a single array that’s located in a single page. I hope it’s obvious by now which one actually fetches data faster.
Calc currently employs the former storage model, and our hope is to make Calc’s storage model more efficient both space- and time-wise by switching to the latter model.
Applying this to the design
One difficulty with applying this concept to column storage is that, a column in a typical spreadsheet application allows you to store values of different types. Cells containing a bunch of test scores may have in the same column a title cell at the top that stores the text “Score”. Likewise, those test scores may be followed by an empty cell followed by a bunch of formula cells containing formula expressions summing, averaging, or counting the test scores. Since one array can only hold values of identical type, this requires us to use a separate array for each segment of identical cell type.
With that, the column storage structure becomes somewhat like this:
An empty cell segment doesn’t store any value array, but it does store its size which is necessary to calculate the logical position of the next non-empty element.
This is the basic design of the multi_type_vector structure. It stores values of each identical type in a single, secondary value array while the primary column array stores the memory locations of all secondary value arrays. It’s important to point out that, while I used the spreadsheet use case as an example to explain the basic idea of the structure, the structure itself can be used in other, much broader use cases, and is not specific to spreadsheet applications.
In the next section, I will talk about challenges I have faced while implementing this structure. But first one terminology note: from now on I will use the term “element block” (or simply “block”) to refer to what was referred to as “secondary value array” up to this point. I use this name in my implementation code too, so using this name makes easier for me to explain things.
The basic design of multi_type_vector is not that complicated and was not very challenging to understand and implement. What was more challenging was to handle cases where a value, or a series of values, are inserted over a block or blocks of different types. There are a variety of ways to insert new values into this container, and sometimes the new values overlap the existing blocks of different types, or overlap a part of an existing block of the same type and a part of a block of a different type, and so on and so forth. Because the basic design of the container requires that the type of every element block differs from its neighbors’, some data insertions may cause the container to need to re-organize its element block structure. This posed quite a challenge since multi_type_vector supports the following methods of modifications:
set a single value to overwrite an existing one if any (set() method, 2-parameter variant),
set a sequence of values to overwrite existing values if any (set() method, 3-parameter variant),
insert a sequence of values and shift those existing values that occur below the insertion position (insert() method),
set a segment of existing values empty (set_empty() method), and
insert a sequence of empty values and shift those existing value that occur below the insertion position (insert_empty() method),
and each of these scenarios requires different strategy for element block re-organization. Non-overwriting data insertion scenarios (insert() and insert_empty()) were somewhat easier to handle than the overwriting data insertion scenarios (set() and set_empty()), as the latter required more branching and significantly more code to cover all cases.
This challenge was further exacerbated by additional requirement to support a “managed” element block that stores pointers to objects whose life cycle is managed by the block. I decided to add this one for convenience reasons, to allow transitioning the current cell storage model into the new storage model in several phases rather than doing it in one big-bang change. During the transition phase, we will likely convert the number and string cells into raw value element blocks, while keeping more complex cell structures such as formula cells still wrapped in their current form. This means that, during the transition we will have element blocks storing pointers to heap-allocated formula cell objects scattered across memory space. Eventually these formula objects need to be stored in a contiguous memory space but that will have to wait after the transition phase.
Supported data types
Template containers are supposed to work with any custom types, and multi_type_vector is no exception. But unlike most standard template containers which normally have one primary data type (and perhaps another one for associative containers), multi_type_vector allows storage of unspecified numbers of data types.
By default, multi_type_vector supports the following data types: bool, short, unsigned short, int, unsigned int, long, unsigned long, double, and std::string. If these data types are all you need when using multi_type_vector, then you won’t have to do anything extra, and just instantiate the template instance by
mtv_type data(10);// set initial size to 10.// insert values.
mtv_type data(10); // set initial size to 10.
// insert values.
But if you need to store other types of data, you’ll need to do a little more work. Let’s say you have this class type:
and you want to store instances of this class in multi_type_vector. I’ll skip the actual definition of this class, but let’s assume that the basic stuff such as default and copy constructors, equality operator etc are all implemented and working properly.
First, you need to define a unique numeric ID for your custom type. Each element type must be associated with a numeric ID. The IDs for standard data types are defined as follows:
The value of element_type_user_start defines the starting number of all custom type IDs. IDs for the standard types all come before this value. If you only want to define one custom type ID, then just using that value will be sufficient. If you need another ID, just add 1 to it and use it for that type. As long as each ID is unique, it doesn’t really matter what their actual values are.
Next, you need to choose the block type. There are 3 block types to choose from:
The last 2 are relevant only when you need a managing pointer element block to store heap objects. Right now, let’s just use the default element block for your custom type.
Note that these callbacks functions are called from within multi_type_vector via unqualified call, so it’s essential that they are in the same namespace as the custom data type in order to satisfy C++’s argument-dependent lookup rule.
So far so good. The last step that you need to do is to define a structure of element block functions. This is also a boiler plate, and for a single custom type case, you can define something like this:
This is quite a bit of code, I know. I should definitely work on making it a bit simpler to use with a lot less typing in future versions of mdds. Anyway, with this in place, we can finally define the multi_type_vector type:
With all these bits in place, you can finally start using this container:
data.set(0, foo);// Insert a custom data element.
data.set(1, 12.3);// You can still use the standard data types.
data.set(0, foo); // Insert a custom data element.
data.set(1, 12.3); // You can still use the standard data types.
That’s all I will talk about custom data types for now. I hope this gives you a glimpse of how this container works in general.
Since this is the very first incarnation of multi_type_vector, I have no doubt this still has a lot of issues to be worked out. One immediate issue that comes to mind is the performance of element position lookup. Given a logical position of the element, the container first has to locate the right element block that stores the specified element, but this lookup always happens from the first element block. So, if you are doing a continuous lookup of million’s of elements in a loop, the overall lookup speed can be quite slow since each lookup starts from the first block. Speeding up this operation is certainly a task to be worked on in the near future. Meanwhile, the user of this container can resort to using the iterators to iterate through the element blocks and their member elements.
Another issue is the verbosity of the element block function structure required for custom element blocks. This can be worked out by providing templatized structures per number of custom data types. This one is probably easier to solve, and I should look into that soon.
I just had an opportunity to spend some time reading and analyzing what actually takes place when you do a mundane thing like opening a file. If you are a user, you wouldn’t think much when opening a new document. You select the file, click Open, and you expect that file to be open. If you are a coder, however, and especially if you are a coder who has spent some time either looking through or trying to debug this code, I bet that this is one of the most horrifying places to work in even in this code base. It certainly is for me.
Anyway, since I’m a diagram-oriented person, I’ve decided to sketch a very rough diagram of what happens when you open a file, from the moment we receive a dispatch request with the URL of the document, to the point where we pass that call to the appropriate filter code. Here is the result.
Now, this is a cleaned-up version. The actual code contains lots more branch points and quite a few “temporary” hacks (here the term “temporary” is used very loosely), which undoubtedly will confuse you even more. But I believe this diagram illustrates a very rough overview of how we determine the format type of the document, how the “right” (“right” in 95% of the time) filter gets picked, and where to look in case something doesn’t work as expected…. Hopefully.
I have great news to share with you. Calc’s ODS import filter in 3.5 should be substantially faster when you have documents with a large number of named ranges. Read on if you want to know more details.
Laurent Godard, Markus Mohrhard, and myself have been working pretty hard in the past month to bring the performance of ODS import filter to a reasonable level, especially with documents containing a large number of named ranges.
Here is the background. Laurent uses LibreOffice as a platform for his professional extension, which makes heavy use of named ranges. It programmatically generates ODS documents and inserts hundred’s or thousand’s of named ranges as intermediary storage to further process the data. The problem was, however, our import performance with that kind of documents was so suboptimal that this process was taking a prohibitively long time. In order for his extension to perform optimally, our ODS import filter needed to be optimized, and optimized heavily.
During the Paris conference, we got our heads together in order to come up with a strategy to make that happen. Laurent was more than willing to participate this effort, and in the end, he did substantial amount of work profiling, analyzing code, coming up with optimization strategy and putting it altogether. Markus and I provided mentorship, code pointers, as well as occasional coding to accelerate this effort.
Our hope was to make it all happen in time for our first 3.5 release. And I’m very happy to say that we made it.
Since we are talking about performance, it won’t be complete without the actual numbers. So here goes.
Test document 1
Here is the first test document global500.ods. It contains 500 sheets, 12,500 global named ranges, and 12,500 formulas that reference them.
On my development machine, the last stable release 3.4.4 takes 14 seconds to open this document. While 14 seconds may not seem that slow, keep in mind that this machine is somewhat unfairly fast tailored for the abusive developer use, so the real world performance is likely much less impressive (you can probably multiply that number by 3 to get a rough idea of the real world performance). Anyhow, using the latest master branch on the same machine, this document opens roughly in 2 and a half seconds. That’s roughly 86% reduction in import time.
Test document 2
Here is the second, somewhat larger document global1000.ods. This document contains 1000 sheets, 25,000 named ranges and 25,000 formulas that reference them.
According to my benchmark performed in the same condition as the first document, 3.4.4 opens this document in 50 seconds, whereas in 3.5.0 it opens under 5 seconds. That’s about 90% reduction in import time. Pretty impressive!
Real power of open source
This story shows another aspect of this remarkable achievement worth mentioning. If you use an open source product such as LibreOffice in your business, and if it doesn’t perform the way you need it to, you can actually join the project as a developer and coordinate the effort with the upstream developers to make it happen. And depending on the nature of the change you want to see happen, it can happen very quickly as this story demonstrates.
I wanted to emphasize this because, while more and more businesses and institutions are embracing open source software, many of them tend to focus too much on the cost-saving aspect of it, thereby developing the wrong mindset that that’s what open source is all about. It isn’t. The real power of using open source software in your deployment is it gives you the ability to join and contribute to the project to influence the direction of its development. That gives you real flexibility in planning, and in my opinion the best way to harness the power of using open source software. The monetary cost-saving side of the benefit comes as a side effect but should be thought of only as an added bonus, not the primary reason for deploying open source software.
I’m happy to announce that I’ve managed to squeeze this new feature in just in time for the 3.5 code freeze.
As I’ve mentioned briefly in G+, I’ve been working on brushing up the age-old autofilter popup window in the past few weeks. I have no idea how old the old one is, but it’s been there for as long as I remember. In case anyone needs a reminder as to what the old one looks like, here it is.
It’s functional, yet very basic. While this has served us for many years since the last century, it was also clear that the world has since moved on, and the people has started craving for modern looks and eye candies even in the office productivity applications. Clearly, it was time for a change.
In contrast to the old, here is how the new one looks:
I don’t know about you, but I really like the new one better. :-)
Aside from updating the aged look of the old popup, I was also motivated to introduce the new popup for its ability to allow selection of multiple values from the selection list.
You may think that this new popup looks somewhat familiar. That’s because the same popup is also used as the pivot table (formerly data pilot) field member selection popup. I’ve touched on this previously on my blog, and you’ll probably notice the similarity when comparing the screenshot of the new popup with the screenshot of the pivot table popup included in that post.
Internally these two use the same code. In fact, when I developed that feature for the pivot table, I intentionally designed it to be re-usable, precisely so that I could use it for the autofilter popup at a later time.
So, the hard part of implementing the new popup had already been finished. All I had to do was to put the autofilter functionality into the popup and launch it instead of the ugly old one, which is precisely what I did to bring the new popup into reality. I also had to refactor the code that performs the filtering to allow multi-value matching, which was, while invisible to the users, not a trivial task.
The work is not totally done yet. As of this writing, the xlsx filter has not been fully adopted to take advantage of the new multi-selection capability, but that’s my next task, and I expect that to be done in time for 3.5.
Also, the menu still looks very basic, and contains only the same set of options that the old popup had. This was done deliberately in order for us to ship it in time for 3.5, by avoiding the rather expensive process of re-designing the menu part of the popup. But I expect we work on the re-design post-3.5, to make it even better and more usable. Note that the new popup is fully capable of doing sub menus, which gives us all sorts of possibilities.
Anyhow, that’s all I have to say about this at the moment. I hope you guys will enjoy the new and shiny autofilter popup! :-)
Notes for testing
As with any new features, this one needs lots of testing. I’ve written new unit test to cover some parts of it, but unit test can’t cover all corners of use cases (especially those involving UI interactions), and manual testing from real users is always appreciated. Some of the affected areas I can think of are:
Built-in functions MATCH, LOOKUP, HLOOKUP and VLOOKUP that use the core filtering code which I’ve heavily refactored.
Import and export of the existing filtering rules, with ods, xls, and xlsx.
Filtering with pivot tables, which shares parts of the filtering code that has been refactored.
Standard and advanced filter dialogs
So, watch out for the next daily build that includes this feature!
So, it was a real pleasure to be a part of the very first LibreOffice conference held in Paris, France. Some of the faces and names were familiar from the old OOo conferences, but the atmosphere of the conference was very different from the OOo ones in the past. I have been to the 2007 Barcelona conference and the 2009 Orvieto one, and I have to say, while there were some rough-edges, this is by-far my favorite OOo/LibO conference to date.
The only regret I have is that, because I had another international trip (to South Korea) only a week prior to the conference, I felt pretty much exhausted most of the time I was there. But I think I managed to chat with most of the people I needed to chat with during this once-a-year event. I intentionally tried not to hack too much during this conference, mainly because of my travel fatigue, but also because I felt it was more important to see people and talk to them to have a good feel for each other. Working from home, I sometimes miss the human interaction that people who work in the office probably take for granted, so this conference was a perfect place to fulfill that need, to make me feel human again. ;-) (Actually I tried to code a bit during the conference, but apparently my brain wasn’t cooperating at all I decided it probably wasn’t a good idea).
Anyway, it was good to see and chat with Markus Mohrhard (moggi), a very active Calc hacker who’s been instrumental in Calc’s filter test development in recent days. We discussed on various topics on Calc development since we work together in that code.
Also, Laurent Godard, whom I’ve known many years from the OOo days, but never met face-to-face.
And Valek Filippov, who happens to be in the same timezone as I. There aren’t many of us left in this LibreOffice circle, unfortunately. I tried to persuade him into this wonderful world of hacking, but so far he’s successfully fended off my attack.
It was also nice to chat with Michael Meeks at length, to clarify the new Calc cell storage structure that he and I discussed previously. Now the concept is very much clear, waiting to be coded.
Of course, many other countless hackers I’ve had beer with during the conference week, it was a real pleasure.
Now, I got some homework to do based on my interaction with various people during the conference. I will list them up item by item to use as a reminder.
Two Calc bugs from Valek. Both are related to this 1C program that pretty much everyone in Russia uses. I’ve already added them to my 3.5 TODO list, so it’s just a matter of finding time to tackle them unless something tricky comes out.
Some documentation on how to use the ixion library. Since there were some interests on using ixion to support formula calculations in other applications, I should probably start working on producing documentation on ixion, both on how to build it, and how to use it. I should also create a package for it while I’m at it.
Support for temporary cell buffer in the orcus library, to allow converting cell values before passing them to the client code. In some cases we can’t simply push the cell value as-is but convert it first before passing it to the client code. Typical examples are double quotes as a literal quote in CSV, as well as encoded characters (e.g. &) in XML/HTML. This will unfortunately cost us a bit for the allocation of the buffer and copying of the char array, but fortunately we don’t need to do this for all cells.
And lots and lots more.
All in all, I was glad to be a part of this successful conference. The atmosphere was very much all inclusive and personal, exactly how an open source conference should be.
In case someone wants to get a hold of the slides for my talk during the LibreOffice conference, they are available here (also in PDF).
I will write something up about the conference in more detail at later time. For now, I’ll take some time off to recover from the several travels I did in the past few weeks, across 3 different timezones that are 17 hours apart in total.