March 2010 – Roundtrip to Shanghai via Tokyo

STL container performance on data insertion

March 31, 2010August 7, 2021 Kohei Yoshida5 Comments

I just ran a quick analysis on the performance of various STL containers on simple data insertion. The result was not exactly what I expected so I’d like to share it with you.

What was performed was sequential insertions of 50,000,000 (50 million) unique pointer values into various STL containers, either by push_back or insert, depending on which method is supported by the container. I ran the test on openSUSE 11.2, with g++ 4.4.1, with the compiler options of -std=c++0x -Os -g. The -std=c++0x flag is necessary in order to use std::unordered_set.

Anyway, here is the result I observed:

stl-perf

I was fully aware of the set containers being slower than list and vector on insertion, due to the internal structure of set being more elaborate than those of list or vector, and this test confirms my knowledge. However, I was not aware of such wide gap between list and vector. Also, the difference between unreserved and reserved vector was not as wide as I would have expected. (For the sake of completeness, a reserved vector is an instance of vector whose internal array size is pre-allocated in advance in order to avoid re-allocation.) My belief has always been that reserving vector in advance improves performance on data insertion, which it does, but I was expecting a wider gap between the two. So, the result I see here is a bit unexpected.

In case you want to re-run this test on your own environment, here is the code I used to measure the containers’ performance:

#include <vector>
#include <unordered_set>
#include <set>
#include <list>
 
#include <stdio.h>
#include <string>
#include <sys/time.h>
 
using namespace std;
 
namespace {
 
class StackPrinter
{
public:
    explicit StackPrinter(const char* msg) :
        msMsg(msg)
    {
        fprintf(stdout, "%s: --begin\n", msMsg.c_str());
        mfStartTime = getTime();
    }
 
    ~StackPrinter()
    {
        double fEndTime = getTime();
        fprintf(stdout, "%s: --end (duration: %g sec)\n", msMsg.c_str(), (fEndTime-mfStartTime));
    }
 
    void printTime(int line) const
    {
        double fEndTime = getTime();
        fprintf(stdout, "%s: --(%d) (duration: %g sec)\n", msMsg.c_str(), line, (fEndTime-mfStartTime));
    }
 
private:
    double getTime() const
    {
        timeval tv;
        gettimeofday(&tv, NULL);
        return tv.tv_sec + tv.tv_usec / 1000000.0;
    }
 
    ::std::string msMsg;
    double mfStartTime;
};
 
}
 
int main()
{
    size_t store_size = 50000000;
    {
        StackPrinter __stack_printer__("vector non-reserved");
        string* ptr = 0x00000000;
        vector<void*> store;
        for (size_t i = 0; i < store_size; ++i)
            store.push_back(ptr++);
    }
 
    {
        StackPrinter __stack_printer__("vector reserved");
        string* ptr = 0x00000000;
        vector<void*> store;
        store.reserve(store_size);
        for (size_t i = 0; i < store_size; ++i)
            store.push_back(ptr++);
    }
 
    {
        StackPrinter __stack_printer__("list");
        string* ptr = 0x00000000;
        list<void*> store;
        for (size_t i = 0; i < store_size; ++i)
            store.push_back(ptr++);
    }
 
    {
        StackPrinter __stack_printer__("set");
        string* ptr = 0x00000000;
        set<void*> store;   
        for (size_t i = 0; i < store_size; ++i)
            store.insert(ptr++);
    }
 
    {
        StackPrinter __stack_printer__("unordered set");
        string* ptr = 0x00000000;
        unordered_set<void*> store;
        for (size_t i = 0; i < store_size; ++i)
            store.insert(ptr++);
    }
}

#include <vector> #include <unordered_set> #include <set> #include <list> #include <stdio.h> #include <string> #include <sys/time.h> using namespace std; namespace { class StackPrinter { public: explicit StackPrinter(const char* msg) : msMsg(msg) { fprintf(stdout, "%s: --begin\n", msMsg.c_str()); mfStartTime = getTime(); } ~StackPrinter() { double fEndTime = getTime(); fprintf(stdout, "%s: --end (duration: %g sec)\n", msMsg.c_str(), (fEndTime-mfStartTime)); } void printTime(int line) const { double fEndTime = getTime(); fprintf(stdout, "%s: --(%d) (duration: %g sec)\n", msMsg.c_str(), line, (fEndTime-mfStartTime)); } private: double getTime() const { timeval tv; gettimeofday(&tv, NULL); return tv.tv_sec + tv.tv_usec / 1000000.0; } ::std::string msMsg; double mfStartTime; }; } int main() { size_t store_size = 50000000; { StackPrinter __stack_printer__("vector non-reserved"); string* ptr = 0x00000000; vector<void*> store; for (size_t i = 0; i < store_size; ++i) store.push_back(ptr++); } { StackPrinter __stack_printer__("vector reserved"); string* ptr = 0x00000000; vector<void*> store; store.reserve(store_size); for (size_t i = 0; i < store_size; ++i) store.push_back(ptr++); } { StackPrinter __stack_printer__("list"); string* ptr = 0x00000000; list<void*> store; for (size_t i = 0; i < store_size; ++i) store.push_back(ptr++); } { StackPrinter __stack_printer__("set"); string* ptr = 0x00000000; set<void*> store; for (size_t i = 0; i < store_size; ++i) store.insert(ptr++); } { StackPrinter __stack_printer__("unordered set"); string* ptr = 0x00000000; unordered_set<void*> store; for (size_t i = 0; i < store_size; ++i) store.insert(ptr++); } }

The book has arrived!

March 22, 2010August 7, 2021 Kohei Yoshida

This is the 2nd edition of the classic Advanced Programming in the UNIX Environment book. I’m a happy owner of the 1st edition, and ever since the 2nd edition was published I had the urge to go ahead and order a copy. I had been fighting off that urge, but last week I had finally given up fighting and decided to place an order.

The 1st edition truly opened up my eyes on the power of UNIX programming. Like the 1st edition, I am looking forward to discovering what this book offers, and how the UNIX system (most notably Linux) has evolved since the 1st edition was published.

mso-dumper now packaged in OBS

March 20, 2010August 7, 2021 Kohei Yoshida2 Comments

I’m happy to announce that the mso-dumper tool is now packaged in the openSUSE build service under my home repository. This tool is written in Python, and allows you to dump the contents of MS Office documents stored in the BIFF-structured binary file format in a more human readable fashion. It is an indispensable tool when dealing with importing from and/or exporting to the Office documents. Right now, only Excel and Power Point formats are supported.

This package provides two new commands xls-dump and ppt-dump. If you wish to dump the content of an Excel document, all you have to do is

xls-dump ./path/to/mydoc.xls

and it dumps its content to standard output. What the output looks like depends on what’s stored with the document, but it will look something like this:

...
0085h: =============================================================
0085h: BOUNDSHEET - Sheet Information (0085h)
0085h:   size = 14
0085h: -------------------------------------------------------------
0085h: B4 09 00 00 00 00 06 00 53 68 65 65 74 31 
0085h: -------------------------------------------------------------
0085h: BOF position in this stream: 2484
0085h: sheet name: Sheet1
0085h: hidden state: visible
0085h: sheet type: worksheet or dialog sheet

008Ch: =============================================================
008Ch: COUNTRY - Default Country and WIN.INI Country (008Ch)
008Ch:   size = 4
008Ch: -------------------------------------------------------------
008Ch: 01 00 01 00 

00EBh: =============================================================
00EBh: MSODRAWINGGROUP - Microsoft Office Drawing Group (00EBh)
00EBh:   size = 90
00EBh: -------------------------------------------------------------
00EBh: 0F 00 00 F0 52 00 00 00 00 00 06 F0 18 00 00 00 
00EBh: 02 04 00 00 02 00 00 00 02 00 00 00 01 00 00 00 
00EBh: 01 00 00 00 02 00 00 00 33 00 0B F0 12 00 00 00 
00EBh: BF 00 08 00 08 00 81 01 09 00 00 08 C0 01 40 00 
00EBh: 00 08 40 00 1E F1 10 00 00 00 0D 00 00 08 0C 00 
00EBh: 00 08 17 00 00 08 F7 00 00 10 

00FCh: =============================================================
00FCh: SST - Shared String Table (00FCh)
00FCh:   size = 8
00FCh: -------------------------------------------------------------
00FCh: 00 00 00 00 00 00 00 00 
00FCh: -------------------------------------------------------------
00FCh: total number of references: 0
00FCh: total number of unique strings: 0
...

I have originally written this tool to deal with the Excel import and export part of Calc’s development, and continue to develop it further. Thorsten Behrens has later joined forces and added support for the Power Point format. Right now, I’m working on adding an XML output format option to make it easier to compare outputs, which is important for regression testing.