Use a user-defined custom document class¶
In this section we will demonstrate how you can use orcus to populate your own custom document model by implementing your own set of interface classes and passing it to the orcus import filter. The first example code shown below is the absolute minimum that you need to implement in order for the orcus filter to function properly:
#include <orcus/spreadsheet/import_interface.hpp>
#include <orcus/orcus_ods.hpp>
#include <iostream>
using namespace std;
using namespace orcus::spreadsheet;
using orcus::orcus_ods;
class my_empty_import_factory : public iface::import_factory
{
public:
virtual iface::import_sheet* append_sheet(
sheet_t sheet_index, const char* sheet_name, size_t sheet_name_length) override
{
cout << "append_sheet: sheet index: " << sheet_index
<< "; sheet name: " << string(sheet_name, sheet_name_length)
<< endl;
return nullptr;
}
virtual iface::import_sheet* get_sheet(
const char* sheet_name, size_t sheet_name_length) override
{
cout << "get_sheet: sheet name: "
<< string(sheet_name, sheet_name_length) << endl;
return nullptr;
}
virtual iface::import_sheet* get_sheet(sheet_t sheet_index) override
{
cout << "get_sheet: sheet index: " << sheet_index << endl;
return nullptr;
}
virtual void finalize() override {}
};
int main()
{
my_empty_import_factory factory;
orcus_ods loader(&factory);
loader.read_file("/path/to/multi-sheets.ods");
return EXIT_SUCCESS;
}
Just like the example we used in the previous section, we are also loading a
document saved in the Open Document Spreadsheet format via
orcus_ods
. The document being loaded is named
multi-sheets.ods, and contains three sheets which are are named ‘1st
Sheet’, ‘2nd Sheet’, and ‘3rd Sheet’ in this exact order. When you
compile and execute the above code, you should get the following output:
append_sheet: sheet index: 0; sheet name: 1st Sheet
append_sheet: sheet index: 1; sheet name: 2nd Sheet
append_sheet: sheet index: 2; sheet name: 3rd Sheet
One primary role the import factory plays is to provide the orcus import
filter with the ability to create and insert a new sheet to the document. As
illustrated in the above code, it also provides access to existing sheets by
its name or its position. Every import factory implementation must be a
derived class of the orcus::spreadsheet::iface::import_factory
interface base class. At a minimum, it must implement
- the
append_sheet()
method which inserts a new sheet and return access to it, - two variants of the
get_sheet()
method which returns access to an existing sheet, and - the
finalize()
method which gets called exactly once at the very end of the import, to give the implementation a chance to perform post-import tasks.
in order for the code to be buildable. Now, since all of the sheet accessor methods return null pointers in this code, the import filter has no way of populating the sheet data. To actually receive the sheet data from the import filter, you must have these methods return valid pointers to sheet accessors. The next example shows how that can be done.
Implement sheet interface¶
In this section we will expand on the code in the previous section to implement the sheet accessor interface, in order to receive cell values in each individual sheet. In this example, we will define a structure to hold a cell value, and store them in a 2-dimensional array for each sheet. First, let’s define the cell value structure:
enum class cell_value_type { empty, numeric, string };
struct cell_value
{
cell_value_type type;
union
{
size_t index;
double f;
};
cell_value() : type(cell_value_type::empty) {}
};
As we will be handling only three cell types i.e. empty, numeric, or string
cell type, this structure will work just fine. Next, we’ll define a sheet
class called my_sheet
that stores the cell values in a 2-dimensional
array, and implements all required interfaces as a child class of
import_sheet
.
At a minimum, the sheet accessor class must implement the following virtual
methods to satisfy the interface requirements of
import_sheet
.
set_auto()
- This is a setter method for a cell whose type is undetermined. The implementor must determine the value type of this cell, from the raw string value of the cell. This method is used when loading a CSV document, for instance.set_string()
- This is a setter method for a cell that stores a string value. All cell string values are expectd to be pooled for the entire document, and this method only receives a string index into a centrally-managed string table. The document model is expected to implement a central string table that can translate an index into its actual string value.set_value()
- This is a setter method for a cell that stores a numeric value.set_bool()
- This is a setter method for a cell that stores a boolean value. Note that not all format types use this method, as some formats store boolean values as numeric values.set_date_time()
- This is a setter method for a cell that stores a date time value. As with boolean value type, some format types may not use this method as they store date time values as numeric values, typically as days since epoch.set_format()
- This is a setter method for applying cell formats. Just like the string values, cell format properties are expected to be stored in a document-wide cell format properties table, and this method only receives an index into the table.get_sheet_size()
- This method is expected to return the dimension of the sheet which the loader may need in some operations.
For now, we’ll only implement
set_string()
,
set_value()
, and
get_sheet_size()
, and
leave the rest empty.
Here is the actual code for class my_sheet
:
class my_sheet : public iface::import_sheet
{
cell_value m_cells[100][1000];
range_size_t m_sheet_size;
sheet_t m_sheet_index;
public:
my_sheet(sheet_t sheet_index) :
m_sheet_index(sheet_index)
{
m_sheet_size.rows = 1000;
m_sheet_size.columns = 100;
}
virtual void set_string(row_t row, col_t col, size_t sindex) override
{
cout << "(sheet: " << m_sheet_index << "; row: " << row << "; col: " << col << "): string index = " << sindex << endl;
m_cells[col][row].type = cell_value_type::string;
m_cells[col][row].index = sindex;
}
virtual void set_value(row_t row, col_t col, double value) override
{
cout << "(sheet: " << m_sheet_index << "; row: " << row << "; col: " << col << "): value = " << value << endl;
m_cells[col][row].type = cell_value_type::numeric;
m_cells[col][row].f = value;
}
virtual range_size_t get_sheet_size() const override
{
return m_sheet_size;
}
// We don't implement these methods for now.
virtual void set_auto(row_t row, col_t col, const char* p, size_t n) override {}
virtual void set_bool(row_t row, col_t col, bool value) override {}
virtual void set_date_time(row_t row, col_t col, int year, int month, int day, int hour, int minute, double second) override {}
virtual void set_format(row_t row, col_t col, size_t xf_index) override {}
virtual void set_format(row_t row_start, col_t col_start, row_t row_end, col_t col_end, size_t xf_index) override {}
};
Note that this class receives its sheet index value from the caller upon instantiation. A sheet index is a 0-based value and represents its position within the sheet collection.
Finally, we will modify the my_import_factory
class to store and manage a
collection of my_sheet
instances and to return the pointer value to a
correct sheet accessor instance as needed.
class my_import_factory : public iface::import_factory
{
std::vector<std::unique_ptr<my_sheet>> m_sheets;
public:
virtual iface::import_sheet* append_sheet(
sheet_t sheet_index, const char* sheet_name, size_t sheet_name_length) override
{
m_sheets.push_back(std::make_unique<my_sheet>(m_sheets.size()));
return m_sheets.back().get();
}
virtual iface::import_sheet* get_sheet(
const char* sheet_name, size_t sheet_name_length) override
{
// TODO : implement this.
return nullptr;
}
virtual iface::import_sheet* get_sheet(sheet_t sheet_index) override
{
sheet_t sheet_count = m_sheets.size();
return sheet_index < sheet_count ? m_sheets[sheet_index].get() : nullptr;
}
virtual void finalize() override {}
};
Let’s put it all together and run this code:
#include <orcus/spreadsheet/import_interface.hpp>
#include <orcus/orcus_ods.hpp>
#include <iostream>
#include <memory>
using namespace std;
using namespace orcus::spreadsheet;
using orcus::orcus_ods;
enum class cell_value_type { empty, numeric, string };
struct cell_value
{
cell_value_type type;
union
{
size_t index;
double f;
};
cell_value() : type(cell_value_type::empty) {}
};
class my_sheet : public iface::import_sheet
{
cell_value m_cells[100][1000];
range_size_t m_sheet_size;
sheet_t m_sheet_index;
public:
my_sheet(sheet_t sheet_index) :
m_sheet_index(sheet_index)
{
m_sheet_size.rows = 1000;
m_sheet_size.columns = 100;
}
virtual void set_string(row_t row, col_t col, size_t sindex) override
{
cout << "(sheet: " << m_sheet_index << "; row: " << row << "; col: " << col << "): string index = " << sindex << endl;
m_cells[col][row].type = cell_value_type::string;
m_cells[col][row].index = sindex;
}
virtual void set_value(row_t row, col_t col, double value) override
{
cout << "(sheet: " << m_sheet_index << "; row: " << row << "; col: " << col << "): value = " << value << endl;
m_cells[col][row].type = cell_value_type::numeric;
m_cells[col][row].f = value;
}
virtual range_size_t get_sheet_size() const override
{
return m_sheet_size;
}
// We don't implement these methods for now.
virtual void set_auto(row_t row, col_t col, const char* p, size_t n) override {}
virtual void set_bool(row_t row, col_t col, bool value) override {}
virtual void set_date_time(row_t row, col_t col, int year, int month, int day, int hour, int minute, double second) override {}
virtual void set_format(row_t row, col_t col, size_t xf_index) override {}
virtual void set_format(row_t row_start, col_t col_start, row_t row_end, col_t col_end, size_t xf_index) override {}
};
class my_import_factory : public iface::import_factory
{
std::vector<std::unique_ptr<my_sheet>> m_sheets;
public:
virtual iface::import_sheet* append_sheet(
sheet_t sheet_index, const char* sheet_name, size_t sheet_name_length) override
{
m_sheets.push_back(std::make_unique<my_sheet>(m_sheets.size()));
return m_sheets.back().get();
}
virtual iface::import_sheet* get_sheet(
const char* sheet_name, size_t sheet_name_length) override
{
// TODO : implement this.
return nullptr;
}
virtual iface::import_sheet* get_sheet(sheet_t sheet_index) override
{
sheet_t sheet_count = m_sheets.size();
return sheet_index < sheet_count ? m_sheets[sheet_index].get() : nullptr;
}
virtual void finalize() override {}
};
int main()
{
my_import_factory factory;
orcus_ods loader(&factory);
loader.read_file("/path/to/multi-sheets.ods");
return EXIT_SUCCESS;
}
We’ll be loading the same document we loaded in the previous example, but this time we will receive its cell values. Let’s go through each sheet one at a time.
Data on the first sheet looks like this:
It consists of 4 columns, with each column having a header row followed by exactly ten rows of data. The first and forth columns contain numeric data, while the second and third columns contain string data.
When you run the above code to load this sheet, you’ll get the following output:
(sheet: 0; row: 0; col: 0): string index = 0
(sheet: 0; row: 0; col: 1): string index = 0
(sheet: 0; row: 0; col: 2): string index = 0
(sheet: 0; row: 0; col: 3): string index = 0
(sheet: 0; row: 1; col: 0): value = 1
(sheet: 0; row: 1; col: 1): string index = 0
(sheet: 0; row: 1; col: 2): string index = 0
(sheet: 0; row: 1; col: 3): value = 35
(sheet: 0; row: 2; col: 0): value = 2
(sheet: 0; row: 2; col: 1): string index = 0
(sheet: 0; row: 2; col: 2): string index = 0
(sheet: 0; row: 2; col: 3): value = 56
(sheet: 0; row: 3; col: 0): value = 3
(sheet: 0; row: 3; col: 1): string index = 0
(sheet: 0; row: 3; col: 2): string index = 0
(sheet: 0; row: 3; col: 3): value = 6
(sheet: 0; row: 4; col: 0): value = 4
(sheet: 0; row: 4; col: 1): string index = 0
(sheet: 0; row: 4; col: 2): string index = 0
(sheet: 0; row: 4; col: 3): value = 65
(sheet: 0; row: 5; col: 0): value = 5
(sheet: 0; row: 5; col: 1): string index = 0
(sheet: 0; row: 5; col: 2): string index = 0
(sheet: 0; row: 5; col: 3): value = 88
(sheet: 0; row: 6; col: 0): value = 6
(sheet: 0; row: 6; col: 1): string index = 0
(sheet: 0; row: 6; col: 2): string index = 0
(sheet: 0; row: 6; col: 3): value = 90
(sheet: 0; row: 7; col: 0): value = 7
(sheet: 0; row: 7; col: 1): string index = 0
(sheet: 0; row: 7; col: 2): string index = 0
(sheet: 0; row: 7; col: 3): value = 80
(sheet: 0; row: 8; col: 0): value = 8
(sheet: 0; row: 8; col: 1): string index = 0
(sheet: 0; row: 8; col: 2): string index = 0
(sheet: 0; row: 8; col: 3): value = 66
(sheet: 0; row: 9; col: 0): value = 9
(sheet: 0; row: 9; col: 1): string index = 0
(sheet: 0; row: 9; col: 2): string index = 0
(sheet: 0; row: 9; col: 3): value = 14
(sheet: 0; row: 10; col: 0): value = 10
(sheet: 0; row: 10; col: 1): string index = 0
(sheet: 0; row: 10; col: 2): string index = 0
(sheet: 0; row: 10; col: 3): value = 23
There is a couple of things worth pointing out. First, the cell data
flows left to right first then top to bottom second. Second, for this
particular sheet and for this particular format, implementing just the
two setter methods, namely
set_string()
and
set_value()
are
enough to receive all cell values. However, we are getting a string
index value of 0 for all string cells. This is because orcus expects
the backend document model to implement the shared strings interface
which is responsible for providing correct string indices to the import
filter, and we have not yet implemented one. Let’s fix that.
Implement formula interface¶
In this section we will extend the code from the previous section in order to
receive and process formula cell values from the sheet. We will need to make
quite a few changes. Let’s go over this one thing at a time. First, we are
adding a new cell value type formula
:
enum class cell_value_type { empty, numeric, string, formula }; // adding a formula type here
which should not come as a surprise.
We are not making any change to the cell_value
struct itself, but we are
re-using its index
member for a formula cell value such that, if the cell
stores a formula, the index will refer to its actual formula data which will
be stored in a separate data store, much like how strings are stored
externally and referenced by their indices in the cell_value
instances.
We are also adding a branch-new class called cell_grid
, to add an extra
layer over the raw cell value array:
class cell_grid
{
cell_value m_cells[100][1000];
public:
cell_value& operator()(row_t row, col_t col)
{
return m_cells[col][row];
}
};
Each sheet instance will own one instance of cell_grid
, and the formula
interface class instance will hold a reference to it and use it to insert
formula cell values into it. The same sheet instance will also hold a formula
value store, and pass its reference to the formula interface class.
The formula interface class must implement the following methods:
set_position()
set_formula()
set_shared_formula_index()
set_result_string()
set_result_value()
set_result_empty()
set_result_bool()
commit()
Depending on the type of a formula cell, and depending on the format of the
document, some methods may not be called. The
set_position()
method
always gets called regardless of the formula cell type, to specify the
position of the formula cell. The
set_formula()
gets
called for a formula cell that does not share its formula expression with any
other formula cells, or a formula cell that shares its formula expression with
a group of other formuls cells and is the primary cell of that group. If it’s
the primary cell of a grouped formula cells, the
set_shared_formula_index()
method also gets called to receive the identifier value of that group. All
formula cells belonging to the same group receives the same identifier value
via
set_shared_formula_index()
,
but only the primary cell of a group receives the formula expression string
via set_formula()
. The
rest of the methods -
set_result_string()
,
set_result_value()
,
set_result_empty()
and
set_result_bool()
- are
called to deliver the cached formula cell value when applicable.
The commit()
method gets
called at the very end to let the implementation commit the formula cell data
to the backend document store.
Without further ado, here is the formula interface implementation that we will use:
class my_formula : public iface::import_formula
{
sheet_t m_sheet_index;
cell_grid& m_cells;
std::vector<formula>& m_formula_store;
row_t m_row;
col_t m_col;
formula m_formula;
public:
my_formula(sheet_t sheet, cell_grid& cells, std::vector<formula>& formulas) :
m_sheet_index(sheet),
m_cells(cells),
m_formula_store(formulas),
m_row(0),
m_col(0) {}
virtual void set_position(row_t row, col_t col) override
{
m_row = row;
m_col = col;
}
virtual void set_formula(formula_grammar_t grammar, const char* p, size_t n) override
{
m_formula.expression = std::string(p, n);
m_formula.grammar = grammar;
}
virtual void set_shared_formula_index(size_t index) override {}
virtual void set_result_string(size_t sindex) override {}
virtual void set_result_value(double value) override {}
virtual void set_result_empty() override {}
virtual void set_result_bool(bool value) override {}
virtual void commit() override
{
cout << "(sheet: " << m_sheet_index << "; row: " << m_row << "; col: " << m_col << "): formula = "
<< m_formula.expression << " (" << m_formula.grammar << ")" << endl;
size_t index = m_formula_store.size();
m_cells(m_row, m_col).type = cell_value_type::formula;
m_cells(m_row, m_col).index = index;
m_formula_store.push_back(std::move(m_formula));
}
};
Note that since we are loading a OpenDocument Spereadsheet file (.ods) which
does not support shared formulas, we do not need to handle the
set_shared_formula_index()
method. Likewise, we are leaving the set_result_*
methods unhandled for
now.
This interface class also stores references to cell_grid
and
std::vector<formula>
instances, both of which are passed from the parent
sheet instance.
We also need to make a few changes to the sheet interface class to provide a formula interface and add a formula value store:
class my_sheet : public iface::import_sheet
{
cell_grid m_cells;
std::vector<formula> m_formula_store;
my_formula m_formula_iface;
range_size_t m_sheet_size;
sheet_t m_sheet_index;
const ss_type& m_string_pool;
public:
my_sheet(sheet_t sheet_index, const ss_type& string_pool) :
m_formula_iface(sheet_index, m_cells, m_formula_store),
m_sheet_index(sheet_index),
m_string_pool(string_pool)
{
m_sheet_size.rows = 1000;
m_sheet_size.columns = 100;
}
virtual void set_string(row_t row, col_t col, size_t sindex) override
{
cout << "(sheet: " << m_sheet_index << "; row: " << row << "; col: " << col
<< "): string index = " << sindex << " (" << m_string_pool[sindex] << ")" << endl;
m_cells(row, col).type = cell_value_type::string;
m_cells(row, col).index = sindex;
}
virtual void set_value(row_t row, col_t col, double value) override
{
cout << "(sheet: " << m_sheet_index << "; row: " << row << "; col: " << col
<< "): value = " << value << endl;
m_cells(row, col).type = cell_value_type::numeric;
m_cells(row, col).f = value;
}
virtual range_size_t get_sheet_size() const override
{
return m_sheet_size;
}
// We don't implement these methods for now.
virtual void set_auto(row_t row, col_t col, const char* p, size_t n) override {}
virtual void set_bool(row_t row, col_t col, bool value) override {}
virtual void set_date_time(row_t row, col_t col, int year, int month, int day, int hour, int minute, double second) override {}
virtual void set_format(row_t row, col_t col, size_t xf_index) override {}
virtual void set_format(
row_t row_start, col_t col_start, row_t row_end, col_t col_end, size_t xf_index) override {}
virtual iface::import_formula* get_formula() override
{
return &m_formula_iface;
}
};
We’ve added the
get_formula()
method which
returns a pointer to the my_formula
class instance defined above. The
rest of the code is unchanged.
Now let’s see what happens when loading the same sheet from the previous section:
(sheet: 1; row: 0; col: 0): string index = 44 (X)
(sheet: 1; row: 0; col: 1): string index = 45 (Y)
(sheet: 1; row: 0; col: 2): string index = 46 (X + Y)
(sheet: 1; row: 1; col: 0): value = 18
(sheet: 1; row: 1; col: 1): value = 79
(sheet: 1; row: 2; col: 0): value = 48
(sheet: 1; row: 2; col: 1): value = 55
(sheet: 1; row: 3; col: 0): value = 99
(sheet: 1; row: 3; col: 1): value = 35
(sheet: 1; row: 4; col: 0): value = 41
(sheet: 1; row: 4; col: 1): value = 69
(sheet: 1; row: 5; col: 0): value = 5
(sheet: 1; row: 5; col: 1): value = 18
(sheet: 1; row: 6; col: 0): value = 46
(sheet: 1; row: 6; col: 1): value = 69
(sheet: 1; row: 7; col: 0): value = 36
(sheet: 1; row: 7; col: 1): value = 67
(sheet: 1; row: 8; col: 0): value = 78
(sheet: 1; row: 8; col: 1): value = 2
(sheet: 1; row: 1; col: 2): formula = [.A2]+[.B2] (ods)
(sheet: 1; row: 2; col: 2): formula = [.A3]+[.B3] (ods)
(sheet: 1; row: 3; col: 2): formula = [.A4]+[.B4] (ods)
(sheet: 1; row: 4; col: 2): formula = [.A5]+[.B5] (ods)
(sheet: 1; row: 5; col: 2): formula = [.A6]+[.B6] (ods)
(sheet: 1; row: 6; col: 2): formula = [.A7]+[.B7] (ods)
(sheet: 1; row: 7; col: 2): formula = [.A8]+[.B8] (ods)
(sheet: 1; row: 8; col: 2): formula = [.A9]+[.B9] (ods)
Looks like we are getting the formula cell values this time around.
One thing to note is that the formula expression strings you see here follow the syntax rules of OpenFormula specification, which is the formula syntax referenced by the OpenDocument Spreadsheet format.
Implement more interfaces¶
This section has covered only a part of the available spreadsheet interfaces you can implement in your code. Refer to the Spreadsheet Interface section to see the complete list of interfaces.