File IO

Basics

All of Calc’s file IO operations, that is, saving, loading, importing and exporting are handled by ScDocShell class, which has methods such as Save, SaveAs, Load, LoadFrom, ConvertTo, and ConvertFrom to handle file IO’s. The last two functions – ConvertTo and ConvertFrom are for exporting to and importing from non-native file formats, and saving to and loading from the Excel file format are handled by these functions.

Excel file format documentation

Daniel Rentz wrote a very detailed documentation on Excel file format which you can download here. Another good source of Excel file format documentation may be this site.

Exporting to Excel file

ScDocShell’s ConvertTo method calls ScExportExcel5 when the request filter name is of one of Excel variants. Despite the 5 in the function name, it also handles all of Excel variants that Calc currently can export. Within that function, a local instance of either ExportBiff5 or ExportBiff8 is used to write to their respective version of Excel file format. Here, the term BIFF5 corresponds to Excel 5.0, while BIFF8 covers all variants that are later than (and including) Excel 97.

An instance of struct XclExpRootData, used to hold all internal document data dumped from the ScDocument instance, gets passed as a reference to either ExportBiff5 or ExportBiff8, which then calls its own Write method to write the data to output stream. As a matter of fact, ExportBiff8 is a derived child class of ExportBiff5, and does not have its own Write method. So regardless of which class gets used, ExportBiff5’s Write method gets called in the end.

Struct XclExpRootData is a child class of XclRootData, and most of the initialization of its member instances occurs in XclRootData. As XclExpRoot holds reference to the XclExpRootData instance, it is accessible to all of the child classes of XclExpRoot (ExcDocument, ExcTable, XclExpPageSettings and ExportBiff5 are all child classes of XclExpRoot).

Class ScDocument has ScExtDocOptions, which holds additional settings useful for filters. During the initialization of XclExpRootData, a copy of ScDocument’s ScExtDocOptions gets created and remains accessible via XclExpRootData throughout the export operation.

Class ExcDocument represents Excel’s document model to be exported. ExportBiff5 manages the lifecycle of ExcDocument instance. ExportBiff5’s Write method serves as the topmost “main” method of the entire export operation, which can be summarized in the following two steps:

  1. read the internal document model and translate it into Excel’s model via ExcDocument::ReadDoc, and
  2. dump it into the stream via ExcDocument::Write.

Inside ExcDocument::ReadDoc, an instance of ExcTable gets created for each Calc sheet that need to be exported, and is added to the list of tables on the ExcDocument instance. Another instance of ExcTable is used for the header section of the exported document. Each record that needs to be exported has to be its own class as a subclass of XclExpRecordBase, and is to be added to the list via ExcTable’s Add method. XclExpRecordBase has one virtual method called Save that gets called to write the actual bytes to the export file stream.

There is even a further child class of XclExpRecordBase called XclExpRecord, which also serves as a base class to get its virtual method WriteBody overwritten by the derived class. XclExpRecord’s own Save method takes care of writing the record header (the opcode and the record size), calculate the record size, and calls WriteBody of the derived class to write the data bytes to the stream. This class verifies the size of the record after it’s been written, and in case it’s different from the originally reported size, it seeks back to the record header position and overwrites the record value with the correct size.

Escher (Office Drawing Layer)

Microsoft Office 97 Drawing Layer is what’s known as the Escher format. All versions later than and including Excel 97 support Escher drawing format. Class XclEscher handles all Escher objects during export.

Importing Excel file

Dumping file stream

Sometimes it’s necessary to dump the raw content of the file stream in order to discover undocumented data, or data that Calc does not yet import. One good way to dump an arbitrary Excel file is to use the xls_dump utility I wrote, which reads a binary stream from an Excel file and dumps a structured series of record blocks to stdout. To get xls_dump, check out the go-oo source tree from the gnome.org svn repository:

svn co svn://svn.gnome.org/svn/ooo-build/trunk ooo-build-trunk

change directory into ooo-build-trunk/scratch/sc-xlsutil, and run ./xls_dump.py with the file path passed to the script as an argument.

Converting Excel model to Calc model

Cell formula

Excel’s cell formula record contains an array of formula tokens in the RPN order. Method ExcelToSc8::Convert parses and converts this Excel token array into Calc’s.

Chart objects

When importing an Excel file with chart objects and/or sheets, Calc’s importer stores their data into a buffer, and converts them to Calc’s chart models after the parsing of the file is complete. ImportExcel::PostDocLoad handles the post model conversion of all sorts of Excel models including the chart objects. The actual conversion of chart model takes place in method XclImpChChart::Convert.

Class XclImpChChart appears to act as the central class to manage a single chart object; it stores instances of other chart record classes and query them during the final conversion of the chart object.

OLE objects

XclImpDffManager contains the core implementation of Escher stream import. XclImpDrawObjBase serves as the base class for Excel’s draw objects, which is sub-classed by XclImpChartObj, XclImpDrawingObj, and XclImpOleObj.

XclImpObjectManager stores all drawing and OLE objects and additional data related to those objects. Its method ConvertObjects is perhaps a good place to start digging when debugging loading of OLE objects during import. Note that chart objects are also treated as OLE objects.

Excel chart objects are converted to OO.o chart objects in XclImpChartObj::CreateSdrObject, where a new instance of the object is created and its pointer is returned to the caller. Because of the complete re-implementation of OO.o’s chart module (dubbed ‘chart2′), all interactions with the chart code is now done via UNO API. The new chart engine expects certain interfaces to be provided by the application, and in Calc they are provided by ScChart2DataProvider, ScChart2DataSource, ScChart2LabeledDataSequence, ScChart2DataSequence, and ScChart2EmptyDataSequence which implements necessary UNO interfaces to define certain chart behaviors.

Workbook encryption

When an Excel document is saved with a workbook password, the contents of certain types of records are encrypted. Such encrypted records need to be decrypted on a per-record basis.

In Calc’s importer code, class XclImpDecrypter serves as the base class for the two derived decrypter classes XclImpBiff5Decrypter and XclImpBiff8Decrypter. As their names indicate, XclImpBiff5Decrypter decrypts BIFF5 encryption while XclImpBiff8Decrypter does BIFF8. A child class of XclImpDecrypter is expected to overwrite three private pure virtual functions: OnClone, OnUpdate, and OnRead, that are called from their respective public non-virtual functions: Clone, Update, and Read of the base class.

The actual codec implementations are found in the svx module. XclImpBiff5Decrypter uses the codec class named MSCodec_XorXLS95 to decode the encrypted streams, whereas XclImpBiff8Decrypter uses MSCodec_Std97 for the same purpose.

There are two helper classes that provide static functions to deal with Excel’s encryption: XclCryptoHelper and XclImpDecryptHelper. Class XclCryptoHelper, however, only provides method that returns Excel’s default password VelvetSweatshop which is hard-coded and is identical for both BIFF5 and BIFF8. Likewise, class XclImpDecryptHelper only has one method ReadFilepass which reads the file password from the imported document.

Certain records must be read without decryption even when the encryption flag is on. The following records must be read raw:

For MS Office 97/2000 compatible encryption, three parameters are needed to successfully decrypt the encrypted records: the unique document ID, salt, and the salt hash, all of which are read from the file being imported (See lclReadFilepass8_Standard). The document ID is later used to generate a 16-byte MD5 hash together with the actual password string that the user provided upon opening the document.

Document Protection

Excel offers two options upon protecting the document (from Microsoft Excel Help):

  • protect the structure of a workbook so that worksheets in the workbook can’t be moved, deleted, hidden, unhidden, or renamed, and new worksheets can’t be inserted
  • protect windows so that they are the same size and position each time the workbook is opened.

Interestingly, when the workbook structure is protected with password, Excel encrypts the content of the file using the 97/2000 compatible encryption with the default password (’VelvetSweatshop’), but it does not when only the workbook windows are protected even with a password.

The record IDs for windows protection, document structure protection and the password are 0×19, 0×12 and 0×13, respectively, and they all occur in the “workbook globals prefetch” section of the binary stream in this order. There is no explicit document protection flag stored in the file; instead, when either the structure or windows protection is on, the document protection gets automatically turned on. This is reflected in the fact that Excel requires at least one document protection option to be set in order to turn on the document protection at run-time.

MS Office Encryption

MSCodec_Std97

Object Construction

Class MSCodec_Std97 is a wrapper class in the svx module that implements MS Office 97/2000 compatible encryption. All of its low levels calls made inside this class are implemented in the sal module and are prefixed rtl_.

When an instance of MSCodec_Std97 is constructed, a cipher instance is created by calling rtl_cipher_create with ARCFOUR (RC4) cipher algorithm, and also an MD5 hash algorithm by calling rtl_digest_create. The ARCFOUR algorithm is used only for stream mode.

Internally, the ARCFOUR cipher instance is a void pointer to the cipherARCFOUR_Impl structure type, which holds the cipher structure (Cipher_Impl) and the context structure (ContextARCFOUR_Impl). At creation time, only the cipher structure is initialized (see rtl_cipher_createARCFOUR). The Cipher_Impl instance holds four function pointers to perform init, encode, decode and delete actions, and for the ARCFOUR cipher algorithm type, the rtl_cipher_initARCFOUR, rtl_cipher_encodeARCFOUR, rtl_cipher_decodeARCFOUR and rtl_cipher_destroyARCFOUR functions are used for those purposes, respectively.

Likewise, the MD5 hash algorithm instance is also a void pointer (typedef to rtlDigest) pointing to the DigestMD5_Impl structure type, which in turn holds the digest structure and the context structure (Digest_Impl and DigestContextMD5, respectively).

MD5 Key Initialization

MSCodec_Std97::InitKey takes two parameters – password data and unique ID (usually document ID) – and constructs key data – an array of 64 unsigned 8-bit integers – to use it to eventually create a 128-bit MD5 hash value at the end of the method. The key data is discarded at the end of the method while the hash value is stored with the class instance as a data member (m_pDigestValue).

It performs the following:

  1. First, the key data array gets filled with zeros.
  2. The password data, which are 16 16-bit unsigned integers, are transformed into 32 8-bit unsigned integers to occupy the first 32 elements of the key data.
  3. 0×80 is assigned to the 33th element.
  4. the number of letters in the password (in this case 16) gets shifted to the left by 4 bits then the lower 8 bits are taken and are stored into the 57th element of the key data as an 8-bit unsigned int.
  5. get the raw digest of the key data, and put that into the key data itself (TODO: find out what happens here).
  6. update the digest value with the key data and the unique value, which in this case is the document ID. Note that this does not alter the key data itself. This process gets repeated 16 times.
  7. 0×80 is assigned to the 17th element, and all elements above the 17th element are zeroed.
  8. 0×80 and 0×0A are assgined to the 57th and 58th elements, respectively.
  9. update the digest value with those elements of the key data greater than and including the 17th element via rtl_digest_updateMD5 call.
  10. get the final 128-bit (16-byte) digest value from the context of the digest structure instance, and store it into m_pDigestValue.

It’s probably important to understand the two functions from sal: rtl_digest_updateMD5 and rtl_digest_rawMD5 since they are used a lot in the InitKey method.

rtl_digest_updateMD5 modifies the context of the MD5 structure instance from the key data, where the context structure has the following 6 parameters: H, L, A, B, C and D, one data buffer (16 32-bit unsigned integers) and one member storing the data length (see DigestContextMD5). It does not modify the digest data structure (Digest_Impl) itself. (TODO: find out exactly what it does).

rtl_digest_rawMD5 modifies the key data and resets the context of the MD5 structure instance to the initial state. It fills the key data buffer with the A, B, C and D parameters of the MD5 context in this exact order such that the byte order of each parameter gets reversed as the value is assigned to the buffer on a per-byte basis. When this function returns, the buffer should have the first 16 elements filled with the current raw MD5 value.

Cipher Initialization

MSCodec_Std97::InitCipher takes one 32-bit unsigned integer parameter (the counter value), and initializes the cipher data for decoding. It first constructs a 512-bit block (key data) and use it to initialize the cipher data. The counter value that is passed as the parameter is embedded into the key data (as explained later).

It performs the following steps to construct the key data:

  1. fill the key data with 64 zero values (64 8-bit unsigned integers for 512 bits).
  2. copy the first 40-bits of the digest value (generated in MSCodec_Std97::InitKey) to the key data buffer.
  3. put the full 4 bytes of the counter value to the 6th to 9th element from the LSB to the MSB (the counter value is considered 32-bit little endian).
  4. assign 0×80 to the 10th element.
  5. assign 0×48 to the 57th element.
  6. generate a raw MD5 hash of the key data, and store it back into the first 16 bytes of the key data itself.

This key data is then passed to rtl_cipher_init function to initialize the cipher structure instance for decoding. See rtl_cipher_initARCFOUR and rtl_cipherARCFOUR_init_Impl on how the cipher instance is initialized. Note that only the first 16 bytes of the key data are used to initialize the cipher instance, by passing 16 as the second parameter of the cipher initialization function.

In rtl_cipherARCFOUR_init_Impl, there parameters get initialized: X, Y and S. S is a permutation of all 256 possibles bytes, while X and Y are two index pointers. The permutation S is initialized with a variable-length key data, which in this case is the MD5 hash generated above.

Key Verification

MSCodec_Std97::VerifyKey is called to verify whether or not the password that the user just typed in is correct. It takes two parameters – salt and salt hash, both of which need to be provided by the caller. These two values should be available in the encrypted Excel file.

The following steps are performed:

  1. initialize the cipher instance by calling InitCipher.
  2. decode the 16-byte salt into a 64-byte-length buffer.
  3. assign 0×80 to the 17th element of the buffer, and fill all the elements above the 17th with zero bytes.
  4. assign 0×80 to the 57th element.
  5. generate an 16-byte MD5 hash from the 64-byte buffer and store it.
  6. decode the salt hash (passed as the 2nd parameter to this method) into the 64-byte buffer.
  7. finally, compare the first 16 bytes of the decoded salt hash and the MD5 hash generated above. If they match, the password is correct, if not, then it’s the wrong password.

Resources

Comments are closed.