Recently, developers here at Sinara have built a number of applications that have involved dynamically generating large numbers of Excel spreadsheets from database tables. In some cases, the spreadsheets had simple layouts, with just raw numbers in cells, whilst others had more sophisticated formatting and charts. In addition to the core functionality, the performance of the applications was an important requirement, as well as the need to avoid installing Excel itself on the target machines. To achieve these goals, we made use of the Microsoft Office OpenXML SDK for .NET, which is a library that makes it easier (though not trivial) to generate Office documents in the relatively newer Office XML formats (DOCX, XLSX, etc). This SDK, supplied as a free download from Microsoft, provides a complete set of .NET classes that represent the various XML elements that make up the file formats. It allows consuming applications to generate new documents and also to load existing documents and inspect or modify their content. Documents generated using the OpenXML SDK can be opened in Office 2007 and later, and also by Office 2003 if the file converter extension is installed.
The SDK does not support generating documents in the older binary formats (DOC, XLS, etc). These binary file formats are still proprietary and their specifications were kept undisclosed for many years. This often meant the only way to generate an Office document was to use OLE Automation (usually through VBA) to actually run the program in the background and dynamically build up the document. This, of course, was not ideal for server applications, as it meant that Office had to be installed on the server machine. Some component vendors did succeed in reverse engineering parts of the file formats and making Office document generation tools available. Eventually, the file format specifications were made public, but these were not complete and some reverse engineering work by vendors remained necessary.
The modern Office file formats, which have superseded the binary formats, are based upon a combination of standard XML and zip compression technologies. Each Office document is actually a zip file that contains a number of XML files (and perhaps image or other binary files) that represent the content of the document. One can easily verify this by changing the file extension of any Office document to ‘.zip’ and opening it using Windows Explorer or a compression utility.
The fact that the modern Office file formats are XML based, rather than binary, makes it significantly easier to generate documents that can be opened in these applications. However, the XML is still very complex and is not straightforward to construct. Ideally, a developer would typically use a library that lets them focus on the high-level document layout and content, and abstracts away the underlying file format. The OpenXML SDK does not do this–it is therefore important to be aware that this SDK is not a high-level document generation library. It is designed around OpenXML elements, not cells, formulas or graphs–it is actually a thin wrapper around the XML elements that make up the file format. This means that it is still necessary to gain some understanding of how, for example, charts or cells are specified within the XML and how they are linked to each other using unique identifiers. However, developers do not have to worry about the lower level details of XML tag generation, formatting, zip compression, etc.
The SDK does includes a tool (named the Productivity Tool) that can make a developer’s life much easier by automatically generating C# code that uses the SDK classes to in turn generate any specified document. For example, if you have a spreadsheet template built in Excel with a certain layout, colours, logos etc, you can use the Productivity Tool to generate C# code that, when run, will generate the very same spreadsheet as its output. It is then possible to include that code into your program and customise it to add in the dynamic portions. It also makes it easier for the developer to learn exactly how certain features, e.g. spreadsheet graphs or images, are specified in the file format.
The design pattern that we have used at Sinara is to use the Productivity Tool to generate code based on a spreadsheet template. This code is never manually edited and is placed into a ‘designer’ file, e.g. “My Spreadsheet.Designer.cs”. We then create a custom file, “MySpreadsheet.cs”, which contains manually written code that customises whatever has been previously constructed in memory using the generated code. The advantage to this is that if the template needs to be changed, all that is required is to regenerate the template code using the Productivity Tool and replace the designer file.
If the template itself needs to be loaded dynamically, because the end users may wish to change it from time to time, then an application can use the OpenXML SDK to load the spreadsheet template from a file and create a copy of it in memory. It can then add or modify the various XML elements as required. These can then be written back to disk as a new Excel file.
As one of our requirements was to generate large numbers of spreadsheets in a relatively small amount of time, it was important that any library we used had good performance characteristics. The standard alternative to the OpenXML SDK would have been to use the Office Interop libraries provided by Microsoft to directly launch Excel ‘in the background’ and make use of its own object model to construct the spreadsheet and save it to disk. This would actually have been a far easier solution to develop, as the object model exposed by Excel is much more high-level than its corresponding file format. However, this would have imposed a significant performance penalty and would not be an efficient way to generate large numbers of spreadsheets. Of course, it would also require Office to be installed on the target machine.
Overall, although this SDK does have a considerable learning curve, it provides a good solution for both client and server-side applications that require to generate Office documents without relying on Office itself to be installed.