Automating the conversion of Microsoft Office Documents to PDF/A

A central service to convert Microsoft Office documents to PDF or PDF/A has obvious advantages. The conversion is done on an enterprise wide platform with well defined software versions and conversion process configurations. This guarantees a consistent quality and makes the deployment and operation of client based software obsolete. The price for this, however, is that the central service must automate the native applications, such as Microsoft Word, which are designed for interactive use not for server operation.

If I had to build such a service my first naive approach would be to let the service perform the following automation steps: Run the application and call the "open file" and the "Save as PDF/A" functions. Unfortunately, this is not as easy. Why not?

Applications, such as Microsoft Word, Excel and PowerPoint are designed for interactive use. They only can run as a single instance in a user session. A service, however, must be prepared to convert documents in parallel to make optimal use of the computer resources. Then, most applications notify the user with pop-up dialog boxes and similar user interface features. If this happens within the context of a service the applications will block the process because there is no user to press the OK and Cancel buttons. Furthermore, interactive applications are not robust enough to process thousands of documents. They get unstable after a while and need to be terminated. Finally, the quality of the produced PDF/A document does sometimes not conform to the standard or is of inferior visual quality.

For these reasons a service to automate the conversion of Microsoft Office Documents to PDF/A must do much more than I described above in my naive approach. The most important tasks are:
  • Run the application in multiple instances of a "worker session". This allows for the execution of conversions in parallel.
  • Automate the conversion process by controlling the application through an API and run a "robot" to operate the user interface of the application (press OK buttons, read messages and act according to them, etc.).
  • Monitor the sanity of the applications and restart them accordingly.
  • Use the optimal means to create a PDF file and convert it into a PDF/A file as a post processing step. Some applications have a built in "Save as PDF" function, others can print to a virtual printer driver and some can produce a file format (XPS, PostScript, etc.) which can be converted to PDF/A.
The conversion is not the only function which such a service can provide. Once the service gets hold of a document it can provide some additional, very useful post-processing steps. Here are some of them:
  • Merging documents
  • Apply digital signatures
  • Embedding XML data in electronic invoice documents (ZUGFeRD standard)
  • Embedding XMP metadata
  • Stamping
If you found this article useful, please let me know and post a comment.