Automating the conversion of Microsoft Office Documents to PDF/A

A central service to convert Microsoft Office documents to PDF or PDF/A has obvious advantages. The conversion is done on an enterprise wide platform with well defined software versions and conversion process configurations. This guarantees a consistent quality and makes the deployment and operation of client based software obsolete. The price for this, however, is that the central service must automate the native applications, such as Microsoft Word, which are designed for interactive use not for server operation.

If I had to build such a service my first naive approach would be to let the service perform the following automation steps: Run the application and call the "open file" and the "Save as PDF/A" functions. Unfortunately, this is not as easy. Why not?

Applications, such as Microsoft Word, Excel and PowerPoint are designed for interactive use. They only can run as a single instance in a user session. A service, however, must be prepared to convert documents in parallel to make optimal use of the computer resources. Then, most applications notify the user with pop-up dialog boxes and similar user interface features. If this happens within the context of a service the applications will block the process because there is no user to press the OK and Cancel buttons. Furthermore, interactive applications are not robust enough to process thousands of documents. They get unstable after a while and need to be terminated. Finally, the quality of the produced PDF/A document does sometimes not conform to the standard or is of inferior visual quality.

For these reasons a service to automate the conversion of Microsoft Office Documents to PDF/A must do much more than I described above in my naive approach. The most important tasks are:
  • Run the application in multiple instances of a "worker session". This allows for the execution of conversions in parallel.
  • Automate the conversion process by controlling the application through an API and run a "robot" to operate the user interface of the application (press OK buttons, read messages and act according to them, etc.).
  • Monitor the sanity of the applications and restart them accordingly.
  • Use the optimal means to create a PDF file and convert it into a PDF/A file as a post processing step. Some applications have a built in "Save as PDF" function, others can print to a virtual printer driver and some can produce a file format (XPS, PostScript, etc.) which can be converted to PDF/A.
The conversion is not the only function which such a service can provide. Once the service gets hold of a document it can provide some additional, very useful post-processing steps. Here are some of them:
  • Merging documents
  • Apply digital signatures
  • Embedding XML data in electronic invoice documents (ZUGFeRD standard)
  • Embedding XMP metadata
  • Stamping
If you found this article useful, please let me know and post a comment.

15 comments :

  1. Nice.

    If you are building a server application you can choose your weapons:-)

    Better not to start from Word :-) The solutions are all kludges and are not reliable long term or big volume.

    A radical solution is using Markdown and going straight to PDF. It will work better because word's design goal is to make it easy to tweak layouts or styles. This mitigates against getting uniform looking documents.

    Markdown saves all the Styling for the program creating the PDF so the user cannot make it look different. This actually saves everyone a lot of time and effort.

    ReplyDelete
    Replies
    1. Thank you for your comment, Dave. I fully agree, provided you can freely choose your source format. In enterprises you might encounter a large zoo of source formats delivered by users and established business applications, MS Office formats among them.

      Delete
  2. This comment has been removed by a blog administrator.

    ReplyDelete
  3. Hi Danny. Thank you for your comment. Unfortunately we had to delete it as our policy doesn't allow for advertising.

    ReplyDelete
  4. This comment has been removed by a blog administrator.

    ReplyDelete
  5. Hi Aaditya - Thank you for your comment. Unfortunately we had to delete it as our policy doesn't allow for advertising.

    ReplyDelete
  6. This comment has been removed by the author.

    ReplyDelete
  7. This comment has been removed by a blog administrator.

    ReplyDelete
  8. Hi Aeldra - Thank you for your comment. Unfortunately we had to delete it as our policy doesn't allow for advertising.

    ReplyDelete
  9. This comment has been removed by a blog administrator.

    ReplyDelete
    Replies
    1. Hi Harris- Thank you for your comment. Unfortunately we had to delete it as our policy doesn't allow for advertising.

      Delete
  10. This comment has been removed by a blog administrator.

    ReplyDelete
    Replies
    1. Hi Richard - Thank you for your comment. Unfortunately we had to delete it as our policy doesn't allow for advertising.

      Delete
  11. This comment has been removed by a blog administrator.

    ReplyDelete
    Replies
    1. Hi Alie- Thank you for your feedback that our blog is a great and helpful piece of info. Unfortunately we had to delete it as our policy doesn't allow for advertising.

      Delete