Are the PDF/A space requirements a show stopper for archiving?

A PDF/A document requires that all resources such as fonts, color profiles, etc. must be embedded in the file. The archiving of transactional documents can be nightmare because such documents are usually short by nature and contain huge number of copies of the same Frutiger font, sRGB color profile and company logo. Many archives therefore prefer TIFF over PDF/A when it comes to born-digital documents. But that is certainly not the idea of a uniform standard. How can this problem be solved?

PDF/A is widely accepted in archives for scanned documents. This is mainly due to the fact, that PDF/A offers stronger and standardized compression algorithms which allow for reducing a color scanned page to less than 50 KB. Even for individual born-digital documents PDF/A is the preferred file format. However, the application of PDF/A in mass archiving of transactional documents is still disputed. But, in my opinion, this is not a problem of the format. It is a problem of the archiving system and must therefore be solved there.

Most archiving systems are proud of the fact that they store 'objects' without caring about their format. This unawareness has a crucial disadvantage, however. They cannot handle the files in an appropriate and intelligent way. Therefore most solutions for the mass archiving of PDF/A documents add a software layer to the archiving system which tries to reduce the negative effects of repetitively embedded resources. There are two main approaches for this software layer.

The first approach collects individual documents and merges them into a single container file for which the resources can be optimized in a way such that they occur only once in the file. This file is then submitted to the archive. When a document is retrieved the container file is retrieved and splitted into the original documents.

The second approach separates the documents into individual resource files and a body document which refers to it. The resources are then optimized by replacing equal copies by a single instance. The optimized resource files and the body documents are then submitted to the archive. When a document is retrieved it is rebuilt from its parts.

I personally prefer the second approach since it can be implemented with a much higher performance than the first one. However, people criticize that the 'objects' are not PDF/A documents anymore. In my opinion this is not required because the software layer to split and merge the resources does this transparently for the user and guarantees that the document is the same before it is stored and after it is retrieved. Usually this argument can be better understood if the mechanism is compared to the compression or encryption algorithm withing the storage layer of the archiving system. The data which is stored on the media is no longer a PDF/A file if it is stored in compressed or encrypted form. After decompression or decryption it is the identical file again. The same is true for the resource management software layer.

I have implemented the second approach for customers with huge document volumes and it works flawless, saves space and cost.

What do you think about this? Please let me know and post a comment!