Splitting and merging pages of PDF documents

Single out pages from a number of input documents and re-arrange them in a set of output documents belongs to the daily routine in a document assembly application. At first glance, this seems to be a clear and understandable task. But PDF offers some special features, on which you should keep an eye during assembly.

Essentially, a page split and merge tool must be able to handle two kinds of data structures:

All objects which belong to a specific page
Objects which belong to the document and relate to the specific page

Let us start with the first kind. To retrieve a page object from an input document's page tree and insert it into the output document's page tree is obvious and fairly easy to implement. All objects that are referenced by the said page object are copied as well. This works fairly well even if the referenced objects are shared objects such as page resources (fonts, color spaces etc.) and content stream objects. If the tool needs to ensure that shared objects remain shared in the output document which is certainly not rocket science. So far so good.

Things get slightly more complicated, however, for all objects which are not directly referenced by the specific page but relate somehow to it. Those objects belong to the document itself and are common to all pages. Examples of such objects are outlines trees, named destination trees, forms and many more. In most cases it doesn't make sense to copy all these objects to the output document. The tool has to reduce those data structures to a meaningful subset. As an example only outlines are copied which relate to the set of pages in the output document. To find out which objects relate to this page set is not always easy and may require the tool to follow configurable policies.

Merging pages from various document sources is much harder than splitting. Again, copying pages and its referenced objects is easy. Merging objects from document level data structures such as outline trees, named destination trees etc. is in general not easy. This is because the names of tree elements from different sources may not be unique and the tool must resolve these collisions. To do so, the tool again must follow configurable policies.

A well elaborated split and merge tool, however, is able to handle the special situations described above. In addition to the main function most tools provide additional functions used in the context of document assembly. Some of these are:

Decrypt input documents and encrypt output documents
Linearize output document
Rotate pages
Enlarge and shrink page sizes

I hope this article was helpful to you. Please let me know your thoughts and post a comment.