If I try to extract images from a PDF file it sometimes happens that I get a bunch of slices of the original image, mostly consisting of a few image rows per slice or, in extreme cases, just one row. Why is that and how can I get the entire image in one piece?
There are various reasons for dividing an image into slices and storing them as separate image objects in a PDF file. One obvious reason could be, that the PDF creation software imports an already sliced source image, e.g. a TIFF file containing stripes or tiles, without merging the slices into one image. Another often found reason is, that the PDF creation software has architectural limits regarding the size of the image sample data, e.g. a Windows native application that creates a PDF file through a virtual printer driver. And, sometimes a graphics library, like GDI+, implements masked images by creating slices for the visible parts.
Once we have understood how slices arise, we also know how to put the pieces together again. But this is certainly not easy. Here's how I'v done it in one of our products. Let's call it image merger here.
The image merger reads the content stream object by object. If it encounters an image it sets up an empty surface and an image mask with all bits set to 'invisible'. The slice is stored in the surface and the corresponding bits in the mask are set to 'visible'. If the next object is an image then the slice is stored in the same way. This process is repeated until another object type is encountered or it is obvious that the image is not a slice, e.g. if the color space of the image changes. If this happens then the enclosing rectangle of all slices is computed and the image is copied to the output file and the surface reset to its initial state.
I must admit that this is not a very sophisticated algorithm and I hope you will have a better approach. Please let me know how you would solve the problem and post a comment!