Digging for information by extracting data from a PDF document

Extracting text from a PDF document is one of the most popular information retrieval function. But how about other information such as images, metadata and more? It can be simple - but also tricky.

Among the easiest things to extract you'll find metadata. The document metadata can usually be extracted as a short XMP stream. Even if the document contains an old fashioned information dictionary then the extraction of the key / value pairs is not a big deal. Similar are outlines (bookmarks), navigation aids such as named destinations, links and the like.

However, the extraction of the graphics contents of a page is much more complex. Theoretically, it is possible to extract each content object and the associated resource objects and use them for creating an HTML page or a page in any other description language. In practical applications this proves to be too complex due to the graphics model that PDF offers. This model has some unique features such as patterns, shadings and transparency groups with a variety of blend modes. Furthermore, the scan conversion rules differ significantly from those which are built-in in commercially available graphics processors. Thus, the mapping of a PDF page description to HTML, PCL or even PostScript can only be achieved by transforming the page description using transparency flattening and other techniques.

For this reason, if one has to convert the page contents to another document format then it is much wiser to use a specialized converter tool such as the PDF to Image Converter.

Most applications deal with the extraction of text. Typical areas of use are the classification of transaction documents such as invoices, the implementation of a text search function in document repositories and many more. For further information please refer to this article: Why is the extraction of text from a PDF document such a hassle?

As outlined above the extraction of information from a PDF document can be very simple but also quite tricky. It depends on what kind of information the application requires. In order to make the programming of such applications as easy as possible we have created a specialized tool, the PDF Extract Tool. It offers an easy to use interface which has been designed based on the above insights. Most use cases can be handled with only a few lines of code. This is achieved by hiding some features of the PDF graphics model such as coordinate transforms from the programmer.

I hope this article was useful to you. If you have any questions, don't hesitate to post a comment. I'll be happy to answer it.