The art of repairing damaged PDF files

I sometimes got a little shock when I wanted to open a PDF file and the viewer only showed an error message. In some cases, however, the viewer can save the file and repair it, but often not. This experience made me think about a repair tool.

There may be various reasons for a PDF file beeing corrupt or damaged. Here are the most common:
  • Systematic errors caused by bad creator or processing software
  • Unwanted modification of the PDF stream caused by storage and communication software (e.g. so called ASCII mode of Source Safe, FTP, etc.).
  • Data loss or corruption by an unreliable communication line or a disk crash
If the repair tool is aware of the nature of corruption it can carry out specific fixes:
  • Systematic errors: If the creator software is known then the repair tool can find and fix typical errors such as missing mandatory directory entries, systematically wrong object structures such as names instead of strings, badly formatted fonts, and the like. 
  • Unwanted modification: A PDF file can be created in ASCII mode. If this is the case then modifications such as inserting line breaks or replacing carriage returns by the combination carriage return / line feed is not critical. If the PDF file is created in binary mode then the stream objects can most likely not be recovered.
  • Data loss or corruption: In this case the file cannot be repaired and the information which is still valid such as scanned pages etc. can be recovered and put into a new output file.
A repair tool usually proceeds in two steps:
  1. Analysis: Check the file header and trailer, check the cross reference table, check the individual objects, check the root object, page tree and the related data structures
  2. If the analysis shows that the root object can be found and the page tree is intact then it can repair the file. If not, then it must recover as much as possible from the file and create a new one.
One of the most often found damages is an invalid cross reference table. In general this is not critical and the tool can recover it by scanning through the file and rebuild it from the found object. If there are redundant objects then the latest one is used.

Repairing the PDF object structure is quite straight forward. Some of these errors are not critical such as a missing /Type /Page entry in a page node. The tool can just add the missing entry. However, a missing /BBox [...] entry in an XObject is fatal and the tool must either try to recover it from the graphics contained in it or even remove the object.

Compressed streams, such as embedded fonts, ICC color profiles, image data etc. are difficult to recover if they have been damaged. Some compression algorithms are more robust than others. Sometimes a stream, such as an image data, can be recovered partly. The tool can then replace the missing parts with white color pixels.

Badly formatted font programs occur very often. In most cases they can be reformatted and fixed. If not they must be re-embedded from the original font. If the font is not available then it must either be replaced by a similar one or just removed.

The above cases are just a short list of examples but of course a full featured repair to can do a lot more. Please let me know what would be an important repair tool feature for you and post a comment.