Image detection in scanned images

Detecting pictures in scanned document pages has many advantages such as better compression rates and the possibility to extract them individually.

A scanned page is originally a raster image consisting of bi-level or color pixels. Since we have sophisticated compression methods scanning in color is clearly preferred over pure black / white modes. One such method is the mixed raster content (MRC) method which separates the scanned image in a background, mask and foreground layer. Each layer can individually be compressed using specialized algorithms parameterized for its specific purpose. Such algorithms are JBIG2 for the mask and JPEG2000 for the background layer.

There can be multiple foreground layers e.g. for photographic images that are part the scanned page. In order to separate these images from the background and mask layer a specific segmentation algorithm must detect and isolate them. Each of these images now can form an individual foreground layer compressed with a specific algorithm such as JPEG.

So the MRC method for scanned pages can be accomplished as follows:

Segmentation algorithm: detect and isolate images
Separation algorithm: compute the pixels of the image mask and the color background.
Compress each layer using a dedicated compression algorithm
Compose the layers according to an MRC schema such as RFC 2301 in TIFF or a masked image in PDF.

Removing images from the scanned page may also speed up the text recognition process (OCR).

However, a more interesting function can be offered though. If the said images have been isolated and assigned an individual layer the can be easily extracted from the document by a suitable tool.

Furthermore, when create a PDF document from a scanned page, then the optional content feature can be used to switch on and off the background and foreground layers.

Our products such as the 3-Heights™ TIFF Toolbox, the 3-Heights™ Scan Server and the 3-Heights™ Optimizer now support the features described in this article. To extract the images from a PDF document the 3-Heights™ PDF Extract tool can be used.

Is this article useful to you? Please let me know and post a comment.