tag:blogger.com,1999:blog-89095279715482896162024-03-08T23:41:28.356+01:00PDF Tools AG | PDF expert blogPDF Tools Expert Blog - PDF Tools AG is a world-leading manufacturer of programming components and software solutions for generating, processing, displaying and archiving PDF and PDF/A files.PDF Marketinghttp://www.blogger.com/profile/11608567623166682097noreply@blogger.comBlogger52125tag:blogger.com,1999:blog-8909527971548289616.post-8711303725877256792018-11-13T08:35:00.000+01:002019-11-06T09:59:56.894+01:00PAdES - PDF Advanced Electronic SignatureWhat is PAdES? What does it have to do with PDF? What can PAdES do? For all these questions, there are detailed answers on the web. This article is meant to give a brief overview as a small guide in the jungle of terms.<br />
<a name='more'></a><br />
The concept of digital signatures was introduced in PDF 1.3 and refined in later versions. The PDF Advanced Electronic Signature (PAdES) standard was published by ETSI (European Telecommunication Standards Institute) and is referred to in ISO-32'000-2. It is based on the digital signature concept of PDF and describes a set of profiles making these signatures compliant to the European eIDAS regulations, which are legally binding in all EU member states since July 2014.<br />
<br />
Here is a brief overview of eIDAS and PAdES.<br />
<ul>
<li><b>ETSI TS 102 778:</b> "Old" Technical Standard (TS) for PDF signatures. Also called "Legacy PAdES".</li>
<li><b>ETSI TS 103 172:</b> "Newer" Technical Standard (TS) for PDF Signatures. This standard is referred to by the eIDAS Regulation.</li>
<li><b>ETSI EN 319 122-1:</b> Standard for CAdES signatures, which are essentially CMS (PKCS #7) signatures with a few extensions. This standard is not used for PDF.</li>
<li><b>ETSI EN 319 142-1:</b> Part 1 is the new European Norm (EN) for PDF signatures. It's based on CAdES, but very limited, so that the standards do not have much in common. Defines the baseline signature levels <i>B-B, B-T, B-LT and B-LTA</i> (see below).</li>
<li><b>ETSI EN 319 142-2:</b> Part 2 defines additional signature profiles, especially PAdES-CMS, which also includes Legacy PAdES and other formats from ISO 32000-1.</li>
<li><b>ETSI TR 119 100:</b> Describes how to use the signature standards (for CAdES, XAdES and PAdES). Also, how the validity of old signatures can be extended.</li>
<li><b>ISO 14533-3:</b> Long term signature profiles for PDF Advanced Electronic Signatures (PAdES). This standard is referred to by PDF/A-4.</li>
</ul>
<div>
The Decision 2015/1506/EU of the eIDAS Regulation (Regulation (EU) N° 910/2014) still refers to the previous legacy PAdES baseline signature standard ETSI TS 103 172.<br />
<br />
The baseline signature levels:</div>
<div>
<ul>
<li><b>B-B: </b>Defines a level for short-term electronic signatures. Must include an electronic signature and the signing certificate.</li>
<li><b>B-T: </b>Like B-B, but includes a time-stamp, respectively a time-mark that proves that the signature existed at a certain date and time.</li>
<li><b>B-LT: </b>Like B-T, but adds VRI data to the DSS, like OCSP responses or CRLs and all certificates of the trust chain, from the user certificate to the Root CA certificate. This level allows that a document signature can be validated, even after a long period of time, when the signing environment (e.g. signing CA) is not available anymore. The B-LT level is recommended for <i>Advanced Electronic Signatures</i>.</li>
<li><b>B-LTA: </b>Like B- LT, but includes a document time stamp and VRI data for the TSA to the DSS. A B-LTA level may help to validate the signature beyond any event that may limit its validity This level is recommended for <i>Qualified Electronic Signatures</i>.</li>
</ul>
<div>
And, the types of electronic signatures:</div>
</div>
<div>
<ul>
<li><b>Basic Level Electronic Signature: </b>Data in electronic form which is attached to or logically associated with other data in electronic form and which is used by the signatory to sign.</li>
<li><b>Advanced Electronic Signature: </b>The signatory can be uniquely identified and linked to the signature. The signatory must have sole control of the signature creation data (typically a private key) that was used to create the electronic signature. The signature must be capable of identifying if its accompanying data has been tampered with after the message was signed. In the event that the accompanying data has been changed, the signature must be invalidated.</li>
<li><b>Qualified Electronic Signature (QES): </b>The service provider must provide a valid time and date for created certificates. Signatures that have expired certificates must be revoked immediately. Personnel employed by the qualified trust service provider must be appropriately trained. Software and hardware used by the service provider must be trustworthy and capable of preventing certificate forgery</li>
</ul>
<div>
And finally, a few abbreviations:</div>
<div>
<div>
<ul>
<li><b>CA: </b>Certification Authority</li>
<li><b>CMS: </b>Cryptographic Message Syntax</li>
<li><b>CRL: </b>Certificate Revocation List</li>
<li><b>OCSP: </b>Online Certificate Status Protocol</li>
<li><b>PKCS: </b>Public Key Cryptography Standards (e.g. PKCS #7)</li>
<li><b>TSA: </b>Time-stamp Authority</li>
<li><b>VRI:</b>Verification Related Information (e.g. OCSP, CRL)</li>
<li><b>DSS: </b>Document Security Store (PDF)</li>
<li><b>XAdES:</b> XML Advanced Electronic Signature</li>
</ul>
</div>
</div>
<div>
We have implemented the new PAdES standard in our software such that digital signatures in PDF can be easily created, updated and verified in applications that need to conform to the European eIDAS regulations. This implementation produces signatures which conform to all the mentioned PAdES standards without the need for a specific configuration. This makes it easy to use the tool because it does not require detailed knowledge about which standard to use.<br />
<br /></div>
<div>
</div>
</div>
<div>
I guess, this material is hard to digest. So, if you have any questions, please let me know.</div>
Unknownnoreply@blogger.comtag:blogger.com,1999:blog-8909527971548289616.post-46387980909796003982018-10-03T09:25:00.000+02:002018-10-03T09:25:12.880+02:00Does OCR make sense for digitally generated PDFs?Scanned PDF files usually consist of one raster image for each page. The OCR engine can recognize the text in this image and make the document searchable. But what about digitally generated documents?<br />
<a name='more'></a><br />
Digital born documents contain individually generated content objects, such as texts, geometric figures and raster images. The objects are often overlaid by means of transparency and use spot colors for printing. In addition, the documents may be enriched with structural information such as articles, reading direction and tags (title, paragraph, header, footer, etc.).<br />
<br />
In many cases, the text is embedded so that it is machine-readable. However, it is not uncommon for this information to be missing. Often the text is also embedded in the form of geometric lines and curves or as part of a raster image.<br />
<br />
A naive approach would be to rasterize the page and then pass it to the OCR engine. As a result, you would lose all the details of the digitally generated page. It is therefore worthwhile to choose a different way.<br />
<br />
A good OCR tool for digitally generated PDF files can enrich unreadable fonts with Unicode information, recognize texts in embedded images, and even create missing structure information, thus preparing the document for PDF/A conformance level a. Furthermore, the tool should also be able to recognize bar and QR codes and write their content in the metadata of the document. With all these features, the tool may serve as an essential component of a Robotic Process Automation (RPA) solution.<br />
<br />
Of course, such a tool should be able to handle scanned, digital born and mixed files. As usual, scanned pages are straightened, stains removed, and the recognized text invisibly placed on top of the image, making it searchable like a digitally generated document.<br />
<br />
With the 3-Heights™ PDF OCR Tool we have created such a tool. As part of the 3-Heights™ PDF Quality Gate solution, it ensures that the documents are enriched for further processing. The 3-Heights™ PDF OCR tool also optimizes the number of accesses to the OCR engine to keep the license costs low and increase performance.<br />
<br />
<br />Unknownnoreply@blogger.comtag:blogger.com,1999:blog-8909527971548289616.post-58748000182730829182018-09-05T10:33:00.000+02:002018-09-05T10:33:45.953+02:00Using native applications in a PDF document conversion serviceAutomated conversion of Office documents into PDFs has become a popular service. When designing the architecture of such a service, the question arises as to whether the native application or a specially developed software library should carry out the conversion. The pros and cons are not obvious, so it's worth taking a closer look.<br />
<a name='more'></a><br />
If high quality rendering is required then using the native application is a must. This is especially true for the Microsoft Office products. Alternatives such as LibreOffice or software libraries show some major flaws in the visual representation. A typical example is hidden content, which results from a faulty resolution of transparent objects. Also, if columns are rendered left-justified rather than right-justified, legibility of numbers can be difficult.<br />
<br />
Software libraries for format conversion are usually easy to integrate and operate. On the other hand, it is not trivial to use the native applications for an automated conversion process, as they were designed primarily for interactive use. Although many of these applications can be controlled remotely via a program interface, they have some special characteristics that must be taken into consideration. For example, the user-specific configuration of the application influences the conversion process. Or, suddenly pop-up windows appear that need to be pressed. Most of these applications must also run in a user session and can not be run in the context of a service. Furthermore, many of these applications are not suitable for mass processing and must be regularly monitored and restarted.<br />
<br />
<div>
Although many of the native applications today have a "save as PDF/A" feature, the generated files lack fidelity or do not conform to the PDF/A standard. Therefore, in many cases it is better if the service first generates a normal PDF file and then converts it to PDF/A. In some cases a "Save as ..." function for normal PDF files is not available at all. In these cases, a specialized printer driver must be used, which ensures that a flawless rendering via the print function can be done.<br />
<br />
It pays to use native applications for the conversion service. Ultimately, the resulting quality of the generated documents is what the user sees. However, the better quality has its price: the conversion service has to master all the challenges. Simpler and cheaper products usually opt for the use of conversion libraries. Only products in the highest league are able to meet the high quality requirements of the rendered documents.</div>
Unknownnoreply@blogger.comtag:blogger.com,1999:blog-8909527971548289616.post-64536189055534463702018-08-09T10:01:00.000+02:002018-08-09T10:01:42.844+02:00Importing images into a PDF file - a seemingly trivial task<div>
A picture is worth a thousand words. That's why they are fondly embedded in PDF files. One would expect that embedding images in a PDF file is a simple task. Because it seems so easy, there are also many, including free, tools for it. But do these tools do what you expect them to do?</div>
<div>
<br /></div>
<div>
A closer look reveals that embedding images is anything but trivial.</div>
<div>
<br />
<a name='more'></a></div>
<div>
Let's start by embedding image data of the popular JPEG format. Many PDF creation programs simply take the JPEG data stream and embed it as is in the PDF file. Does that just work? The answer is: in most cases, but not always. The PDF standard requires that only so-called baseline JPEGs can be embedded. But there are also so-called multi-scan, progressive and arithmetically encoded JPEGs. These are not allowed and can then not be displayed by many PDF viewers. The fact that Acrobat displays such PDF files without an error message does not really prevent their dissemination. This is especially troublesome when the file claims to conform to PDF/A, as many PDF validation tools do not examine the image streams for conformance with the standard.</div>
<div>
<br /></div>
<div>
<div>
It becomes similarly problematic if the fax G3 image data streams are transferred from a TIFF container into a PDF file. The specifications of TIFF and PDF are, for whatever reason, slightly different, so this project can go quite wrong.</div>
<div>
<br /></div>
<div>
When embedding image data, the format of the image source must be carefully analyzed and the data stream converted or even re-compressed to conform to the PDF specification.</div>
</div>
<div>
<br /></div>
<div>
But there are other reasons to edit the image stream. JPEG streams also often contain many segments that are not needed in PDF or that must be stored in another place. For example, Exif data (camera settings, GPS location, etc.) should be extracted, converted to the XMP metadata format, and assigned to the image object as a separate property. Other segments, such as private Photoshop data, can also be removed because they have no use and only take up a lot of space.</div>
<div>
<br /></div>
<div>
Apart from the image stream, there is also other information in the source images that should be transferred to the PDF file but is often forgotten. Typically these are color profiles and metadata. But it is not that simple. Since TIFF files and other image formats can not directly be embedded in the PDF file, the containers must be unpacked and converted in such a way that they can be transferred to the PDF file. For example, the information in the TIFF file is stored in so-called tags, which contain metadata in addition to the image data. This metadata must first be converted to XMP format before it can be embedded in the PDF file. It's similar with the colors. Often the descriptions of the color spaces are not simply stored as color profiles and have to be converted into a PDF color space description.</div>
<div>
<br /></div>
<div>
So there are many small details to consider that are not properly or not at all handled by many tools. It is therefore worthwhile to use a professional tool for a seemingly trivial task.</div>
<div>
<br /></div>
<div>
A picture is worth a thousand words. However, it is certainly necessary to write more than a thousand words about how images should be embedded correctly in a PDF file.</div>
<div>
<br /></div>
Unknownnoreply@blogger.comtag:blogger.com,1999:blog-8909527971548289616.post-80767744050152726042018-06-26T09:02:00.000+02:002018-07-05T18:19:40.013+02:00PDF 2.0 - A quick overviewIt is rare for industrial products to survive for more than 20 years – especially in the IT industry. Not even the inventors of the PDF could have imagined just how successful their file format would be when they launched the first version of Acrobat in June 1993. The members of the International Organization for Standardization (ISO) have been working on the next generation of this popular format.<br />
<a name='more'></a><br />
Since ISO-32000-1, entitled “Document Management – Portable Document Format– Part 1: PDF 1.7”, was published in mid-2008, the sixth edition of Adobe’s famous PDF Reference has not changed significantly – just translated into the ISO language. But this has changed with the second part of the standard, “Part 2: PDF 2.0”, which has been published recently. This new version has been created by the ISO members, or to be precise by Technical Committee 171, Sub-committee 2. To make it clear that this is a new standard, a ‘2’ has been added to the main version number.<br />
<br />
The list of changes contains more than 50 entries. The most important changes and improvements relate to the following areas:<br />
<ul>
<li>Encryption: unencrypted wrapper of encrypted documents, 256-bit AES encryption, unicode passwords</li>
<li>Digital signatures: signatures based on the CAdES standard, certificates based on elliptic curves, long-term signature validation (LTV)</li>
<li>Annotations: projections, 3D, rich media</li>
<li>Accessibility: pronunciation hints</li>
<li>3D: support for the new ISO standard ‘PRC’, 3D measurements</li>
<li>Document parts (introduced in PDF/VT)</li>
</ul>
The committee has also been brave enough to scrap some outdated features; the main ones are:<br />
<ul>
<li>XFA forms: Adobe’s XML-based form technology has been a constant source of frustration for many providers</li>
<li>Movie, sound: multimedia content is not compatible with the concept of a portable document format</li>
<li>Superfluous, redundant, outdated or non-portable information, such as the document information dictionary (replaced by XMP), outdated digital signatures, OS-dependent file names and rarely used standards, such as OPI (Open Prepress Interface)</li>
</ul>
There have also been some major revisions to the new part of the standard, particularly in the following chapters:<br />
<ul>
<li>Rendering</li>
<li>Transparency</li>
<li>Digital signatures</li>
<li>Metadata</li>
<li>Tagged PDF and accessibility Support</li>
</ul>
But the numerous changes have taken their toll. It has taken seven years to create the second part, much longer than was needed for previous versions. In fact, Adobe managed to release seven versions in just 15 years – and in outstanding quality. On the plus side, the second part of the standard has received extensive input from the ISO members, and many parts of the text are worded more clearly. This makes it easier for the industry to understand the specification, increase the implementation quality and thereby improve interoperability. It is hoped that this will result in far fewer ‘bad’ PDFs.<br />
<br />
For the main uses of PDF – i.e. archiving (PDF/A), document exchange (PDF/X), engineering (PDF/E) and accessibility (PDF/UA) – ISO has defined special sub-standards, most of which are based on the first part of the PDF standard. It is likely that these standards will also be adapted to make them relevant to the second part. However, it should not be assumed that the master standard will now be ‘new’ and the sub-standards ‘old’.<br />
<br />
Instead, the development of these standards should be seen as an interaction. For example, many changes in the second part of the PDF standard are based on findings derived from working on the sub-standards and incorporated in the development. In addition – unlike the PDF master standard – there is no real urgency to change the PDF/X, PDF/E and PDF/UA sub-standards, as they have been optimized independently of Adobe for some time now. However, the situation for PDF/A is somewhat different.<br />
<br />
As soon as the first Version 2.0 PDF files are created, the question will arise as to how they can be archived in accordance with the standard. PDF/A must have an answer to this question. Unlike the other PDF sub-standards, this application is under a certain degree of time pressure. But the sheer number of changes is making it difficult to find a quick solution.<br />
<br />
The PDF 2.0 standard is still very young and there are hardly any files in circulation. Time will tell if the standard will work and how quickly manufacturers will implement it.Unknownnoreply@blogger.comtag:blogger.com,1999:blog-8909527971548289616.post-5272694604128900762018-06-06T06:56:00.000+02:002020-01-20T09:54:33.086+01:00How to deal with poor PDF quality"Quality is remembered long after the price is forgotten", says a Gucci family slogan. Nevertheless, creators of PDF documents, private users up to large companies, regularly produce files with insufficient quality causing unexpected problems and cost in document processing steps. So, companies are forced to equip their document 'inbox' with a quality control system.<br />
<a name='more'></a><br />
The exchange of electronic documents in business processes has now become a matter of course. The quality of these documents therefore plays an important role in ensuring smooth operation. Many organizations have been surprised by the poor quality of the files delivered and are forced to take action by processing issues and even production downtime.<br />
<br />
An important measure is the establishment of a quality control system for incoming documents - a <a href="https://www.pdf-tools.com/pdf20/en/pdf-quality-gate/" target="_blank">quality gate</a>, as we call it. Before this can be done, one has to be clear about what quality means. In this context, we can distinguish two types of quality, the inherent quality and the dedicated quality. Inherent quality is usually understood as:<br />
<ul>
<li>Conformance with the file format specification (ISO 32'000)</li>
<li>Efficient, non-redundant and memory-friendly use of the PDF language</li>
</ul>
<div>
Dedicated quality is mostly focused on applications such as:</div>
<div>
<ul>
<li>Scanning</li>
<li>Document exchange</li>
<li>Publishing</li>
<li>Printing</li>
<li>Archiving</li>
<li>etc.</li>
</ul>
<div>
For example, it is important for files to be printed that the fonts and colors are optimized, but usually no structural information is needed. And for archiving, the files must conform to the PDF/A standard, whereas this is not necessary for printing.<br />
<br />
Such a quality gate must therefore offer the following functions:<br />
<ul>
<li>Validation: standard conformance checks as well as configurable custom checks such as minimum resolution of scanned images, legibility of text, etc.</li>
<li>Repair and conversion: Fix file corruptions, ensure conformity with a standard such as PDF/A.</li>
<li>Optimization: reduce file size, merge and subset fonts, convert colors, adapt resolution of images.</li>
<li>Digital Signing: seal the document, enable detection and prevention of unauthorized modifications.</li>
</ul>
<div>
The introduction of a quality control system costs money. But as indicated in the slogan at the beginning: Customers and employees will not remember the initial cost, but the problems solved by the better quality. And this saves the most important cost: the cost of a damaged reputation.<br />
<br />
Find out more about our <a href="https://www.pdf-tools.com/pdf20/en/products/pdf-converter-validation/conversion-service/" target="_blank">4-Heights™ Conversion Service</a>.</div>
</div>
</div>
Unknownnoreply@blogger.comtag:blogger.com,1999:blog-8909527971548289616.post-18230208570980317782017-05-22T16:10:00.000+02:002017-05-23T09:35:01.157+02:00The versatility of a PDF viewerAlmost every user knows: a PDF viewer is not just a display tool for a well known document format. It provides many more functions, runs on many platforms, offers interfaces in many technologies and serves as an important component of many applications. When you look closer it is a true miracle.<br />
<a name='more'></a><br />
The majority of the traditional viewers are standalone interactive programs running on a personal computer. But meanwhile viewers are also available as seamlessly integrated controls in a user dialog of applications such as MS Access etc. Or, viewers can be found in embedded systems such as a display unit of an airplane. Some viewers are specialized tools in digital signature applications called "secure viewer". And nowadays viewers are part of web applications running on a the HTML and Java Script platform of a common web browser.<br />
<br />
Most navigation functions such as scrolling pages, jumping to bookmarks, zooming, rotating etc. are self-evident. On top of these basic functions some viewers provide a wide variety of tools of more or less useful functions to manipulate a PDF document. The most requested functions, however, are:<br />
<ul>
<li>adding annotations</li>
<li>filling out forms</li>
<li>applying digital signatures</li>
</ul>
<div>
There are more complex functions such as assembling documents from different source documents but this function is usually out of the scope of a simple viewer application and is provided as specialized tool.</div>
<div>
<br /></div>
A viewer component which is meant to be part of an software application must provide interfaces for at least the .NET and Java technologies. And, the COM technology ActiveX is still widely used in a variety of development environments. But if even if a viewer component concentrates on .NET and Java it has to provide interfaces for the various GUI flavors such as WPF vs. Windows Forms on one side and AWT Swing vs. SWT on the other side.<br />
<br />
In order to cover the needs of our customers we have developed viewer components for various technologies such as <a href="http://www.pdf-tools.com/pdf20/en/products/pdf-rendering-desktop-tools/#tab-list-4">.NET</a>, <a href="http://www.pdf-tools.com/pdf20/en/products/pdf-rendering-desktop-tools/#tab-list-3">Java </a>and <a href="http://www.pdf-tools.com/pdf20/en/products/pdf-rendering-desktop-tools/#tab-list-2">ActiveX</a>. Based on these viewer components we also provide a PDF Document Assembler Tool. The newest component is a pure Java Script viewer running in web browsers. I will post separate articles on some of these topics.<br />
<br />
As always, I appreciate your feedback on this article.Unknownnoreply@blogger.comtag:blogger.com,1999:blog-8909527971548289616.post-32871191035450699412017-01-29T18:06:00.002+01:002017-03-29T07:14:33.148+02:00The caveats of assembling PDF/A documents<div>
Assembling PDF documents from various sources is a crucial part of an output management system. And, as the document needs to be archived in most cases, it should conform to the PDF/A standard. But, is there a way to assemble a document and accomplish PDF/A conformance in one step?</div>
<div>
<a name='more'></a></div>
<div>
<br /></div>
<div>
An assembled document may be a raw transaction document originating from an enterprise resource planning system. Usually it is embellished with corporate identity elements and complemented with some white space advertising before it is sent to the customer. Or, it might be a very complex FDA documentation for a drug development and approval process containing thousands of pages of lab reports, clinical studies and the like.<br />
<br />
Whatever the purpose of an assembled document might be, the common challenge is to create a document or a set of documents with a consistent appearance of all its parts. It should look as if it was created by a single application. In order to achieve this, most output management systems use document assembly toolboxes which typically offer the following functions:</div>
<div>
<ul>
<li>Merge documents from multiple sources</li>
<li>Insert empty pages (for duplex printing)</li>
<li>Insert pages which are created on-the-fly (table of contents, etc.)</li>
<li>Delete unnecessary pages</li>
<li>Sort pages in any order (booklet, reverse order, etc.)</li>
<li>Rotate pages (portrait, landscape)</li>
<li>Scale a page (shrink from A3 to A4, convert from Letter to A4, etc.)</li>
<li>Crop a page (make register and crop marks invisible)</li>
<li>Add page overlays and underlays (corporate identity)</li>
<li>Arrange multiple pages on one sheet (2-up, 4-up, 6-up etc.)</li>
<li>Add content to a page such as OMR marks, bar code, pagination, watermarks etc.</li>
<li>Add XMP metadata to the document</li>
<li>Set the document's output intent color profile</li>
<li>Remove unnecessary features such as named destinations, tagging etc.</li>
</ul>
<div>
<div>
Furthermore, since the assembled document is sent to a customer or another business partner it has to be archived and thus conform to the PDF/A standard.</div>
<div>
<br /></div>
<div>
In general, there are two ways to make an assembled document conform to PDF/A: </div>
</div>
<div>
<ul>
<li>Assemble the documents while disregarding their PDF/A conformance in a first step and then convert the result into a PDF/A document in a second step.</li>
<li>Assemble the document from sources that already conform to PDF/A, create new conforming content and process all parts such that PDF/A conformance is maintained.</li>
</ul>
The first might be easier to implement since it imposes less requirements on the quality of the source documents. In high performance and high volume applications, however, the second approach might be the only feasible solution.<br />
<br />
One of the main challenges to be mastered is to consolidate the output intents. Each input file can have a different output intent. So all color spaces must be checked and adapted to reflect the new output intent before they can be used in the output document. There are many other challenges such as handling fonts etc. But I will touch all of these topics in detail in separate articles.</div>
<div>
<br /></div>
<div>
We have designed a component which offers most of the above features and some more. In addition, the <a href="http://www.pdf-tools.com/pdf20/en/products/pdf-manipulation/pdf-toolbox/">3-Heights™ PDF Toolbox</a> is capable of creating PDF/A conforming output documents assembled from multiple sources and content generated on-the-fly. </div>
</div>
<div>
<br /></div>
<div>
I hope this article is useful. As usual I would appreciate your feedback and get your comments.</div>
Unknownnoreply@blogger.comtag:blogger.com,1999:blog-8909527971548289616.post-87458832262965538642016-12-04T19:09:00.000+01:002017-03-29T07:14:18.203+02:00Digging for information by extracting data from a PDF documentExtracting text from a PDF document is one of the most popular information retrieval function. But how about other information such as images, metadata and more? It can be simple - but also tricky.<br />
<a name='more'></a><br />
Among the easiest things to extract you'll find metadata. The document metadata can usually be extracted as a short XMP stream. Even if the document contains an old fashioned information dictionary then the extraction of the key / value pairs is not a big deal. Similar are outlines (bookmarks), navigation aids such as named destinations, links and the like. <br />
<br />
However, the extraction of the graphics contents of a page is much more complex. Theoretically, it is possible to extract each content object and the associated resource objects and use them for creating an HTML page or a page in any other description language. In practical applications this proves to be too complex due to the graphics model that PDF offers. This model has some unique features such as patterns, shadings and transparency groups with a variety of blend modes. Furthermore, the scan conversion rules differ significantly from those which are built-in in commercially available graphics processors. Thus, the mapping of a PDF page description to HTML, PCL or even PostScript can only be achieved by transforming the page description using transparency flattening and other techniques.<br />
<br />
For this reason, if one has to convert the page contents to another document format then it is much wiser to use a specialized converter tool such as the <a href="http://www.pdf-tools.com/pdf/pdf-to-image-converter-tiff.aspx?l=en-us">PDF to Image Converter</a>.<br />
<br />
Most applications deal with the extraction of text. Typical areas of use are the classification of transaction documents such as invoices, the implementation of a text search function in document repositories and many more. For further information please refer to this article: <a href="http://blog.pdf-tools.com/2014/01/why-is-extraction-of-text-from-pdf.html">Why is the extraction of text from a PDF document such a hassle?</a><br />
<br />
As outlined above the extraction of information from a PDF document can be very simple but also quite tricky. It depends on what kind of information the application requires. In order to make the programming of such applications as easy as possible we have created a specialized tool, the <a href="http://www.pdf-tools.com/pdf/pdf-extract-content-metadata-text.aspx">PDF Extract Tool</a>. It offers an easy to use interface which has been designed based on the above insights. Most use cases can be handled with only a few lines of code. This is achieved by hiding some features of the PDF graphics model such as coordinate transforms from the programmer.<br />
<br />
I hope this article was useful to you. If you have any questions, don't hesitate to post a comment. I'll be happy to answer it.Unknownnoreply@blogger.comtag:blogger.com,1999:blog-8909527971548289616.post-28564731964131902052016-10-11T07:15:00.000+02:002017-03-29T07:21:00.820+02:00Using blockchains as an alternative to PKIs for digital signaturesThe traditional technical environment for a digital signature is the public key infrastructure (PKI). Digital signatures are also used to implement electronic money such as Bitcoin. However, Bitcoin uses a new technology, the blockchain. This new technical infrastructure can also be employed to sign documents. But what are the benefits?<br />
<br />
<a name='more'></a>A public key infrastructure allows for creating digital signatures by the owner of a private key and verifying the signature by anyone with a corresponding public key. A blockchain is a gobal database that allows for safely authorize and verify transactions (transfer of funds). A blockchain can now be used to build a digital signature infrastructure with some unique properties.<br />
<br />
<b>Multiple Signatures</b><br />
<br />
With a blockchain it is possible to support quorums such as 3 of 5 signers (or any other combination) must sign a document in order to make the signature valid. This feature can also be used for counter signatures.<br />
<br />
<b>Time Constraints</b><br />
<br />
In order to make a signature valid we can set a time constraint for a document to be signed.<br />
<br />
<b>Time Stamping</b><br />
<br />
Every transaction which is recorded in the blockchain is time stamped within minutes.<br />
<br />
There are some other features that differentiate a blockchain based digital signature scheme from a traditional PKI based scheme. And, although this technology is still in its pioneer phase it has already been adapted in some applications of the financial industry.<br />
<br />
We are planning to adapt our 3-Heights(TM) Security Tool to support the interfaces of selected blockchain service providers.<br />
<br />
I would be very interested in your opinion about blockchain based digital signatures. If you know of current and future applications, requirements and benefits please share it and post a comment.Unknownnoreply@blogger.comtag:blogger.com,1999:blog-8909527971548289616.post-63503555931145317122016-05-10T14:17:00.000+02:002017-03-29T07:16:53.670+02:00How to render the text of a PDF document if the font is not embedded?Every developer of a PDF viewer, a PDF printer and a PDF to Image Converter tool comes across the requirement to render non embedded fonts and is facing quite a challenging task. Not only developers but also users of these tools might be interested in non embedded fonts and how they are treated by these tools.<br />
<a name='more'></a><br />
Every character of a text in a PDF document is assigned to a font which defines the appearance of the characters, e.g. the widths of the stems, the height of the small letters compared to the capital letters, serifs, character metrics and many more properties. All these properties and the exact appearance of a character is described in a font program (essentially either a TrueType or a PostScript Type 1 program).<br />
<br />
The creator of a PDF document can choose whether the font program is embedded in the file or whether it is only referred to by name. If the font program is embedded then it can be used by the rendering engine to display the text. However, if the font is not embedded and the file is displayed or printed on a different computer then it is not guaranteed that the original font program is still available. In this case the rendering engine has to find a replacement for the original font. However, this process can be quite challenging. Why? Consider the following:<br />
<ul>
<li>The name of the font does not guarantee that a system font with the same name is identical to the original font, e.g. the <i>Arial </i>font which was used to create the document might be different to the <i>Arial </i>font which is installed on the system which is used to display the document.</li>
<li>The name of the embedded font might be different to the name of the installed font, e.g. the embedded font may be named <i>TimesRoman </i>and the equivalent installed font may be named <i>Times New Roman</i>.</li>
<li>The embedded font may not be available at all on the target system, e.g. there doesn't exist an installed font named <i>Coronet</i>.</li>
</ul>
<div>
In order to solve these problems some rendering engines do not use system fonts as replacement fonts but provide a set of font templates which are used to construct the definitive font program on-the-fly based on some font metrics such as the font weight, the height of the small characters, the character widths etc. </div>
<div>
<br /></div>
<div>
Multiple master fonts are one of these technologies to tailor font programs. Multiple master fonts are font templates from which specific font instances can be produced by providing a font design vector containing elements such as the font weight and the character width. Conventional fonts such as TrueType or PostScript Type 1 fonts are not suited for this purpose. Why? If you want to change the width of a character of a TrueType font then you e.g. automatically change its stem width which is not the case with multiple master fonts where you can change the stem width and the character width independently.</div>
<div>
<br /></div>
<div>
Multiple master fonts are only available as a replacement for simple fonts, that is fonts with a small, defined set of characters. For fonts with large character sets such as Chinese, Korean and Japanese fonts it is better to use a set of defined font program resources.</div>
<div>
<br /></div>
<div>
Our 3-Heights(TM) PDF Rendering Engine uses a mixture between the described technologies to tailor replacement fonts and to use installed system fonts.</div>
<br />
I hope this article was useful to you. If you have further questions or comments, please post a comment.<br />
<br />
<br />
<br />Unknownnoreply@blogger.comtag:blogger.com,1999:blog-8909527971548289616.post-73322765794792018452016-03-31T07:18:00.000+02:002017-03-29T07:21:52.922+02:00Inline images and Type 3 fontsI often hear that the inline image construct is a major flaw in the design of the PDF page description language. Inline images are an often used feature in Type 3 fonts. However, the stomach pain of some experts even caused them to adjust this feature in the upcoming PDF 2.0 standard. What are inline images and why do some programmers of PDF readers feel uncomfortable about them?<br />
<a name='more'></a><br />
The PDF page description language consists of operators by which text, graphics and images can be placed onto a blank page. If one wants to paint a raster image onto a page then the image object is given a name and added to the page resource dictionary. The painting operator then refers to the image resource by its name. The separation between operators and resources has the advantage that the page description is short and that the resource can be reused many times, e.g. if the image represent a company logo which is used on each page of a document.<br />
<br />
The appearance of the characters (glyphs) of a type 3 font are described with the same operator language that is used to describe the appearance of the page. The appearance of such a glyph is mostly described by a small image mask. Since a font can have many glyphs and they usually have a unique appearance these image masks cannot be reused and the overhead to put each of them in a separate resource object is high. For this and similar use cases, PDF allows for placing small images directly into the operator stream. This feature is called 'inline image'.<br />
<br />
Some programmers now argue that these inline images are difficult to parse. Indeed, if the pixel data is a compressed then the length of this image data can only be determined if the data is decompressed by the parser. And, if the parser doesn't know the length of the data then it cannot properly find the next operator. Therefore, an optional length attribute was introduced in PDF 2.0.<br />
<br />
However, my personal opinion is that inline images, if used carefully in situations for which they have been design for, are a useful means to reduce the size of a PDF file. Decompressing the images in the parser isn't a real problem since they are usually very small. And, the length attribute doesn't really help since it is optional. Furthermore, the feature worked for more than 20 years.<br />
<br />
If I missed something or if you have a different opinion please let me know and post a comment.Unknownnoreply@blogger.comtag:blogger.com,1999:blog-8909527971548289616.post-57597192741356240992016-01-25T08:25:00.003+01:002016-01-26T08:37:20.298+01:00How to convert signed documents to PDF/A?I often get the question whether it is possible to convert digitally signed documents to PDF/A. Because there's no short answer to this I thought it would be helpful to explore the topic a bit into more detail.<br />
<a name='more'></a><br />
What technical means does PDF offer to perform such a conversion?<br />
<br />
Generally spoken, the conversion of a PDF document to a PDF/A document is done by deleting and updating existing objects and adding new objects. Theoretically, this could be performed by using the incremental update mechanism of PDF which would not break the existing digital signature. Practically, however, the most important viewer, Acrobat Reader, only accepts very limited updates, such as adding comments, once a document is digitally signed. Thus, the incremental update feature cannot be used for this purpose and, unfortunately, there is no other suitable mechanism.<br />
<br />
But how can the problem be solved then?<br />
<br />
There are several possibilities. A simple solution would be to convert the document to a PDF/A-3 file and attach the original file to it. However, in some archiving environments, PDF/A-3 is not permitted. In these cases the conversion tool can create a page containing the verification protocol of the digital signature, convert the original file without the signature to a PDF/A file, merge both files and resign the result.<br />
<br />
How would you solve the problem in your environment? Please let me know and post a comment.Unknownnoreply@blogger.comtag:blogger.com,1999:blog-8909527971548289616.post-68355096218378458562015-12-02T11:04:00.001+01:002015-12-08T15:11:37.603+01:00Replacing rich black by true black in PDF documentsWhen it comes to printing then all colors in a PDF document are transformed to the native color space of the printing device. If, e.g. a text uses a black RGB color then it is transformed to an equivalent CMYK value which contains contributions from all four color channels. In particular in mass printing applications these "rich black" values are not wanted, however, and it is required to use "true black" colors which use the K channel only. This article gives some ideas how this transform can be achieved.<br />
<br />
<a name='more'></a><br /><br />
Rich black colors are usually created by color management systems (CMS) which are used in viewers or printing applications. The transform itself uses the ICC color profiles of the source and target device. If the source e.g. is an sRGB color space then the black is first transformed to the linear XYZ space then to the L*a*b space and then these values are used to lookup the equivalent CMYK values using the color profile of the printing device.<br />
<br />
In order to create true black values the CMS transform must be bypassed. For simple fill and stroke colors this can be done by comparing the values of the RGB color channels. If they are equal or almost equal within a given tolerance then the value can directly be converted to a true black value. For images, however, all pixel values must be analysed first, and if every pixel is on a gray scale then the image can be transformed to true black.<br />
<br />
If the source color space, however, is not RGB but CMYK because the colors have already been transformed by an earlier processing stage then things get more complicated. A naive approach would be to transform them into RGB and thereafter use the algorithm above. However, one get more accurate results if one computes the gray line (a line which represents all values from the black pint to the white point) within the color space and then compute the distance of the CMYK value from this gray line.<br />
<br />
Since converting rich black colors into true black colors is not a trivial task we created a new tool to perform this. Beside of this the tool also can convert other color values to CMYK and replace specific colors within a given tolerance even supporting anti-aliased images.<br />
<br />
Please let me know if you find this article useful.Unknownnoreply@blogger.comtag:blogger.com,1999:blog-8909527971548289616.post-49299058251093667832015-11-13T10:28:00.003+01:002015-11-13T10:28:46.588+01:00The problem with embedded fonts in PDF mass printing applicationsPDF is more and more finding its way into mass printing applications. However, PDF spool files often ask too much from a print engine resulting in aborts or, even worse, incomplete prints which may not be noticed. What is special about PDF mass printing and what can be done about it?<br />
<div>
<a name='more'></a><br />
Individual PDF files from various application software systems are assembled into large spool files along with print tickets before they are submitted to the mass print service. The print preparation steps such as merging, splitting, reformatting, pagination, barcode insertion etc. leads to spool files that contain huge amounts of font and other resources. In particular, it may happen that a spool file of 100'000 pages contains 300'000 embedded, slightly different font subsets of the same Times Roman or Helvetica font family. It is immediately clear that an average print engine can't properly handle such a spool file. </div>
<div>
<br /></div>
<div>
One possibility to solve the problem is to omit embedded fonts. However, since the PDF files often conform to PDF/A because they need to be archived, this is not a real option for a general solution. Furthermore, print service organizations are not used to handle these problems, since they are not used to it. Traditional spool file formats such as ASP and PostScript have been optimized to handle font resources in an economical way. So, one must find a general solution to reduce the amount of resources in the PDF spool file.</div>
<div>
<br /></div>
<div>
The general solution is an optimizer tool. It can replace redundant objects such as repeatedly embedded logo images by a single instance and merge subsets of the same font family into a single font program. However, the merging of font programs is not as easy as it seems for the following reasons:</div>
<div>
<ul>
<li>The font subsets have been derived from different versions of the same font family, e.g. Helvetica 1.0, Helvetica 1.1 etc.</li>
<li>The font subsets have been created by different PDF libraries using different subsetting and embedding rules.</li>
<li>The character code to glyph mapping is different for each subset.</li>
<li>The various subsets use different font technologies such as TrueType, Type 1, CFF or OpenType.</li>
<li>The subsets use different metrics for equivalent glyphs.</li>
<li>etc.</li>
</ul>
<div>
Since a powerful font merging algorithm for the purpose of mass print preparation, we have developed a special tool to perform the task. In the best case the tool is able to reduce the number of embedded fonts in the above mentioned spool file from 300'000 to only 3 fonts.</div>
</div>
<div>
<br /></div>
<div>
<br /></div>
Unknownnoreply@blogger.comtag:blogger.com,1999:blog-8909527971548289616.post-40791914822610245702015-07-28T09:18:00.000+02:002018-06-12T08:03:28.095+02:00Is JBIG2 soon banned?JBIG2 is a compression algorithm for bitonal images and has been developed to replace the widely used CCITT G4 algorithm because it can reach better compression ratios. However, the algorithm has received a bad reputation which has led some security experts to the recommendation not to use the algorithm anymore. Is this a wise advice or just an overreaction? Why could it go so far?<br />
<div>
<a name='more'></a><br />
To understand this let us start with some properties of the algorithm itself. JBIG2 can be used in two modes: lossless and lossy. In the lossless mode the decompressed image is binary identical to the image before it is compressed. In lossy mode some pixels may differ in favor of a better compression rate. </div>
<div>
<br /></div>
<div>
To achieve this the compressor builds up a symbol dictionary consisting of bit patterns for e.g. the character "e". On a scanned page this character can appear often, but the bit patterns may differ slightly. The compression algorithm now replaces all occurrences of these patterns with references to the pattern stored in the symbol dictionary. Most compressors have a quality parameter which indicates how "similar" a pattern is to a previously stored symbol. It is obvious that this method can save space. </div>
<div>
<br /></div>
<div>
But, if the quality parameter is set to low then the compressor may replaces a bit pattern for "6" by a reference to the symbol "8". In this case we might get a problem. This possible behavior is the source of the whole discussion about the JBIG2 algorithm.</div>
<div>
<br /></div>
<div>
Due to the problems that might occur during compression some experts recommend not to use the algorithm at all. In particular the German federal authority BSI (Bundesamt für die Sicherheit in der Informationstechnik) revised the RESISCAN guideline accordingly. Although JBIG2 is not mentioned explicitly therein it forbids pattern matching / replacement and soft pattern matching algorithms. This implies that JBIG2 shall not be used neither lossless nor lossy. Also the Swiss KOST (Koordinationsstelle für die dauerhafte Archivierung elektronischer Unterlagen) recommends not to use JBIG2 anymore.</div>
<div>
<br /></div>
<div>
Technically spoken, if a user uses lossless JBIG2 compression then the described problem cannot occur. On the other side I can understand that BSI and KOST recommend not to use the algorithm at all since they assume that most users do no care about the details such as lossy and lossless and quality parameters.<br />
<br />
In order to avoid security discussions the setting of the quality parameter has been disabled in our software since version 4.6.5.0 with the effect that only lossless compression is being used.</div>
<div>
<br /></div>
<div>
I would be interested in your opinion. Is this an overreaction or a wise advice? Please let me know and post a comment.</div>
Unknownnoreply@blogger.comtag:blogger.com,1999:blog-8909527971548289616.post-80204985116685054482015-06-27T21:00:00.001+02:002019-02-11T14:24:01.245+01:00Image detection in scanned images<div style="background-color: white; color: #4c4c4c; font-stretch: normal; font-weight: normal; margin: 0.75em 0px 0px; position: relative;">
<span style="font-family: inherit;">Detecting pictures in scanned document pages has many advantages such as better compression rates and the possibility to extract them individually.</span><br />
<a name='more'></a><span style="font-family: inherit;"><br /></span></div>
<span style="background-color: white; color: #4c4c4c; font-family: inherit;">A scanned page is originally a raster image consisting of bi-level or color pixels. Since we have sophisticated compression methods scanning in color is clearly preferred over pure black / white modes. One such method is the mixed raster content (MRC) method which separates the scanned image in a background, mask and foreground layer. Each layer can individually be compressed using specialized algorithms parameterized for its specific purpose. Such algorithms are JBIG2 for the mask and JPEG2000 for the background layer. </span><br />
<div style="background-color: white; color: #4c4c4c; font-stretch: normal; font-weight: normal; margin: 0.75em 0px 0px; position: relative;">
<span style="font-family: inherit;">There can be multiple foreground layers e.g. for photographic images that are part the scanned page. In order to separate these images from the background and mask layer a specific segmentation algorithm must detect and isolate them. Each of these images now can form an individual foreground layer compressed with a specific algorithm such as JPEG.</span></div>
<div style="background-color: white; color: #4c4c4c; font-stretch: normal; font-weight: normal; margin: 0.75em 0px 0px; position: relative;">
<span style="font-family: inherit;">So the MRC method for scanned pages can be accomplished as follows:</span></div>
<div style="background-color: white; font-stretch: normal; margin: 0.75em 0px 0px; position: relative;">
</div>
<ul>
<li><span style="color: #4c4c4c; font-family: inherit;">Segmentation algorithm: detect and isolate images</span></li>
<li><span style="color: #4c4c4c; font-family: inherit;">Separation algorithm: compute the pixels of the image mask and the color background.</span></li>
<li><span style="color: #4c4c4c; font-family: inherit;">Compress each layer using a dedicated compression algorithm</span></li>
<li><span style="color: #4c4c4c; font-family: inherit;">Compose the layers according to an MRC schema such as RFC 2301 in TIFF or a masked image in PDF.</span></li>
</ul>
<span style="color: #4c4c4c; font-family: inherit;">Removing images from the scanned page may also speed up the text recognition process (OCR). </span><br />
<div style="background-color: white; font-stretch: normal; margin: 0.75em 0px 0px; position: relative;">
<span style="color: #4c4c4c; font-family: inherit;">However, a more interesting function can be offered though. If the said images have been isolated and assigned an individual layer the can be easily extracted from the document by a suitable tool.</span></div>
<div style="background-color: white; font-stretch: normal; margin: 0.75em 0px 0px; position: relative;">
<span style="font-family: inherit;">Furthermore, when create a PDF document from a scanned page, then the optional content feature can be used to switch on and off the background and foreground layers.</span></div>
<div style="background-color: white; font-stretch: normal; margin: 0.75em 0px 0px; position: relative;">
<span style="font-family: inherit;"><span style="color: #4c4c4c;">Our products such as the </span><span style="background-color: transparent;"><span style="color: #4c4c4c;">3-Heights™ TIFF Toolbox, the </span></span><span style="background-color: transparent;"><span style="color: #4c4c4c;">3-Heights™ Scan Server and the </span></span><span style="background-color: transparent;"><span style="color: #4c4c4c;">3-Heights™ Optimizer now support the features described in this article. To extract the images from a PDF document the 3-</span></span></span><span style="background-color: transparent;"><span style="color: #4c4c4c;">Heights™ PDF Extract tool can be used.</span></span></div>
<div style="background-color: white; font-stretch: normal; margin: 0.75em 0px 0px; position: relative;">
<span style="background-color: transparent; color: #4c4c4c;">Is this article useful to you? Please let me know and post a comment. </span></div>
Unknownnoreply@blogger.comtag:blogger.com,1999:blog-8909527971548289616.post-42857122736704898262015-05-11T08:25:00.001+02:002018-11-13T09:34:04.435+01:00Font subsetting - how it works and when to useIn order to reduce the file size PDF producers use a technique called font subsetting. What does exactly happen with the fonts and what are the consequences?<br />
<a name='more'></a><br />
If a PDF creator software adds text to a page description then it refers to a font. The font contains a collection of characters with a description of their graphic appearance called glyphs, metrics and other relevant information to render the text. There exist various types of font formats which can be used in a PDF document such as Type 1, TrueType, CFF and OpenType fonts.<br />
<br />
Fonts do not need to be embedded into a PDF file. In this case they are referred to by name and must be available to correctly reproduce the document. To guarantee that the fonts are always available the PDF creator software can embed them as binary streams in the file. However, the size of the font file themselves can be reduced by removing all information from them that is actually not needed to correctly render the document. The PDF specification exactly lists the font parts that are required. Beside this the font size can be reduced by only leaving the character descriptions which are actually referred to by the text objects. The removal of unneeded character descriptions is called subsetting.<br />
<br />
Fonts with PostScript outlines (Type1, CFF, OpenType) can be reduced by just removing the character strings as their selection is done using unique glyph names or character identifiers (CID). For TrueType outlines there are various options to perform the subsetting. One option is to completely remove the glyphs and the other is to only remove the outlines and keep empty glyphs. The advantage of the first option is that the resulting size is smaller. The advantage of the latter is that the glyph identifiers (GID) do not change. This is important because the glyphs are selected by their GID. If using the first option then the font encoding tables (cmap) or the GIDtoCIDMap data structure must be adapted to reflect the changes in the glyph numbering. But there are also some special cases.<br />
<br />
Font programs may contain compound glyphs, e.g. the glyph 'ä' may refer to two separate glyphs 'a' and '¨'. When subsetting such a font the compound glyph descriptions must be updated if the glyph numbering changes and the referred glyphs must not be removed if they are not directly referred to by the text in a document. Other information such as the encoding tables (cmap), font metrics (head, hhea, hmtx) and instructions (prep, fpgm, cvt) may also refer to glyph numbers and must be updated accordingly.<br />
<br />
Subsetting inhibits the editing of PDF documents. Especially in interactive forms the fonts which are used to fill out form fields must not be subsetted.<br />
<br />
The subsetting of fonts is a complex task and error prone. The majority of bad real world PDF files contain malformed embedded fonts resulting from non-functioning subsetting algorithms.Unknownnoreply@blogger.comtag:blogger.com,1999:blog-8909527971548289616.post-8876731584282317682015-01-31T11:11:00.001+01:002015-05-10T19:02:39.305+02:00Digital signatures in PDF/ADigital signatures are still not very widely used and the the knowledge about them is often fuzzy. This article tries to give an overview about this huge and complex topic.<br />
<a name='more'></a><br />
The term digital signature refers to implementations of the more generic concept of an electronic signature on digital computers. The electronic signature is more used in conjunction with the legal aspect of such signatures. The functions of an electronic signature is to<br />
<ol>
<li>Replace the handwritten signature</li>
<li>Ensure the integrity of a document (electronic seal)</li>
<li>Convey the authenticity of the signer (electronic identity)</li>
</ol>
<div>
In the most countries electronic signatures are subject to the national legislature, e.g. in ZERTES Switzerland.</div>
<div>
<br /></div>
<div>
A digital signature is a cryptographic method to implement the above functions. In most cases the the signer owns a digital certificate and a private key. The private key is stored on a secure token or on a hardware security module (HSM). It is used to create the digital signature. The signer's certificate contains the corresponding public key and can be used to verify the signature.</div>
<div>
<br /></div>
<div>
PDF defines three types of signatures:</div>
<div>
<ol>
<li>Document signature: Any user of the document can apply such a signature and a document can be signed multiple times. Each user can add annotations to the document before it is signed. Each signature creates a specific revision of the document at the moment it is applied. This revision can later be reliably restored.</li>
<li>Modification Detection and Prevention (MDP) signature: The author of the document can add a signature connected with specific action rights such as filling out forms which do not invalidate the integrity of a document. Only one such signature can be added to a document.</li>
<li>Usage Rights (UR) signature: Any software can add these types of signature to enable specific reader functions such as the known Acrobat Reader Extensions.</li>
</ol>
<div>
The signatures themselves are a mixture of PDF objects and strings in a cryptographic message syntax. In order to provide maximum interoperability the embedding of a digital signature must follow specific rules which are listed here:</div>
<div>
<ul>
<li> PDF/A-1 is based on PDF 1.4 and does not specifically define any rules. The PDF/A Competence Center created therefore a document called Tech. Note. #6. I happen to be the editor of this document. You can get it from the PDF Association website.</li>
<li>PDF/A-2 and PDF/A-3 is based on PDF 1.7 which refers to the PAdES standard.</li>
</ul>
<div>
As it is always the case with blog posts this article is far from being in depth or even complete. The main goal is to invite you to post questions and start discussions. So, please post a comment and share your thoughts with others.</div>
</div>
</div>
<br />
<br />Unknownnoreply@blogger.comtag:blogger.com,1999:blog-8909527971548289616.post-71608822897693465762014-12-13T11:16:00.000+01:002015-05-10T19:01:04.984+02:00Handling embedded and non-embedded fonts in PDF & PDF/A documentsAlthough the first part of the PDF/A standard was published in 2005 there is still a need for some clarifications regarding fonts and embedding. What does the standard exactly require? How should PDF to PDF/A converters handle fonts? How do viewers actually deal with them and how should they?<br />
<a name='more'></a><br />
<b>PDF/A Creation</b><br />
<br />
Let us start with the easiest case. If you create a PDF/A document then in general you have to embed all used fonts. This is true for any flavor of PDF/A such as PDF/A-1b or PDF/A-3u etc. There is only one exception to this. If the text is not visible (text rendering mode 3) then embedding is not required. Invisible text is often used to overlay a scanned page with the text from an OCR engine in order to allow for searching text in a scanned document as if it was a born digital document.<br />
<br />
If embedding is required, however, then the font can be minimized so that it only contains those characters which are being used by the document, e.g. if a document shows the single text string "help" in Arial then the embedded Arial font program can be reduced to contain only four characters. This process is called subsetting and it is extensively used to reduce the size of the created file.<br />
<br />
But creators must pay attention to some characters which are composed from others such as the German character "ä" which can be composed from the character "a" and "¨". This is one of the sources of bad PDFs with incomplete fonts programs.<br />
<div>
<br /></div>
If an embedded font is used to fill out text in form fields then the whole font must be embedded since the creator doesn't know in advance which characters are eventually selected by the user. From a technical point of view, the text remains editable if the associated font is not subsetted and vice versa. But there are also legal constraints.<br />
<br />
The embedding and also the subsetting of a font is subject to licensing of the font manufacturer. The majority of the licenses grant the right to freely use the font for reproduction such as viewing and printing but restricts creating and editing of text to the license owner. In any case you should carefully check the license conditions before using a font to avoid legal issues.<br />
<br />
TrueType and OpenType fonts contain usage rights information which tells the creator software whether a font is allowed to be embedded or not. Some creators obey these flags, others don't. Whatever this information tells you it can only be regarded as a hint. In the end the written license text which comes with the purchased font is the only decisive source of information.<br />
<br />
<b>PDF to PDF/A Conversion</b><br />
<b><br /></b>
A PDF to PDF/A Converter software has to embed fonts if they aren't. For a well formed PDF input document this is not a problem. If the font is found (by name) in the installed font collection of the operating system then it is used. If it is not found it is replaced by a font which has similar characteristics as the searched font. Such fonts are often synthesized using a generic font template for serif and non-serif characters (Multiple Master Fonts) instead of installed fonts.<br />
<br />
If the PDF input document is not well formed (e.g. if non-embedded fonts exist which are symbolic or CID fonts without a known CMAP etc.) then the converter must use similar heuristics as a viewer would use in such a situation. But since these algorithms aren't bullet proof the result might not look like as expected or the conversion may even fail.<br />
<br />
<b>PDF/A Viewer </b><br />
<br />
A viewer (in general a software which reads PDF files) may behave differently dependent of whether the document claims to conform to the PDF/A standard or not. If the document carries the PDF/A label then the viewer is required to use the embedded fonts whereas for a regular PDF document it may use the installed fonts instead. Using an installed font is usually faster than loading the embedded font from its compressed and possibly encrypted data stream. On the other side even if fonts have the same name they may look and behave differently.<br />
<br />
If I still left some open questions or raised new ones please let me know and post a comment.Unknownnoreply@blogger.comtag:blogger.com,1999:blog-8909527971548289616.post-73423305460413393902014-11-04T10:44:00.004+01:002015-05-10T18:59:51.240+02:00PDF validation with customer specific extensionsWhile talking about PDF validation workflows I often come across questions like "Can I let the validation fail if the paper format does not match our corporate rules?". This and other customer specific requirements are indeed useful extensions to the pure file format and standard conformance tests.<br />
<a name='more'></a><br />
Customer specific tests depend on whether the document was scanned or produced from a digital source and how it is intended to be used. For scanned documents the following checks could make sense:<br />
<ul>
<li>Resolution of the scanned images</li>
<li>Compression algorithms</li>
<li>Manufacturer of scanner (stored as the producer property in the document's metadata)</li>
<li>Presence of OCR text</li>
</ul>
<div>
For digital born documents the following information could be helpful:</div>
<div>
<ul>
<li>Names of the used fonts</li>
<li>Font embedding information (if not PDF/A)</li>
<li>Creator and producer application name</li>
</ul>
<div>
And, in general:</div>
</div>
<div>
<ul>
<li>Page format (A4, Letter, etc.)</li>
<li>PDF minimum and maximum version (e.g. from 1.4 to 1.7)</li>
<li>Presence of specific features such as embedded files, transparency, patterns, shadings, color spaces etc.</li>
</ul>
<div>
I'd like to learn more about your specific requirements. Please let me know them and post a comment!</div>
</div>
Unknownnoreply@blogger.comtag:blogger.com,1999:blog-8909527971548289616.post-49674547658362648132014-10-27T07:22:00.002+01:002018-06-12T08:05:21.647+02:00Scan to PDF/A - some insightsTraditionally a scanner produces a TIFF or JPEG image for each page. Some of them can directly produce PDF files. And newer devices produce files conforming to the PDF/A standard. However, the quality of the produced files differ significantly. Why is this and why is it worth to use a central scan server?<br />
<a name='more'></a><br />
Of course, the scan to PDF conversion process is not just about embedding an image in a PDF envelope. It can involve text and barcode recognition, embedding of metadata and digital signatures too. But in this article I'd like to concentrate on image data compression which is marketed as a main advantage of PDF/A over TIFF. It is said that PDF/A is better because it offers more advanced compression mechanisms than TIFF. So, let us have a closer look at this particular topic.<br />
<br />
One of the main requirements in the scan to PDF/A conversion process is to reduce the file size. A smaller size is often achieved at the price of a lower quality. There are a some factors which have an influence on the quality / size ratio:<br />
<ul>
<li>Color vs. Gray vs. Black / White</li>
<li>Choice of compression algorithm (lossless vs. lossy)</li>
<li>Multi vs. single page</li>
<li>MRC (Mixed Raster Content) mechanism</li>
</ul>
<div>
The most widely used bi-tonal (black and white) compression algorithms are G4 (standard name ITU.T6) and JBIG2. G4 is lossless whereas JBIG2 can be operated in lossless and lossy mode. In order to achieve a better compression rate lossy JBIG2 may store symbols such as text characters in a table and reuse them. If the symbol table is used it can save a significant amount of space especially in multi-page documents since the JBIG2 symbol table can be commonly used for all pages. The downside of this mechanism is that it may unexpectedly mix up some symbols. That is why lossy mode of JBIG2 is often disabled. But even in lossless mode JBIG2 has in general a better compression rate than G4.</div>
<div>
<br /></div>
<div>
For gray and color images the most often used algorithms are JPEG and JPEG2000. JPEG can only be used in lossy mode whereas JPEG2000 again can be used in both modes. If used in lossy mode both algorithms offer a parameter which controls the quality / size ratio. Although JPEG2000 is more modern it cannot be said to be 'better' than JPEG. Mesurements show that for higher quality settings JPEG2000 has better compression rates whereas for lower quality settings JPEG is better in general. The quality loss introduces image artifacts such as shadows which are typical for both algorithms. JPEG has an additional artifact which is called blocking. It has its origin in the subdivision of the image in 8 x 8 pixel blocks which are compressed independently. In addition to this the JPEG algorithm usually reduces the resolution of the chromaticity signal by 2 with respect to the luminosity signal which increases the compression rate but amplifies the blocking artifacts.</div>
<div>
<br /></div>
<div>
If converting color scans to PDF then often some sort of a mixed raster content mechanism is used. MRC separates the color information into layers: a background layer, a mask layer and a number of foreground layers. A typical example is a page that contains black text with some words emphasizes in red and blue. The mask then would contain the shapes of the characters and the background layer the color of the text. It is obvious that mask can be efficiently compressed with G4 or JBIG2 and the background layer with either JPEG or JPEG2000 using a very low resolution. When using this mechanism a scanned page can be reduced to approximately 40 k Byte with good quality. This result cannot be achieved by just using a lossy compression algorithm. However if the page contains graphics or images then these have to be isolated and compressed with good quality in one or several foreground layers. This isolation process is called segmentation and it is a essential part of the MRC mechanism.</div>
<div>
<br /></div>
<div>
Now, after reviewing the various compression schemes, it is time to discuss them in the context of archiving systems. Of course, the file size is often the most important issue but not always. In many scenarios the display speed is crucial issue. And, with respect to this requirement, JPEG2000 has often proved as too slow especially if it is combined with an MRC mechanism. As we learned JPEG is better at higher compression rates. So, why not use it at least for the background layer. The disturbing blocking artifacts can be reduced if disabling the down-sampling of the chromaticity signal. A bigger problem is that scanners deliver color images in JPEG compression only which reduces the power of a server based compressor software significantly because the JPEG image introduces artifacts which makes the segmentation and MRC compression much more difficult. But why not use the scanners built-in image to PDF conversion feature? This may be useful in a personal environment but in enterprise applications there exist many reasons why to use a central server. The most important are: Better quality, smaller file sizes, better OCR quality, post-scan processing steps and many more. </div>
<div>
<br /></div>
<div>
And, last but not least. Is PDF/A better than TIFF? The answer is definitely Yes! But not with respect to compression. TIFF offers essentially the same compression algorithms as PDF/A does. The real strength of PDF/A is that it provides the embedding of color profiles, metadata and optically recognized text in a standard manner. Furthermore, PDF/A is a uniform standard for scanned as well as born digital documents.</div>
<div>
<br /></div>
<div>
Is this article useful to you? Please let me know and post a comment!</div>
<div>
<br /></div>
<div>
</div>
<br />
<br />
<ul>
</ul>
<div>
<br /></div>
Unknownnoreply@blogger.comtag:blogger.com,1999:blog-8909527971548289616.post-44853938656234451082014-09-23T10:27:00.004+02:002015-05-10T18:57:01.840+02:00Are the PDF/A space requirements a show stopper for archiving?A PDF/A document requires that all resources such as fonts, color profiles, etc. must be embedded in the file. The archiving of transactional documents can be nightmare because such documents are usually short by nature and contain huge number of copies of the same Frutiger font, sRGB color profile and company logo. Many archives therefore prefer TIFF over PDF/A when it comes to born-digital documents. But that is certainly not the idea of a uniform standard. How can this problem be solved?<br />
<a name='more'></a><br />
PDF/A is widely accepted in archives for scanned documents. This is mainly due to the fact, that PDF/A offers stronger and standardized compression algorithms which allow for reducing a color scanned page to less than 50 KB. Even for individual born-digital documents PDF/A is the preferred file format. However, the application of PDF/A in mass archiving of transactional documents is still disputed. But, in my opinion, this is not a problem of the format. It is a problem of the archiving system and must therefore be solved there.<br />
<br />
Most archiving systems are proud of the fact that they store 'objects' without caring about their format. This unawareness has a crucial disadvantage, however. They cannot handle the files in an appropriate and intelligent way. Therefore most solutions for the mass archiving of PDF/A documents add a software layer to the archiving system which tries to reduce the negative effects of repetitively embedded resources. There are two main approaches for this software layer.<br />
<br />
The first approach collects individual documents and merges them into a single container file for which the resources can be optimized in a way such that they occur only once in the file. This file is then submitted to the archive. When a document is retrieved the container file is retrieved and splitted into the original documents.<br />
<br />
The second approach separates the documents into individual resource files and a body document which refers to it. The resources are then optimized by replacing equal copies by a single instance. The optimized resource files and the body documents are then submitted to the archive. When a document is retrieved it is rebuilt from its parts.<br />
<br />
I personally prefer the second approach since it can be implemented with a much higher performance than the first one. However, people criticize that the 'objects' are not PDF/A documents anymore. In my opinion this is not required because the software layer to split and merge the resources does this transparently for the user and guarantees that the document is the same before it is stored and after it is retrieved. Usually this argument can be better understood if the mechanism is compared to the compression or encryption algorithm withing the storage layer of the archiving system. The data which is stored on the media is no longer a PDF/A file if it is stored in compressed or encrypted form. After decompression or decryption it is the identical file again. The same is true for the resource management software layer.<br />
<br />
I have implemented the second approach for customers with huge document volumes and it works flawless, saves space and cost.<br />
<br />
What do you think about this? Please let me know and post a comment!Unknownnoreply@blogger.comtag:blogger.com,1999:blog-8909527971548289616.post-38936072395756139442014-09-15T09:42:00.003+02:002015-05-10T18:56:02.903+02:00What can I do about sliced images?If I try to extract images from a PDF file it sometimes happens that I get a bunch of slices of the original image, mostly consisting of a few image rows per slice or, in extreme cases, just one row. Why is that and how can I get the entire image in one piece?<br />
<a name='more'></a><br />
There are various reasons for dividing an image into slices and storing them as separate image objects in a PDF file. One obvious reason could be, that the PDF creation software imports an already sliced source image, e.g. a TIFF file containing stripes or tiles, without merging the slices into one image. Another often found reason is, that the PDF creation software has architectural limits regarding the size of the image sample data, e.g. a Windows native application that creates a PDF file through a virtual printer driver. And, sometimes a graphics library, like GDI+, implements masked images by creating slices for the visible parts.<br />
<br />
Once we have understood how slices arise, we also know how to put the pieces together again. But this is certainly not easy. Here's how I'v done it in one of our products. Let's call it image merger here.<br />
<br />
The image merger reads the content stream object by object. If it encounters an image it sets up an empty surface and an image mask with all bits set to 'invisible'. The slice is stored in the surface and the corresponding bits in the mask are set to 'visible'. If the next object is an image then the slice is stored in the same way. This process is repeated until another object type is encountered or it is obvious that the image is not a slice, e.g. if the color space of the image changes. If this happens then the enclosing rectangle of all slices is computed and the image is copied to the output file and the surface reset to its initial state.<br />
<br />
I must admit that this is not a very sophisticated algorithm and I hope you will have a better approach. Please let me know how you would solve the problem and post a comment!Unknownnoreply@blogger.comtag:blogger.com,1999:blog-8909527971548289616.post-56970868952406581132014-09-07T09:17:00.002+02:002015-05-10T18:55:06.928+02:00Automating the conversion of Microsoft Office Documents to PDF/AA central service to convert Microsoft Office documents to PDF or PDF/A has obvious advantages. The conversion is done on an enterprise wide platform with well defined software versions and conversion process configurations. This guarantees a consistent quality and makes the deployment and operation of client based software obsolete. The price for this, however, is that the central service must automate the native applications, such as Microsoft Word, which are designed for interactive use not for server operation.<br />
<a name='more'></a><br />
If I had to build such a service my first naive approach would be to let the service perform the following automation steps: Run the application and call the "open file" and the "Save as PDF/A" functions. Unfortunately, this is not as easy. Why not?<br />
<br />
Applications, such as Microsoft Word, Excel and PowerPoint are designed for interactive use. They only can run as a single instance in a user session. A service, however, must be prepared to convert documents in parallel to make optimal use of the computer resources. Then, most applications notify the user with pop-up dialog boxes and similar user interface features. If this happens within the context of a service the applications will block the process because there is no user to press the OK and Cancel buttons. Furthermore, interactive applications are not robust enough to process thousands of documents. They get unstable after a while and need to be terminated. Finally, the quality of the produced PDF/A document does sometimes not conform to the standard or is of inferior visual quality.<br />
<br />
For these reasons a service to automate the conversion of Microsoft Office Documents to PDF/A must do much more than I described above in my naive approach. The most important tasks are:<br />
<ul>
<li>Run the application in multiple instances of a "worker session". This allows for the execution of conversions in parallel.</li>
<li>Automate the conversion process by controlling the application through an API and run a "robot" to operate the user interface of the application (press OK buttons, read messages and act according to them, etc.).</li>
<li>Monitor the sanity of the applications and restart them accordingly.</li>
<li>Use the optimal means to create a PDF file and convert it into a PDF/A file as a post processing step. Some applications have a built in "Save as PDF" function, others can print to a virtual printer driver and some can produce a file format (XPS, PostScript, etc.) which can be converted to PDF/A.</li>
</ul>
<div>
The conversion is not the only function which such a service can provide. Once the service gets hold of a document it can provide some additional, very useful post-processing steps. Here are some of them:</div>
<div>
<ul>
<li>Merging documents</li>
<li>Apply digital signatures</li>
<li>Embedding XML data in electronic invoice documents (ZUGFeRD standard)</li>
<li>Embedding XMP metadata</li>
<li>Stamping</li>
</ul>
<div>
If you found this article useful, please let me know and post a comment.</div>
</div>
Unknownnoreply@blogger.com