Embedding fonts in PDF - a never ending struggle

I collect bad PDFs since the Reference Manual 1.0 was published in 1993 and today I have recourse to a data base of more than 100'000 real world PDF files with all kinds of faults in them. The vast majority of problems, however, is related to fonts. But, why does dealing with fonts in PDF files turn out to be so troublesome?

I guess, dealing with fonts is difficult because a developer has to digest a sickening amount of documentation before he or she can create a PDF producer software that handles fonts correctly, especially embedded ones. First, one must understand simple and compound fonts and the various mechanisms of encoding and glyph selection for symbolic and non-symbolic fonts which are completely independent of the mechanisms of text extraction and Unicodes. Then, one must understand the internal structure of the Type 1, CFF, TrueType, OpenType font programs. And finally, one must know the secrets of correctly building font subsets of all these types. This is not easy at all and real world PDF files reveal all kinds of misunderstandings of the basic concepts.

My experience with training people did not really help because it didn't make the font data structures easier to understand. And, the PDF standard cannot be changed to simplify font handling because of compatibility with existing PDF files.

During my quest to find a solution to the problem I found out that most problems related to fonts and font embedding are mainly observed with documents with latin character sets. Similar problems occurred much less in fonts with asian character sets. How can that be? One reason might be that we have more latin files in our data base. Another reason could be, however, that the PDF standard specifies pre-defined CMAPs only for asian but not for latin character sets. I think, that a pre-defined CMAP for latin character sets would significantly simplify the glyph selection and the Unicode mapping for almost all languages used in America and Europe. This would also help to create invisible text in OCR applications.

What do you think? Please, post a comment. I would be glad to get your opinion! Or, if I can help you with a specific font issue, just let me know.