Font subsetting - how it works and when to use

In order to reduce the file size PDF producers use a technique called font subsetting. What does exactly happen with the fonts and what are the consequences?

If a PDF creator software adds text to a page description then it refers to a font. The font contains a collection of characters with a description of their graphic appearance called glyphs, metrics and other relevant information to render the text. There exist various types of font formats which can be used in a PDF document such as Type 1, TrueType, CFF and OpenType fonts.

Fonts do not need to be embedded into a PDF file. In this case they are referred to by name and must be available to correctly reproduce the document. To guarantee that the fonts are always available the PDF creator software can embed them as binary streams in the file. However, the size of the font file themselves can be reduced by removing all information from them that is actually not needed to correctly render the document. The PDF specification exactly lists the font parts that are required. Beside this the font size can be reduced by only leaving the character descriptions which are actually referred to by the text objects. The removal of unneeded character descriptions is called subsetting.

Fonts with PostScript outlines (Type1, CFF, OpenType) can be reduced by just removing the character strings as their selection is done using unique glyph names or character identifiers (CID). For TrueType outlines there are various options to perform the subsetting. One option is to completely remove the glyphs and the other is to only remove the outlines and keep empty glyphs. The advantage of the first option is that the resulting size is smaller. The advantage of the latter is that the glyph identifiers (GID) do not change. This is important because the glyphs are selected by their  GID. If using the first option then the font encoding tables (cmap) or the GIDtoCIDMap data structure must be adapted to reflect the changes in the glyph numbering. But there are also some special cases.

Font programs may contain compound glyphs, e.g. the glyph 'ä' may refer to two separate glyphs 'a' and '¨'. When subsetting such a font the compound glyph descriptions must be updated if the glyph numbering changes and the referred glyphs must not be removed if they are not directly referred to by the text in a document. Other information such as the encoding tables (cmap), font metrics (head, hhea, hmtx) and instructions (prep, fpgm, cvt) may also refer to glyph numbers and must be updated accordingly.

Subsetting inhibits the editing of PDF documents. Especially in interactive forms the fonts which are used to fill out form fields must not be subsetted.

The subsetting of fonts is a complex task and error prone. The majority of bad real world PDF files contain malformed embedded fonts resulting from non-functioning subsetting algorithms.