Leanpub: Publish Early, Publish Often

8 How can I extract embedded fonts from a PDF as valid font files?

I’m aware of the pdftk.exe utility that can indicate which fonts are used by a PDF, and whether they are embedded or not.

Now the problem: given I had PDF files with embedded fonts – how can I extract those fonts in a way that they are re-usable as regular font files? Are there (preferably free) tools which can do that? Also: can this be done programmatically with, say, iText?

You have several options. All these methods work on Linux as well as on Windows or Mac OS X. However, be aware that most PDFs do not include to full, complete fontface when they have a font embedded. Mostly they include just the subset of glyphs used in the document.

8.1 Method 1: Using `pdftops`

One of the most frequently used methods to do this on *nix systems consists of the following steps:

Convert the PDF to PostScript, for example by using XPDF’s pdftops (on Windows: pdftops.exe helper program.
Now fonts will be embedded in .pfa (PostScript) format + you can extract them using a text editor.
You may need to convert the .pfa (ASCII) to a .pfb (binary) file using the t1utils and pfa2pfb.
In PDFs there are never .pfm or .afm files (font metric files) embedded (because PDF viewer have internal knowledge about these). Without these, font files are hardly usable in a visually pleasing way.

8.2 Method 2: Using `fontforge`

Another method is to use the Free font editor FontForge:

Use the “Open Font” dialogbox used when opening files.
Then select “Extract from PDF” in the filter section of dialog.
Select the PDF file with the font to be extracted.
A “Pick a font” dialogbox opens – select here which font to open.

Check the FontForge manual. You may need to follow a few specific steps which are not necessarily straightforward in order to save the extracted font data as a file which is re-usable.

8.3 Method 3: Using `mupdf`

Next, MuPDF. This application comes with a utility called pdfextract (on Windows: pdfextract.exe) which can extract fonts and images from PDFs. (In case you don’t know about MuPDF, which still is relatively unknown and new: “MuPDF is a Free lightweight PDF viewer and toolkit written in portable C.”, written by Artifex Software developers, the same company that gave us Ghostscript.)
<sub>(The better known SumatraPDF (Windows only) program is based on MuPDF, and it also ships with pdfextract.exe.)</sub>

Note: pdfextract.exe is a command-line program. To use it, do the following:

1    c:\>  pdfextract.exe  c:\path\to\filename.pdf         # (on Windows)
2    $>    pdfextract  /path/tofilename.pdf                # (on Linux, Unix, Mac OS X)

This command will dump all of the extractable files from the pdf file referenced into the current directory. Generally you will see a variety of files: images as well as fonts. These include PNG, TTF, CFF, CID, etc. The image names will be like img-0412.png if the PDF object number of the image was 412. The fontnames will be like FGETYK+LinLibertineI-0966.ttf, if the font’s PDF object number was 966.

CFF (Compact Font Format) files are a recognized format that can be converted to other formats via a variety of converters for use on different operating systems.

Again: be aware that most of these font files may have only a subset of characters and may not represent the complete typeface.

Update: (Jul 2013) Recent versions of mupdf have seen an internal reshuffling and renaming of their binaries, not just once, but several times. The main utility used to be a ‘swiss knife’-alike binary called mubusy (name inspired by busybox?), which more recently was renamed to mutool. These support the sub-commands info, clean, extract, poster and show. Unfortunatey, the official documentation for these tools isn’t up to date (yet). If you’re on a Mac using ‘MacPorts’: then the utility was renamed in order to avoid name clashes with other utilities using identical names, and you may need to use mupdfextract.

To achieve the (roughly) equivalent results with mutool as its previous tool pdfextract did, just run mubusy extract ....*

So to extract fonts and images, you may need to run one of the following commandlines.

On Windows:

1    c:\>  mutool.exe extract filename.pdf

On Linux, Unix, Mac OS X:

1    $>    mutool     extract filename.pdf

8.4 Method 4: Using `gs` (Ghostscript)

Finally, Ghostscript can also extract fonts directly from PDFs. However, it needs the help of a special utility program named extractFonts.ps, written in PostScript language, which is available from the Ghostscript source code repository.

Now use it, you need to run both, this file extractFonts.ps and your PDF file. Ghostscript will then use the instructions from the PostScript program to extract the fonts from the PDF. It looks like this on Windows (yes, Ghostscript understands the ‘forward slash’, /, as a path separator also on Windows!):

   gswin32c.exe                  ^
     -q -dNODISPLAY              ^
      c:/path/to/extractFonts.ps ^
     -c "(c:/path/to/your/PDFFile.pdf) extractFonts quit"

or on Linux, Unix or Mac OS X:

   gs                          \
     -q -dNODISPLAY            \
      /path/to/extractFonts.ps \
     -c "(/path/to/your/PDFFile.pdf) extractFonts quit"

I’ve tested the Ghostscript method a few years ago. At the time it did extract *.ttf (TrueType) just fine. I don’t know if other font types will also be extracted at all, and if so, in a re-usable way. I don’t know if the utility does block extracting of fonts which are marked as protected.

8.5 Caveats:

In any case you need to follow the license that applies to the font. Some font licences do not allow free use and/or distribution. Pirating fonts is like pirating any software or other copyrighted material.
Most PDFs which are in the wild out there do not embed the full font anyway, but only subsets. Extracting a subset of a font is only useful in a very limited scope, if at all.

Please do also read the following about Pros and (more) Cons regarding font extraction efforts:

http://typophile.com/node/34377

Up next

9 How can I get Ghostscript to use embedded fonts in PDF?

8 How can I extract embedded fonts from a PDF as valid font files?

8.1 Method 1: Using pdftops

8.2 Method 2: Using fontforge

8.3 Method 3: Using mupdf

8.4 Method 4: Using gs (Ghostscript)

8.5 Caveats:

8.1 Method 1: Using `pdftops`

8.2 Method 2: Using `fontforge`

8.3 Method 3: Using `mupdf`

8.4 Method 4: Using `gs` (Ghostscript)