Leanpub: Publish Early, Publish Often

4 How can I unit test a Python function that draws PDF graphics?

I’m writing a CAD application that outputs PDF files using the Cairo graphics library. A lot of the unit testing does not require actually generating the PDF files, such as computing the expected bounding boxes of the objects. However, I want to make sure that the generated PDF files “look” correct after I change the code.

Is there an automated way to do this? How can I automate as much as possible? Do I need to visually inspect each generated PDF? How can I solve this problem without pulling my hair out?

4.1 Answer

I’m doing the same thing using a shell script on Linux that wraps

ImageMagick’s compare command
the pdftk utility
Ghostscript (optionally)

(It would be rather easy to port this to a .bat Batch file for DOS/Windows.)

I have a few reference PDFs created by my application which are “known good”. Newly generated PDFs after code changes are compared to these reference PDFs. The comparison is done pixel by pixel and is saved as a new PDF. In this PDF, all unchanged pixels are painted in white, while all differing pixels are painted in red.

This method utilizes three different building blocks: pdftk, compare (part of ImageMagick) and Ghostscript.

pdftk

Use this command to split multipage PDF files into multiple singlepage PDFs:

1 pdftk  reference.pdf  burst  output  somewhere/reference_page_%03d.pdf
2 pdftk  comparison.pdf burst  output  somewhere/comparison_page_%03d.pdf

compare

Use this command to create a “diff” PDF page for each of the pages:

   compare                                  \
         -verbose                           \
         -debug coder -log "%u %m:%l %e"    \
          somewhere/reference_page_001.pdf  \
          somewhere/comparison_page_001.pdf \
         -compose src                       \
          somewhereelse/reference_diff_page_001.pdf

Ghostscript

Because of automatically inserted meta data (such as the current date+time), PDF output is not working well for MD5hash-based file comparisons.

If you want to automatically discover all cases which consist of purely white pages, you could also convert to a meta-data free bitmap format using the bmp256 output device. You can do that for the original PDFs (reference and comparison), or for the diff-PDF pages:

    gs                               \
      -o reference_diff_page_001.bmp \
      -r72                           \
      -g595x842                      \
      -sDEVICE=bmp256                \
       reference_diff_page_001.pdf

    md5sum reference_diff_page_001.bmp

If the MD5sum is what you expect for an all-white page of 595x842 PostScript points, then your unit test passed.

Up next

5 How can I compare 2 PDFs on the commandline?