I 100 Tipps and Tricks

1 Downloading the tools

My own preferred work environments are Linux and Mac OS X. However, most of the methods explained in the following chapters can be applied to Windows too. Readers will benefit most from this book if they reproduce and play with some of the example commandlines given. To do that they should install the following tools.

1.1 Windows

These are the preferred download sites. They are the ones which are offered by the respective developers themselves. Do not ever use any other, third-party download links, unless you really know what you do!

Download Ghostscript:
http://downloads.ghostscript.com/public/
Currently available are installer files for Windows:
- gs914w32.exe (for all Windows OS 32bit, but also works on Windows 64 bit)
- gs914w64.exe (does not work on Windows 32bit – but Ghostscript developers warn anyway: the 64bit version may even run slower on 64bit than does the 32bit version on 64bit Windows!)
- Keep your eyes open for newer versions appearing in that directory!
Download GhostPCL:
http://downloads.ghostscript.com/public/binaries/
Currently available are pre-compiled 32-bit binaries for Windows embedded in a *.zip file:
- ghostpcl-9.14-win32.zip
- Keep your eyes open for newer versions appearing in that directory!
Download GhostXPS:
http://downloads.ghostscript.com/public/binaries/
Currently available are pre-compiled 32-bit binaries for Windows embedded in a *.zip file:
- 32-bit Version: ghostxps-9.14-win32.zip
- Keep your eyes open for newer versions appearing in that directory!
Download XPDF-Utils (CLI tools: pdffonts, pdfinfo, pdfimages, pdftotext, pdftops…)
http://www.foolabs.com/xpdf/download.html
Current file version:
- xpdfbin-win-3.03.zip
  The .zip contains pre-compiled binaries for Windows. It’s not an installer: just unpack anywhere you want, modify the %PATH% variable to find the binaries and start them from an CMD window.
  You may want to additionally add one or more of the *‘Language Support Packages’.
- Keep your eyes open for newer versions appearing in that directory!
Download qpdf:
32-bit Version: http://sourceforge.net/projects/qpdf/files/qpdf/5.1.1/qpdf-5.1.1-bin-mingw32.zip
64-bit Version: http://sourceforge.net/projects/qpdf/files/qpdf/5.1.1/qpdf-5.1.1-bin-mingw64.zip
64-bit Version: http://sourceforge.net/projects/qpdf/files/qpdf/5.1.1/qpdf-5.1.1-bin-msvc64.zip
Download pdftk:
http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/pdftk_server-2.02-win-setup.exe
Download ImageMagick:
http://www.imagemagick.org/download/binaries/ (make sure to select the correct and most current download for your system)

A Warning Note about Downloading from Sourceforge

Sourceforge.net used to be a resource that was very useful for the Open Source community. However, in recent years the site has become more and more overloaded with ads. Some of the more recent stories on the Internet even suggest that the site may be poisoned with links that lead you to third party ‘drive-by’ malware download sites.

Unfortunately, some of the tools advertised in this eBook (such as qpdf) still do host their source code or their Windows binaries on this platform. I’m hoping the developers of these tools will find some other hosting soon…

:-(

1.2 Mac OS X

My recommendation is to use the ‘MacPorts’ framework for installing additional software packages:

http://www.macports.org/install.php

After you have put MacPorts in place, open a Terminal.app window and start installing the packages by typing:

1  sudo port -p install ghostscript poppler ImageMagick coreutils qpdf pdftk

Be aware that this may take a while. Ghostscript depends on additional packages for its functionality, like libpng, jpeg, tiff, zlib and more. The same applies for the other tools. The port command downloads, compiles and installs all these dependencies automatically, so this may take quite a while…

1.3 Linux

Debian, Ubuntu, …

You can run this command in a terminal window:

1  sudo apt-get install ghostscript poppler-utils ImageMagick coreutils qpdf

Or you use the package manager of your choice to find and install the packages…

RedHat, Fedora

Command in a terminal window:

1  sudo yum install ghostscript poppler-utils ImageMagick coreutils qpdf

Slackware

Command in a terminal window:

1  TODO

1.4 Documentation

1  TODO

2 How can I convert PCL to PDF?

Is it possible to use Ghostscript for converting PCL print files to PDF format? If so, how?

2.1 Answer

No, it’s not possible with Ghostscript itself…

But yes!, it’s very well possible with another cool piece of workhorse software from the same barn: its name ist GhostPCL.

Ghostscript developers in recent years integrated their sister products GhostXPS, GhostPCL and GhostSVG into their main Ghostscript source code tree, which switched from Subversion to Git some time ago. The complete family of products is now called GhostPDL. So all of these additional functionalities (load, render and convert XPS, PCL and SVG) are now available from there.

Previously, though GhostPCL was available as a source code tarball, it was hardly to be found on the ‘net. The major Linux distributions (Debian, Ubuntu, Redhat, Fedora, OpenSUSE,…) don’t provide packages for their users either. On MacPorts it is missing too.

This means you have to build the programs yourself from the sources. You could even compile the so-called language switching binary, pspcl6. This binary, in theory, can consume PCL, PDF and PostScript and convert this input to a host of other formats. Just run make ls-product in the top level Ghostscript source directory in order to build it. The resulting binary will end up in the ./language-switch/obj/ subdirectory. Run make ls-install in order to install it.

WARNING: While it worked for me whenever I needed it, Ghostscript developers recommend to stop using the language switching binary (since it’s ‘almost non-supported’ as they say, and it will possibly go away in the future).

Instead they recommend to use the explicit binaries:

pcl6 or pcl6.exe for PCL input,
gsvg or gsvg.exe for SVG input (also ‘almost non-supported’ but it may work better than the language switching binary pspcl6) and
gxps or gxps.exe for XPS input (support status unclear to me).

So for ‘converting PCL code to PDF format’ as the request reads, you could use the pcl6 command line utility, the PCL consuming sister product to Ghostscript’s gs (Linux, Unix, Mac OS X) and gswin32c.exe/gswin64c.exe (Windows) which are PostScript and PDF input consumers.

Sample commandline (Windows):

     pcl6.exe            ^
       -o output.pdf     ^
       -sDEVICE=pdfwrite ^
        [...more parameters as required (optional)...] ^
       -f input.pcl

Sample Commandline (Linux, Unix, Mac OS X):

     pcl6                \
      -o output.pdf      \
      -sDEVICE=pdfwrite  \
       [...more parameters as required (optional)...] \
      -f input.pcl

Explanation

-o output.pdf: The -o parameter determines the location of the output. In this case it will be the file output.pdf in the current directory (since we did not specify any path prefix for the filename). At the same time, using -o saves us from typing -dBATCH -dNOPAUSE -dSAFER, because -o implicitly does also set these parameters.
-sDEVICE=pdfwrite: This parameter determines which kind of output to generate. In our current case it will be PDF. If you wanted to produce a multipage grayscale TIFF with CCITT compression, you would change that to -sDEVICE=tiffg4 (don’t forget to modify the output file name accordingly too: -o output.tif).
-f input.pcl: This parameter determines which file to read as input. In this case it is the file input.pcl in the current directory.

Update: The Ghostscript website now at least for Windows users offers pre-compiled 32-bit binaries for GhostPCL and GhostXPS.

^{See also the hints in *'[How can I convert XPS to PDF?](#convert-xps-to-pdf)'*.}

TODO! Hint about the GhostPCL licensing. Esp. important: hint about the URW fonts which are not GPL (they require commercial licensing for commercial use).

3 How can I to convert XPS to PDF?

How can I convert XPS to PDF? Is this possible with Ghostscript?

3.1 Answer

Ghostscript developers in recent years have integrated a sister product named GhostXPS into their main Ghostscript source code tree, which is based on Git now. (They have also included two other products, named GhostPCL and GhostSVG.) The complete family of products is now called GhostPDL. So all of these additional functionalities (load, render and convert XPS, PCL and SVG) are now available from one location.

Unfortunately, none of the major Linux distributions (Debian, Ubuntu, Redhat, Fedora, OpenSUSE,…) do currently provide packages for their users. On MacPorts GhostXPS is missing too, as are GhostPCL and GhostSVG.

This means you have to build the programs yourself from the sources – unless you are a Windows user. In this case you are lucky: there is a *.zip container on the Ghostscript website, which contains a pre-compiled Win32 binary (which also runs on Windows 64 bit!):

http://downloads.ghostscript.com/public/binaries/ghostxps-9.07-win32.zip

While you’re at it and build the code yourself, you could even build a so-called language switching binary. The Makefile has targets prepared for that. This binary can consume PCL, PDF and PostScript. It converts these input formats to a host of other file types. Just run make ls-product && make ls-install in the top level Ghostscript source directory in order to get it installed.

Instead they recommend to use the explicit binaries, also supported as build targets in the Makefile:

pcl6 or pcl6.exe for PCL input,
gsvg or gsvg.exe for SVG input (also ‘almost non-supported’) and
gxps or gxps.exe for XPS input (support status unclear to me).

Sample commandline (Windows):

     gxps.exe            ^
      -o output.pdf      ^
      -sDEVICE=pdfwrite  ^
       [...more parameters as required (optional)...] ^
      -f input.xps

Sample commandline (Linux, Unix, Mac OS X):

     gxps                \
      -o output.pdf      \
      -sDEVICE=pdfwrite  \
       [...more parameters as required (optional)...] \
      -f input.xps

Explanation

-o output.pdf: The -o parameter determines the location of the output. In this case it will be the file output.pdf in the current directory (since we did not specify any path prefix for the filename). At the same time, using -o saves us from typing -dBATCH -dNOPAUSE -dSAFER, because -o implicitly does also set these parameters.
-sDEVICE=pdfwrite: This parameter determines which kind of output to generate. In our current case that’s PDF. If you wanted to produce a PostScript level 2 file, you would change that to -sDEVICE=ps2write (don’t forget to modify the output file name accordingly too: -o output.ps).
-f input.pcl: This parameter determines which file to read as input. In this case it is the file input.pcl in the current directory.

^{See also the hints in *['How can I convert PCL to PDF?'](#convert-pcl-to-pdf.html)*.}

TODO! Hint about the GhostPCL licensing. Esp. important: hint about the URW fonts which are not GPL (they require commercial licensing for commercial use).

4 How can I unit test a Python function that draws PDF graphics?

I’m writing a CAD application that outputs PDF files using the Cairo graphics library. A lot of the unit testing does not require actually generating the PDF files, such as computing the expected bounding boxes of the objects. However, I want to make sure that the generated PDF files “look” correct after I change the code.

Is there an automated way to do this? How can I automate as much as possible? Do I need to visually inspect each generated PDF? How can I solve this problem without pulling my hair out?

4.1 Answer

I’m doing the same thing using a shell script on Linux that wraps

ImageMagick’s compare command
the pdftk utility
Ghostscript (optionally)

(It would be rather easy to port this to a .bat Batch file for DOS/Windows.)

I have a few reference PDFs created by my application which are “known good”. Newly generated PDFs after code changes are compared to these reference PDFs. The comparison is done pixel by pixel and is saved as a new PDF. In this PDF, all unchanged pixels are painted in white, while all differing pixels are painted in red.

This method utilizes three different building blocks: pdftk, compare (part of ImageMagick) and Ghostscript.

pdftk

Use this command to split multipage PDF files into multiple singlepage PDFs:

1 pdftk  reference.pdf  burst  output  somewhere/reference_page_%03d.pdf
2 pdftk  comparison.pdf burst  output  somewhere/comparison_page_%03d.pdf

compare

Use this command to create a “diff” PDF page for each of the pages:

   compare                                  \
         -verbose                           \
         -debug coder -log "%u %m:%l %e"    \
          somewhere/reference_page_001.pdf  \
          somewhere/comparison_page_001.pdf \
         -compose src                       \
          somewhereelse/reference_diff_page_001.pdf

Ghostscript

Because of automatically inserted meta data (such as the current date+time), PDF output is not working well for MD5hash-based file comparisons.

If you want to automatically discover all cases which consist of purely white pages, you could also convert to a meta-data free bitmap format using the bmp256 output device. You can do that for the original PDFs (reference and comparison), or for the diff-PDF pages:

    gs                               \
      -o reference_diff_page_001.bmp \
      -r72                           \
      -g595x842                      \
      -sDEVICE=bmp256                \
       reference_diff_page_001.pdf

    md5sum reference_diff_page_001.bmp

If the MD5sum is what you expect for an all-white page of 595x842 PostScript points, then your unit test passed.

5 How can I compare 2 PDFs on the commandline?

I’m looking for a Linux command line tool to compare two PDF files and save the diffs to a PDF outfile. The tool should create diff-PDFs in a batch-process. The PDF files are construction plans, so pure text-compare doesn’t work.

Something like:

1     <tool> file1.pdf file2.pdf -o diff-out.pdf

Most of the tools I found convert the PDFs to images and compare them, but only with a GUI.

Any other solution is also welcome.

5.1 Answer

What you want can be achieved with using ImageMagick’s compare command. And this will work on all important operating system platforms: Windows, Mac OS X, Linux and various Unix variations.

The basic command is very simple:

1  compare  file1.pdf  file2.pdf  delta1.pdf

First, please note: this only works well for PDFs which use the same page/media size.

The comparison is done pixel by pixel between the two input PDFs. In order to get the pixels, the pages are rendered to raster images first, by default using a resolution of 72 ppi (pixels per inch). The resulting file is an image showing the “diff” like this:

Each pixel that is identical on each input file becomes white.
Each pixel that is different between the two input files is painted in red.
The ‘source’ file (the first one named in the command) will, for context, be used to provide a gray-scale background to the diff output.

The above command outputs a PDF file, delta.pdf. Should you prefer a PNG image or a JPEG image instead of a PDF, simply change the suffix of the ‘delta’ filename:

1 compare  file1.pdf  file2.pdf  delta2.png
2 compare  file1.pdf  file2.pdf  delta3.jpeg

In some cases the default resolution of 72 ppi used to render the PDF pages may be insufficient to uncover subtle differences. Or, on the contrary, it may over-emphasize differences which are triggered by extremely minimal shifts of individual characters or lines of text caused by some computational rounding of real numbers.

So, if you want to increase the resolution, add the -density NNN parameter to the commandline. To get 720 ppi images, use this:

1 compare  -density 720  file1.pdf  file2.pdf  delta4.pdf

Note, increasing the density/resolution of the output files also increases processing time and output file formats accordingly. A 10-fold increase in density leads to a 100-fold increase in the number of total pixels that need to be compared and processed.

All of the above examples do only work for 1-page PDF files. For multi-page PDFs you need to add a [N] notation to the file name, where N is the zero-based page number (page 1 is noted as [0], page 2 as [1], page 3 as [2], and so forth). The following compares page 4 of file1.pdf with page 18 of file2.pdf:

1 compare  file1.pdf[3]  file2.pdf[17]  delta5.pdf

If you do not want the gray-scale background created from the source file, use a modified command:

1 compare  file1.pdf  file2.pdf  -compose src  delta1.pdf

This modification changes the output to purly red/white: all pixels which are identical between the two base files are red, identical pixels are white.

In case you do not like the red and white default colors to visualize the pixel differences, you can add the following commandline parameters:

-highlight-color blue (change default color for pixel differences from ‘red’ to ‘blue’)
-lowlight-color yellow (change default color for identical pixels from ‘white’ to ‘yellow’)

or any other color combination you desire. Allowed names for colors include #RRGGBB values for RGB shades.

Note, ImageMagick’s compare command does not process the PDF input files directly. compare originally was designed to process raster images only. You can easily test this by replacing the PDFs in above commands with some image files – just make sure that the files are ‘similar enough’ to give sensible results, and also ensure, that the compared images do have the same dimensions in width and height.

To process PDFs, ImageMagick needs to resort to Ghostscript as its ‘delegate’ program for processing PDF input. Ghostscript gets called behind the curtains by compare in order to create the raster files which then compare does its magic on.

To see the exact commandline parameters that ImageMagick uses for Ghostscript call, just add a -verbose parameter to the compare commands. The output on the terminal/console will be much more verbose and reveal what you want to know.

Examples

I’m using this very same method for example to discover minimal page display differences when font substitution in PDF processing comes into play.

It can easily be the case, that there is no visible difference between two PDFs, though they are extremely different in MD5 hashes, file sizes or internal PDF code structure. In this case the delta1.pdf output PDF page from the above command would become all-white. You could automatically discover this condition, so you only have to visually investigate the non-white PDFs by deleting the all-white ones automatically.

To give you a more visual impression about the way this comparison works, I’ve constructed a few different input files. I used Ghostscript to do this. (The exact commands I used are documented at the end of this chapter.)

Example 1

The following image shows two PDF pages side by side. Most people will notice from a quick look the differences between these two pages:

Two PDF pages which do differ – differences can be spotted by looking twice…

Now use the following commands to create a few different visualization of the ‘deltas’:

compare  file1.pdf  file2.pdf  delta1.png                 # default resolution, 72 ppi
compare  file1.pdf  file2.pdf  -compose src  delta2.png   # default resolution, 72 ppi
compare  -density 720  file1.pdf  file2.pdf  delta3.png   # resolution of 720 ppi
compare  -density 720  file1.pdf  file2.pdf  -compose src  delta4.png   # 720 ppi

The resulting ‘delta’ images are shown in the following picture.

Four different visualizations of differences. The top two use a 72 ppi resolution, the bottom two a 720 ppi resolution. The 2nd and the 4th do not show a grayscale context background, but only white and red pixels.

As you can easily see, the 72 ppi-based comparison of the two input PDFs shows a clearly visible ‘pixelization’ of the results (top two images). Zoom in to see this in more detail. The 720 ppi version appears to come out much more smoothly. However, for this specific case 72 ppi would be ‘good enough’ to discover that in the two PDFs there was used a ‘0’ (number zero) instead of an ‘O’ (capital letter ‘o’) at two different spots.

Example 2

The following image shows two other PDF pages side by side. Hardly anybody will be able to spot the differences between these, but some people will:

Two PDF pages which do differ – differences can only be be spotted by looking very closely.

Now use the following commands to create a few different visualization of the ‘deltas’:

compare                file3.pdf  file4.pdf                delta5.pdf
compare                file3.pdf  file4.pdf  -compose src  delta6.pdf
compare  -density 720  file3.pdf  file4.pdf                delta7.pdf
compare  -density 720  file3.pdf  file4.pdf  -compose src  delta8.pdf

The resulting differences are shown in the following picture.

Four different ways to visualize the differences between the last two input files. Again a 72 ppi resolution for the top two and a 720 ppi resolution for the bottom ones. The 1st and the 3rd do show a grayscale context background, the others do not. Please zoom in to spot the finer pixel differences between the different resolutions…

Again, the 72 ppi-based comparison of the two input PDFs shows a clearly visible ‘pixelization’ of the results (top two images). The 720 ppi version does show the differences much more clearly: it is just that the text is shifted slightly to the left and to the top in the case of the second input. If you zoom in enough into the 720 ppi versions, you can even count the number of pixels: the shift for each single character of the text is constistenlty 5 pixels to the right and 5 pixels to the top. The 72 ppi version cannot bring out this subtle difference so clearly: at this resolution the shift is only 1/2 pixel to the right and 1/2 pixel to the top. This means that for some characters there is no shift occuring at all, and other characters move by a full pixel in either direction. This becomes clearly visible in the fact that some characters do not look changed at all while others clearly do.

Example 3

The following image shows two other PDF documents. Can you spot the difference?

Two PDF documents which do differ. Try to spot the difference!

Creating visualizations in red/white pixels will give the following results.

If you have access to the original delta files and zoom in on no. 3 you can clearly see that the second document contains a changed prize: going up by 2.000 $US by change the original ‘6’ to an ‘8’.

Update

For those of you who want to reproduce the commands shown above, you’d also need access to the same source files I used. That’s easy: I used Ghostscript to create these example input PDFs. Here are the commands for this:

gs                                            \
  -o file1.pdf                               \
  -sDEVICE=pdfwrite                          \
  -g5950x1100                                \
  -c "/Courier findfont 72 scalefont setfont \
      30   30   moveto (HELL0, WORLD\!) show \
      showpage"

gs                                            \
  -o file1.pdf                               \
  -sDEVICE=pdfwrite                          \
  -g5950x1100                                \
  -c "/Courier findfont 72 scalefont setfont \
      30   30   moveto (HELLO, W0RLD\!) show \
      showpage"

gs                                            \
  -o file1.pdf                               \
  -sDEVICE=pdfwrite                          \
  -g5950x1100                                \
  -c "/Courier findfont 72 scalefont setfont \
      30   30   moveto (Hi, Universe\!) show \
      showpage"

gs                                            \
  -o file1.pdf                               \
  -sDEVICE=pdfwrite                          \
  -g5950x1100                                \
  -c "/Courier findfont 72 scalefont setfont \
      30.5 30.5 moveto (Hi, Universe\!) show \
      show showpage"

6 How can I remove white margins from PDF pages?

I would like to know a way to remove white margins from a PDF file. Just like Adobe Acrobat X Pro does. I understand it will not work with every PDF file.

I would guess that the way to do it, is by getting the text margins, then cropping out of that margins.

PyPdf is preferred.

iText finds text margins based on this code:


   public void addMarginRectangle(String src, String dest)
       throws IOException, DocumentException {
       PdfReader reader = new PdfReader(src);
       PdfReaderContentParser parser = new PdfReaderContentParser(reader);
       PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(RESULT));
       TextMarginFinder finder;
       for (int i = 1; i <= reader.getNumberOfPages(); i++) {
           finder = parser.processContent(i, new TextMarginFinder());
           PdfContentByte cb = stamper.getOverContent(i);
           cb.rectangle(finder.getLlx(), finder.getLly(),
               finder.getWidth(), finder.getHeight());
           cb.stroke();
       }
       stamper.close();
   }

6.1 Answer

I’m not too familiar with PyPDF, but I know Ghostscript will be able to do this for you. Here are links to some other answers on similar questions:

Convert PDF 2 sides per page to 1 side per page (SuperUser.com)
Freeware to split a pdf’s pages down the middle? (SuperUser.com)
Cropping a PDF using Ghostscript 9.01 (StackOverflow.com)

The third answer is probably what made you say ‘I understand it will not work with every PDF file’. It uses the pdfmark command to try and set the /CropBox into the PDF page objects.

The method of the first two answers will most likely succeed where the third one fails. This method uses a PostScript command snippet of <</PageOffset [NNN MMM]>> setpagedevice to shift and place the PDF pages on a (smaller) media size defined by the -gNNNNxMMMM parameter (which defines device width and height in pixels).

If you understand the concept behind the first two answers, you’ll easily be able to adapt the method used there to crop margins on all 4 edges of a PDF page:

An example command to crop a letter sized PDF (8.5x11in == 612x792pt) by half an inch (==36pt) on each of the 4 edges (command is for Windows):

    gswin32c.exe         ^
       -o cropped.pdf    ^
       -sDEVICE=pdfwrite ^
       -g5400x7200       ^
       -c "<</PageOffset [-36 -36]>> setpagedevice" ^
       -f input.pdf

The resulting page size will be 7.5x10in (== 540x720pt). To do the same on Linux or Mac, use:

    gs                   \
       -o cropped.pdf    \
       -sDEVICE=pdfwrite \
       -g5400x7200       \
       -c "<</PageOffset [-36 -36]>> setpagedevice" \
       -f input.pdf

Update: How to determine ‘margins’ with Ghostscript

A comment asked for ‘automatic’ determination of the white margins. You can use Ghostscript’s too for this. Its bbox device can determine the area covered by the (virtual) ink on each page (and hence, indirectly the whitespace for each edge of the canvas).

Here is the command:

   gs                     \
     -q -dBATCH -dNOPAUSE \
     -sDEVICE=bbox        \
      input.pdf 

Output (example):

    %%BoundingBox: 57 29 562 764
    %%HiResBoundingBox: 57.265030 29.347046 560.245045 763.649977
    %%BoundingBox: 57 28 562 667
    %%HiResBoundingBox: 57.265030 28.347046 560.245045 666.295011

The bbox device renders each PDF page in memory (without writing any output to disk) and then prints the BoundingBox and HiResBoundingBox info to stderr. You may modify this command like that to make the results more easy to parse:

    gs                       \
        -q -dBATCH -dNOPAUSE \
        -sDEVICE=bbox        \
         input.pdf           \
         2>&1                \
     | grep -v HiResBoundingBox

Output (example):

1      %%BoundingBox: 57 29 562 764
2      %%BoundingBox: 57 28 561 667

This would tell you…

…that the lower left corner of the content rectangle of Page 1 is at coordinates [57 29] with the upper right corner is at [562 741]
…that the lower left corner of the content rectangle of Page 2 is at coordinates [57 28] with the upper right corner is at [561 667]

This means:

Page 1 uses a whitespace of 57pt on the left edge (72pt == 1in == 25,4mm).
Page 1 uses a whitespace of 29pt on the bottom edge.
Page 2 uses a whitespace of 57pt on the left edge.
Page 2 uses a whitespace of 28pt on the bottom edge.

As you can see from this simple example already, the whitespace is not exactly the same for each page. Depending on your needs (you likely want the same size for each page of a multi-page PDF, no?), you have to work out what are the minimum margins for each edge across all pages of the document.

Now what about the right and top edge whitespace? To calculate that, you need to know the original page size for each page. The most simple way to determine this: the pdfinfo utility. Example command for a 5 page PDF:

   pdfinfo      \
     -f 1       \
     -l 5       \
      input.pdf \
   | grep "Page "

Output (example):

   Page    1 size: 612 x 792 pts (letter)
   Page    2 size: 612 x 792 pts (letter)
   Page    3 size: 595 x 842 pts (A4)
   Page    4 size: 842 x 1191 pts (A3)
   Page    5 size: 612 x 792 pts (letter)

This will help you determine the required canvas size and the required (maximum) white margins of the top and right edges of each of your new PDF pages.

These calculations can all be scripted too, of course.

But if your PDFs are all of a uniq page size, or if they are 1-page documents, it all is much easier to get done…

7 Using Ghostscript to get page size

Is it possible to get the page size (from e.g. a PDF document page) using Ghostscript?

I have seen the bbox device, but it returns the bounding box (it differs per page), not the TrimBox (or CropBox) of the PDF pages. ( See Prepressure website for info about page boxes.) Any other possibility?

7.1 Answer 1

Unfortunately it doesn’t seem quite easy to get the (possibly different) page sizes (or *Boxes for that matter) inside a PDF with the help of Ghostscript.

But since you asked for other possibilities as well: a rather reliable way to determine the media sizes for each page (and even each one of the embedded {Trim,Media,Crop,Bleed}Boxes) is the commandline tool pdfinfo.exe. This utility is part of the XPDF tools from http://www.foolabs.com/xpdf/download.html. You can run the tool with the -box parameter and tell it with -f 3 to start at page 3 and with -l 8 to stop processing at page 8.

Example output

   C:\downloads>pdfinfo -box -f 1 -l 3 _IXUS_850IS_ADVCUG_EN.pdf
   Creator:        FrameMaker 6.0
   Producer:       Acrobat Distiller 5.0.5 (Windows)
   CreationDate:   08/17/06 16:43:06
   ModDate:        08/22/06 12:20:24
   Tagged:         no
   Pages:          146
   Encrypted:      no
   Page    1 size: 419.535 x 297.644 pts
   Page    2 size: 297.646 x 419.524 pts
   Page    3 size: 297.646 x 419.524 pts
   Page    1 MediaBox:     0.00     0.00   595.00   842.00
   Page    1 CropBox:     87.25   430.36   506.79   728.00
   Page    1 BleedBox:    87.25   430.36   506.79   728.00
   Page    1 TrimBox:     87.25   430.36   506.79   728.00
   Page    1 ArtBox:      87.25   430.36   506.79   728.00
   Page    2 MediaBox:     0.00     0.00   595.00   842.00
   Page    2 CropBox:    148.17   210.76   445.81   630.28
   Page    2 BleedBox:   148.17   210.76   445.81   630.28
   Page    2 TrimBox:    148.17   210.76   445.81   630.28
   Page    2 ArtBox:     148.17   210.76   445.81   630.28
   Page    3 MediaBox:     0.00     0.00   595.00   842.00
   Page    3 CropBox:    148.17   210.76   445.81   630.28
   Page    3 BleedBox:   148.17   210.76   445.81   630.28
   Page    3 TrimBox:    148.17   210.76   445.81   630.28
   Page    3 ArtBox:     148.17   210.76   445.81   630.28
   File size:      6888764 bytes
   Optimized:      yes
   PDF version:    1.4

7.2 Answer 2

Meanwhile I found a different method. This one uses Ghostscript only (just as you required). No need for additional third party utilities.

This method uses a little helper program, written in PostScript, shipping with the source code of Ghostscript. Look in the toolbin subdir for the pdf_info.ps file.

The included comments say you should run it like this in order to list fonts used, media sizes used

   gswin32c -dNODISPLAY   ^
      -q                  ^
      -sFile=____.pdf     ^
       [-dDumpMediaSizes] ^
       [-dDumpFontsUsed [-dShowEmbeddedFonts]] ^
       toolbin/pdf_info.ps

I did run it on a local example file, with commandline parameters that ask for the media sizes only (not the fonts used). Here is the result:

   C:\> gswin32c    ^
        -dNODISPLAY ^
        -q          ^
        -sFile=c:\downloads\_IXUS_850IS_ADVCUG_EN.pdf ^
        -dDumpMediaSizes ^
         C:/gs8.71/lib/pdf_info.ps
  

    c:\downloads\_IXUS_850IS_ADVCUG_EN.pdf has 146 pages.
    Creator: FrameMaker 6.0
    Producer: Acrobat Distiller 5.0.5 (Windows)
    CreationDate: D:20060817164306Z
    ModDate: D:20060822122024+02'00'
    
    Page 1 MediaBox: [ 595 842 ] CropBox: [ 419.535 297.644 ]
    Page 2 MediaBox: [ 595 842 ] CropBox: [ 297.646 419.524 ]
    Page 3 MediaBox: [ 595 842 ] CropBox: [ 297.646 419.524 ]
    Page 4 MediaBox: [ 595 842 ] CropBox: [ 297.646 419.524 ]
    [....]

Up next

1 Downloading the tools