Generating PDFs With (and Without) Python

Programmatically creating PDFs is fairly common among server side and desktop applications. In this post, I’ll share the results of my experiences with PDFs and what to think about when making a decision about libraries to use and the trade-offs with them. To understand the trade-offs though, we’ll start from the basics.

What’s a PDF?

Just about everybody has interacted with a PDF file or two, but it is worth going over some details. The PDF format is a document presentation format. HTML which probably more readers of this blog are familiar with is also a sort of document presentation format but the two have different goals. PDF is designed for fixed layouts on specific sizes of paper while HTML uses a variable layout that can differ across screen sizes, screen types and potentially across browsers. With relatively few exceptions, the same document should look exactly the same on two different PDF viewers even on different operating systems, screen sizes or form factors.

PDF was originally developed as a proprietary format by Adobe but is now an open standard (ISO 32000-1:2008). The standard document weighs in at over 700 pages and that doesn’t even cover every aspect of PDFs. As you might guess from the previous sentence, the format itself is extremely complicated and so usually tools are used when creating PDFs. The PDF format is fairly close in capability to Postscript, a format and programming language used by printers and PDFs are used most commonly to convey some data intended for printing. PDFs print fairly precisely and uniformly even across printers and OSs and so it’s a good idea to create a PDF for any document that is intended for printing.

Generating a PDF

If you only need to generate a single PDF, then you probably should stop reading. Creating a PDF from a word document for example can be done from the print dialog in MacOS and many Linux distros and a tool like Acrobat can accomplish the same thing on Windows. However, when it comes to generating a lot of PDFs — say a different PDF for every invoice in your freelancing job — you don’t want to spend hours in the print dialog saving files as PDFs. You’d rather automate that. There are lots of tools to accomplish this but some are better than others and more suited to specific tasks. In general though, most of these tools work along the same lines. You have a fixed layout in your document but some of the content is variable because it comes from a database or some external source. All invoices look more or less the same, but what the invoice is for and the amounts, dates and whatnot differ from invoice to invoice.

ReportLab

If you use Python regularly, you may have already heard of ReportLab. ReportLab is probably the most common Python library for creating PDFs and it is backed by a company of the same name. It is very robust albeit a bit complicated for some. ReportLab comes with a dual license of sorts. The Python library is open source but the company produces a markup language (RML, an XML-based markup) and an optimized version of the library for faster PDF generation which they sell. They claim that their proprietary improvements make it around a factor of five faster. Using markup languages like RML comes with the nice advantage that you can use your favorite template library (like Jinja2 or Django’s built-in one) to separate content from layout cleanly.

While the above example only writes paragraphs to a document, ReportLab can draw lines or shapes on a document, or support using images with your PDF.

RestructuredText and Sphinx

While not a PDF generator by itself, if you’ve ever created a Python module, you’ve probably heard of Sphinx, a module used to create documentation. It was created for the Python documentation itself but has been used by Django, Requests and many other big projects. It is the documentation format that powers Read the Docs. In general, you write a document in RestructuredText (reST), a wiki-like markup format and then run the Sphinx commands to put the whole thing together. Frequently, Sphinx is used to generate HTML documentation but it can be used to generate PDFs. It does this by creating LaTeX documents from reST and then creating PDFs from those. This can be a good solution if you’re looking for documents similar to what Sphinx already outputs.

I have a post from long ago on Sphinx and documentation in general if you desire further reading.

LaTeX

LaTeX is a document preparation system that is used widely in scientific and mathematical publishing as well as some other technical spheres. It can produce very high quality documents that are suitable for being published as books, presentations (see an example at the end of the post), handouts or articles. LaTeX has a reputation for being fairly complicated and it involves a fairly large installation to get going. LaTeX is also not a Python solution although there are Python modules to help somewhat with LaTeX. However, because it is a plain text macro language, it can be used with a template library to generate PDFs. The big advantage is that it can very precisely typeset documents and has a very extensive user base (it’s own stackexchange) and a large amount of modules to do everything from tables, figures, charts, syntax highlighting, bibliographies and more. It takes more to get going but you can generate just about anything you can think of with LaTeX. Compared with all the other solutions, it also offers the best performance in my testing for non-trivial documents.

Alternatives

There are lots of other alternatives from using Java PDF libraries via Jython to translators that translate another format into PDFs. This post is not meant to be exhaustive and new modules to generate PDFs pop up all the time. I’ll briefly touch on a few.

XHTML2PDF
There are quite a few HTML to PDF converters and they all suffer from problems that PDF has lots of concepts (like paper size) that HTML doesn’t have. XHTML2PDF works around this by adding some special additional markup tags. This allows doing things like having headers and footers on each page which HTML doesn’t do very well. XHTML2PDF uses ReportLab to actually generate the PDFs so it is strictly less powerful than ReportLab itself. It also only supports a subset of HTML and CSS (for example, “float” isn’t supported) and so it’s best to just consider it another ReportLab markup language. With that said, it’s fairly easy to get off the ground and reasonably powerful. The library itself isn’t maintained very much though. Update: see the update at the end of the post.
wkhtmltopdf
I haven’t used this one myself although I looked into when choosing a PDF library. It has many of the same issues as XHTML2PDF. It uses a headless Webkit implementation to actually layout the HTML rather than translating it to ReportLab as XHTML2PDF does. It has a C library with bindings to it. It has some support for headers and footers as well. People I’ve talked to say it is fairly easy to get started but you run into problems quickly when trying to precisely layout complex documents.
PhantomJS
I haven’t looked into this one at all and it is not Python whatsoever but it is worth mentioning. It also translates HTML to PDF.
Inkscape
Inkscape is a program used to create SVGs, an XML based document for vector graphics. However, Inkscape can be used headlessly to translate an SVG into a PDF fairly accurately. SVGs have a concept of their size which is important in translating to PDF. However, SVG isn’t a format designed for multiple pages and so it may not be the best fit for multi-page documents. As an aside, SVGs can be used by LaTeX and ReportLab as imported graphics so Inkscape could also be part of a larger solution.
Resources

If you’re just looking for a simple recommendation, I’d say ReportLab or LaTeX are the best choices although it can depend on your use case and requirements. They are a little tougher to get off the ground but give you a higher ceiling in terms of capability. Where you don’t need to create particularly complicated documents, Sphinx or an HTML to PDF translator could work.

This post is a longer version of a talk (pdf) I gave at San Diego Python on August 27, 2015. The repository for the talk is on github.

Update

As of October 2016, XHTML2PDF is no longer supported in favor of WEasyPrint. I have never personally used WEasyPrint but it looks significantly more capable than when I first saw it a few years ago. It still likely suffers from some similar problems to XHTML2PDF in that HTML is not designed for the same reasons as PDF. With that said, it has much better support. It does not rely on Reportlab and instead looks like it relies on Cairo, a Python library for converting SVGs to PDFs.

More updates

It may also be worth mentioning that there are paid cloud services in this space depending on your budget and use case. As of January 2017, I have not integrated with these so I can’t comment on them too much. DocRaptor and HyPDF are both services which integrate with Heroku and convert HTML/CSS to PDFs. There are a number of other players in this space as well.