Python Command Line Apps

Recently, a junior engineer at my company was tasked with building a command line app and I wanted to point him in the right direction. So I thought I would just Google some resources, send them to him and he’d be on his merry way. However, I couldn’t find anything complete. I found lists of command line libraries for Python as well as guides on using specific libraries but little that gave a good overview of why things are done in certain ways for command line apps. Hopefully this helps.

Why build command line apps

The biggest advantage of command line apps (sometimes called CLIs or command line interfaces) is that they are easier to combine with other programs to build something new. Unlike a mobile app where the all the functionality is built up front and designed by the developers, command line apps are much more flexible. When “grep” — a program for searching for text in files — was first built, there’s no chance that all of its possibilities and power were conceived of in advance. A person might search for some text in a file, filter the set of results with a second invocation of grep and then further refine or reduce the set of results by chaining with another command or chain it by executing yet another program for each match. If you want to automate something, it is much easier if you start with a CLI.

It’s hard to avoid programming overcomplicated monoliths if none of your programs can talk to each other.
Eric S. Raymond in “The Art of Unix Programming”

Because of the ease of combining or chaining commands together with command line apps, they work well if they are single purpose. This leads to easy to understand and easy to maintain programs. There’s no need to build everything including the kitchen sink into an app. Just make sure it has a clear, well-defined and easy to understand interface. To help with that, there are a number of conventions, libraries and considerations when building CLIs.

Conventions and terminology of well-behaved CLIs

Options
Options are optional parameters passed to a command line program. On most *nix systems, these start with - or -- and commonly start with / on Windows. The most widely used is --help which is used to get short documentation on how a program is used. The order of options almost never matters.
Arguments (or positional parameters)
Arguments differ from options in that they are frequently not optional, usually do not start with any prefix, and the order of arguments usually matters. Usually this is critical to functionality. When Python is executed with python FILENAME.py, FILENAME.py is the argument to the python program.
Commands (or subcommands)
Commands are a way to split functionality of a command line app. The first argument is the “command” and based on this command there are different sets of options and arguments available. Not all programs use commands but complex command line apps frequently do. For example, when executing pip install --upgrade django, install is the command, django is an argument and --upgrade is an option specific to install.

pip accepts a number of possible commands and each of them have their own possible arguments and options.

Standard output (stdout)
Stdout is where the normal output of command line apps go. This output can be redirected to a file — which writes the output of a command to a file instead of the terminal (with the > operator) — or chained to another command line (with the | operator). Print in Python writes to stdout but stdout can be accessed directly at sys.stdout.
Standard error (stderr)
Stderr is where error output from CLIs go as well as informational updates on a longer running app. It can be redirected separately from stdout but it is reasonably safe to assume the user sees it. Stderr is accessed at sys.stderr. While only occasionally relevant, stderr is “unbuffered” (stdout is buffered) meaning that content written to stderr is immediately written to the stream rather than waiting for a certain amount of data to be written to an internal buffer before it is actually written to the stream.
Standard input (stdin)
Stdin is a stream of input passed to the command line app. Not all apps require stdin. A good rule of thumb is that if your program accepts an argument that is a file path, there should be a way to pass the actual file contents to stdin. This makes it much easier to chain command line apps together — meaning to pass the stdout from one app as stdin to another app. For example, grep can read and filter a file (grep -i error LOGFILE) or stdin (tail -f LOGFILE | grep -i error). Stdin is accessed at sys.stdin.
Exit status (or return code or exit code or status code)
Command line apps return an exit status to their parent process when they complete. This can inform the caller whether the command succeeded or failed. In Python, this is usually set by calling sys.exit but it is set automatically when a program raises an uncaught exception. For best compatibility between operating systems, the value should be between 0 and 127 with 0 being success and all other values indicating different failure states (sys.exit(256) often indicates “success” depending on the OS so be careful). This exit status is frequently used to stop command line apps from chaining when there’s a failure.
Signals
Signals are a way for a user or outside process to send further instructions to a running program. For example, there can be a signal to indicate to a running program to re-read a configuration file. I have never actually seen a Python program that handles signals but I’m including it here for completeness. The standard library has the signal module for setting asynchronous signal handlers.

Modules & libraries for building Python CLIs

There are a number of Python libraries and modules to help build a command line app from parsing arguments and options to testing to full blown CLI “frameworks” which handle things like colorized output, progress bars, sending email or pluggable modules. There is not one single “best” module and they all have trade-offs as well as being better suited for apps of a certain size and complexity. I hesitated to call out specific libraries as it will be result in this post being outdated as modules come into and go out of fashion but it’s important to discuss the tradeoffs and this approach can be used to evaluate modules I didn’t mention. For a good list of modules, see the Python guide or see the links at the bottom of this post for more details and usage on different ones.

Argparse

Argparse is probably the most common modern library used to help parse command line arguments and options and provides a simple and uniform interface for documenting the CLI app. It is very versatile in how it handles arguments, has built-in support for type checking (ensuring an argument or option is an integer or a file path for example), subcommands, and automatic --help generation. It supports both Python 2.7 and Python 3.x although there are some gotchas and argparse is present in the Python standard library which means there’s nothing extra for users to install with a command line app based on argparse.

Running the above example results in the following:

Argparse represents to me the minimum functionality that a module that helps with documentation or parsing command line arguments or options should do. Parsing command line arguments manually is virtually always a mistake even for a trivial app and all other modules should be compared against argparse.

Click

Click is a third-party module for building CLI apps but it does more than argparse. Click includes features to write colorized output, show progress bars, process subcommands, and prompt for input among other things. Sensible common conventions (like passing - to signify reading a file from stdin or -- to signify the end of options) are built into Click. It makes it much easier to test a command line app and I can’t stress enough how big of an advantage I’ve found this personally. Not only does Click support Python 2.x as well as 3.x but it has helpers for some common gotchas. It is very well documented although it might benefit from some tutorials.

The above very contrived example functions identically to the one further up that uses argparse. While it doesn’t really showcase any of the big advantages of click, click is the module of choice for me when I build larger CLIs. For an example of a larger app, see an implementation of the Coreutils in Python that I’m working on or any of the examples in the Click docs.

Considerations

  • Using a library built-in to the standard library like argparse has some advantages when it comes to distribution since the app won’t require any dependencies.
  • The smaller the app, the less likely I am to miss some of the features of larger frameworks like Click.
  • If you’re planning on distributing your CLI for multiple Python versions or operating systems, you want a module that is helpful for dealing with that unless it is fairly simple. Notably, sys.stdout/err/in deal with bytes in Python2 and strings in Python3.
  • End to end testing (including exit statuses, stderr, etc.) can be hard to achieve with argparse alone

Future topics

There’s a number of nuances I haven’t yet explored that might be worth a whole post on their own. These include:

  • Packaging command line apps for distribution – for distributing as widely as possible, it is usually best to distribute to PyPI as a regular Python module but there are some tips and tricks.
  • Testing command line apps – this can be surprisingly tricky especially if the app needs to work across Python versions (2.x and 3.x) and different operating systems.
  • Handling configuration with CLIs
  • Structuring command line apps – there’s some overlap between this and packaging for distribution but it might be worth a post.

Links

  • Of the Python command line videos out there, I think Mark Smith’s EuroPython 2014 talk “Writing Awesome Command-Line Programs in Python” was the best.
  • Kyle Purdon put together a great post comparing argparse, docopt, click and invoke for building CLI apps.
  • Vincent Driessen has a good post on getting started fast with click and Cookiecutter
  • The Python documentation has an argparse tutorial which is much more useful for beginners than the module documentation.

Generating PDFs With (and Without) Python

Programmatically creating PDFs is fairly common among server side and desktop applications. In this post, I’ll share the results of my experiences with PDFs and what to think about when making a decision about libraries to use and the trade-offs with them. To understand the trade-offs though, we’ll start from the basics.

What’s a PDF?

Just about everybody has interacted with a PDF file or two, but it is worth going over some details. The PDF format is a document presentation format. HTML which probably more readers of this blog are familiar with is also a sort of document presentation format but the two have different goals. PDF is designed for fixed layouts on specific sizes of paper while HTML uses a variable layout that can differ across screen sizes, screen types and potentially across browsers. With relatively few exceptions, the same document should look exactly the same on two different PDF viewers even on different operating systems, screen sizes or form factors.

PDF was originally developed as a proprietary format by Adobe but is now an open standard (ISO 32000-1:2008). The standard document weighs in at over 700 pages and that doesn’t even cover every aspect of PDFs. As you might guess from the previous sentence, the format itself is extremely complicated and so usually tools are used when creating PDFs. The PDF format is fairly close in capability to Postscript, a format and programming language used by printers and PDFs are used most commonly to convey some data intended for printing. PDFs print fairly precisely and uniformly even across printers and OSs and so it’s a good idea to create a PDF for any document that is intended for printing.

Generating a PDF

If you only need to generate a single PDF, then you probably should stop reading. Creating a PDF from a word document for example can be done from the print dialog in MacOS and many Linux distros and a tool like Acrobat can accomplish the same thing on Windows. However, when it comes to generating a lot of PDFs — say a different PDF for every invoice in your freelancing job — you don’t want to spend hours in the print dialog saving files as PDFs. You’d rather automate that. There are lots of tools to accomplish this but some are better than others and more suited to specific tasks. In general though, most of these tools work along the same lines. You have a fixed layout in your document but some of the content is variable because it comes from a database or some external source. All invoices look more or less the same, but what the invoice is for and the amounts, dates and whatnot differ from invoice to invoice.

ReportLab

If you use Python regularly, you may have already heard of ReportLab. ReportLab is probably the most common Python library for creating PDFs and it is backed by a company of the same name. It is very robust albeit a bit complicated for some. ReportLab comes with a dual license of sorts. The Python library is open source but the company produces a markup language (RML, an XML-based markup) and an optimized version of the library for faster PDF generation which they sell. They claim that their proprietary improvements make it around a factor of five faster. Using markup languages like RML comes with the nice advantage that you can use your favorite template library (like Jinja2 or Django’s built-in one) to separate content from layout cleanly.

While the above example only writes paragraphs to a document, ReportLab can draw lines or shapes on a document, or support using images with your PDF.

RestructuredText and Sphinx

While not a PDF generator by itself, if you’ve ever created a Python module, you’ve probably heard of Sphinx, a module used to create documentation. It was created for the Python documentation itself but has been used by Django, Requests and many other big projects. It is the documentation format that powers Read the Docs. In general, you write a document in RestructuredText (reST), a wiki-like markup format and then run the Sphinx commands to put the whole thing together. Frequently, Sphinx is used to generate HTML documentation but it can be used to generate PDFs. It does this by creating LaTeX documents from reST and then creating PDFs from those. This can be a good solution if you’re looking for documents similar to what Sphinx already outputs.

I have a post from long ago on Sphinx and documentation in general if you desire further reading.

LaTeX

LaTeX is a document preparation system that is used widely in scientific and mathematical publishing as well as some other technical spheres. It can produce very high quality documents that are suitable for being published as books, presentations (see an example at the end of the post), handouts or articles. LaTeX has a reputation for being fairly complicated and it involves a fairly large installation to get going. LaTeX is also not a Python solution although there are Python modules to help somewhat with LaTeX. However, because it is a plain text macro language, it can be used with a template library to generate PDFs. The big advantage is that it can very precisely typeset documents and has a very extensive user base (it’s own stackexchange) and a large amount of modules to do everything from tables, figures, charts, syntax highlighting, bibliographies and more. It takes more to get going but you can generate just about anything you can think of with LaTeX. Compared with all the other solutions, it also offers the best performance in my testing for non-trivial documents.

Alternatives

There are lots of other alternatives from using Java PDF libraries via Jython to translators that translate another format into PDFs. This post is not meant to be exhaustive and new modules to generate PDFs pop up all the time. I’ll briefly touch on a few.

XHTML2PDF
There are quite a few HTML to PDF converters and they all suffer from problems that PDF has lots of concepts (like paper size) that HTML doesn’t have. XHTML2PDF works around this by adding some special additional markup tags. This allows doing things like having headers and footers on each page which HTML doesn’t do very well. XHTML2PDF uses ReportLab to actually generate the PDFs so it is strictly less powerful than ReportLab itself. It also only supports a subset of HTML and CSS (for example, “float” isn’t supported) and so it’s best to just consider it another ReportLab markup language. With that said, it’s fairly easy to get off the ground and reasonably powerful. The library itself isn’t maintained very much though. Update: see the update at the end of the post.
wkhtmltopdf
I haven’t used this one myself although I looked into when choosing a PDF library. It has many of the same issues as XHTML2PDF. It uses a headless Webkit implementation to actually layout the HTML rather than translating it to ReportLab as XHTML2PDF does. It has a C library with bindings to it. It has some support for headers and footers as well. People I’ve talked to say it is fairly easy to get started but you run into problems quickly when trying to precisely layout complex documents.
PhantomJS
I haven’t looked into this one at all and it is not Python whatsoever but it is worth mentioning. It also translates HTML to PDF.
Inkscape
Inkscape is a program used to create SVGs, an XML based document for vector graphics. However, Inkscape can be used headlessly to translate an SVG into a PDF fairly accurately. SVGs have a concept of their size which is important in translating to PDF. However, SVG isn’t a format designed for multiple pages and so it may not be the best fit for multi-page documents. As an aside, SVGs can be used by LaTeX and ReportLab as imported graphics so Inkscape could also be part of a larger solution.
Resources

If you’re just looking for a simple recommendation, I’d say ReportLab or LaTeX are the best choices although it can depend on your use case and requirements. They are a little tougher to get off the ground but give you a higher ceiling in terms of capability. Where you don’t need to create particularly complicated documents, Sphinx or an HTML to PDF translator could work.

This post is a longer version of a talk (pdf) I gave at San Diego Python on August 27, 2015. The repository for the talk is on github.

Update

As of October 2016, XHTML2PDF is no longer supported in favor of WEasyPrint. I have never personally used WEasyPrint but it looks significantly more capable than when I first saw it a few years ago. It still likely suffers from some similar problems to XHTML2PDF in that HTML is not designed for the same reasons as PDF. With that said, it has much better support. It does not rely on Reportlab and instead looks like it relies on Cairo, a Python library for converting SVGs to PDFs.

More updates

It may also be worth mentioning that there are paid cloud services in this space depending on your budget and use case. As of January 2017, I have not integrated with these so I can’t comment on them too much. DocRaptor and HyPDF are both services which integrate with Heroku and convert HTML/CSS to PDFs. There are a number of other players in this space as well.

Code longevity and rewrites

A while ago I interviewed with a company planning to rewrite their existing money-making product with lots of users that consisted of CGI scripts with most of the business logic in Oracle stored procedures. The developers and management seemed sharp. All but one of the original product devs were gone and nobody really knew the system. The size of the rewrite was a concern of mine that I brought up in the interview. If stored procedures were a pain point, swapping them out piece by piece seemed wise to me, but they thought that a focusing their developers for a big switchover would be less costly. Their Python rewrite didn’t go too well. I believe they’re on Rewrite Part II with .NET and SQL Server after hiring an ex-Microsoft manager to take over the project. In the meantime, nobody wants to maintain on the old product which is on its way out.

Rewrite everything… all the time

Joel Spolsky calls rewrites the single worst strategic mistake a company can make. At the same time, only dead code is static. New features and extensions are a sign of a healthy project. Sometimes these improvements involve big and invasive changes. At his product tech talk a few months ago, Adam Wolff talked about the culture of Facebook and how they do software development. As part of their approach, he said that Facebook “rewrites everything all the time”. This strategy — at least on the surface — sounds similar to what Joel Spolsky sees as a complete disaster. Facebook can’t be making such huge mistakes constantly, though, or the competition would catch up. Adam mentioned Joel’s post in his talk and then went on to describe how Facebook developers find a bit of the stack with some code odor, carve out their niche and then rewrite it.

After thinking on this for a bit, I came to the realization that this is not a rewrite at all. At least it isn’t a rewrite in the same way Joel described Netscape rewriting Navigator. Facebook as a whole hasn’t changed. A part of it was just ripped out and replaced with a newer version with all the critical features of the last revision and more. In my mind, I would characterize that as a refactor. I don’t want to argue semantics, but Facebook’s approach — after some reflection — seems quite good. Code that doesn’t change doesn’t get looked at and when the original developers leave, a company can get left with a code base nobody knows. Jeff Atwood says that understanding means rewriting and so hopefully Facebook has a system that’s been through many iterations and a number of people understand. When my page became a timeline, a number of engineers learned a lot I bet.

Size, among other things, does matter

It is very telling that the Safari team didn’t choose to write a browser from scratch. Instead, they took a relatively small 140kloc layout and JavaScript engine and then built all the browser bits around it (having a smart and talented team along with some Netscape hindsight might have helped too). By building the pieces around it, they also built up some expertise in the initial code base as well. While Safari may not currently be the talk of the browser town, there’s little doubt that WebKit is a big success. So if rewriting a browser from scratch is crazy talk that will sink your company and doing a full rewrite on say your personal blog is OK, what’s the difference? One big difference is size. I could switch my blog from WordPress to a custom site of my own design in about a day. In his talk, Adam described a big shift in Facebook’s deployment system to use BitTorrent that a few engineers wrote in a weekend hackathon to overcome problems with the speed of deploying the latest software to their fleet. Developing a browser from scratch, by contrast, is a multi-man-year effort.

Legacy of code

Code becomes legacy code when nobody touches it for a long time, the developers who know it leave, and everything else changes around it. While code doesn’t rust or degrade with time, the know-how around it does fade away. Developers want to create something great that will be used for a long time; they are architects at heart. However, they don’t want to maintain a bloated, poorly documented system built by somebody else (who is probably gone and creating the next bloated, poorly documented system). The best way I’ve seen as a developer to keep code from becoming “legacy” is to make it my own. To make it my own doesn’t mean rewriting the whole thing from scratch, but rather making incremental improvement where I can or swapping out components where it makes sense. This builds a sense of ownership for the next generation of maintainers. The key here is to write software such that making these types of improvements or swapping things out is easy. Good APIs and separation of concerns are critical.

When big changes are on the horizon and the future of a software project looks least sure, I’ve found that breaking a project into smaller pieces always helps. Unless you end up working on something completely different, at least some of the old code will always be useful to the new project provided the API makes it easy to integrate. To give a concrete example from a project I’m working on relating to automation against an array of Android phones, a solid Python interface to ADB (the Android debug bridge) will almost definitely be worthwhile regardless of what else changes. Hopefully the legacy I leave won’t be a large unmaintainable system but a large number of small systems that are easier to understand.

Getting started with pygit2

I should preface this post by saying that I’m not a Git expert so this is based on my experimentation rather than any deep knowledge. I bet I’m not the only one who merely skimmed the internals chapter in Pro Git. This post is the result of me setting out to learn the Git internals a little better and help anybody else who is trying to use pygit2 for something awesome. With that said, corrections are welcome.

Installation

While this is obvious to some, I think it’s worth pointing out that pygit2 versions track libgit2 versions. If you have libgit2 v0.18.0 installed, then you need to use pygit2 v0.18.x. Compiler errors will flow if you don’t. The docs don’t exactly mention this (pull request coming). Other than that, just follow the docs and you should be set. On MacOS, you can brew install libgit2 (make sure you brew update first) to get the latest libgit2 followed by pip install pygit2.

The repository

The first class almost any user of pygit2 will interact with is Repository. I’ll be traversing and introspecting the Twitter bootstrap repository in my examples.

There’s quite a bit more to the repository object and I’ll show it after I introduce some other git primitives and terminology.

Git objects

There are four fundamental git objects — commits, tags, blobs and trees — which reference snapshots of the git working directory (commits), potentially annotated named references to commits (tags), chunks of data (blobs) and an organization of blobs or directory structure (trees). I’ll save blobs and trees for a later post, but here’s some examples of using commits and tags in pygit2.

Since commits and tags are user facing, most git users should be familiar with them. Essentially commits point to a version of the working copy of the repository in time. They also have various interesting bits of metadata.

One tip/issue with using the repository bracket notation (repo[hex | oid]) is that the key MUST be either a unicode string if specifying the object hash or if it is a byte string pygit2 assumes that it points to the binary version of the hash called the oid.

Tags are essentially named pointers to commits but they can contain additional metadata.

You can read all about the types of parameters that revparse_single handles at man gitrevisions or in the Git documentation under specifying revisions.

Typically, you won’t need to ever convert between hex encoded hashes and oids, but in case you do the the conversion is trivial:

Walking commits

The Repository object makes available a walk method for iterating over commits. This script walks the commit log and writes it out to JSON.

Dump repository objects

This script dumps all tags and commits in a repository to JSON. It shows how repositories are iterable and sort of puts the whole tutorial together.

Notes
  • There’s talk of changing the pygit2 API to be more Pythonic. I was using v0.18.x for this and significant things may change in the future.
  • It helps to think of a Git repository as a tree (or directed acyclic graph if you’re into that sort of thing) where the root is the latest commit. I used to think about version control where the first commit is the root, but instead it is a leaf!
  • If your repository uses annotated or signed tags, there will be longer messages or PGP signatures in the tag message.
  • I’ve glossed over huge chunks of pygit2 — pretty much anything that writes to the repository — but if I don’t leave something for later my loyal readers won’t come back to read more. =)

GitHub Data Challenge II

GitHub’s public timeline contains a wealth of knowledge about contributions to open source software from all over the world. It’s pretty typical to see over ten thousand contributions of some sort every hour! I decided to focus on the top 200 repositories (by forks) only in order to have a more manageable set of data. Each comment, pull request or commit is tied to a repository which in turn usually has a primary language associated with it. Contributions from folks who didn’t provide a location were ignored and OpenStreetMap’s Nominatim service was used to geocode locations into latitude and longitude for those who did say where they coding from.

If you aren’t from New York City or San Francisco and you contributed to a top 200 repository, you can probably find your own commits if you zoom in enough.

Contributions

Not all events are created equal. Watching a repository is not the same as committing code or opening issues. In general, I tried to calculate contributions based on the same criteria GitHub uses but I think I’m not introspecting commits and pull requests as deeply as they are. Typically, for larger repositories, users commit to their own forks — which I ignore — and later send pull requests which I’m counting. However, this discounts a large fork which merged many commits to be worth the same as a one line pull request. The person who actually merges the pull request gets the same credit as the author which actually makes sense when I gave it a second thought.

One way to improve my accounting of contributions would be to look at the actual repositories to see which commits to forks ended up in the “main line”. For a repository that actually uses GitHub virtually all commits end up in the main repository through pull requests or via somebody with permission to push directly which appear in the githubarchive.org data. For a repository like Linux which only stores code on GitHub and doesn’t accept pull requests it would be nice to actually analyze the commit history. I bet most of the Linux contributors have GitHub accounts to attribute their work to.

Geocoding

Geocoding messy data is well… messy. The location field for users on GitHub is simply a fill-in-the-blank field and users can type anything in there from their city to their university to an IRC channel. Sometimes people just type in a country name which is fine for Singapore but doesn’t really narrow it down too much for Canada. The locations listed for contributors on the top 200 repositories was surprisingly clean, however. It wasn’t without somewhat humorous errors though.

Links