Lessons learned with RabbitMQ & Celery

A couple months ago, I posted about using asynchronous jobs with RabbitMQ and Celery. This is a follow-up with some lessons I learned the hard way.

Celery settings for good performance

Do not run millions of jobs with DEBUG = True. You will run out of memory — even if you have 48GB of it. On top of that, you might want to consider the celeryd option –maxtasksperchild.

Be extra careful with CELERY_SEND_TASK_ERROR_EMAILS = True. I sent 9000 emails to myself in a couple minutes. My phone which syncs my email really didn’t like it. I’m running with CELERY_STORE_ERRORS_EVEN_IF_IGNORED = True and I’m looking to get a dashboard view of it with django-sentry. I think I’m almost there.

Persistence & disk space

RabbitMQ stores messages intelligently so you don’t have to keep track of them. It’s very good at this. However, problems can arise when you’re queuing tasks faster than you’re processing them. Use rabbitmqctl which ships with RabbitMQ. If you see things like this:

There’s probably going to be some issues. Ten million messages have to be stored somewhere. By default on CentOS, they’re stored in /var. RabbitMQ really doesn’t like it when you run out of disk space for it to write persistent messages so be careful.

The new persistence engine in RabbitMQ 2.x handles this much better than before. In 1.x, the persistence log has to copy itself over every so often and copying multi-GB files all the time really slows the queue to a halt and adds to the problem of not processing tasks fast enough. On top of this, RabbitMQ writes a ton of logs, which is a good thing, but can backfire when disk runs out.

Task sets

Celery’s task sets work like magic. Instead of this:

Use this:

Note: the first parameter to subtask is a tuple of arguments to process_item.

General tips
  • If you can make your tasks re-entrant — meaning they can be run with the same parameters multiple times without any side effects — your life will be a lot easier. Django’s get_or_create works wonders.
  • Try to break tasks into smaller subtasks. Instead of one 45 minute task, break it into 2,000 tasks that take a second or two.
  • If you are clever with your logging, debugging things will be a lot easier. This is generally always true, but it becomes much more apparent with celery’s concurrency.

Djangocon Day One

As I promised, here’s some (semi-live) blogging from Djangocon.

The first talk of the day was Scaling the World’s Largest Django Application given by the guys at Disqus (slides). The basic gist is how they scaled up Django to handle a bajillion requests per month. One thing that I noticed is that to reach this scale, they are effectively throwing away referential integrity in their database. When asked about it, one of the guy mentioned that they have some scripts to verify that things have some level of integrity which I thought was a little scary. I was happy to see that they are basically measuring everything. On top of the usual continuous integration stuff like unit tests, lint (Pyflakes, actually) they are logging every query and every traceback using a package they opensourced named Sentry. In addition, the brief mention of the save() method’s concurrency issues was interesting.

Jeff Balogh’s talk on Switching addons.mozilla.org [AMO] from CakePHP to Django (slides – pdf) was on the similar subject of switching the high traffic site AMO to Django from PHP. My favorite point from this talk is how they handle the classic stale DB replication problem of a user submitting new data (to the master) and then not seeing their data (from the slave). Basically, they use a custom Django router that detects a POST and switches that user’s session to always read and write from the master which is pretty damn clever. Mozilla also de-normalized their database in order store a reference to the latest add-on (a classic problem). However, Jeff did mention that they might switch that to storing that info in cache. The main gist of this talk was cache everything.

Russell Keith-Magee’s So you want to be a core developer? and James Bennett’s Topics of Interest (going on now — see live blogging) are both on the future of Django and how to get involved and what needs to change in the Django community. The Django people need more people to get involved. James said there are only 14 committing developers and Russell said that to get Django 1.2 out the door he had to review tickets for 5 hours a night for 2 weeks because nobody else was doing it. As James said, their bus numbers — the number of people who get hit by a bus and then you’re screwed — is frighteningly small. There’s only 2 people who know Django Oracle support.

Themes from the con
  • Git seems to be winning in the DVCS space. It really seems like git and github are taking over and Bazaar and Mercurial are being left at the wayside. I’ve seen tons of links to github and zero to bitbucket or launchpad. James Bennett just now briefly mentioned bitbucket and launchpad.
  • Django developers think that database referential integrity is overrated. Multiple people mentioned that integrity is a farce when things scale insanely.
  • Deploy early and often. The Mozilla guys deploy at least weekly. The Disqus guys deploy daily or more often. Although Jeff Balogh from Mozilla didn’t say it, I wonder if this means that like Disqus they are running out of trunk rather than branching and releasing.
  • Celery (see my previous post) is awesome and everybody seems to be adopting it.
  • If you aren’t using Pip and Virtualenv, you should (previous post). However, you probably shouldn’t deploy a production box from Pip and Pypi.

Django and asynchronous jobs

I’m going to focus on RabbitMQ & Celery, what it buys you and why and when to use it. Eric’s blog always seems to be a couple months ahead of mine, so instead of re-hashing what he wrote, I’ll let you read his blog and then come back and finish my complimentary post. Go there now. I’ll wait.

Here’s the mini-rehash for those who didn’t actually go read it. Basically, RabbitMQ is a general purpose messaging queue system based on the AMQP protocol. When you need something to get executed, you simply queue up a message with RabbitMQ. It won’t get lost if Rabbit restarts, you can be pretty confident that your tasks will eventually get executed or you’ll be able to find out why they didn’t. Celery (and django-celery) is a Python library for queuing up tasks to be executed and for actually executing these tasks asynchronously. It gets its tasks from RabbitMQ although it does work with other queues as well.

Why RabbitMQ & Celery?

Not everyone needs to execute asynchronous tasks. If you’re reading this and asking yourself “why would I ever need that,” you probably don’t need it. My project involves data analysis that takes minutes for each item in the database. In addition, I periodically re-analyze items in the database. At first, I built a cronjob (using Django management commands) which would wake up and check for new items queued in the database every minute. Then, I built a second cronjob for re-analysis. As the system grew, this became hokey and had issues anytime the database crashed during this analysis. Anytime your Django view connects to some external service that may or may not be up (source control, web services) or may take a while (shell commands, massive database queries, etc.), it’s probably best done asynchronously. It’s also great for anything that needs to be executed periodically at a particular interval (caching, expensive calculations). Celery also provides built-in ways to retry failed tasks a specified number of times at a specified interval which is particularly useful for ensuring that things actually get done.

Setup & integration with Django

My Django views didn’t change much after I switched to Celery. Instead of adding an item to the database when I want to queue up an item for analysis, I simply queue up the execution of a @task with django-celery. The particular tasks that your application can handle are simply placed in a tasks.py file in your Django applications. To process the tasks, you simply run python manage.py celeryd. It can also be setup to run at boot time using init.d (Ubuntu/Debian, Redhat/Fedora). Celeryd clients can connect to Rabbit from the same server as RabbitMQ or a different server or servers in order to distribute the load.

The nitty gritty

One useful bit that I had to dig to find in the Celery documentation was on locking a particular task so that two different workers don’t work on the same thing at the same time. This could happen in my application if two users queued up analysis on the same item for example. The trick is to use Django’s cache framework and lock a particular item in the cache.

A word on Djangocon

Unlike last year, I convinced work to send me to Djangocon in Portland next week. I’ll probably do some live blogging on some of the interesting topics. If you’ll also be there, say hi!