Django-taggit versus Django-tagging

For some time, there was one re-usable Django tagging application — django-tagging — and that was it. If you didn’t like it, you rolled your own. It was certainly a decent application. You could pretty easily tag anything and provided some decent features out of the box. However, very recently, another app showed up: django-taggit. This post is going to compare and contrast the two and why I decided to switch my work project to django-taggit.

Old school tagging

Back in March when we were just starting out, I went with django-tagging because that was the only tagging app around. Back then I didn’t think about it, but now 6 months later there have been no new updates or releases to django-tagging. I think this really led to the creation of django-taggit. Django-tagging had some nice features that were pretty useful like a template tag for a tag cloud. While django-tagging got me up and running quickly, it wasn’t without its annoyances. Deleting an object left dangling references to its tags since those would not be deleted because it used generic foreign keys. There was an issue filed for this back in 2008 that was never resolved. It resulted in me having to override the delete method of every object that got tagged. If I had 3 objects that got tagged, I repeated the same snippet 3 times!

The fact that such seemingly important functionality had not been added since 2008 pointed to the fact that django-tagging had been left fallow for a while.

New school tagging

At Djangocon, I first heard about django-taggit. Immediately, I liked the fact that the docs were a little bit more fleshed out than django-tagging’s docs. In addition, I found the search API for taggit to be much more intuitive.

The little things

The nicest part about django-taggit is that it integrates much better with the admin. To tag an object in the admin with django-tagging, I would need to figure out the primary key id of the object I want to tag and then go to the TaggedItem admin and then tag it by id. It was unintuitive and error prone. With django-taggit, it’s as easy as editing an object the normal way. A “tags” field shows up and it explains that it simply accepts a comma separated list.

The one feature I liked from django-tagging that taggit doesn’t implement is the tag cloud. I can understand that different folks want clouds done slightly differently and that it’s not a feature that has one right way to do it. However, it was pretty damn convenient.

All told, django-taggit seems to do the job that I want it to do and it stays out of the way otherwise. It’s much more intuitive to setup and use. It’s actively maintained and the docs are better. There’s nothing not to like. At the same time, I don’t want to say it’s the end all. I’d love to see the django-tagging guys come back with some great new features because apps can always be better.

Django Security Update September 2010 Edition

Yesterday, the Django team released a security update. The post basically says it all. If you are on Django 1.2.1, UPDATE NOW!

The details

The issue is a standard non-persistent cross site scripting (XSS) exploit. Django explicitly trusted the cross site request forgery token which is supposed to be a hexdigest based on your SECRET_KEY in settings.py. However, cookies are simply stored on the client filesystem and they should generally be considered untrusted user input.

Exploit howto

First, setup a simple Django project that has the admin enabled. Visit your simple website and you’ll be issued a CSRF token that is saved in your cookie. Simply edit the token (with Edit this Cookie for Chrome maybe) and enter a script tag and save it. Reload the page and the script tag you entered will get echoed back unescaped and executed.

Edit: I should note that this vulnerability affects any form that uses a CSRF token, not just the admin.

Djangocon 2010 Day Two

More live-ish blogging…

The keynote

Eric Florenzano’s talk Why Django Sucks, and How We Can Fix It (slides) builds on top of the 2008 keynote by Cal Henderson entitled Why I Hate Django by pointing out instances where Django can improve. Like Cal, Eric complained about some things — some of which may not be solvable — and hopefully like some of the things Cal complained about they’ll get fixed. The note about needing more contributors again came up. Becoming a core developer is pretty much impossible. He complained about the reusability of apps citing django-avatar as an example. By rigidly defining a model, a “reusable app” becomes somewhat locked and cannot easily store new metadata. I really liked the concept of lazy foreign keys.

I’m somewhat torn on the idea of just switching Django’s source to github. I don’t fully buy Russell’s argument that you can just checkin to github and it will trigger to subversion. While that is a true statement, by having mirrors in bitbucket, launchpad and github, the Django core has fractioned the social aspects of those services. User comments are going to be split between between those services rather than being concentrated. However, similar to how Django used to allow comments in the documentation, these comments may not be that useful. I also think removing the Django admin from Django would be a travesty.

Django Security

Adam Baldwin’s Pony Pwning (notes, slides) was a decent hundred foot overview of Django security and web security in general. I would have really liked to see more details although it looked like some security vulnerability he found was redacted from his slides to give the Django developers time to fix it. Other than that, he said that the Django community as a whole “gets” security (I generally agree) and that while Django is fairly secure by default developers still manage to make mistakes. He did point out that there is no clickjacking protection for the admin and using X-FRAME-OPTIONS would be a good addition. Also, it seems that Django’s escaping could be improved. I liked that he pushed pen testing with w3af and running a web app firewall like mod_security. While frameworks can buy a certain level of security cheaply since it’s unlikely that a single developer or group will get everything right and it’s more likely that a well thought through framework will be more secure. However, then problems in the framework are discovered and basically all sites using it are suddenly vulnerable. I think some security researcher really just needs to spend some time with Django and really push the limits of its security model. I’ve talked about doing it at work, but buyin will be tough.

Other Stuff

Andrew Godwin’s (the developer of South) talk Step Away From That Database (slides – pdf) was the hundred foot overview of the various data stores for Django. One interesting trend that I’m picking up on is that Django developers seem to dislike MySQL a lot and Postgres is preferred by far. This might have something to do with MySQL development stalling and getting forked into Drizzle.

I enjoyed, but don’t entirely agree with Malcolm Tredinnick’s Modeling challenges (data) which seemed to be more about how to model complex data into Django’s models but in reality could have very well left out Django entirely and focused on databases and data modeling. I really liked the part about modeling dates that have different precisions. However, I am less sold on his implementation of modeling sports teams and players from data retrieved from retrosheet.

While this will result in fairly normalized data form data that looks like “keara001″,”Austin Kearns” where 001 is Kearns’ first season and gets incremented yearly. Basically, this easily allows you to join data easily and find all the history of where a player was and is, but it doesn’t take into account seasons very well. I think retrosheet does it the way they do so that you can easily get a player’s stats for a given year. It’s a complicated problem but I’ve seen retrosheet’s solution in many places.

Django add-ons

I think the best talk of the day so far (it JUST finished) was Eric Holscher’s Large Problems in Django, Mostly Solved (slides) which basically gives the hundred foot overview of the add-ons available and the external packages you should be using. Some stuff is pretty clear like pip, south, celery, fabric and sphinx, but on top of that there were packages that I hadn’t heard of or knew relatively little about like Haystack for search and Gunicorn for easy or simple deployments (it also might be the right solution for an easy Pythonic Hudson replacement). I was interested that Eric sees that Piston is in competition with Tastypie and that django-tagging is being overtaken by django-taggit. At work, we setup django-tagging but since then it seems that django-taggit has emerged since then. I also loved Eric’s metric of lines of docs and lines of tests as a metric of how good a project is.

Themes
  • djangopackages.com is the new place to go for Django add-ons.
  • 1.3 will probably have some good logging stuff built-in

Djangocon Day One

As I promised, here’s some (semi-live) blogging from Djangocon.

The first talk of the day was Scaling the World’s Largest Django Application given by the guys at Disqus (slides). The basic gist is how they scaled up Django to handle a bajillion requests per month. One thing that I noticed is that to reach this scale, they are effectively throwing away referential integrity in their database. When asked about it, one of the guy mentioned that they have some scripts to verify that things have some level of integrity which I thought was a little scary. I was happy to see that they are basically measuring everything. On top of the usual continuous integration stuff like unit tests, lint (Pyflakes, actually) they are logging every query and every traceback using a package they opensourced named Sentry. In addition, the brief mention of the save() method’s concurrency issues was interesting.

Jeff Balogh’s talk on Switching addons.mozilla.org [AMO] from CakePHP to Django (slides – pdf) was on the similar subject of switching the high traffic site AMO to Django from PHP. My favorite point from this talk is how they handle the classic stale DB replication problem of a user submitting new data (to the master) and then not seeing their data (from the slave). Basically, they use a custom Django router that detects a POST and switches that user’s session to always read and write from the master which is pretty damn clever. Mozilla also de-normalized their database in order store a reference to the latest add-on (a classic problem). However, Jeff did mention that they might switch that to storing that info in cache. The main gist of this talk was cache everything.

Russell Keith-Magee’s So you want to be a core developer? and James Bennett’s Topics of Interest (going on now — see live blogging) are both on the future of Django and how to get involved and what needs to change in the Django community. The Django people need more people to get involved. James said there are only 14 committing developers and Russell said that to get Django 1.2 out the door he had to review tickets for 5 hours a night for 2 weeks because nobody else was doing it. As James said, their bus numbers — the number of people who get hit by a bus and then you’re screwed — is frighteningly small. There’s only 2 people who know Django Oracle support.

Themes from the con
  • Git seems to be winning in the DVCS space. It really seems like git and github are taking over and Bazaar and Mercurial are being left at the wayside. I’ve seen tons of links to github and zero to bitbucket or launchpad. James Bennett just now briefly mentioned bitbucket and launchpad.
  • Django developers think that database referential integrity is overrated. Multiple people mentioned that integrity is a farce when things scale insanely.
  • Deploy early and often. The Mozilla guys deploy at least weekly. The Disqus guys deploy daily or more often. Although Jeff Balogh from Mozilla didn’t say it, I wonder if this means that like Disqus they are running out of trunk rather than branching and releasing.
  • Celery (see my previous post) is awesome and everybody seems to be adopting it.
  • If you aren’t using Pip and Virtualenv, you should (previous post). However, you probably shouldn’t deploy a production box from Pip and Pypi.

Django and asynchronous jobs

I’m going to focus on RabbitMQ & Celery, what it buys you and why and when to use it. Eric’s blog always seems to be a couple months ahead of mine, so instead of re-hashing what he wrote, I’ll let you read his blog and then come back and finish my complimentary post. Go there now. I’ll wait.

Here’s the mini-rehash for those who didn’t actually go read it. Basically, RabbitMQ is a general purpose messaging queue system based on the AMQP protocol. When you need something to get executed, you simply queue up a message with RabbitMQ. It won’t get lost if Rabbit restarts, you can be pretty confident that your tasks will eventually get executed or you’ll be able to find out why they didn’t. Celery (and django-celery) is a Python library for queuing up tasks to be executed and for actually executing these tasks asynchronously. It gets its tasks from RabbitMQ although it does work with other queues as well.

Why RabbitMQ & Celery?

Not everyone needs to execute asynchronous tasks. If you’re reading this and asking yourself “why would I ever need that,” you probably don’t need it. My project involves data analysis that takes minutes for each item in the database. In addition, I periodically re-analyze items in the database. At first, I built a cronjob (using Django management commands) which would wake up and check for new items queued in the database every minute. Then, I built a second cronjob for re-analysis. As the system grew, this became hokey and had issues anytime the database crashed during this analysis. Anytime your Django view connects to some external service that may or may not be up (source control, web services) or may take a while (shell commands, massive database queries, etc.), it’s probably best done asynchronously. It’s also great for anything that needs to be executed periodically at a particular interval (caching, expensive calculations). Celery also provides built-in ways to retry failed tasks a specified number of times at a specified interval which is particularly useful for ensuring that things actually get done.

Setup & integration with Django

My Django views didn’t change much after I switched to Celery. Instead of adding an item to the database when I want to queue up an item for analysis, I simply queue up the execution of a @task with django-celery. The particular tasks that your application can handle are simply placed in a tasks.py file in your Django applications. To process the tasks, you simply run python manage.py celeryd. It can also be setup to run at boot time using init.d (Ubuntu/Debian, Redhat/Fedora). Celeryd clients can connect to Rabbit from the same server as RabbitMQ or a different server or servers in order to distribute the load.

The nitty gritty

One useful bit that I had to dig to find in the Celery documentation was on locking a particular task so that two different workers don’t work on the same thing at the same time. This could happen in my application if two users queued up analysis on the same item for example. The trick is to use Django’s cache framework and lock a particular item in the cache.

A word on Djangocon

Unlike last year, I convinced work to send me to Djangocon in Portland next week. I’ll probably do some live blogging on some of the interesting topics. If you’ll also be there, say hi!