MIT Sloan Sports Analytics Conference 2012

This will be a slight divergence from the usual programming…

MIT Sloan Sports Analytics Conference Attendees List
I’m at the MIT Sloan Sports Analytics Conference this weekend and I’m having a blast. Jeff van Gundy is hilarious. Also, I was feeling a little snarky when I registerred (like 4 months ago) and I was also at work doing security stuff. So I put a joke in my company name. The attendees list is sorted by company and now I’m the top of the list.

However, getting down to San Diego sports, I took a look at the attendees list and nobody is here from the Chargers or Padres. I’m hoping they’re just incognito, but I’m guessing nobody came. Hopefully the home teams don’t get left behind…

Visualizing Data: Startup Edition

Lately, I’ve been working on visualization some security data we have at work. While I can’t share exactly what I’m doing, I thought I’d share a little of what I’m doing.

I created a treemap of technology company market capitalization data as of today using Protovis. The different colors correspond to different sub-sectors. The startup data is hazy as there’s no publicly available market and I did the best I could. My goal was to compare the size of these technology companies and see if I could see anything interesting. One interesting note is that Facebook is about as big as the combined rest of the startups. Google and IBM rule the services world. Apple rules the hardware world, but I’m not sure I’d classify them as a hardware company.

Regardless, enjoy!

Working with World Bank Data in R

Although I generally stick to Python, I am going to go off on a tangent about statistics, data sets and R. You’ve been warned.

Getting the data

Last week, the World Bank released some of its underlying data that it uses as development indicators. The data is fairly clean and easy to work with. I grabbed the USA data in Excel format and transposed it (using “paste special”) so that each year was a row instead of having the years as columns. Then I saved it as a CSV file on my desktop.

Working with the data in R

R is a programming language that focuses on statistics and data visualization. Unlike Python, R has a number of useful functions for statistics as built-ins to the language. These features allow you to easy find means, minimums, maximums, standard deviations, summarize data sets, plot graphs and more. Working with the data is very interesting and it provides a good way to learn R.

First off, you can read in the CSV file saved easily.

The variable usa contains all columns of data and the columns can be accessed easily:

Plotting with R

Visualizing the data is the real interesting aspect and this is where R really shines. First we need to get the columns we want to graph.

There are some missing data points in both the population and energy use columns for the most recent years. It is possible that that data hasn’t yet been collected and verified. By coercing the data into an integer vector any non-integer data points will be converted into the R NA type. While similar to null or Python’s None, this type indicates that the data is not available and it will be ignored in plotting. Once the data is ready, it can be plotted easily.

When I saw the resulting graph I thought to myself: WOW, that’s a lot of energy. I don’t think I use multiple tons of oil per year, but I assume this also includes industrial, commercial and military usage. Still, that’s a lot of energy. It’s interesting to note that the peak of US energy usage was 1978 and then there’s the subsequent decline due to the energy crisis. The next thing I thought about was how energy usage has leveled off while population has continued to grow. So I decided to put population on the same chart.

US Energy Usage
While the leveling of energy usage may not be as amazing as I thought due to the fact that a significant percentage of it must be industrial use which is probably declining, it is still interesting and fairly impressive. While the population has continued to grow fairly linearly, energy usage is flat or slightly less than it was 35 years ago. I guess those slightly more efficient water heaters and refrigerators are paying off.