Data analysis the IPython Notebook way… In a single day

I’m at Mozilla in Toronto for a few days to meet Software Carpentry’s Greg Wilson and to help out with a SWC boot camp here. In my (free) time, I am working on an automated way to gather skills survey results for other boot camps (prior to when a workshop is conducted), process and analyze these results, and then send the results to other instructors (so they can have background information about their students before they teach).

I started out this morning using R statistics to do descriptives. It was quick and easy. But as I started getting into the specifics of writing R code to do exactly what I want, I realized that the SWC way to do this would be to develop in an ipython notebook. So, I switched over to developing in python using the ipynb.

This post is dedicated to documenting how I got the script up and running today. When I tried using matplotlib in ipynb on my laptop today, it wasn’t working, so I did a fresh install. Thus, it’s your lucky day: here is a zero-to-sixty walkthrough of how to do descriptives of data and display output in charts/graphs in the ipython notebook.

What is the starting point? You have an Excel spreadsheet (or comma-separated, or tab-delimited) file and are willing to develop in the ipynb. That’s it. Here we go…

—————————————————————————————

(1) Download and install Enthought Python.

Now, in the past, installing the IPython Notebook and all of its dependencies has been really difficult. Thanks to resources like Enthought and Anaconda, the install process is streamlined. I have found that Enthought works very well for my Mac needs. And I’ve heard that Anaconda is great for Windows. Both are freely available to users and I highly recommend using one of them to install.

Time: Less than 5 minutes.

(2) Use the easy_install feature to add a couple of extra packages. openpyxl is for reading in excel spreadsheets. statlib is for statistical analyses – imagine that.

Using the Enthought Canopy distribution, installing Python packages is wildly easy. It really incentivizes using other packages, so +10 to having a working easy_install feature. The other packages you’ll need for this walkthrough are openpyxl and statlib. Use the following commands to get them:

$ easy_install openpyxl
$ easy_install statlib

Time: 30 seconds.

(3) Assuming you used a Google Form to collect your survey data, download the data from the google form spreadsheet. Read the data into ipynb from the Excel file.

Basically, using openpyxl:

#open the Excel workbook
workbook = openpyxl.load_workbook(filename = f, use_iterators=True)

#select the worksheet in the workbook
worksheet = workbook.get_sheet_by_name(name = 'Sheet1')

#initialize an empty list to 
table = []
#iterate through the rows in the worksheet
for row in worksheet.iter_rows():
#iterate through the columns
for column in row:
#append the value to the list
table.append(column.internal_value)

Make sure to double-check your data values to see that they are entered correctly (in the order you intend). Alter how you are storing it in Python data structures as necessary. In my case, I actually ended up with a list of lists, where the larger list is of length 16, and each smaller list is of length 29, for a total matrix of 464 values. Each sub-list represents a single variable (column in my Excel spreadsheet). The first value in the list is the survey question/prompt, and the following 28 values are participant responses. Modify your data structure as you see fit for later use in Python.

Time: 30 minutes (trial and error with openpyxl)

(4) Calculate descriptive statistics for your data.

Here is where statlib comes into our script. Calculating the descriptives themselves is actually very easy. In my case, I have a list representing each variable. Here is an example:

list1 = ["Check out a working copy of a project from Github, 
add a file called paper.txt, and commit the change.", 
"I could struggle through it", 
"I could struggle through it", 
"I could struggle through it", 
"I wouldn't know where to start", 
"I could do this easily", 
"I wouldn't know where to start", 
"I could struggle through it",
...]

Note that list1[0] is the survey prompt, and all following values in lists are different individuals’ responses.

In order to simplify the data analysis, I changed these categorical values into integers:

#convert the categorical string value to an integer
def string_to_number(s):
#category 1
if s == "I could do this task easily":
#highest integer represents most perceived ability
return 3
#category 2
elif s == "I could struggle through it":
#mid value
return 2
#category 3
elif s == "I wouldn't know where to start":
#lowest integer represents least perceived ability 
return 1
#not categorized
else:
#original string returned if no matching category
return s

So, my list becomes:

list1 = ["Check out a working copy of a project from Github, 
add a file called paper.txt, and commit the change.", 
2, 
2, 
2, 
1 
3, 
1, 
2,
...]

And now for the descriptives themselves:

#calculate item frequency
#(appropriate for any categorical data)
frequencies = stats.itemfreq(list1[1:])
#calculate mean
#appropriate for ordinal data
m = stats.mean(list1[1:])
#calculate standard deviation
#appropriate for ordinal data
stdv = stats.stdev(list1[1:])

Time: 5 minutes (a few minutes for copy-pasting the categorical string values, really)

(5) Draw the charts and graphs for your data using pylab, matplotlib, numpy, all that good stuff.

Don’t worry, it’s already installed with Enthought. All you have to do is call the appropriate functions. In my case, I want to create a pie chart based on the item frequencies. I also want to know the mean and standard deviation. This code was all adapted from examples of matplotlib provided online, particularly this example.

#create a list of category labels
labels = ["I wouldn't know where to start",
"I could struggle through it",
"I could do this task easily"]

#filter out just the frequencies, not the categories
freq = []
for item in frequencies:
freq.append(item[1])

#set dimensions and figure number
figure(str(f), figsize=(6,6))

#set axes
ax = axes([0.1, 0.1, 0.8, 0.8])

explode=(0, 0.05, 0, 0)

#set colors for chart
c = ['slategray','darkseagreen','rosybrown']

#make the pie chart
pie(freq, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90, colors=c)
title("Figure " + str(i), bbox={'facecolor':'0.8', 'pad':10})

#display the chart
show()

That code gives me the following graph (note that the graph was generated when I ran my entire script on the actual dataset, so it will not match the above code exactly for input values):

Screen Shot 2013-05-09 at 11.22.02 PM

Time: 20 minutes (the graph started drawing quickly, but I wanted to change some default values, such as labeling each section of the chart)

(6) Make your output pretty and understandable.

The default colors for graphing in the ipynb can be harsh on the eyes at times. Here is an awesome tutorial about some color theory and best practices for visual presentation.

Remember, I’m trying to create a file that other SWC instructors can read to learn about their student audience. So, I added the entire survey question text to each question-response-analysis. I also printed the mean and standard deviation.

Screen Shot 2013-05-09 at 11.26.32 PM

The image isn’t perfect (or particularly advanced), but I don’t cringe when I look at it, and I think it’s pretty clear and readable for others. Getting it to look even nicer will probably take a bit more work, but this is a zero-to-sixty tutorial, so we aren’t worrying about that too much.

Time: 10 minutes

(7) Convert your ipynb to a format share-able with others.

In this case, I can’t just hand out my data files to instructors because the files contain private student data. At the same time, I want to take my output shown in the pictures above and send it directly to the instructors (saving me the extra work of transferring it to a different file format). Luckily, the ipynb people exist, and they’ve already written a converter that does exactly that. Really, what haven’t they done??

The converter is called nbconvert. In order to get this up and running, I couldn’t use the package installer. Instead, I did a quick download from GitHub, and then ran the following line in the shell:

./nbconvert.py --format=pdf yourfile.ipynb 

You definitely have to be in the directory containing nbconvert.py to do this command, which was a good quick-fix for me. Time? Less than 3 minutes. Including the download. The README file tells you how to set up an appropriate symbolic link so that you have your very own nbconvert command from the shell.

Once I did all this, I got this file: skills_survey_stats.

Time: 5 minutes

Yeah, it’s not perfect:

  1. My equals signs and parentheses are disappearing in the LaTeX conversion.
  2. The print statements do not have bounding boxes and therefore print off the margins of the page.
  3. The code isn’t fancy, flashy, or fast in any way.

But you know what matters? It works. And it’s really close to being distributable and readable to instructors without sacrificing individual privacy. And it is replicable. And very reusable (in fact, we are going to use this sort of process for all SWC events that we can). Best of all, I got this all up-and-running from scratch, with very limited matplotlib background, no openpyxl or statlib background, and no working install of ipynb on my computer and it only took a couple of hours.

That’s really how computational scientists can and should be working, and that’s what SWC is all about.

Time to write this blog post? 40 minutes. Sigh.

Advertisements

2 thoughts on “Data analysis the IPython Notebook way… In a single day

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s