9 ways Software Carpentry has changed my life

I recently left a position as an intern for Greg Wilson with Software Carpentry. I want to document the impact my involvement with SWC has had on my life and encourage anyone who is looking for a way to get involved in the open source community to consider entering via SWC.

(1) Greg (and SWC) always believed in me. I never once doubted my worth as a member of the organization. I knew that I belonged and that I was appreciated. I also knew that I was heard. I knew that Greg would be there to fight for me, that I always had someone on my side. I have been lucky to have some great mentors in my life, but I definitely took Greg’s support for granted. Now I know that when you find an expert professional mentor, you hold onto them tight. Don’t let them escape!

(2) Greg pushed me to be my best. I was recently reading a post called “20 things 20-year-olds don’t get” which says that (in your 20’s) you should be getting your butt kicked everyday. By that, the writer meant that you should improve everyday. Greg pushes me to do my best work with respect to academia, but he also pushes me to be healthy and effective at being a working-adult (and having a balance with real life).


Greg pushes me to go places and meet people that (as an introvert) I never would otherwise

(3) I got to travel and meet all different sorts of people. I worked in Toronto, Logan, UT, San Francisco, Monterey Bay, and Chicago. I had opportunities to go to Oklahoma, Virginia, and Massachusetts that I just didn’t get time for. I was able to live the dream of being a 20-something traveler, seeing the world.

Screenshot 2013-08-05 at 9.10.59 AM

Mountains in Logan, UT

(4) SWC has a comprehensive code of conduct. In a male-dominated world of engineers and computer scientists, it’s absolutely refreshing to be a part of a community that is as open and accepting as SWC is. SWC has also helped me connect with other groups, like PyLadies, that help me feel supported as a woman in tech. Overall, SWC taught me to expect the best of people, not to settle for the status quo treatment of women in tech.

Screenshot 2013-08-05 at 8.57.58 AM

(5) I was able to find a series of role models. Through SWC, I met people who define the essence of who I want to be as a professional. Titus Brown teaches me to be genuine and not to be afraid to make waves. Matt Davis teaches me to be patient, to listen, and never to be afraid to offer a helping hand. Tracy Teal teaches me to stand up for myself, to build an individual identity, to ask any questions I have. Ethan White teaches me how to be an engaging speaker and how to be welcoming to newcomers. And most of all, Greg Wilson teaches me the power of supporting another person, the power of believing in another person.

Watching Ethan White teach about the shell, inspiring research scientists

(6) I was never once treated as “the new kid.” There is no hazing. There is no Noogler hat. In fact, I believe SWC actually focuses on creating a truly open and accessible environment for newcomers and new-learners of all different levels in the open source software / scientific community. I know now the importance of treating even newcomers as full, contributing, valued members of a community.

(7) SWC taught me to believe in openness. To be fair, Titus Brown has had a huge hand in that lesson as well. But I am now a believer in tweeting and blogging my own work, documenting on GitHub, and generally sharing what I am doing in whatever ways possible. As I am transitioning into a job in a corporate environment, I am struggling with saying goodbye to the vast openness that I experienced when working with SWC. I miss it, and I still believe it is the best way to work.

(8) I had the opportunity to meet and befriend really awesome, talented, driven, exciting people through SWC. I’ve now added Dan McGlinn, Karthik Ram, Jessica McKellar, Jon Pipitone, Anthony Scopatz, and tons of others, to my circle of programmer-friends. They spice up my life, keep me inspired, and give me ideas for how to incorporate my own unique interests / personality into the work that I do.

An ipynb love letter I wrote using Matt Davis’s ipython blocks


Jon Pipitone from Toronto made me an art journal out of his old lab notebooks and left it as a surprise gift for me at a local cafe

(9) SWC is actually making a difference. The work I was doing with SWC was based on interviewing and surveying previous participants of the programming boot camps to find out if the workshops were having a significant impact. What I learned from those interviews, from meeting all those people, is that SWC and the messages they deliver about open science, reproducibility, and the fact that anyone can program, are actually making a difference in the lives of real people. Really, if you’re going to join a volunteer organization, what better group to join than one you know is already out there changing lives for the better? I knew I was making a difference in the world, actually having a positive and lasting impact.

My first chance teaching for SWC, at Utah State University

I already miss the collaboration that I was able to do with SWC in the past 8-10 months of my life, and I hope to be reintegrating with them soon as I am now finally adjusting to my new job. In the mean time, I want to encourage people who are looking for a clear, welcoming, supportive entrypoint into open source to check out Software Carpentry.


A Rant about the Science of Teaching and Learning

Lately I have been looking at tools used to teach young people how to program. That includes environments and languages like Scratch, Turtle Graphics (in Python), Alice, Greenfoot, etc. It also includes full curriculum programs like Girls Who Code, Black Girls Code and online programs through the Khan Academy and Coursera.

I’m hitting a wall though, and I need to take a minute to discuss the problems I am seeing – otherwise I might implode.

[0] There are a lot of people working on the same problem. First and foremost, I keep uncovering more teachers (through CSTA, CS4HS, the Khan Academy, code.org, independent blogs, Twitter) who are encountering the same exact problems and are asking the same sorts of questions. These people are all doing really incredible work, but they are it independently (on their own little corner of the internet). We need to get on the same page.

[1] The tools are repetitive. I once worked on a research project developing algorithm visualizations for JHAVE. While I thought the system was super cool and it was an excellent foray into CS education research for me, I soon realized that a ton of visualization programs actually already exist. In fact, you can see them all at AlgoViz. At first, it seems great that there are so many tools and options out there.

But the problem I’m seeing is that computer scientists are too quick to just create a new tool, rather than add-on, mashup or adapt a tool in existence. When we find a problem with the existing teaching tools, we create from the ground up, scrapping all design decisions in other system. Instead, I think we should try using what is already out there, evaluate whether or not it actually works in a research sense, and make informed decisions about how to continue developing better CS teaching tools based on the research. Yes, that means we have to work together, collaborate. Open-source style, anyone?

[2] The tools are designed for independent learners. I am really starting to think that the “independent learner” design that I am seeing built into teaching tools for CS is part of our pipeline problem. No wonder minorities aren’t getting into computing; we are not giving them access points, an ecosystem of support. Instead, let’s build mentoring, tutoring, conversation, socialization, collaboration, and feedback into our teaching tools.

[3] The actual curriculum out there is limited. If you search “How to teach mean-median-mode to middle schoolers,” you get a series of examples of lesson plans, tools used for teaching, possible assessments for students, feedback about how certain types of lessons target different student demographics. This portfolio of resources allows a teacher to go out and teach immediately, without having to build a curriculum framework around a teaching tool. In CS, often our teaching tools are still stand-alone and are not accompanied by curriculum, lesson plans, assessments, use-case examples, etc. We need to package our tools with the curriculum, not just create the tools and expect teachers to know how to implement them.

[4] The tools do not follow best practices for teaching. Since the tools are often designed primarily by engineers (sometimes without even consultation from educators), they are often missing the perspective of the “science of teaching and learning.” There is a ton of research out in the world about cognitive psychology, philosophies of education, mental models, the notional machine, inquiry-based learning, and problem-based learning, just to name a few fields and concepts related to CS education. This sort of research should not just inform the development of CS teaching tools, it should fundamentally drive direction of their development.

[5] The tools do not talk about the “whys” of programming and computational thinking. The world is aflutter with the general need to learn to program and to do so quickly. But why? If you’ve ever attended a middle or high school math class, I’m sure you’ve heard students ask “When would I ever use this in the real world? Why do I have to learn this?” As an engineer, I can brainstorm a thousand reasons why programming is important and exciting and useful and relevant. We have to build this sort of learning direction into our tools and lessons in order to motivate student audiences. For example, “at the end of this lesson, you will know a bit about how search engines work and consequently you will how to trick web searches into ranking your personal website/blog/Tumblr highly.”

[6] While we are at the drawing board, let’s consider cultural relevance of teaching tools. Who are we to define what is “cool” and “hip” for young people? Let’s start asking some them! I cannot speak for middle-schoolers today, but I can tell you that when I was 13, I thought myself pretty “mature” and “adult-like.” So, let’s design teaching systems that play off of the world views held by middle-schoolers. In another example, when I teach on Pine Ridge Reservation, I use Lakota culture and folklore in my lessons about programming. Computer science is not just for rich, white males, so let’s teach it in diverse ways and build that diversity into our teaching tools.

Forming, Storming, Norming, Performing

I was recently introduced to Tuckman’s stages of group development, and I’m having one of those how-have-I-never-seen-this-before-? moments. As far as models go, Tuckman’s does not seem particularly revolutionary. What it is, however, is universal. All different types of groups – sports team, research labs, small project teams in elementary school – go through this set of phases. Perhaps most interesting is Tuckman’s hypothesis that all four phases are necessary steps for teams that are going to grow, face challenges, tackle problems, create solutions, and deliver work.

So, what are the stages?

  1. Forming: The group is just coming together. Individual roles are not clear and aims/objectives may not be set (or may be conflicting).
  2. Storming: Different ideas compete for consideration. The team must address what problems they plan to solve, solidify all aims and objectives, create patterns for individual and group interactions, and adopt a leadership model. At this point, the group will start to see that individuals bring different perspectives to the task. They may begin to confront one another, and these issues must be resolved in order for the group to move forward.
  3. Norming: The group manages to define a single common goal and objective. They have a mutual plan of attack. Often, compromising on ideas is necessary for the team to function. Here, all members take responsibility for the team and begin to feel an identify as a team.
  4. Performing: The team is motivated and knowledgeable. They move through appropriate conflict smoothly and without external supervision. Team members are competent and autonomous. Work is getting done.

It is important to note that after forming, a group can cycle through the stages of development as conflicts arise, new leadership appears, or tasks and goals change. Thus groups stay in a fluid state, transitioning from one stage to the next as appropriate:

Another interesting facet of Tuckman’s development model is its implications for leaders. Managers, coaches, captains, and the like all have certain responsibilities in each stage of the development process. Some aspects, like the forming stage, require a very hands-on, guiding approach with a lot of initiative. Other stages, like storming, require the supervisor to take a step back and allow individuals to resolve their differences a bit more organically.

Now, all of this may seem very obvious when we talk about group development in the abstract. What I challenge you to do is to take this model and apply it to your own life. Think of a very effective, productive team that you have been a member of; can you describe it? What qualities made the team a positive experience? In contrast, think of a dysfunctional group that you have been a member of at one time or another. What qualities made that group ineffective? Which stages were your two different groups in?

And for the teachers out there, try and keep these stages in mind as you divide students into small groups for projects. It is important to ensure that the groups in forming and storming stages especially have the resources they need to move into the subsequent stages of development. What is your role as a teacher or facilitator to help this process along?

A Translation of Software Engineering Jargon

In the past few days, I attended two different workshops co-located with the International Conference on Software Engineering. The workshops were: the International Workshop on Software Engineering for Computational Science and Engineering (SECSI) and the International Workshop on Conducting Empirical Studies in Industry (CESI). I’ve learned one major thing from both of these workshops:

Software engineers have a lot of fancy terms for concepts that Software Carpentry teaches.

This post outlines some of the buzzwords that I heard at the conference (in the order they appeared), and how I see them relating (or not relating) to principles taught by Software Carpentry (SWC). Note: I am *not* a software engineer; these descriptions of software engineering concepts are my best understanding of the ideas/techniques in lay terms.

Agile software development is a development method that is based on iterative and incremental development. Requirements for the software develop through self-organizing, cross-functional teams. Adaptive planning is encouraged. Rapid and flexible response to change are inherent components of this development process.

SWC: We definitely teach incremental development, and we advocate iterative development (although I’m not sure that we truly teach it). Requirements for software in computational science are often vague, because the researchers may not yet know what the outcomes should be. Adaptive planning, therefore, is a must.

Lean software development is emerging from within the agile development movement. Lean development has seven major principles: (1) eliminate waste, (2) amplify learning, (3) decide as late as possible, (4) deliver as fast as possible, (5) empower the team, (6) build integrity in, (7) see the whole. Specifically, eliminating waste refers to taking out any part of software development that does not add value to the customer (like unnecessary functionality, unclear requirements, insufficient testing). Building integrity in refers to ensuring that all of the system’s components work well together as a whole; one way to achieve this quality is refactoring to keep simplicity, clarity, minimum amount of features in the code. In addition, the complete and automated build process should include developer and customer tests.

SWC: I think most computational scientists practice variations of lean software development. Since the products created in this environment generally have very specific use-cases, they only have a very specific (constrained) set of functionalities. SWC advocates building integrity as well, but I’m not sure how much students actually take away the importance of software integrity from workshops.

Continuous integration in software engineering requires all developer workspaces to merge into a shared mainline several times a day. It is intended to be used in combination with automated unit tests. In addition to the automated tests, organizations that use continuous integration usually use a build server to implement continuous processes of applying quality control. Principles of continuous integration are: (1) maintain a code repository, (2) automate the build, (3) make the build self-testing, (4) everyone commits to the baseline every day, (5) every commit (to baseline) should be built, (6) keep the build fast, (7) test in a clone of the production environment, (8) make it easy to get the latest deliverables, (9) everyone can see the latest of the build, (10) automate deployment.

SWC: With the use of version control and git, I believe we are advocating ideas in line with continuous integration, although we do not outright teach or require it. Often, during workshops participants have to pull additions to a git repo of course materials on day 2, which reinforces the concepts of getting latest deliverables, everyone can see the latest build, and deployment is automated.

Test-driven development is a process that requires software developers to write an automated test case that defines an improvement or new function. Initially, the test will fail. Then, the developer writes the minimum amount of code to pass that test. Finally, the developer refactors the new code to suit project standards. This process results in a short development cycle.

SWC: TDD is taught at workshops now. Usually, the second half of day 2 focuses on testing.

Extreme programming (XP) is a software development methodology that claims to organize people to produce higher-quality software more productively. Five major activities are defined as a part of this process: (1) coding, (2) testing, (3) listening, and (4) designing. XP also recognizes four major values: (1) communication, (2) simplicity, (3) feedback, (4) courage, and (5) respect.

SWC: I do not think we directly advocate XP with SWC. We do encourage working with a partner on programming and facilitating communication between programmers.

Pair programming is an agile software development technique where two programmers work together at a single workstation. The “driver” writes the code while the “navigator” reviews each line of code as it is being written. The programmers switch roles frequently.

SWC: The workshops do not have enough time to do actual pair programming, but participants work in pairs on small exercises assigned by instructors.

Unit testing is a method to isolate and test individual units of source code. Each test case should be independent from other tests. A suite of unit tests provides a strict, written contract that a piece of code must satisfy. Benefits of unit testing are: finding problems early, facilitating change, simplifying integration, providing documentation, replacing formal design.

SWC: Just like TDD, unit testing is taught in SWC boot camps in the second part of day 2. In particular, the benefits of testing are clearly stated for participants.

A scrum in an iterative and incremental technique used in agile software development. In a scrum, there are three core roles and a variety of ancillary roles. The core roles are referred to as “pigs” and the ancillary roles are “chickens.” The core roles are: product owner (stakeholders and voice of the customers), development team (responsible for delivering potentially shippable product increments at the end of each sprint), and scrum master (the facilitator of the scrum, accountable for removing impediments to the ability of the team). A “sprint” is a basic “time boxed” unit of development in scrum. The process of a sprint follows:

  • Preliminary planning meeting where tasks for sprint are identified and previous progress is reviewed
  • The product owner identifies items for the sprint from the “backlog,” an ordered list of requirements
  • The backlog can only be modified by the development team during the sprint
  • Development occurs over a fixed time period (one week to one month, generally)
  • After a spring is completed, the development team demonstrates how to use the software

A sprint often has daily meetings called the “daily scrum.” Such meetings have specific guidelines: all members come prepared with updates; the meeting starts precisely on time, even if members are missing; the meeting should happen at the same location and time every day; the meeting length is timeboxed for 15 minutes; and all are welcome but only the core roles speak. Generally, the meetings are conducted with all members standing, to discourage a lengthier meeting. During the daily scrum, team members answer the following questions: (1) what have you done since yesterday? (2) what are you planning to do today? (3) any impediments/stumbling blocks?

SWC: Scrumming is not used in Software Carpentry workshops. At first glance, I don’t think it really has a place in SWC. But, when we are teaching participants “best practices” for computational science, maybe we should teach them the ideas of a daily “check-in” with small goals and incremental steps in software development.

Data analysis the IPython Notebook way… In a single day

I’m at Mozilla in Toronto for a few days to meet Software Carpentry’s Greg Wilson and to help out with a SWC boot camp here. In my (free) time, I am working on an automated way to gather skills survey results for other boot camps (prior to when a workshop is conducted), process and analyze these results, and then send the results to other instructors (so they can have background information about their students before they teach).

I started out this morning using R statistics to do descriptives. It was quick and easy. But as I started getting into the specifics of writing R code to do exactly what I want, I realized that the SWC way to do this would be to develop in an ipython notebook. So, I switched over to developing in python using the ipynb.

This post is dedicated to documenting how I got the script up and running today. When I tried using matplotlib in ipynb on my laptop today, it wasn’t working, so I did a fresh install. Thus, it’s your lucky day: here is a zero-to-sixty walkthrough of how to do descriptives of data and display output in charts/graphs in the ipython notebook.

What is the starting point? You have an Excel spreadsheet (or comma-separated, or tab-delimited) file and are willing to develop in the ipynb. That’s it. Here we go…


(1) Download and install Enthought Python.

Now, in the past, installing the IPython Notebook and all of its dependencies has been really difficult. Thanks to resources like Enthought and Anaconda, the install process is streamlined. I have found that Enthought works very well for my Mac needs. And I’ve heard that Anaconda is great for Windows. Both are freely available to users and I highly recommend using one of them to install.

Time: Less than 5 minutes.

(2) Use the easy_install feature to add a couple of extra packages. openpyxl is for reading in excel spreadsheets. statlib is for statistical analyses – imagine that.

Using the Enthought Canopy distribution, installing Python packages is wildly easy. It really incentivizes using other packages, so +10 to having a working easy_install feature. The other packages you’ll need for this walkthrough are openpyxl and statlib. Use the following commands to get them:

$ easy_install openpyxl
$ easy_install statlib

Time: 30 seconds.

(3) Assuming you used a Google Form to collect your survey data, download the data from the google form spreadsheet. Read the data into ipynb from the Excel file.

Basically, using openpyxl:

#open the Excel workbook
workbook = openpyxl.load_workbook(filename = f, use_iterators=True)

#select the worksheet in the workbook
worksheet = workbook.get_sheet_by_name(name = 'Sheet1')

#initialize an empty list to 
table = []
#iterate through the rows in the worksheet
for row in worksheet.iter_rows():
#iterate through the columns
for column in row:
#append the value to the list

Make sure to double-check your data values to see that they are entered correctly (in the order you intend). Alter how you are storing it in Python data structures as necessary. In my case, I actually ended up with a list of lists, where the larger list is of length 16, and each smaller list is of length 29, for a total matrix of 464 values. Each sub-list represents a single variable (column in my Excel spreadsheet). The first value in the list is the survey question/prompt, and the following 28 values are participant responses. Modify your data structure as you see fit for later use in Python.

Time: 30 minutes (trial and error with openpyxl)

(4) Calculate descriptive statistics for your data.

Here is where statlib comes into our script. Calculating the descriptives themselves is actually very easy. In my case, I have a list representing each variable. Here is an example:

list1 = ["Check out a working copy of a project from Github, 
add a file called paper.txt, and commit the change.", 
"I could struggle through it", 
"I could struggle through it", 
"I could struggle through it", 
"I wouldn't know where to start", 
"I could do this easily", 
"I wouldn't know where to start", 
"I could struggle through it",

Note that list1[0] is the survey prompt, and all following values in lists are different individuals’ responses.

In order to simplify the data analysis, I changed these categorical values into integers:

#convert the categorical string value to an integer
def string_to_number(s):
#category 1
if s == "I could do this task easily":
#highest integer represents most perceived ability
return 3
#category 2
elif s == "I could struggle through it":
#mid value
return 2
#category 3
elif s == "I wouldn't know where to start":
#lowest integer represents least perceived ability 
return 1
#not categorized
#original string returned if no matching category
return s

So, my list becomes:

list1 = ["Check out a working copy of a project from Github, 
add a file called paper.txt, and commit the change.", 

And now for the descriptives themselves:

#calculate item frequency
#(appropriate for any categorical data)
frequencies = stats.itemfreq(list1[1:])
#calculate mean
#appropriate for ordinal data
m = stats.mean(list1[1:])
#calculate standard deviation
#appropriate for ordinal data
stdv = stats.stdev(list1[1:])

Time: 5 minutes (a few minutes for copy-pasting the categorical string values, really)

(5) Draw the charts and graphs for your data using pylab, matplotlib, numpy, all that good stuff.

Don’t worry, it’s already installed with Enthought. All you have to do is call the appropriate functions. In my case, I want to create a pie chart based on the item frequencies. I also want to know the mean and standard deviation. This code was all adapted from examples of matplotlib provided online, particularly this example.

#create a list of category labels
labels = ["I wouldn't know where to start",
"I could struggle through it",
"I could do this task easily"]

#filter out just the frequencies, not the categories
freq = []
for item in frequencies:

#set dimensions and figure number
figure(str(f), figsize=(6,6))

#set axes
ax = axes([0.1, 0.1, 0.8, 0.8])

explode=(0, 0.05, 0, 0)

#set colors for chart
c = ['slategray','darkseagreen','rosybrown']

#make the pie chart
pie(freq, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90, colors=c)
title("Figure " + str(i), bbox={'facecolor':'0.8', 'pad':10})

#display the chart

That code gives me the following graph (note that the graph was generated when I ran my entire script on the actual dataset, so it will not match the above code exactly for input values):

Screen Shot 2013-05-09 at 11.22.02 PM

Time: 20 minutes (the graph started drawing quickly, but I wanted to change some default values, such as labeling each section of the chart)

(6) Make your output pretty and understandable.

The default colors for graphing in the ipynb can be harsh on the eyes at times. Here is an awesome tutorial about some color theory and best practices for visual presentation.

Remember, I’m trying to create a file that other SWC instructors can read to learn about their student audience. So, I added the entire survey question text to each question-response-analysis. I also printed the mean and standard deviation.

Screen Shot 2013-05-09 at 11.26.32 PM

The image isn’t perfect (or particularly advanced), but I don’t cringe when I look at it, and I think it’s pretty clear and readable for others. Getting it to look even nicer will probably take a bit more work, but this is a zero-to-sixty tutorial, so we aren’t worrying about that too much.

Time: 10 minutes

(7) Convert your ipynb to a format share-able with others.

In this case, I can’t just hand out my data files to instructors because the files contain private student data. At the same time, I want to take my output shown in the pictures above and send it directly to the instructors (saving me the extra work of transferring it to a different file format). Luckily, the ipynb people exist, and they’ve already written a converter that does exactly that. Really, what haven’t they done??

The converter is called nbconvert. In order to get this up and running, I couldn’t use the package installer. Instead, I did a quick download from GitHub, and then ran the following line in the shell:

./nbconvert.py --format=pdf yourfile.ipynb 

You definitely have to be in the directory containing nbconvert.py to do this command, which was a good quick-fix for me. Time? Less than 3 minutes. Including the download. The README file tells you how to set up an appropriate symbolic link so that you have your very own nbconvert command from the shell.

Once I did all this, I got this file: skills_survey_stats.

Time: 5 minutes

Yeah, it’s not perfect:

  1. My equals signs and parentheses are disappearing in the LaTeX conversion.
  2. The print statements do not have bounding boxes and therefore print off the margins of the page.
  3. The code isn’t fancy, flashy, or fast in any way.

But you know what matters? It works. And it’s really close to being distributable and readable to instructors without sacrificing individual privacy. And it is replicable. And very reusable (in fact, we are going to use this sort of process for all SWC events that we can). Best of all, I got this all up-and-running from scratch, with very limited matplotlib background, no openpyxl or statlib background, and no working install of ipynb on my computer and it only took a couple of hours.

That’s really how computational scientists can and should be working, and that’s what SWC is all about.

Time to write this blog post? 40 minutes. Sigh.

Reliability when Evaluating Learning

This is a complementary post to my previous piece about validity when evaluating learning. Together these two posts give an overview of the major aspects of creating rigorous assessments/evaluations.

If you haven’t yet read about validity, I suggest you start there. A (very brief) overview: does your assessment test what it claims to test?

In contrast, a reliable assessment produces consistent measures of skills or knowledge under varying conditions. The higher the reliability, the more consistent the results. The image below depicts the difference between validity and reliability in terms of a target:

They are a variety of ways to test for an instrument’s reliability. It is not necessary that you do all of these tests on your instrument, but it is important to do as many of them as possible. In addition, certain types of reliability are more appropriate for certain types of instruments. Below are descriptions of different types of reliability:

  1. Inter-observer: Do different observers and evaluators examine the same project/intervention/lesson/performance and agree on its overall rating on one or more dimensions? This form of reliability is especially important whenever observation or other possibly subjective types of evaluation are used.
  2. Test-retest: When the test is given at two different times, does it yield similar results? This type of reliability can establish the stability of scores from an assessment. A possible way to obtain test-retest reliability is to administer a test once, and then again a second time approximately a week later.
  3. Parallel-forms: Do two different measurements of the same knowledge or skill yield comparable results? In order to establish this form of reliability, two different but similar assessments must be administered to the same population. The scores from the two different test versions can be correlated to evaluate the consistency of results across versions.
  4. Internal consistency reliability: Do different test items that probe the same construct produce similar results?
  5. Split-half reliability: When half of a set of test items are compared to the other half, do they yield similar results? Determined by splitting all items of a test that probe the same area of knowledge into two different sets. The entire test is administered to a group of individuals. The total score for each set is computed. The split-half reliability is obtained by calculating the correlation between the two total set scores.
  6. Average inter-item correlation: A form of internal consistency reliability – do two items that measure the same construct yield similar results? Obtained by calculating the pairwise correlation coefficient for all items that measure the same construct and then averaging the coefficients.

Lastly, here are some ways to improve the reliability of an instrument:

  • Make sure that all questions and methodology are clear
  • Use explicit definitions of terms
  • Use already tested and proven methods

Validity when Evaluating Learning

I want to spend some time writing about major topics that pertain to assessment. If you’re new to assessment or if you’re looking for ideas for how to do really good evaluation of educational interventions, then this is definitely something you should read. Today’s topic is validity.

What is validity?

Validity refers to the accuracy of an assessment. Does it measure what it is supposed to measure? A toy example: you have a scale in your bathroom that consistently displays your weight. Every day your weight is the same: 140 lbs. But your actual weight is 150 lbs. Such a scale is not valid.

Validity requires that the purpose of the assessment is carefully defined. In order for researchers to determine if the items on an assessment target their goal, the goal must be explicit for the researchers. For all of your assessments, you need to answer the following questions:

  • What do you want to measure? In education, this often means: for students, what sort of activities would indicate “success?”
  • Why do we want to measure this specific topic?
  • Can it be measured?
  • How can we best ensure that what we measure is actually measuring what we want/intend?

Now, for an example of aligning goals of assessment with purpose.
In CSE 231 (introductory computer science) at MSU, students attend lecture twice a week and a lab session once a week. During lecture, they passively listen to a speaker talking about programming. In lab, the students work in pairs to complete a programming assignment. In addition, every week they have a new programming project to complete as homework. The skills we target by these projects are: (1) basic programming competence and (2) problem-solving using computing. The exams in the course, however are multiple choice and based on code-reading (see picture below for a sample question). Other than on the exams, students never have practice doing multiple choice questions or code-reading exercises. The *goal* of the exam is to measure a student’s programming ability with respect to solving problems. In practice, the exam actually measures: (1) a student’s ability to adapt their Python knowledge to unfamiliar question types, (2) a student’s ability to read code (but not write it), (3) a student’s ability to interpret a given solution to an unspecified problem. My concern here is that the students who succeed on the exams are not the students who exhibit ability to solve problems with Python, but students who can interpret unfamiliar code in Python (or who are good test-takers with multiple choice exams).Image

What does all of this talk about purpose and goals mean for Software Carpentry? Well, it means that we need to define our goals for SWC assessment very clearly, and then make sure that our assessment items target those goals. Some possible goals I have seen:

  • Actual computational ability (proficiency with bash, Python, SQL, etc.)
  • Actual software design ability (proficiency with modularity, testing, version control)
  • Efficiency in workflow
  • Amount and quality of scientific work produced

These goals are very different from what I believe we are currently assessing in SWC (via Libarkin 2012 and Aranda 2012). The goals targeted by those assessments, as I see them, are:

  • Self-perceived computational ability
  • Self-perceived software design ability
  • Confidence (or anxiety) related to computational ability
  • Perceived learning gains due to SWC intervention

In order to have valid assessments for SWC, we are going to need to explicitly define our goals and then restructure our assessments so that they target those goals.

What are some types of validity?

There are a lot of different types of validity that an assessment may seek to achieve. It is important to consider at least 2 or 3 of each of them with respect to any assessment. Often, the purpose of the evaluation indicates the types of validity that are important to consider. When developing your own assessment, try to make sure to cover as many types of validity as possible. It is possible to explicitly show validity, and doing so may require special tests or input from experts. Note that the names of each type of validity vary across fields at times.

The following are types of internal validity, which refer to the confidence that researchers can place on the proposition that the assessment shows a cause-and-effect relationship. Note that internal validity does not establish generalizability of the measure.

Face validity: Do the assessment items appear (on face-value) to measure the desired construct? A panel of experts can be used to establish face validity.

Content validity: Do the items cover all of the content targeted by the goals? A panel of experts can be used to review the items on an assessment (with respect to the goals) for content validity.

Construct validity: Does the assessment measure the construct it claims to measure? Or, does it measure a similar (yet, still different) construct? (If you don’t know what a construct is, then read this blog post). Demonstration of comparative test performance results or a pre-test/post-test framework can be used to show construct validity. There are two major sub-types of construct validity to be discussed separately: discriminant and convergent.

Discriminant validity: Is it clear that measurements that should not be related actually are not related? Lack of correlation can establish discriminant validity.

Convergent validity: If two constructs are considered to be related, are their measurements also related? Positive correlation can establish convergent validity.

Criterion-related validity: How “good” is your measure? To establish criterion-related validity, compare the measure with some other (outside) validated measure. Sub-types of criterion-related validity are described separately, including predictive and concurrent validity.

Predictive validity: Can the assessment be used to predict a recognized association between the target construct and a different construct? In order to show predictive validity, one measure is used at an earlier time to predict the results on a later measure.

Concurrent validity: Does the measure positively correlate with another measure that was previously validated? Typically, to establish concurrent validity, two different assessments for the same construct are employed.

A different type of validity is external validity, which asks “Which populations, variables, and treatments can this measure be applied to?”

Population validity: How well can the sample measured be extrapolated to the population as a whole? Was the sampling procedure acceptable/appropriate?

Ecological validity: How does the testing environment influence behavior of participants taking an assessment? To what extent do these measures apply to real-world settings?

For more reading about each of these types of validity, as well as some examples, see this website. It appears to have a lot of ads on the site, yes I know. But I surveyed a lot of sites about validity, and this one has the best information and is also the most comprehensive in its coverage. Just ignore the ads. :]

What are threats to validity?

Now that we have talked about a bunch of different types of validity, let’s spend some time brainstorms things that might threaten the validity of an assessment/measure/instrument. The list I have below is adapted from a post here:

  • Inappropriate selection of constructs (or poor definition of constructs)
  • Inappropriate selection of measures (like in CSE 231 exams)
  • Measurement is performed in too few contexts
  • Measurement is performed with too few variables measured
  • Too much variation appears in the data
  • Target subjects (sample population) selected inadequately
  • Constructs interact complexly
  • Subjects are biased when being assessed
  • Experimental method is not valid
  • Operation of experiment is not rigorous

For a more in depth discussion of threats to validity and ways they can be minimized, see these slides.