I want to spend some time writing about major topics that pertain to assessment. If you’re new to assessment or if you’re looking for ideas for how to do really good evaluation of educational interventions, then this is definitely something you should read. Today’s topic is validity.
What is validity?
Validity refers to the accuracy of an assessment. Does it measure what it is supposed to measure? A toy example: you have a scale in your bathroom that consistently displays your weight. Every day your weight is the same: 140 lbs. But your actual weight is 150 lbs. Such a scale is not valid.
Validity requires that the purpose of the assessment is carefully defined. In order for researchers to determine if the items on an assessment target their goal, the goal must be explicit for the researchers. For all of your assessments, you need to answer the following questions:
- What do you want to measure? In education, this often means: for students, what sort of activities would indicate “success?”
- Why do we want to measure this specific topic?
- Can it be measured?
- How can we best ensure that what we measure is actually measuring what we want/intend?
Now, for an example of aligning goals of assessment with purpose.
In CSE 231 (introductory computer science) at MSU, students attend lecture twice a week and a lab session once a week. During lecture, they passively listen to a speaker talking about programming. In lab, the students work in pairs to complete a programming assignment. In addition, every week they have a new programming project to complete as homework. The skills we target by these projects are: (1) basic programming competence and (2) problem-solving using computing. The exams in the course, however are multiple choice and based on code-reading (see picture below for a sample question). Other than on the exams, students never have practice doing multiple choice questions or code-reading exercises. The *goal* of the exam is to measure a student’s programming ability with respect to solving problems. In practice, the exam actually measures: (1) a student’s ability to adapt their Python knowledge to unfamiliar question types, (2) a student’s ability to read code (but not write it), (3) a student’s ability to interpret a given solution to an unspecified problem. My concern here is that the students who succeed on the exams are not the students who exhibit ability to solve problems with Python, but students who can interpret unfamiliar code in Python (or who are good test-takers with multiple choice exams).
What does all of this talk about purpose and goals mean for Software Carpentry? Well, it means that we need to define our goals for SWC assessment very clearly, and then make sure that our assessment items target those goals. Some possible goals I have seen:
- Actual computational ability (proficiency with bash, Python, SQL, etc.)
- Actual software design ability (proficiency with modularity, testing, version control)
- Efficiency in workflow
- Amount and quality of scientific work produced
These goals are very different from what I believe we are currently assessing in SWC (via Libarkin 2012 and Aranda 2012). The goals targeted by those assessments, as I see them, are:
- Self-perceived computational ability
- Self-perceived software design ability
- Confidence (or anxiety) related to computational ability
- Perceived learning gains due to SWC intervention
In order to have valid assessments for SWC, we are going to need to explicitly define our goals and then restructure our assessments so that they target those goals.
What are some types of validity?
There are a lot of different types of validity that an assessment may seek to achieve. It is important to consider at least 2 or 3 of each of them with respect to any assessment. Often, the purpose of the evaluation indicates the types of validity that are important to consider. When developing your own assessment, try to make sure to cover as many types of validity as possible. It is possible to explicitly show validity, and doing so may require special tests or input from experts. Note that the names of each type of validity vary across fields at times.
The following are types of internal validity, which refer to the confidence that researchers can place on the proposition that the assessment shows a cause-and-effect relationship. Note that internal validity does not establish generalizability of the measure.
Face validity: Do the assessment items appear (on face-value) to measure the desired construct? A panel of experts can be used to establish face validity.
Content validity: Do the items cover all of the content targeted by the goals? A panel of experts can be used to review the items on an assessment (with respect to the goals) for content validity.
Construct validity: Does the assessment measure the construct it claims to measure? Or, does it measure a similar (yet, still different) construct? (If you don’t know what a construct is, then read this blog post). Demonstration of comparative test performance results or a pre-test/post-test framework can be used to show construct validity. There are two major sub-types of construct validity to be discussed separately: discriminant and convergent.
Discriminant validity: Is it clear that measurements that should not be related actually are not related? Lack of correlation can establish discriminant validity.
Convergent validity: If two constructs are considered to be related, are their measurements also related? Positive correlation can establish convergent validity.
Criterion-related validity: How “good” is your measure? To establish criterion-related validity, compare the measure with some other (outside) validated measure. Sub-types of criterion-related validity are described separately, including predictive and concurrent validity.
Predictive validity: Can the assessment be used to predict a recognized association between the target construct and a different construct? In order to show predictive validity, one measure is used at an earlier time to predict the results on a later measure.
Concurrent validity: Does the measure positively correlate with another measure that was previously validated? Typically, to establish concurrent validity, two different assessments for the same construct are employed.
A different type of validity is external validity, which asks “Which populations, variables, and treatments can this measure be applied to?”
Population validity: How well can the sample measured be extrapolated to the population as a whole? Was the sampling procedure acceptable/appropriate?
Ecological validity: How does the testing environment influence behavior of participants taking an assessment? To what extent do these measures apply to real-world settings?
For more reading about each of these types of validity, as well as some examples, see this website. It appears to have a lot of ads on the site, yes I know. But I surveyed a lot of sites about validity, and this one has the best information and is also the most comprehensive in its coverage. Just ignore the ads. :]
What are threats to validity?
Now that we have talked about a bunch of different types of validity, let’s spend some time brainstorms things that might threaten the validity of an assessment/measure/instrument. The list I have below is adapted from a post here:
- Inappropriate selection of constructs (or poor definition of constructs)
- Inappropriate selection of measures (like in CSE 231 exams)
- Measurement is performed in too few contexts
- Measurement is performed with too few variables measured
- Too much variation appears in the data
- Target subjects (sample population) selected inadequately
- Constructs interact complexly
- Subjects are biased when being assessed
- Experimental method is not valid
- Operation of experiment is not rigorous
For a more in depth discussion of threats to validity and ways they can be minimized, see these slides.