This is a complementary post to my previous piece about validity when evaluating learning. Together these two posts give an overview of the major aspects of creating rigorous assessments/evaluations.
If you haven’t yet read about validity, I suggest you start there. A (very brief) overview: does your assessment test what it claims to test?
In contrast, a reliable assessment produces consistent measures of skills or knowledge under varying conditions. The higher the reliability, the more consistent the results. The image below depicts the difference between validity and reliability in terms of a target:
They are a variety of ways to test for an instrument’s reliability. It is not necessary that you do all of these tests on your instrument, but it is important to do as many of them as possible. In addition, certain types of reliability are more appropriate for certain types of instruments. Below are descriptions of different types of reliability:
- Inter-observer: Do different observers and evaluators examine the same project/intervention/lesson/performance and agree on its overall rating on one or more dimensions? This form of reliability is especially important whenever observation or other possibly subjective types of evaluation are used.
- Test-retest: When the test is given at two different times, does it yield similar results? This type of reliability can establish the stability of scores from an assessment. A possible way to obtain test-retest reliability is to administer a test once, and then again a second time approximately a week later.
- Parallel-forms: Do two different measurements of the same knowledge or skill yield comparable results? In order to establish this form of reliability, two different but similar assessments must be administered to the same population. The scores from the two different test versions can be correlated to evaluate the consistency of results across versions.
- Internal consistency reliability: Do different test items that probe the same construct produce similar results?
- Split-half reliability: When half of a set of test items are compared to the other half, do they yield similar results? Determined by splitting all items of a test that probe the same area of knowledge into two different sets. The entire test is administered to a group of individuals. The total score for each set is computed. The split-half reliability is obtained by calculating the correlation between the two total set scores.
- Average inter-item correlation: A form of internal consistency reliability – do two items that measure the same construct yield similar results? Obtained by calculating the pairwise correlation coefficient for all items that measure the same construct and then averaging the coefficients.
Lastly, here are some ways to improve the reliability of an instrument:
- Make sure that all questions and methodology are clear
- Use explicit definitions of terms
- Use already tested and proven methods