Understanding reliability

Reliability can be assessed in many ways, but in this blog, we will discuss three common methods.

This paper was written by students in the Masters of I/O Science program at the University of Guelph, located in Ontario, CA. Special mentions to the students for their generous contributions: Melissa Pike, Molly Contini, Julia Kearney, and Jordan Moore. Supervised by Marian Pitel, VP Research at nugget.ai.

Welcome to the first post in our series about measurement! In organizations, it is incredibly important that research be conducted with rigor. The rigor and quality of research can be assessed through the measurement of validity and reliability. Validity is defined as the extent to which a measure accurately assesses what it intends to measure (Heale & Twycross, 2015). Reliability, on the other hand, refers to the consistency of a test. More specifically, reliability can be defined as whether the test consistently produces similar scores within people or across time and situations (Nunnally & Bernstein, 1994). This post will further explore the concept of reliability. If you missed our post about validity, you can read it here.

Although reliability can be a difficult concept to wrap our heads around, many of us think about it more than we realize. Have you ever taken an online personality test and thought “Why does it say that I’m an introvert? When I took it last week, I was an extravert!” Or have you stepped on your bathroom scale and thought “I can’t weigh that much, I weighed so much less last week!”

If you have encountered anything similar, you have questioned the reliability of a test.

Although it is not the same as validity, reliability is an important part of determining whether a test is valid. That is, if a test cannot produce similar scores over time (i.e. the scores on this test are not reliable), we cannot conclude that it measures the construct that it claims to measure (i.e. that the scores on this test are valid).

Practitioners want to use tests that will consistently return results that are indicative of the applicant’s true ability and are not influenced by outside factors (e.g., time of day, time of year, test environment, or other factors).

If we can trust that scores on a test are consistent across time and situations, then we can be more confident that it measures what it is intended to measure.

Reliability can be assessed in many ways, but in this blog, we will discuss three common methods: internal consistency, split-half/parallel-forms and test-retest reliability.

Internal Consistency Reliability

The internal consistency of a test refers to the degree to which the items within that test assess the same thing (Henson, 2001). It is typically measured using Cronbach’s alpha, a correlation index that demonstrates how closely related all the items on a test are, with a higher value indicating greater internal consistency (Gliem & Gliem, 2003; Tavakol & Dennick, 2011). Cronbach’s alpha can range from 0 to 1. While a higher value is typically a good thing, once a Cronbach’s alpha value gets too high, it could indicate that some items on the test are redundant and could be removed from future iterations of the test. For example, if a manager assesses the internal consistency of a written math test, they may hope to find that their test has a Cronbach’s alpha value between 0.7 and 0.9 (Tavakol & Dennick, 2011). Such a finding would indicate that the questions on the test appear to be assessing the same overarching thing (mathematical ability, in this case), but the items do not all provide the same information. More specifically, some questions may assess addition, while others address multiplication or division.

Parallel-Forms/Split-Half Reliability

Reliability can also be assessed using parallel-forms and split-half reliability. Two tests are considered parallel if they aim to measure the same thing, defined in the exact same way. To be clear, if two tests are designed by two different people and assess the same thing in two different ways, they would not be parallel. Parallel-forms reliability is assessed by creating two versions of the same test, both of which are thought to measure the same thing (e.g. intelligence) in the same way. Parallel-forms reliability would be tested by having individuals complete both versions of the test and correlating their scores (correlations are a statistical index used to represent the strength of a relationship between two variables; Bobko, 2001). The correlation between two parallel-forms, therefore, would provide an estimate of the reliability of the test. For example, if a manager wants to ensure that the new math test they have developed effectively assesses applicants’ mathematical ability, the manager could get the applicants to fill out the one version of the test on Monday and the second version on Wednesday. Similar scores on both tests would lead to a high correlation, indicating strong parallel-forms reliability.Split-half reliability is very similar to parallel-forms; however, instead of comparing two versions of a test, you compare two halves of the same test. In this case, the test is split in half to create two separate tests that aim to measure the same thing. The relationship between the two halves of the test provides an estimate of the test’s reliability. Similar to parallel-forms reliability, the two halves of the test are correlated. Continuing with the math example, the manager could assess the split-half reliability of the new math test by splitting it up by odd and even questions which assess the same thing (I.e., mathematic ability). After applicants fill out the whole test, the manager would correlate scores on the first half with scores on the second. A high correlation between halves would indicate strong split-half reliability.

Test-Retest Reliability

Test-retest reliability is perhaps the most common method to assess reliability. This form of reliability involves administering a test to the same people on multiple occasions to determine whether their scores are consistent on the test (James, Mackenzie, & Capra, 2010; Stanley, 1992). Test-retest reliability is calculated by correlating the scores on the test at one point in time with the scores at a later date (Stanley, 1992). For example, a manager could administer the new math test to applicants on one date, then administer the same test two weeks later. If the results of the test at both time points show a high correlation, then the test can be said to exhibit strong test-retest reliability.One consideration for test-retest reliability is that the scores at the second time point can be influenced by the individual’s knowledge of the test from taking it the first time (James et al., 2010). This increased knowledge of the test could increase their score on their second attempt and, in turn, decrease the correlation between the scores on each administration. For this reason, the time between the two assessments should be long enough that this knowledge doesn't impact scores, but short enough that the individual doesn't naturally learn more about the topic in between assessments. Researchers suggest that the ideal time between test administrations could be anywhere between seven and 60 days (Mount, Muchinsky, & Hanse, 1977; Stanley, 1992). For this reason, managers should use a time frame between these amounts to best measure test-retest reliability.

Let’s think back to our earlier examples. If you’re having doubts about results you are receiving from a test, it might be time to assess its reliability! It’s important to note that not every type of reliability will apply to every test. Most of the time, testing the reliability of a test one or two of the methods described here will do the trick. Keep these forms of reliability in mind and it might provide some insight in terms of why the test may or may not be reliable!

To learn more about our research, contact Marian here.

Understanding reliability

Internal Consistency Reliability

Parallel-Forms/Split-Half Reliability

Test-Retest Reliability

Latest articles

Using Workforce: Roadmap (7/7)

Using Workforce: Advanced/Power User (Agencies) (6/7)

Using Workforce: Advanced/Power User (Talent Database/Campaigns) (5/7)