From manual scoring to AI-based assessments at

A primer on how we developed an AI-based assessment model backed by decades of scientific research with roots to modern practices in Industrical Organizational Psychology

Marian Pitel
Authored by: Marian Pitel, PhD

VP of I/O Psychology @

Marian is a very smart lady.

Hiring the best person for the job has never been a simple task. Often, the hiring process consists of a series of steps and the final hiring decision is made after considering how well have applicants done at each stage. How the ranking and ratings of job candidates are determined at each stage is important to think about when developing and evaluating a company’s hiring process.

This task of assessing how an applicant performs at each stage is easier if the hiring tool currently being used comes with a pre-established way of measuring applicant performance.

Marian Pitel, PhD in I/O Psychology at University of Guelph

For example, a published and validated cognitive ability test will likely come with a specially tailored rubric that can be used to identify which of the candidates who complete the test perform better than others. However, if the hiring tool is new or otherwise does not have a complementary scoring guide, then it can be difficult to determine how candidates should be assigned ratings or rankings based on their performance on the tool. Nonetheless, this is an important task to take on as it has direct implications for which candidates end up being hired by companies.

How we tackled it?

We developed and tested a scoring model used to manually assign scores to candidates, indicating their performance on the assessment - we called them scoring aids. Ee placed utmost focus to determine which indicators will be prioritized and used to determine the holistic success of the scoring aid. The success criteria of the manual scoring model were decided a priori to consist of:

  1. Consistency between raters using the manual scoring model
  2. Consistency between performance scores assigned by raters using the scoring model and holistic assessments of candidate performance by hiring managers
  3. Consistency between the ranking of candidates based on manual assessment scores and ranking of candidates based on holistic evaluations by hiring managers of resume and general job applications

The first draft...

After several iterations of the manual scoring model and rounds of pilot testing, we evaluated the scoring model to reach the following conslusions.

  • User-friendly semi-automated interface: The first key accomplishment was the development of a user-friendly manual scoring tool on Qualtrics, a commonly used survey creation application. This scoring tool was a combination of binary forced choice, single option multiple choice, select-all-that-apply multiple choice, slider, and open-ended questions, with the format having been chosen to match the nature of the question. Creating the tool on Qualtrics had several benefits: (1) allowing the rater to focus on one assessment and one section at a time, (2) not giving the rater access to other candidates’ assessment scores; a confidentiality issue that would be present if the scores were manually entered into an Excel sheet; (3) automating the development of a master database of all candidate scores; and (4) keeping track of the length of time it takes raters to score each candidate.

  • Specific score-tallying system: The second key accomplishment was the development of a specific score-tallying system with equal weights assigned to each item. The score-tallying system was points-based and allowed for us to assign scores for each section of the assessment as well as an overall score. The benefit of this system is that it would allow for us to report back to the client the absolute and partitioned scores of each candidate and/or their ranking in the hiring pool relative to the performance of other candidates. (Image – specific)

Not too bad for a first try... Despite the advantages of the manual scoring model, we also found the following critical issues that forced our attention to further improvement.

  • Statistical issue: The first major problem we encountered with the development of a manual human scoring model was the significant amount of variance accounted for by noise, instead of differences between candidates. After conducting reliability analysis, it was determined that only 36% of the variance in the scores assigned to the candidates were accounted for by differences between the candidates; the remaining 64% were due to error.

  • Conceptual issue: The second major problem was the high degree of subjectivity in interpreting aspects of the scoring tool. At many points, each rater had differing, even directly competing, interpretations of what section of the completed work sample and what component of the candidate’s ability were being assessed by aspects of the work sample. This lack of clarity led to lengthy discussions around the intent of the components of the scoring tool.

AI to the rescue

After comparing the strengths and weaknesses of the manual scoring model, we wanted to develop a different scoring model that was less reliant on the human decision-making processes. Because our goals are to standardize the scoring process and use a single scoring model across multiple types of assessments, we determined that an

artificial intelligence (AI)-based scoring model was a better fit in our situation given the more objective markers and level of sophistication that such a model can provide

Marian Pitel, PhD in I/O Psychology at University of Guelph

Our discussions around developing and testing the manual human scoring model led us to the conclusion that this kind of model is more useful and effective in two kinds of situations: (1) when there is an abundance of opportunities and resources to conduct several briefing sessions such that raters can be calibrated on what to expect from each aspect of the scoring, and (2) when the scoring model can be specifically tailored to individual assessments and the scoring tool can be more concrete and specific, leaving less room for subjective interpretations. Otherwise, an AI-centered evaluation model can often be the more efficient and effective approach.

Nearly 12 months later, our data science and research team is in constant pursuit of improvement. Each datapoint we collect reinfroces our model's predictive power, allowing us to measure across reliability and viability standards. COnstant backtesting models are deployed to ensure fairness against bias and diversity.

To learn more about our research, contact Marian here.

At, we continuously work towards gathering insights and perspectives to foster a global culture around workplace performance and development using big data and machine learning. Visit us to learn more about what we do.

More reads...

Subscribe by entering your email to gain access to our exclusive content

- You won't regret it