min read

The importance of proper data infrastructure to avoid hiring biases

A dive into the various requirements companies must fulfill when setting out to use data.

This is part 2 of a 5-part series on Ethical AI. If you missed part 1- Ethical AI Standards: A Process Approach, click here.

It is through the use of our perception that we gain access to the world. By way of sense organs, we access different dimensions of the world and as a result we are each left with our own, cohesive experience. AI on the other hand, must see the world through the lens of data.  

Without sensory organs, we supply our AI with information by feeding it inputs, which is analogous to the world feeding us with all the various sights, sounds, tastes, etc. it has to offer. To that end, we can only expect accurate work from AI on the condition that the inputs correspond to the actual world.

This means that data must be properly parsed, be representative of that which is being represented, and be updated continuously to improve and finetune machine learning models in use.  

Let’s now dive into the various requirements companies must fulfill when setting out to use data, and we will demonstrate how the data at nugget.ai is solid in its foundation and utilization.

Ontology & Taxonomy’s Role in Proper Data Infrastructure

Ontology is the study of being, and it is a way in which we can remove our own personal biases towards our perceptions. How is this the case? We consider the ontology of any particular object (e.g., the ontology of being the oldest sibling in a family) and we generate only the rules that capture the essence of what it would mean to be, e.g., the oldest sibling (more on this in the Logical Reasoning section). These rules are the kinds of qualities which are important to determine correctly and give us the proper tools to use and make decisions with.

Human behavior tends to generate additional qualities onto pieces of information; it is a way of “filling in the blanks” and it seems quite natural to do so. In some instances, new qualities are necessarily gained from one piece of information due to the structure of the information (e.g., If you know John has an older brother, then John cannot be the oldest child). However, we often find ourselves adding qualities which are improper and often illuminating of someone’s biases, shortcomings, or quirks. For example, someone inferring that an unmarried woman in her mid-forties must be a feminist, is a telling sign that person takes feminism to mean something other than what it actually means to be a feminist. Gender, age, and marital status all have no bearing on what it means to be a feminist, only your political values may have a determining factor on this matter.

This alone is a fascinating subject, and the exact way in which all stereotypes are formed. It is quite easy to generate a bigoted or prejudice sentiment merely by saying: “person x is member of G, therefore person x has quality F” where x is any arbitrary individual, G is any group being stereotyped, and F is a quality which you wish to stereotype group G with. This activity is quite ingrained in our way of thinking, and while it has its everyday uses, in the workplace it can often cause issues as we tend to place people into groups and claim they have a quality well before we know the person in question. We at nugget.ai actively fights these sentiments, as illustrated by the following use case:

An applicant applies to a firm for an entry level job. They have no work experience in an office, but they do have a master’s degree in hand. An HR worker knows these facts, and deems the applicant insufficient for the job, for lack of work experience means they would not be able to deal with the organizational structure of a company. Simply put, the applicant would just be a disruption.  

This line of reasoning backs up experience requirements on job postings because screening is a necessary activity that must be done to keep an efficient pace when it comes to hiring. However, nugget can circumvent this line of reasoning while maintaining the utmost efficiency in screening: this is done by simply testing applicants on their competencies in organizational structure; the umbrella skill being system fluency (click here to learn more about how we measure soft skills like system fluency).

Return now to the applicant, who instead of being rejected based on a loose line of reasoning which may or may not hold true for this person; they are now able to prove themselves by taking the challenge. The HR worker, now armed with better information about this applicant, can make the right choice and hire based on the applicant’s competencies, or in other words the ontological merit of the person in question.

Through this example, we have demonstrated the various gifts of Ontological analysis:  

  • removing guesswork
  • removing ambiguity
  • removing bias

Finally, we are able to talk about taxonomy’s position in all this. Taxonomy gives a workable database filled with important variables organized to be relevant to an ontological analysis. In nugget’s case, we have used the O*NET database extensively to determine which soft skills to test for, generating lists for occupations, determining proper skill alignment, and providing overall solutions to ontological problems. O*NET is a database which houses an extensive number of occupations complete with required skills, tasks, descriptions, and so on, giving nugget a wealth of data to work with. While no taxonomy is perfect or complete (and O*NET is no exception); human interaction with it leads to proper results where changes may be made in the process (e.g., in the accountant role, mathematics is a required skill, yet the definition includes not just arithmetic, algebra, and statistics, but also geometry which is markedly not required for accounting).

Ontological Arguments Via Logical & Modal Reasoning

To provide justification, I will demonstrate how valid logical reasoning shows us we are in fact on the right track. Observe: (For those untrained in formal logic, look here, here, and here for basic rules/syntax, quantification syntax, and modal-specific rules, respectively)

I: Sound Ontological Argument

Verbal remarks:

We utilize the fact that it is necessarily the case that if John is the oldest sibling, not one sibling will be older than John. However, we know John has an older sibling and so, John is not the oldest sibling. This argument works because it truly captures the essence of being the oldest sibling (premise 1). We can see a deceptively similar argument fail because it fails to capture the essence of the concept despite relying on a truthful premise: consider being the youngest sibling. Here’s a fact about being the youngest sibling: If John is the youngest sibling, he would have to have (at least one) older sibling. John does have an older sibling, so he must be the youngest. This fails to consider that John may be a middle child (that is, he has (at least) one older sibling and (at least) one younger sibling). We would have to say instead: If John is the youngest sibling, then he would have to have no younger sibling. Of course, none of our premises can tell us if John is in fact the youngest sibling, so we can only conclude he is not the oldest one. We can see how picking certain facts are not sufficient to determine ontological merit; we rather must do an ontological analysis and properly pick out the rules that leave no room for error.  

II: Unsound Ontological Argument

Verbal Remarks:

The first premise assumes that all unmarried women in their forties are feminists. Holding such a premise is completely unfounded from an ontological point, and so it fails our test outright. Additionally, since there are unmarried women in their forties that aren’t feminists, the sentence cannot stand to be true- much less match the essence in question. Rather we must match qualities that track the essence, this is what would lead to the correct basis for a premise (e.g., for feminism it would be a collection of beliefs towards gender equality and so on).

So, we can form a table that demonstrates these ideas succinctly:

Proper Data Practices

To cap off this blog, we will list off qualities a good data set will have, all qualities which will be conducive to the project of keeping AI effective, ethical, and smooth.

  1. Data should be properly labelled: while seemingly trivial, it is of utmost importance to have quality data; units should be uniform across all entries (e.g., if ‘height’ is one of the variables, then it is imperative that all entries stick to one unit of measure, such as cm). There should also be as few data entry errors as possible (e.g., writing ‘bed’ instead of ‘bad’ would cause confusion as the machine tries to learn about language). The correct labelling of our data is the aim here: if a model is taking one variable for another, then it will learn incorrectly. If pictures of dogs are labelled as ‘cats’, then it would be of no surprise if the model begins looking for doglike features when checking to see if an object is a cat or not. While the example of cats and dogs are easily to be fixed, labelling on more complex topics, such as the breed of dog should be handled by experts in their respective fields (e.g., biologists/giving taxonomy of animals, accountants for financial objects, etc.).
  1. Data should be filled out as much as possible. Leaving entries blank will cause issues in cases where such information is used to calculate important results, so having complete data entries also must be adhered to. This of course is not always feasible, however there are a plethora of techniques available. Since the data may be varied, different situations call for different measures (e.g., an entry should be thrown out as it has more missing fields than filled fields). This article provides 7 techniques as well as the pros and cons to each, thus lending itself well to help making these choices that may impact the quality of the data.
  1. Datasets should strive to move away from sampling bias. Easier said than done, but the fact of the matter is a disproportional data set will skew results. Novel situations are essential to improvement of AI: simply imagine two AI systems, one being fed pictures of highways, the second being fed fewer pictures of highways, but additionally dirt paths, roads that contain railroad tracks, etc. Which one do you think will know how to detect traversable terrain for vehicles? Clearly the one with a more diverse set of data that covers multiple kinds of terrain. So, diversity in data will help overcome shortcomings from homogenous data groups, thus improving results for those that would otherwise be left out and unaccounted for. To that end, we can derive a simple principle that if followed, will achieve the best mix of data: to the best of one’s ability, match the ratio of the population with your subset of data. So, if there is a split among 45/55 in a population of 1000, then a sample of 100 should take 45 of the 450 and 55 of the 550 in the two groups, respectively.  
  1. Data should be continuously updated and added onto whenever feasible. To finetune the model, new data must be fed into the training dataset. For the AI to improve, it must encounter novel situations that further update the model to be more generalizable. However, the same concepts of keeping data labelled correctly apply. Therefore, as new data is given by users and gives a more complete picture of the world, that data must still be processed in the same way the original training dataset was- that means being filled out properly and verified by experts.
  1. Data should be collected objectively. This last point rests on the idea that data should be collected in such a way that two independent data collectors would, ceteris paribus, attain the same datasets. The data collectors themselves should not affect the data in any way, and so their interpretations are not to be included. Strictly speaking, data should contain facts- even in the case of opinion, it is still the objective truth that so-and-so holds such-and-such opinion, and the aggregate demonstrates the general sentiment of a given sample. Therefore, this collection process is to be seen as a mechanical process, not an art.

To learn more about Data Privacy, click here!

Nicholas Tessier 🧠

Product Manager