A word on how we, data practitioners, should be more aware and attentive to our own biases
(story originally posted on Medium)
In the world of data science, we define bias as the phenomenon in which a system overgeneralize from its data and learn the wrong thing. When this happens, the usual first action we take is to point fingers at the data or training process, followed by saying “this data is bad” or “I should further tune my hyperparameters.” Sure, this could be part of the problem. However, before spending more time and processing power, I’d like to invite you to stop, take a step back, and think about how the data we are using came to be, and more importantly, let us reason about how we are interpreting it.
Unlike machines and smart learners, we humans, suffer from bias, a bias that could have been introduced for numerous reasons, such as by moments we have previously experienced or concepts and definitions that are already part of who we are. Unfortunately, this bias could influence the way we handle, and interpret data, creating a problem when we, inadvertently, transfer those ideas and assumptions into our dataset, and consequently to our machine learning models and their outcome. Examples of these consequences are often mentioned in the media (generally with headers that include an undesirable dose of fearmongering) such as the case of the famous ‘sexist’ recruitment model from Amazon that preferred male prospect candidates over female ones.
In this article, I discuss three sources of bias, the confirmation bias, availability heuristic, and sample bias, and write about how I have acknowledged their presence and effect, alongside several techniques I apply to deal with them.
Since 2016 I have been working on the Antispam team of a dating and social platform, where my goal is to build solutions to detect spammers and avoid the proliferation of them. In the beginning of my career at the company, I had entirely no knowledge about our users (as expected); I did not fully know our demographics, nor the behavioral pattern of them. What I want to say is that from a simple glimpse, I could not tell whether a user was a spammer or not; anybody could be one! Then, with each passing day, you start to experience and learn things. Ah, this geographical region seems to be more spammy, ah, this email domain is terrible news, ah, names like this are never good, and so on. In simpler words, I was creating a mental profile of what a spammer is purely based on what I had learned, seen and dealt with. Now I have to ask myself: Is this knowledge a fair one? Is this profile representative of the whole population? Is my mental portrait of the “ideal” spammer an impartial one? These are some of the questions I ask myself every time I am working with data, and most importantly, each time I train a new machine learning model. Why was I asking these questions? Well, for starters, I believe that in this line of work you should always question yourself. Secondly, because that’s how I acknowledge human-based bias, and the effect it could have if I ignore it.
Of all the many existing biases, there are three main ones — confirmation bias, availability bias, and selection bias — that I believe could cause an unwanted effect in my models if I would not consider them; this does not mean that I don’t mind the other biases, is just that those are the ones that keep me on my toes. Over the next couple of lines, I will define these biases, and give some examples of how they could get me.
Confirmation bias, a type of cognitive bias, refers to the tendency of interpreting information, evidence, and data, in a way that supports and confirms a person’s views and hypotheses while disregarding any possible conflicting proof. The confirmation bias is one of the most common ones, and it is not hard to imagine why this is the case, after all, favoring and confirming ideas based on what we support sounds, in some way, like the logical thing to do. Earlier, I mentioned some possible theories and characteristics that I could learn about the spammers after working with them for such a long time, for example, the likelihood of a user being a spammer if it is located at a specific region. This fact is plausible. There are some regions with a higher concentration of spammers than others, and because this pattern is somehow common, I might “unconsciously” learn and confirm that if user X comes from region Y, he might be a spammer. But is this enough reason to conclude that this user is an actual spammer? Of course not! Nonetheless, under specific and unfavorable circumstances, for example, if I had to flag a user profile during a stressful day, I could accidentally mark the user as a suspicious one, and thus confirming my biased belief just because my hypothesis stated that this user might indeed be a spammer. Nonetheless, I seldom do this manually, so the chances of this happening are almost 0.
The availability heuristic, another cognitive bias, describes the tendency of giving importance to the most recent and immediate experience, information, or example a person think of, whenever it encounters a decision-making situation. The main idea behind this mental shortcut is that if a person remembers a piece of information, it must mean that said information is essential. When dealing with data, and decision systems, ignoring the existence of this assumption can lead to disastrous results. Here’s why.
Usually, during my working hours, colleagues approach me asking whether a profile is a spammy one or not. Usually, I answer right away because I am well familiar with how a spammer look like (do I sound biased?). Having said that, I must admit that there have been cases in which I am reluctant to answer with a quick yes or no without giving it a second thought. Why’s this? Because I’m sure that I have seen a case like that. For example, daily I see many profiles and their usernames, and I have familiarized many patterns and keywords that indicates if said username relates to a spam profile. So, if you would randomly ask me, what’s a typical spammy username, I might have the answer.
Another example is while labeling data. Even though this process is an automatic one, every now and then I dive into the dataset looking for outliers or strange cases that require a human eye. During these expeditions through a sea of rows and features, I might see a particular example in which my brain, through the availability heuristic, might determine that a profile is a spammy or good one based on a recent experience. In such case, the easiest solution would be to listen to the little voice in my brain, and just switch the label (which honestly I’d do if I am a 100% sure), however, since I am aware of this bias, I’d first consult our other sources to confirm or deny my beliefs.
Lastly, there is the sample bias, a statistical bias. This kind of bias is observed when the data selected for training your system does not represent the population in which the model will be used. The outcome is most likely a biased sample, a dataset that over-represents a group(s) and under-represent others. Getting rid of this bias is not an easy task and will most likely occur in practice because, as Wikipedia says, it is “practically impossible to ensure perfect randomness in sampling,” however, being aware of its existence could help to alleviate its influence. There are, probably infinite ways, in which this bias could show up in my day-to-day, and in the next paragraph, I illustrate some of the ones I have identified.
For starters, I am always thinking about timezones. That’s because every time I do something time related, for example, selecting X data from the last Y hours, my sample will be mostly made of observations from the geographical region that was at its peak time in those Y hours. For example, I am in Europe, so if at 9 am I do a query to select X thing from the last hour, my sample will most likely be made of European users and people with insomnia. So, in some way, I am adding bias to my sampled data. Another case I have identified is the differences between platforms and app versions. While querying data, we have to keep in mind that users are using different platforms, or release version of the app, meaning that they might be generating distinct kinds of data. For example, suppose that one day a product team decides that in the next version of the app, users will be allowed to upload a million images to their profiles. Then by any random and a very unfortunate chance, on that same day, I decided to build a model that detects spammers based on the number of pictures without being aware of such change in the app. Then, since the “million images” feature is new and not everybody will have the update, I won’t have a good representation of this new group of people who has a million images on their profiles, which will result in some unwanted results during training and inference time.
Is there a way to completely avoid human-based bias? I don’t know, but I am sure there are steps that we, as practitioners, can take to mitigate their effects in our dataset, and consequently, our decision-making models.
My first recommendation is to be data-driven. I don’t mean data-driven in the sense of “oh yes I read my data before making a decision” and making a few queries. What I mean is to squeeze, be one, delve, and heck, love those datasets. Draw their distributions, remove outliers, clusters them, test them, reduce their dimensionality and so on. Make sure you truly know them.
Another tip that goes hand to hand with the previous one is to identify the possible sources of bias. Write on a sheet of paper, wiki page, sticky note, or on the back of your hand, what could introduce bias to your system. Is it time, or differences across the app versions, like I mentioned? Question yourself the same way I did. Ask yourself if your sample data is representative of the population, or if the decision you are about to make is based on a genuine piece of information or on a gut feeling you have because of that data point you remember from yesterday.
Lastly, share your process with others. Talk to the person next to you, and ask them what they think about your code or query, or create a pull request so that others can scrutiny your work. Sometimes because we are so close and attached to the material, we fail to see mistakes and details that others could detect.
We, humans, are biased. If this human-based bias is not handled correctly, it might affect the way we work and interpret data, and ultimately it will influence the outcome — which in most cases is a non-desirable one — and performance of our machine learning models. In this article, I introduced three kinds of biases: confirmation bias, availability heuristic, and sample bias, and talked about the many ways they could manifest in my daily work and offered some suggestions on how we could lighten their impact.
Ignoring the existence of these biases could cause unwanted and disastrous behavior in our system, responses that in the majority of the cases would leave us with the “heh? Why does my model believe this is a monkey?”-kind-of-questions, resulting in sensationalist and fearmongering articles stating that AI is racist, sexist, elitist or just plainly unjust. I sincerely believe that every person who works with data should be aware of the impact this could have on their work. With the rapid adoption of machine learning in every facet of our life, our products could turn out to be a biased system that unfortunately could be responsible for causing a fatal accident, diagnosing an incorrect treatment or blocking your whole userbase.