How to Critically Evaluate the Next Data Science Project You Come Across | by Murtaza Ali | Mar, 2023

By Jessie Hobb On Mar 10, 2023

Opinion

Qualitative methods, data manipulation, and media sources — as well as a detailed look into how numbers can lie

Photo by Laurenz Kleinheider on Unsplash

Much of the craze behind data science focuses on the glitz and glamor: big data that tracks your every movement, powerful models that predict the the state of the Earth, intelligent systems that can simulate human thought better than we ever thought possible.

While these brilliant achievements are indeed a part of data science, they shouldn’t be taken at face value. Whether you’re actively working on your own data science project(s) or perusing the products other data scientists are quickly developing, it is absolutely essential that you know how to critically evaluate the data therein.

It’s called data science for a reason. In any project, the underlying data is of the utmost importance. In this article, we’ll take a look at four powerful ways to help ensure we’re analyzing it as effectively as possible.

The case for human-centered data science

With artificial intelligence (AI) becoming more prevalent by the day (Exhibit A: ChatGPT), folks are finally starting to pay a bit more attention to the ethical component of data science.

When you interact with most AI systems, they will claim that because they are machines, they have no opinions. However, this simply isn’t true. All artificial intelligence systems are strongly driven by inherent biases in the data used to train them. As a result, any biases inherent in the data are transferred into the model. This is the reason many AI systems can be discriminatory — a famous example of this is bias in facial recognition technology [1]. Furthermore, in many cases, raw quantitative data simply isn’t enough to capture the important aspects of the data.

Sourojit Ghosh(G), a longtime collaborator of mine who specializes in social recommendation algorithms (SRAs), makes the following case in his recent publication [4]:

“A common thread to the functioning and development of SRAs is their dehumanization of users by reducing expressive individuals to data and metadata archived into databases. Such algorithms learn and are taught to infer human expressions only as ‘content’ and ‘variables.’ However, human beings are more than just their data, and there are many contexts and circumstances that go into creating content.

…

We imagine that integrating social and individual contexts into an SRA’s understanding of user-generated content could improve the end results on a number of levels. At different times during a user’s creation of content and engagement with recommended content, system prompts could ask optional qualitative questions about the created content, such as social and local context, and the user’s motivations behind creating the content, as a few possibilities. Should users choose to answer these questions, the additional qualitative content can be incorporated into the SRA’s choices in making recommendations, as it strives to produce more meaningful content for a user to engage with. Such a feature could let users determine which levels of additional information and context they offer to SRAs to infer and incorporate into suggestions.”

The above illustrates at a high level how qualitative analysis techniques can be combined with traditional quantitative approaches to gain deeper insight into data. There are many examples of qualitative techniques within data science — here are two common ones:

Ethnographies: An ethnography involves searching for predictable patterns in the social and cultural experiences of a particular group of people by carefully observing and actively participating in their lives [2].
User Interviews: This is an intermediate step in the process of building some product or model, and involves going out and testing it on an actual subset of eventual users [3].

The future is now, and the future is human-centered. Don’t get left behind.

Figure out how to properly manipulate data

This might seem obvious if you’re an experienced data scientist, but it isn’t necessarily clear when you’re just getting started.

The best way to illustrate this point is through the example of a particular data science workflow. Here, we’ll use the process of analyzing data and then presenting it to some audience by way of a data visualization.

Folks new to visualization might be tempted to look at a data set and simply produce charts of the metrics already available to them. For example, say we have the following data set:

We’ve kept the data set small for simplicity, but you can imagine that such a data set could be collected for people in a much larger setting — think something like a university or a city.

Looking at the above, we might build a bar chart of everyone’s heights. Perhaps we can plot a histogram of the distribution of ages. Better still, we could use two variables, and make a scatter plots of height vs. weight.

There is nothing inherently wrong with these ideas; in fact, we might well want to build some of these plots in a visualization dashboard for this data set. That said, these questions are limited in their scope. What if we wanted to delve deeper?

What is the average height and weight when people are grouped by sex? How about if they are grouped by age? Is there a relationship there?
Can we make the height plots in centimeters as well, in the event that our primary audience is outside the U.S.?
Can we make plots that filter out people who aren’t in a certain age range, allowing us to better generalize our plots to whatever our target population may be?

These answers to these questions are in the data set, but they aren’t available to us in the data’s raw form. To get at these questions, we need to manipulate our data through various transformations, aggregations, and filters before we start generating visualizations. This sort of manipulation is a skill you absolutely must master if you wish to be a data scientist.

There are many tools you can use to manipulate data:

If you’re comfortable programming, then your best bet is to use Pandas, Python’s powerful data science module.
If you’re more of a qualitative UX person, there are task-specific tools you can use. For example, if you use Tableau for data visualization, you should be aware that it also has many features that facilitate data manipulation.
If you just want to fix the data in its initial format before exporting it to any external tool, then Excel or Google Sheets have plenty of functionality that might be to your taste.

The point is, learn at least one tool that lets you manipulate your data.

More is better, but one is an absolute necessity.

If you can, go back to the data source

In his excellent book How Charts Lie [5], information designer Alberto Cairo states the following general rule of media literacy:

“Distrust any publication that doesn’t clearly mentions or link to the sources of the stories they publish.”

Viewing this through the lens of data science, we we would do well to adopt an important mindset: If any model, visualization, or otherwise data-related deliverable makes claims without providing their data source, you should be immediately skeptical.

Written in simple terms, this seems like common sense. Of course we should distrust claims made without explicitly provided data. Who in their right mind would even do such a thing?

Many people, as it happens — some of whom are rather influential.

Cairo provides the following example in his book, a graphic which was tweeted by the White House in December 2017 [6]:

Public Tweet, Graphic Available on Trump White House Archives

There are plenty of visual critiques the above image is practically begging for, but seeing as this article is not about visualization, we will leave them aside for now. There is one particular point upon which you should focus your attention.

This graphic is, in a phrase, completely made up.

There is no data source linked, which checks out, because no data went into constructing the visualization. The depicted base-3 exponential growth is random at best, and a serious exaggeration at worst.

Nevertheless, millions of people saw this graphic, and many of them probably drew strong conclusions from it, despite the reality that there is nothing factual presented in the above image. It should be self-explanatory why that is not good, and even potentially dangerous.

The next time you examine a data science project, check the source. It matters.

Mathematically correct ≠ Accurate

How can this be right? Surely if something is mathematically correct, then it must by definition be accurate.

It is important to define what we mean by accurate here. For data science — which by its very nature is embedded in social structures — accuracy includes the interpretation of the data. That is to say, context matters.

Thus, what we mean by accuracy is that the claims we make from data must not be misleading to a broader audience. This makes more sense with an example, so let’s look at one from an excellent online textbook, Computational and Inferential Thinking [7].

Consider a large population of people struck by a disease that only affects a small subset of people within it. Specifically, 4 in 1000 people are infected by the disease. We also have a medical test that lets us predict whether or not a person has the disease with high accuracy: a false positive rate of 5/1000 and a false negative rate of 1/100.

This in mind, ponder the following question: If a person is randomly selected from the population and tests positive for the disease, what is the probability that they actually have the disease?

Take a moment to think about this before scrolling.

Now then, what do you think? With our test being so accurate, maybe somewhere around 80%? 85%?

The actual answer may surprise you: Approximately speaking, there is a 44% chance that this person actually has the disease. We leave the mathematical details of this aside, but you are encouraged to click the link above if that interests you.

Here, we focus on the intuition behind why this is the case. Is there an error in the math? As you may have guessed by the title of this subsection, there in fact is not.

The reason for this perplexing result is that the prevalence of this hypothetical disease is so incredibly low in the underlying population that despite our test’s apparent accuracy, there are more false positives than true positives. In other words, there are so many people that don’t have the disease that even the seemingly small rate of false positives corresponds to a pretty high raw number.

This is what we mean when we say something can be mathematically correct but still not quite accurate. If such a test were given blindly without any discussion of the underlying contextual factors, it could result in myriad misled or even panicked people. The context — in this case the distribution of the underlying population — is incredibly important when dealing with data.

Final Thoughts + Recap

When dealing with data, it is incredibly important that you critically evaluate every aspect of it. Here’s your cheat sheet to doing so:

It isn’t always all about the numbers. Human-centered data science is the way forward.
Raw data is almost never enough. Learn how to manipulate it.
A deliverable is only as good as the data. Check data sources religiously.
Numbers do in fact lie. When the math seems right but the conclusion is questionable, look for contextual explanations.

If you aren’t considering your data carefully enough, it’s simply too easy for people to mislead you.

Don’t let that person be you. Stay informed.

Until next time, my friends.

References

[1] https://www.theregreview.org/2021/03/20/saturday-seminar-facing-bias-in-facial-recognition-technology/
[2] https://nsuworks.nova.edu/cgi/viewcontent.cgi?article=1071&context=tqr
[3] https://www.interaction-design.org/literature/article/how-to-conduct-user-interviews
[4] Not Just Cells in a Dataset: Imagining a Human-Centered Data Science Approach to Social Recommendation Algorithms
[5] How Charts Lie, Alberto Cairo
[6] https://trumpwhitehouse.archives.gov/articles/time-end-chain-migration/?utm_source=twitter&utm_medium=social&utm_campaign=wh_20171218_Chain-migration_v2
[7] https://inferentialthinking.com/chapters/18/2/Making_Decisions.html