The Myth Of p-values: Why They’re Not the Holy Grail in Data Science | by Federico Trotta | Apr, 2023

By Jessie Hobb On Apr 27, 2023

The p-value fallacy: why we need to stop treating them as Gospel

In a recent article, I stated that we don’t always need to calculate p-values in Data Science, showing we found a good model to solve an ML problem without calculating them.

Now it’s time to write an article that clarifies this statement. I know: this will be a controversial article, and I’d really like to hear your opinions in the comments.

If you’re an aspiring Data Scientist, here you’ll find the definition of the p-values and an interesting discussion on their usage. Instead, if you’re an expert, I know you know the definition of p-values, and I hope you’ll like the discussion on their usage.

My idea is not about convincing anyone: these kinds of discussions are very serious in the Scientific World, as you’ll read in the article. I just want to turn a light on the topic of p-values, suggesting if we, as Data Scientists, must use p-value or if we can skip their usage, testing our ML models in other ways (and we’ll see how).

Table of Contents:Definingthe p-values
Why we could avoid p-values as Data Scientists
What to use instead of p-values as Data Scientists
An example in Python

Quoting from Wikipedia:

In null-hypothesis significance testing, the p-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct.

Now, I don’t know you, but my way of learning mathematics has always been one: to translate it into practice. As an Engineer, I love practical math because it helps me understand it better. But I have to tell you the truth: I’ve never understood this definition at its core.

The good news for me is that…I’m not the only one in the world! In fact, I encourage you to read this (very short) article here where we can read that:

To be clear, everyone I spoke with at METRICS could tell me the technical definition of a p-value […] but almost no one could translate that into something easy to understand.

It’s not their fault, said Steven Goodman, co-director of METRICS. Even after spending his “entire career” thinking about p-values, he said he could tell me the definition, “but I cannot tell you what it means, and almost nobody can.”

Well, let me tell you that when I read these words I remained astonished and happy at the same time. Astonished because it means that the “Scientific World” seems working on something unknown, somehow. Happy because…I was not the wrong one!

As you may know, we consider acceptable a p-value of 0.05. And here’s another problem: why this number? George Cobb, Professor Emeritus of Mathematics and Statistics at Mount Holyoke College, posed these questions to an American Statistical Association (ASA) discussion forum (here’s the complete article):

Q: Why do so many colleges and grad schools teach p = 0.05?

A: Because that’s still what the scientific community and journal editors use.

Q: Why do so many people still use p = 0.05?

A: Because that’s what they were taught in college or grad school.

Which is like: which came first, the chicken or the egg?

The problem with that approach is that the p-value won’t tell us anything about the results. This led ASA to state that:

Principle 5: A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.

The threshold of statistical significance that is commonly used is a p-value of 0.05. This is conventional and arbitrary. It does not convey any meaningful evidence of the size of the effect. A p-value of 0.01 does not mean the effect size is larger than with a p-value of 0.03.

This is the fifth of six principles developed by ASA on p-values. Let’s read them all (here’s the complete article by ASA):

1. p-values can indicate how incompatible the data are with a specified statistical model.

2. p-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.

3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.

4. Proper inference requires full reporting and transparency.

5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.

6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

So, given what we’ve read so far by experts in the field, we can say that p-values — especially their usage — are, somehow, a controversial topic. So, I’m continuing this article by saying why, in my opinion, we can avoid using p-values as Data Scientists, except in some situations. Then, we’ll implement an example in Python.

I’m going to tell you something that may seem bad but, please let me explain. Data Science is a very huge field, and many professionals work as Data Scientists in different industries coming from different backgrounds. So, in my opinion, and per my experience, there are three kinds of Data Scientists:

The “Statisticians”.
The “Software Engineers”.
“All the others”.

The “Statisticians” are people who graduated in statistics, math, physics, or related fields, who have a huge knowledge of advanced statistics, and who work applying a lot of advanced statistics. For example, people who work in the biology field who studies viruses and stuff like that. These Data Scientists are generally researchers.

The “Software Engineers” are generally people who have studied Computer Engineering (or are proficient in the field) and, since they like data, they turned to become Data Scientists. They have particularly specialized in programming and learned statistics to become Data Scientists. Of course, even they can work in the research field, but I found they generally work in some specific field giving value to the business (more than on research).

“All the others” are…just all the other people that turned Data Scientists by coming from different backgrounds. These people have huge domain knowledge, and study programming and statistics to apply Data Science to their field. For example, given that I’m a side-hustler Data Scientist, I graduated in Mechanical Engineering and learned statistics and programming to apply DS to the domain of my knowledge (mostly, the industrial field). So, I can be classified in the “all the others” category. Of course, even those people can work in research, but I found the majority of them working to give value to the business.

This is to say that we all know that statistics is important in Data Science, but, in my experience, we generally need the basics (and just something more of it, in some cases) if we give value to businesses because we are not researchers.

And, please, don’t get me wrong: I’m not giving less credit to research with respect to business. Even I am contributing somehow to research with come collaborations: so I can see that there is a difference between contributing to research and to business. This is just my point here.

So, this is the controversial idea: p-values are more meaningful if we are performing advanced statistical analysis, using the scientific method. Especially, when contributing to research. I mean, the scientific method — as the name says — is a method that consists of some simple steps:

Identify a problem to solve.
Make hypotheses to solve the problem.
Solve the actual problem.

Easy, isn’t it? Well, yes and no: it depends mainly on the field and on the problem. For example, we all know that in Science there are problems that are unsolved or that are only partially solved. The Navier-Stokes equation, for example, can (currently) be solved only under some precise and particular hypotheses. The general case is currently unsolved. Well, to be precise: none has currently shown that the solution to the Navier-Stokes equation exists and, if exists, if it is unique (if my memories do not betray me: please, let me know if I’m wrong).

But here’s the point: in Science we always make hypotheses. Remember our algebra and analysis classes: when our professors teached us the theorems, they always started from a problem, showed hypotheses, and solved them with respect to hypotheses.

Now, the problem with p-values, as we’ve shown above, is that (quoting):

The threshold of statistical significance that is commonly used is a p-value of 0.05. This is conventional and arbitrary. It does not convey any meaningful evidence of the size of the effect

So, given this statement, my experience tells me that this kind of methodology is more suitable for applications of Data Science that have to do with research and advanced math. Like, for example, in biology, medicine, particle physics, and stuff like that.

In these fields, in fact, the scientific method is very important to arrive at a conclusion, because we need to follow some standardized methodologies so that our work can be compared with others. Often because conclusions are published in scientific papers and we need a strict methodology to show our results.

But let’s tell the truth: Data Science today is way more than just research. This is why I’m telling you that, in all the cases in which we need to apply math and statistics to “everyday” cases, we generally need some basic statistics. And, when we arrive at a result, what’s important is the result in itself because it gives answers to some businesses’ needs, which do not require scientific validation.

So, here’s what we can do instead of using p-values, in these cases.

Data Science is a field that, today, is used even to give businesses some insights using data, often thanks to Machine Learning to make predictions.

To do so, we can make use of the “spot-check” methodology, for example, which is a very simple methodology that works like so:

We need to solve a business problem by making predictions with Machine Learning.
We get the data, clean them, and subdivide them into the train and test set.
We test 3–5 ML models on the training set and, from them, we choose the 2–3 that perform the best.
We tune the hyperparameters of the 2–3 ML models that performed well on the training set and test them on the test set.
We decide which one of the 2–3 models is the best performing one.

So, given this methodology, we can solve business problems by making predictions by choosing a pool of ML models. So, if we’d use p-values, we should calculate the p-value for one ML model, test it, find results, see results that are not satisfactory, and change the ML model.

So, given the fact that p-values can’t say anything about the result, why use them in “a non-strictly scientific environment”? Sorry but, to me, it hasn’t much sense…

Also, the fact that we generally use multiple ML models on our data is related to the so-called “no-free lunch theorem” (NFL). This states (quoting from Wikipedia):

any two optimization algorithms are equivalent when their performance is averaged across all possible problems

This means that an algorithm can perform very well on a particular ML problem, but that gives us no reason to believe that it will perform as well on a different problem where the same assumptions may not work.

In other words, some algorithms may generally perform better than others on certain types of problems, but every algorithm has pros and cons due to the a priori assumptions it has.

So, here’s a fundamental concept to get in mind: we can’t simply apply an ML model to a new problem just because we got good performances on a previous similar problem, because this means biasing the solution.

This is why we have better test different ML models on the same dataset, before choosing the best one. We can’t just use one because it worked on a similar problem because we’re biasing our solution.

So, let’s make an example in Python. For the sake of learning, we’re creating a simple dataset generated by a third-degree polynomial (what a big spoiler!). We, then, scale the data set and split it into the train and the test sets:

import numpy as np
import pandas as pdimport matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.preprocessing import PolynomialFeatures
# Set random seed for reproducibility
np.random.seed(42)
# Generate some random data
n_samples = 1000
X = np.random.uniform(-10, 10, size=(n_samples, 5))
y = 3*X[:,0]**3 + 2*X[:,1]**2 + 5*X[:,2] + np.random.normal(0, 20, n_samples)
# Create a pandas DataFrame
df = pd.DataFrame(X, columns=['x1', 'x2', 'x3', 'x4', 'x5'])
df['y'] = y
# Fit a 3-degree polynomial regression model
X = df[['x1', 'x2', 'x3', 'x4', 'x5']]
y = df['y']
poly_reg = LinearRegression()
poly_reg.fit(X ** 3 + X ** 2 + X, y)
# Calculate the R-squared value
y_pred = poly_reg.predict(X ** 3 + X ** 2 + X)
r2 = r2_score(y, y_pred)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Now, we’ll fit the linear regression model to the X_trainset and use the statsmodels.api package to calculate a statistical summary, p-values included:

import statsmodels.api as sm# Add a constant term to X_train
X_train_with_const = sm.add_constant(X_train)
# Fit linear regression model with statsmodels
lin_reg_sm = sm.OLS(y_train, X_train_with_const).fit()
# Print summary of the model
print(lin_reg_sm.summary())

The statistical summary for the linear regression model (note: your results may be different due to the
stochastic nature of ML). Image by Author.

So, using the Ordinary Least Squared method, we get a lot of statistics about our model, including R² and the adjusted R². If you don’t know what we mean by that, you can read my article here:

Also, the above summary gave us the p-values, calculated in a certain confidence interval (0.025, 0.975. That is: 2.5%, 97.5%). In particular, we are interested in the “t” and “P>|t|” columns.

Let me explain:

First of all, we can see that these values are calculated for each feature ( x1, x2, x3, x4, x5), including the constant value (const).
The “t” column measures how many standard deviations the estimated coefficient is from its hypothesized value under the null hypothesis. The larger the absolute value of the t-statistic, the more significant the corresponding coefficient is in predicting the target variable.
The “P>|t|” column represents the p-value for each coefficient. We said that a low p-value indicates that the corresponding coefficient is statistically significant, while a high p-value suggests that it is not significant. Ideally, to be statistically significant, we know that we agreed on:p-values < 0.05.

So, in the case of the linear regression model, a lot of features seem significant using this model. Good! So, let’s see, for example, a 2-degree and a 3-degree polynomial to see what happens.

To do so, we need to transform our domain into a quadratic and a cubic one and fit the linear regression model. Then, we print the summary like so:

#creating quadratic and cubic features
quadratic = PolynomialFeatures(degree=2)
cubic = PolynomialFeatures(degree=3)X_quad = quadratic.fit_transform(X_train) #quadratical transformation
X_cubic = cubic.fit_transform(X_train) #cubical tranformation
# Add a constant term to X_train
X_train_with_const_quad = sm.add_constant(X_quad)
X_train_with_const_cub = sm.add_constant(X_cubic)
# Fit polynomial regression models with statsmodels
quad_reg_sm = sm.OLS(y_train, X_train_with_const_quad).fit()
cub_reg_sm = sm.OLS(y_train, X_train_with_const_cub).fit()
# Print summary of the model
print(quad_reg_sm.summary())
print(cub_reg_sm.summary())

The statistical summary for the quadratic regression model (note: your results may be different due to the
stochastic nature of ML). Image by Author.

The statistical summary for the cubic regression model (note: your results may be different due to the
stochastic nature of ML). Image by Author.

First of all, the polynomial models have more features than the linear model because the formula of a polynomial model is:

The formula of the polynomial model. Written by Author, powered by embed-dot-fun.

So the formula adds polynomial terms to the formula of the linear model.

So, commenting on our results, we get an improvement in R² and in the adjusted R² but a lot of features don’t seem statistically significant, according to the p-values. In particular, R²=1 for the third-degree polynomial (of course! We created the dataset using a third-degree polynomial, so…everything fits!), but a lot of features don’t seem statistically significant.

Now, let me ask you a question: based on that statistics, which model would you choose? The linear model seems pretty good, doesn’t it?

So, here’s the problem: as we said, we have no guarantees on the results by using p-values.

Also, remembering that we’ve built this dataset with a 3-degree polynomial how is it possible that barely all the features are not statistically significant? Well, it’s hard to say. We just know that one or more assumptions for the model are not respected (the assumptions for the polynomial model are the same as for the linear model; you find them in this article here). But in these cases, multicollinearity is generally the most probable cause.

Now, let’s use the spot-check methodology to solve this ML model. Remember that the linear regression model has no hyperparameters, so we can fit it (and also the polynomial models) directly to the test set:

NOTE:for the sake of clarity, we are using again the 2-degree and 3-degree
polynomial transformation, even if we shouldn't if we were using
a Jupyter Notebook (we've already done the transformation above).

# Fit linear regression model to the train set
lin_reg = LinearRegression().fit(X_train, y_train)# Create quadratic and cubic features
quadratic = PolynomialFeatures(degree=2)
cubic = PolynomialFeatures(degree=3)
X_quad = quadratic.fit_transform(X_train) #quadratical transformation
X_cubic = cubic.fit_transform(X_train) #cubical tranformation
# Fit quadratic and cubic transformed set
quad_reg = LinearRegression().fit(X_quad, y_train)
cubic_reg = LinearRegression().fit(X_cubic, y_train)
# Calculate R^2 on test set
print(f'Coeff. of determination linear reg test set:{lin_reg.score(X_test, y_test): .2f}')
print(f'Coeff. of determination quadratic reg test set:{quad_reg.score(quadratic.transform(X_test), y_test): .2f}') 
print(f'Coeff. of determination cubic reg test set:{cubic_reg.score(cubic.transform(X_test), y_test): .2f}')
>>>
Coeff. of determination linear reg test set: 0.81
Coeff. of determination quadratic reg test set: 0.81
Coeff. of determination cubic reg test set: 1.00

So, we get high values of R² for all the models. Especially, the third-degree polynomial has an R²=1 on both the train and the test sets. And this should tell us something…

Now, let’s validate our models on the test set even with KDE plots. If you don’t know what a KDE model is and how to use it you can read my article here:

We can use them like so:

#Kernel Density Estimation plot
ax = sns.kdeplot(y_test, color='r', label='Actual Values') #actual values
sns.kdeplot(y_test_pred_lin, color='b', label='Predicted Values', ax=ax) #predicted values#showing title
plt.title('Actual vs Predicted values for linear model')
#showing legend
plt.legend()
#showing plot
plt.show()

KDE plot for the linear model. Image by Author.

#Kernel Density Estimation plot
ax = sns.kdeplot(y_test, color='r', label='Actual Values') #actual values
sns.kdeplot(y_test_pred_quad, color='b', label='Predicted Values', ax=ax) #predicted values#showing title
plt.title('Actual vs Predicted values for quadratic model')
#showing legend
plt.legend()
#showing plot
plt.show()

KDE plot for the quadratic model. Image by Author.

#Kernel Density Estimation plot
ax = sns.kdeplot(y_test, color='r', label='Actual Values') #actual values
sns.kdeplot(y_test_pred_cub, color='b', label='Predicted Values', ax=ax) #predicted values#showing title
plt.title('Actual vs Predicted values for the cubic model')
#showing legend
plt.legend()
#showing plot
plt.show()

KDE plot for the cubic model. Image by Author.

So, watching the KDEs, we can say that the cubic model is clearly overfitting the data (but we could tell so even because we get R²=1 for both the train and the test sets. But, again: we found something obvious here. But it’s good to discover obvious things because we know we’re doing well!).

Also, even if we got very good values of the R² on both the sets for the linear and quadratic model, the KDEs clearly tell us that these are not good models to solve this ML problem. So, we should investigate other models.

Finally, to give a piece of final advice, if we really want to calculate p-values so that we’re sure our model respects the hypotheses, we can calculate them in the end (so we’ll calculate them only for the model we believe is the best one).

But remember: basing our ideas on p-values, the linear model could be a good one to solve this ML problem, while the KDE showed us the contrary!

Thanks for reading this article: I’d really like to hear your opinions on that in the comments.

Let me summarize the ideas:

The usage of p-values is somehow a controversial topic. It is useful to test our hypotheses, but it doesn’t give us any guarantees about the results, as per what the ASA says.
My personal idea is that p-values are more suitable in scientific research, where it’s important to show we’ve used a rigorous method when arriving at some results (remembering that, as per what ASA said, the results mustn’t be based only on p-values).
Considering we need to test different ML models on a given dataset for the no-free lunch theorem, when providing business value with ML there’s really no need to use p-values for giving a scientific base to our results. Anyway, if we want to use them, we can test the hypotheses at the very end, to show that the model we’ve chosen respects the hypotheses.

Subscribe to my newsletter to get more content on Python & Data Science.
Found it useful? Buy me a Ko-fi.
Liked the article? Join Medium through my referral link: unlock all the content on Medium for 5$/month (with no additional fee).
Want to contact me? Do it here.