Hacking Statistical Significance: Hypothesis Testing with ML Approaches | by Marco Cerliani | Jan, 2023

By Jessie Hobb On Jan 11, 2023

Test Statistical Significance in any Context Without Assumptions

The importance of data analytics is well-known in every field. From business to academics, carrying out proper analysis is the key to reaching cutting-edge results. In this sense, it is crucial to correctly manipulate and extract meaningful insights from the data at our disposal. Data Analysts/Scientists are responsible to fill the gap between theoretical hypothesis and practical evidence.

Providing an analytical answer to all the questions that may raise is an expensive and hard journey. Translating a question/need in the analytical language it’s the first step to carrying out. The goodness of this kind of operation is crucial since it influences the correctness of the final results. During the preliminary phases is it important to understand the analytic goals and point out which are the best data sources, frameworks, and people to engage in order to reach the best outcomes.

Most of the time analytically answering questions is made by carrying out a statistical test. Many statistical tests work as below:

State a null hypothesis, which is the default option that can describe the world.
State an alternative and complementary hypothesis.
Calculate a test statistic (a function of the data) and outline the final results.

Given the distribution of the test statistic known, the probability to observe any value of the underlying statistic can be easily calculated (p-value). If the p-value is smaller than a prefixed (generally 0.01 or 0.05) significance level the null hypothesis is rejected in favor of the alternative hypothesis.

There’s nothing wrong with statistical testing, but there are some hidden pitfalls we should pay attention to:

Strict assumptions on the data. Most of the time the underlying data must follow normal or known distributions. As we know, real-world phenomena are not normally distributed at all.
Limitation to quantities/statistics we are not interested in. If we want to test something custom or more complicated we may be in trouble.

In this post, we introduce some fancy and simple methods to test hypotheses and extract meaningful insights from the data at our disposal. We don’t reach conclusions using standard statistical tests but using simulations and permutations.

In order to explain the methodologies, we use a dataset containing records of house sales in King County (USA). The dataset is accessible from Kaggle and it’s available under the CC0 Public Domain Licence. It contains house sale prices, sold between May 2014 and May 2015, for King County (including Seattle).

The dataset contains around 20,000 entries of houses sold with different numerical attributes: selling prices, number of bedrooms, number of bathrooms, square footage of the living space, number of floors, latitude/longitude, the building year, and much more.

In a standard predictive application, it would be interesting to forecast the selling price of the houses given their features. Here we are not interested in this kind of application. We want to answer some questions by looking at the data in a way that differs from classical statistical testing but that it’s equally efficient (or maybe more flexible).

Let’s imagine being interested to know whether there’s an association between the building years of houses and the selling prices.

Selling prices have a distribution that differs much from a normal one. As we expect a clear linear relationship between prices and building year does not exist.

Price distribution (left). Price vs building years (right) [image by the author]

The median selling price is 45,000$. The houses built in 2015 (the newer ones according to our dataset) have higher median prices. This seems to be reasonable but it would be interesting to understand if this effect is “due to chance”.

Lowest and highest median selling prices by year [image by the author]

With “due to chance” we are referring to the fact that we are observing only a sample of the entire population. The data we have at our disposal is limited to only a part of all the house transactions that happened between 2014 and 2015 in King County. There may be more houses built in 2015 that are sold in this period and aren’t recorded in our dataset.

The best we can do in this situation is to take note of the limitations and try to estimate the real median of the sold houses built in 2015. We can do this through simulations.

As the first step, we compute and store the observed difference between the median selling price of houses built in 2015 and the median selling price on all the data at our disposal. This value (observed difference) represents the difference in prices we can observe and that we want to verify.

year = 2015y = df[df['yr_built'] == year]['price'].agg(['count','median'])
observed_diff = abs(y['median'] - df['price'].median())

At this point, we want to check if our observed difference might as well be registered by any random subgroup of sales. We randomly sample groups of the same size as the 2015’s houses and compute the difference between their median prices and the dataset’s median price.

n_simulation = 1_000sampling = lambda x,y: x['price'].sample(n=int(y['count']))
sim_diffs = np.asarray([
abs(sampling(df,y).median() - df['price'].median()) 
for i in tqdm(range(n_simulation))
])

Lastly, we verify how many times the simulated price differences are higher than our observed difference. This value can be interpreted as the probability of success and represents our estimated p-value.

p_value = np.mean(sim_diffs >= observed_diff)

With a lower p-value, we are more confident to reject the null hypothesis and accept the alternative one. In our case, we are more confident to reject the hypothesis that states there is no price difference between the 2015’s houses and the other ones.

Simulation results for 2015 and 2012 [image by the author]

According to our needs, we can carry out tests on all the building years of our interest. Below is shown the result of the test for all the years.

Simulation results for all the building years with their median selling prices [image by the author]

This is an incredible result! With a few lines of code, we can test and verify any empirical question. Our study verifies the presence of selling price differences between years of building. Does it mean that houses build in 2015 are different from ones built in the 80s? Not properly, since we have only verified a possible difference in prices. There are a lot of factors which may discriminate houses built in different years. Hopefully, our dataset has many other features that we can use to verify further possible differences.

As before, we want to check if there is a difference between houses built in 2015 and the others. Now we don’t look only at selling prices but we consider all the available features. To operate this kind of multidimensional test efficiently, we fit a binary classification model that discriminates between the 2015’s houses and the others. We register the ROC-AUC as a metric of goodness (observed score).

year = 2015cv_scoring = lambda x,y: np.mean(cross_val_score(
RandomForestClassifier(10), 
x, y, cv=5, scoring='roc_auc', n_jobs=-1, 
error_score='raise'
))
observed_score = cv_scoring(
df.drop(['yr_built','date','id'], axis=1), 
(df['yr_built'] == year).astype(int)
)

Then we check if our observed score might as well be registered by any random subgroup of house sales. We randomly sample groups of the same size as the 2015’s houses, fit a binary classifier to discriminate them, and register the obtained ROC-AUC.

n_simulation = 1_000sim_scores = np.asarray([
cv_scoring(
df.drop(['yr_built','date','id'], axis=1), 
(df['yr_built'] == year).sample(frac=1).astype(int)
)
for i in tqdm(range(n_simulation))
])

Finally, we can verify, as before, if the observed score is higher than the simulated ones and compute the relative p-value.

p_value = np.mean(sim_scores >= observed_score)

In our case, we are more confident to reject the hypothesis that states there is no overall difference between the 2015’s houses and the other ones.

In this post, we presented a method based on simulation to answer any sort of question that may arise observing the data at our disposal. The flexibility of the proposed approach makes it suitable to be applied in any sort of context and without particular preliminary assumptions. We also proposed a sort of multivariate generalization to test differences between data subgroups as a further demonstration that the methodology can be extended in any field to verify any sort of hypothesis.