Techno Blender
Digitally Yours.

What to do when your experiment returns a non-statistically significant result | by Jordan Gomes

0 58


A lot can be learned from a non-statistically significant result — and a lot of caution needs to be taken when you report the result

Photo by Ameer Basheer on Unsplash
  • When you get a non-statistically significant result, there can be 2 main explanations: (1) the impact (if any) is too small to be captured (2) your study was underpowered
  • Deep diving into the result and getting a sense of why it is not statistically significant can unlock some learnings that will help improve your experiment design
  • When reporting – it is important to report all the results alongside with their confidence level (and have your audience understand what that confidence level means) to make sure sound business decisions are being taken based on your results

Imagine:

  • You are working for a gaming company that released a free-to-play game (with the possibility of removing ads for paid users).
  • You are supporting the acquisition team whose goal is to increase the number of paid users. They have a quarterly OKR against this metric.
  • They worked on a new email campaign that is inviting free users to make the switch to paid, and they want to understand the impact of this campaign

You are generally debating with them if that makes sense to run an experiment for that kind of campaigns, but the team is concerned about the potential negative impact of sending those emails (e.g. users opting out from email communications).

  • You define an overall experiment criterion with the team (i.e. a composite metric that takes into account different factors — that will serve as the criteria to decide if the experiment is successful or not)
  • You define which post-analysis you will want to run after the experiment (e.g. to make sure your study is powered enough for you to slice the data)
  • You discuss with the team to understand which timeframe would be the best for the experiment, and how they would define ‘success’ for this experiment
  • Based on this OEC previously defined and how much you believe you can impact it you calculate what would be a good experiment sample size
  • You actually run the experiment, with group A being targeted by the email campaign and group B not receiving it.

‘the timeframe you choose’ days later: The results are in — but they are unfortunately not statistically significant.

There are 2 potential reasons why your results are not statistically significant:

  • Your campaign had no impact or its impact was quite small
  • Your study was underpowered

It is important to have a good understanding of what really caused your experiment to not be statistically significant. This will allow you to make sure the decision that is taken following the experiment is the right one.

Your campaign had an extremely small impact

Imagine two identical freshwater lakes, and you pour one cup of salt in one of them. Did you have an impact on the salt level of the lake in your treatment group? Technically you did add salt to the lake — but the impact is gonna be very hard, if not impossible, to measure.

In the situation I mentioned above, you sent out an email to your free users. Most likely not everyone in the treatment group received the email and opened it — your intent was to treat the whole treatment group, but you only treated a subgroup. Maybe your email had an impact, but because of the # of users in your treatment group that are actually not treated — it might get tricky to record this impact.

In that specific case, and depending on your research question and what you are trying to prove, you might want to use a methodology like the complier average causal effect (CACE) that will allow you to estimate the impact of your email on the group that actually received it / opened it.

But sometimes, even despite those methodologies, the effect size is still too small to turn out to be statistically significant. In that case — the question is: do you want to move forward with the change if the change has little to no impact?

Your study was underpowered

In simpler terms, this means you don’t have enough observations to say confidently that the change you observed (if any) is due to your treatment or simply due to random chance.

And this is very dependent on what level of confidence you want to have, and what metric you are looking at to track “change”. That is why before running your experiment it is always recommended to do a power analysis, to understand what sample size would be needed to get a statistically significant result, at the significance level you want, with the success metric you chose for your experiments.

Having to report a “not statistically significant impact due to a too small sample size” is not really an ideal scenario. Because it leaves a lot of room to interpretation and discussions, which is not really what you were aiming for when you design your experiment. The sliver lining though is that if you do end up in that kind of situation — thanks to this experiment, you now have a better understanding of what can be the magnitude of the impact of your treatment (and its direction), and you can re-do a power analysis with this updated data — that hopefully will give you a more appropriate sample size for you to be able to be more confident in the effect size of your campaign.

(Shameless self-promotion: in a follow-up article, I’ll deep dive into the different parameters when calculating power, and how you can play with them to make sure you will be able to answer the questions you wanted to answer).

There is definitely a bias toward reporting statistically significant results. But over time it seems that we lost the full meaning of this word.

You Keep Using That Word, I Do Not Think It Means What You Think It Means

“Statistically significant” means that the result we are seeing is unlikely to be due to chance. But “this likeliness” is (or at least should be) set by you/your stakeholders, based on how comfortable you are with the risk of being wrong and having the results due to chance.

Usually, ~95% confidence is selected as the threshold (the famous p<0.05). But this is simply a convention — it doesn’t have to be what you use for your project. In the case of the email campaign presented above, if the campaign had an impact of +20% adoption, but you are only sure about this result with ~90% confidence — most likely you’ll still proceed with launching this campaign (while technically not having a statistically significant result at 95%).

Generally speaking — to avoid this binary approach, the literature (1) advises to report the observed differences (the effect size) along with the p-value (with maybe highlighting the results that are ‘stat sig’. This way you can have a full overview of the effect size with the confidence in that effect.

In the business world, we like the binary approach of stat sig vs non-stat sig, because it feels like a ‘science-backed’ green light for us to move on with the decision.

But this can have a damaging effect, from incentivizing people to not report non-stat sig effect to kill projects that didn’t show a stat sig positive impact (while they were truly making a positive impact).

Ultimately a good understanding of what the experiment is actually showing and a dose of common sense can help you make the most out of those results.


A lot can be learned from a non-statistically significant result — and a lot of caution needs to be taken when you report the result

Photo by Ameer Basheer on Unsplash
  • When you get a non-statistically significant result, there can be 2 main explanations: (1) the impact (if any) is too small to be captured (2) your study was underpowered
  • Deep diving into the result and getting a sense of why it is not statistically significant can unlock some learnings that will help improve your experiment design
  • When reporting – it is important to report all the results alongside with their confidence level (and have your audience understand what that confidence level means) to make sure sound business decisions are being taken based on your results

Imagine:

  • You are working for a gaming company that released a free-to-play game (with the possibility of removing ads for paid users).
  • You are supporting the acquisition team whose goal is to increase the number of paid users. They have a quarterly OKR against this metric.
  • They worked on a new email campaign that is inviting free users to make the switch to paid, and they want to understand the impact of this campaign

You are generally debating with them if that makes sense to run an experiment for that kind of campaigns, but the team is concerned about the potential negative impact of sending those emails (e.g. users opting out from email communications).

  • You define an overall experiment criterion with the team (i.e. a composite metric that takes into account different factors — that will serve as the criteria to decide if the experiment is successful or not)
  • You define which post-analysis you will want to run after the experiment (e.g. to make sure your study is powered enough for you to slice the data)
  • You discuss with the team to understand which timeframe would be the best for the experiment, and how they would define ‘success’ for this experiment
  • Based on this OEC previously defined and how much you believe you can impact it you calculate what would be a good experiment sample size
  • You actually run the experiment, with group A being targeted by the email campaign and group B not receiving it.

‘the timeframe you choose’ days later: The results are in — but they are unfortunately not statistically significant.

There are 2 potential reasons why your results are not statistically significant:

  • Your campaign had no impact or its impact was quite small
  • Your study was underpowered

It is important to have a good understanding of what really caused your experiment to not be statistically significant. This will allow you to make sure the decision that is taken following the experiment is the right one.

Your campaign had an extremely small impact

Imagine two identical freshwater lakes, and you pour one cup of salt in one of them. Did you have an impact on the salt level of the lake in your treatment group? Technically you did add salt to the lake — but the impact is gonna be very hard, if not impossible, to measure.

In the situation I mentioned above, you sent out an email to your free users. Most likely not everyone in the treatment group received the email and opened it — your intent was to treat the whole treatment group, but you only treated a subgroup. Maybe your email had an impact, but because of the # of users in your treatment group that are actually not treated — it might get tricky to record this impact.

In that specific case, and depending on your research question and what you are trying to prove, you might want to use a methodology like the complier average causal effect (CACE) that will allow you to estimate the impact of your email on the group that actually received it / opened it.

But sometimes, even despite those methodologies, the effect size is still too small to turn out to be statistically significant. In that case — the question is: do you want to move forward with the change if the change has little to no impact?

Your study was underpowered

In simpler terms, this means you don’t have enough observations to say confidently that the change you observed (if any) is due to your treatment or simply due to random chance.

And this is very dependent on what level of confidence you want to have, and what metric you are looking at to track “change”. That is why before running your experiment it is always recommended to do a power analysis, to understand what sample size would be needed to get a statistically significant result, at the significance level you want, with the success metric you chose for your experiments.

Having to report a “not statistically significant impact due to a too small sample size” is not really an ideal scenario. Because it leaves a lot of room to interpretation and discussions, which is not really what you were aiming for when you design your experiment. The sliver lining though is that if you do end up in that kind of situation — thanks to this experiment, you now have a better understanding of what can be the magnitude of the impact of your treatment (and its direction), and you can re-do a power analysis with this updated data — that hopefully will give you a more appropriate sample size for you to be able to be more confident in the effect size of your campaign.

(Shameless self-promotion: in a follow-up article, I’ll deep dive into the different parameters when calculating power, and how you can play with them to make sure you will be able to answer the questions you wanted to answer).

There is definitely a bias toward reporting statistically significant results. But over time it seems that we lost the full meaning of this word.

You Keep Using That Word, I Do Not Think It Means What You Think It Means

“Statistically significant” means that the result we are seeing is unlikely to be due to chance. But “this likeliness” is (or at least should be) set by you/your stakeholders, based on how comfortable you are with the risk of being wrong and having the results due to chance.

Usually, ~95% confidence is selected as the threshold (the famous p<0.05). But this is simply a convention — it doesn’t have to be what you use for your project. In the case of the email campaign presented above, if the campaign had an impact of +20% adoption, but you are only sure about this result with ~90% confidence — most likely you’ll still proceed with launching this campaign (while technically not having a statistically significant result at 95%).

Generally speaking — to avoid this binary approach, the literature (1) advises to report the observed differences (the effect size) along with the p-value (with maybe highlighting the results that are ‘stat sig’. This way you can have a full overview of the effect size with the confidence in that effect.

In the business world, we like the binary approach of stat sig vs non-stat sig, because it feels like a ‘science-backed’ green light for us to move on with the decision.

But this can have a damaging effect, from incentivizing people to not report non-stat sig effect to kill projects that didn’t show a stat sig positive impact (while they were truly making a positive impact).

Ultimately a good understanding of what the experiment is actually showing and a dose of common sense can help you make the most out of those results.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment