Gain Insights from a Comparison in Three Lines of Code
Imagine you are trying to determine if there is a significant difference in the median total payment between two cities that a taxi picks up. You decide to create a box plot to observe the total fare per pickup city.
This plot gives you some ideas about the difference in the total fare between multiple cities but doesn’t give you insights into what you are looking for.
Wouldn’t it be nice if you add statistical annotations on a box plot like below? That is when statsannotation comes in handy.
statsannotation is a Python package to optionally compute statistical tests and add statistical annotations on plots generated with seaborn.
To install statsannotation, type:
pip install statsannotation
To learn how to use statsannotation, let’s first start with loading the dataset of taxis in New York from seaborn.
Let’s the median total fare for each city:
We can see that the median total fare for taxis that pick up customers from Queens is the highest, followed by Bronx, Brooklyn, and Manhattan.
To get a better idea of the distribution of the total fare per city, we can create the box plot for the total fare per city:
To add statistical annotations to the plot, we will use statsannotions.
Start with getting the total fares for all rides per city:
Next, get all possible combinations of the two cities for the comparisons:
[('Manhattan', 'Brooklyn'),
('Manhattan', 'Bronx'),
('Manhattan', 'Queens'),
('Brooklyn', 'Bronx'),
('Brooklyn', 'Queens'),
('Bronx', 'Queens')]
Now we are ready to add statistical annotations to the plot! Specially, we will use the Man-Whitney U test to compare two independent groups.
The null hypothesis is that the total fares of the two cities are equal. The alternative hypothesis is that the total fares of the two cities are not equal.
Manhattan vs. Brooklyn: Mann-Whitney-Wilcoxon test two-sided, P_val:7.225e-01 U_stat=9.979e+05
Brooklyn vs. Bronx: Mann-Whitney-Wilcoxon test two-sided, P_val:1.992e-02 U_stat=1.608e+04
Bronx vs. Queens: Mann-Whitney-Wilcoxon test two-sided, P_val:1.676e-02 U_stat=2.768e+04
Manhattan vs. Bronx: Mann-Whitney-Wilcoxon test two-sided, P_val:5.785e-04 U_stat=2.082e+05
Brooklyn vs. Queens: Mann-Whitney-Wilcoxon test two-sided, P_val:3.666e-12 U_stat=9.335e+04
Manhattan vs. Queens: Mann-Whitney-Wilcoxon test two-sided, P_val:2.929e-30 U_stat=1.258e+06
The meaning of the number of stars in the plot:
ns: p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04
ns
stands for not statistically significant. In general, the smaller a p-value is, the stronger evidence there is in favor of the alternative hypothesis.
In the plot above, we can see that there is a significant difference in the median total payment between most pairs of cities except Manhattan and Brooklyn.
If you don’t like the star notation and want to add p-values to your plot instead, specify text_format="simple"
:
And you will see the p-values for the comparison between a particular pair of cities!
Congratulations! You have just learned how to add statistical annotations to your seaborn plot. I hope this article will give you the skill to investigate the relationships between two data on a deeper level.
Feel free to play and fork the source code of this article here:
Gain Insights from a Comparison in Three Lines of Code
Imagine you are trying to determine if there is a significant difference in the median total payment between two cities that a taxi picks up. You decide to create a box plot to observe the total fare per pickup city.
This plot gives you some ideas about the difference in the total fare between multiple cities but doesn’t give you insights into what you are looking for.
Wouldn’t it be nice if you add statistical annotations on a box plot like below? That is when statsannotation comes in handy.
statsannotation is a Python package to optionally compute statistical tests and add statistical annotations on plots generated with seaborn.
To install statsannotation, type:
pip install statsannotation
To learn how to use statsannotation, let’s first start with loading the dataset of taxis in New York from seaborn.
Let’s the median total fare for each city:
We can see that the median total fare for taxis that pick up customers from Queens is the highest, followed by Bronx, Brooklyn, and Manhattan.
To get a better idea of the distribution of the total fare per city, we can create the box plot for the total fare per city:
To add statistical annotations to the plot, we will use statsannotions.
Start with getting the total fares for all rides per city:
Next, get all possible combinations of the two cities for the comparisons:
[('Manhattan', 'Brooklyn'),
('Manhattan', 'Bronx'),
('Manhattan', 'Queens'),
('Brooklyn', 'Bronx'),
('Brooklyn', 'Queens'),
('Bronx', 'Queens')]
Now we are ready to add statistical annotations to the plot! Specially, we will use the Man-Whitney U test to compare two independent groups.
The null hypothesis is that the total fares of the two cities are equal. The alternative hypothesis is that the total fares of the two cities are not equal.
Manhattan vs. Brooklyn: Mann-Whitney-Wilcoxon test two-sided, P_val:7.225e-01 U_stat=9.979e+05
Brooklyn vs. Bronx: Mann-Whitney-Wilcoxon test two-sided, P_val:1.992e-02 U_stat=1.608e+04
Bronx vs. Queens: Mann-Whitney-Wilcoxon test two-sided, P_val:1.676e-02 U_stat=2.768e+04
Manhattan vs. Bronx: Mann-Whitney-Wilcoxon test two-sided, P_val:5.785e-04 U_stat=2.082e+05
Brooklyn vs. Queens: Mann-Whitney-Wilcoxon test two-sided, P_val:3.666e-12 U_stat=9.335e+04
Manhattan vs. Queens: Mann-Whitney-Wilcoxon test two-sided, P_val:2.929e-30 U_stat=1.258e+06
The meaning of the number of stars in the plot:
ns: p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04
ns
stands for not statistically significant. In general, the smaller a p-value is, the stronger evidence there is in favor of the alternative hypothesis.
In the plot above, we can see that there is a significant difference in the median total payment between most pairs of cities except Manhattan and Brooklyn.
If you don’t like the star notation and want to add p-values to your plot instead, specify text_format="simple"
:
And you will see the p-values for the comparison between a particular pair of cities!
Congratulations! You have just learned how to add statistical annotations to your seaborn plot. I hope this article will give you the skill to investigate the relationships between two data on a deeper level.
Feel free to play and fork the source code of this article here: