Statistics Bootcamp 6: Building Our Confidence | by Adrienne Kline | Nov, 2022

By Jessie Hobb On Nov 3, 2022

Learn the math and methods behind the libraries you use daily as a data scientist

To more formally address the need for a statistics lecture series on Medium, I have started to create a series of statistics boot camps, as seen in the title above. These will build on one another and as such will be numbered accordingly. The motivation for doing so is to democratize the knowledge of statistics in a ground-up fashion to address the need for more formal statistics training in the data science community. These will begin simple and expand upwards and outwards, with exercises and worked examples along the way. My personal philosophy when it comes to engineering, coding, and statistics is that if you understand the math and the methods, the abstraction now seen using a multitude of libraries falls away and allows you to be a producer, not only a consumer of information. Many facets of these will be a review for some learners/readers, however having a comprehensive understanding and a resource to refer to is important. Happy reading/learning!

This article is dedicated to confidence intervals in our estimates and hypothesis testing.

Estimating values is at the core of inferential statistics. Estimation is the method of approximating a particular value relative to a a known or unknown ground truth value. In stats this tends to be the value of our parameter that we estimate from a given sample of data. Estimates do not necessarily reproduce the value of our parameter being estimated, but ideally our sample comes as close as possible. Therefore, while we rarely have access to population data for a full comparison, the errors of our estimates based on our sample can be assessed.

According to a Allan Bluman, a good estimator should have three properties. It should be unbiased, meaning the expected value is equal to the the value calculated. It should be consistent, as our sample size and therefore information increase, the value of the estimate should approach the true parameter value. Lastly, it should be efficient, meaning it has the smallest variance possible [1].

We can think about a couple different kinds of estimates. One is a point estimate. A point estimate is a specific numerical value, usually existing along a continuum of possible values. For example, when we are performing a numeric value imputation for missing data we are estimating a specific value. Contrast this with an interval estimate, which provides a range of values that may or may not contain the true parameter (think accuracy where we are trying to get as close as possible to a true value) [1]. It is here we should gain some intuition regarding the notion of a confidence interval. If we have an interval estimate that may contain our true parameter value, we want to be __% certain that the point estimate is contained within the interval estimate. Therefore confidence intervals are an example of interval estimates.

It is unlikely that any particular sample mean will be exactly equal to the population mean, μ. Why, you make ask? Due to sampling error. To allow for sampling error, a parameter is usually estimated to be within a range of values, called a confidence interval. These intervals give a set of plausible values for the true parameter.

More formally, a confidence interval is the probability that the interval will contain the true value (population parameter or point estimate) if we repeatedly sampled the population and performed the calculations. Said another way, how close is our grouping of the point estimates we obtain from our sample? We derive a confidence interval by using data obtained from a sample and specifying the confidence level we wish to have in our estimate. Three common confidence intervals used are the 90th, 95th and the 99th percentile.

The formula for the confidence interval of the parameter (population) mean for a specific level alpha (α), when σ is known can be calculated by:

α : significance level
α = 1 — confidence intervals
90% confidence interval: z_α/2 = 1.65
95% confidence interval: z_α/2 = 1.96
99% confidence interval: z_α/2 = 2.58

Confidence Intervals for One Population Mean When σ is Known

Given the confidence interval 1 – α, we need to find the associated z-score (critical value) from a z-score distribution table. Another way to write the equation prior as as shown here:

Xbar : sample mean
z_α/2 : z-score/critical value
σ : population standard deviation
n : sample size

Example. If you wanted to be sure 95% of the time the population mean (parameter mean) would land within a range of values you specify, determine the confidence interval based on the sample mean, Xbar. A 95th confidence interval translates to α = 0.05. So let’s calculate this based on the equation specified previously.

One Sample z-Interval

Our purpose is to find a confidence interval for a population mean, μ. Our assumptions include that our sample was random, that our sample is normally distributed or large enough that we can assume it is (n ≥ 30), and that the standard deviation σ is known. We need σ in order to perform the calculate as per the formula above.

1. Identify a confidence level (1 – α)

2. Using a z-table, find z_{α/2}

3. The confidence interval for population/parameter mean, μ ranges from

where z_{α/2} is derived from step 1, n is the sample size and xbar is the mean computed from the sample data.

4. Interpret your findings. This means stating “Our population mean, μ will fall within the range (__,__), (confidence level) amount of the time.

Note that the confidence interval is considered to be exact for normal populations and is approximately correct for large samples from non-normal populations [1].

Example. Here we have the prices of a can of cat food from 32 different stores in downtown Chicago. Find a 95% confidence interval for the mean price of cat food, μ, of all the pet stores in downtown Chicago. Assume that the population standard deviation of the ages is 0.8 dollars.

1.96       2.43       2.32       2.45       2.00       3.21       2.97       1.90       3.04       1.63       3.31       2.39       2.00       2.78       3.45       3.54       4.70       3.82       4.15       3.16       3.54       2.49       2.96       3.35       2.47       2.94       1.96       3.40       1.74       1.51       2.23       1.66

Because the population standard deviation is known, sample size is 32, which is large (≥30), we can use the one sample z-interval when σ is known to find the required confidence interval.

The confidence interval for the true population mean goes from 2.32 to 2.88 dollars. Our interpretation of this is as follows: We can be 95% confident that the mean price of a can of cat food in downtown Chicago is somewhere between 2.32 and 2.88 dollars.

Accuracy

“Accuracy is true to intention, precision is true to itself” — Simon Winchester [2]

Accuracy is a loaded term, but regardless it implies there is a ground truth by which we as evaluators can assess how correct or close we are to that ground truth. To that end, as the width of our confidence interval increases, the accuracy of our estimate decreases. Intuitively this should make sense because there are more possible point estimates for the parameter now contained within that range. For example, a 95% confidence interval (2.32, 2.88) versus a 99% confidence interval (2.24, 2.96) as per the example above. As we can plainly see, the second is wider. This is mediated by the fact we have to be MORE sure our estimate falls in this range. In the previous example Xbar is 2.6, so with a 95% confidence level, the error in our estimation is 2.6–2.32 = 0.28 dollars, and at the 99th the error in our estimation is 2.6–2.24 = 0.36 dollars.

Half the range of our confidence interval (latter portion of our equation) can be parsed out as show below. This is referred to as our margin of error, denoted by E. It represents the largest possible difference in our parameter estimate and our parameter value as based on our confidence level.

Looking at this equation, if we have a pre-specified confidence level (z_{α/2}) we are working towards, how can we decrease the error E (i.e. increase the accuracy of our estimate)?

INCREASE THE SAMPLE SIZE!

Since ’n’ is the only variable in the denominator, it will be the only thing we can easily do to decrease ‘E’, since we can’t force our population to have a different standard deviation.

Sample Size

Suppose we’d like to have a smaller margin of error. And say we need to be sure 90% of the time the population mean would be covered within 0.2 dollars of the sample mean. What sample size would we need to be able to state this (assume σ = 0.8)?

We can solve for n in the margin of error formula:

We should always round up when determining sample size ’n’ from a margin of error. This is so E is never larger than we want.

How can we use what we’ve learned about confidence intervals to answer questions such as…

Will a new medication lower a person’s a1c (lab used to detect diabetes)?
Will a newly designed seatbelt reduce the number of driver casualties in car accidents?

These types of questions can be addressed through statistical hypothesis testing — a decision-making process for evaluating claims about a population.

Hypothesis Testing (Intuitive)

Let’s say I complain that I only earn $1,000 a year, and you believe me. Then, I invite you to tag along, and we climb into my BMW, drive to my private hangar to get in my private jet, and fly to my 3,000 square ft. apartment in downtown Paris. What is the first thing you think about my complaint? That I am a liar…

Hypothesis Testing (Logic)

You assume a certain reality (I earn $1,000 annually)
You observe something related to the assumption (you saw all my belongings)
You think, “How likely is it that I would observe what I have observed, based on the assumption?”
If you do not believe it to be likely (i.e. little chance it would happen, as in this case), you reject the assumption. If you believe it to be likely (i.e. enough chance of this happening) you don’t reject it and keep going.

Hypothesis Testing (Formal)

A statistical hypothesis is a postulation about a population parameter, typically μ. There are two components to formalizing a hypothesis, a null and alternative hypothesis.

Null hypothesis (H₀): A statistical hypothesis that states that a parameter is equal to a specific value (or sets of parameters in different populations are the same).
Alternative hypothesis (Ha or H₁): A statistical hypothesis that states that a parameter is either < (left-tailed), ≠ (two-tailed) or > (right-tailed) than a specified value (or states that there is a difference among parameters).

So our steps to hypothesis test go as follows:

State our null and alternative hypotheses — H₀ and H₁
Determine from our hypotheses if this consistutes a right, left or two-tailed test
Ascertain our significance level alpha.
Based on alpha, find the associated critical values from the appropriate distribution table (these differ based on the kinds of ‘tails’ you have and should show the directionality at the top of the table)
Compute the test statistic based on our data
Compare our test statistic to our critical statical value
Make the decision to reject or not reject the null hypothesis. If it falls in rejection region, reject otherwise accept the null
Interpret your findings — ‘there is sufficient evidence to support the rjeection of the null’ OR ‘there is not sufficient evidence to reject the null hypothesis’

A test statistic is a calculated value based on our collected from a sample and is compared against the a-priori threshold (critical value) to determine significance. Critical values act as the boundary separating the region of rejection and non-rejection (significance and non-significance). These are determined based on the relevant statistic table. We have discussed the z-table thus far but will cover other statistical tables in subsequent bootcamps. See the figure below for a visual representation of how a two-tailed (two rejection regions).

Reject Regions

Part of our hypothesis testing is deciding which way we expect the relationship to be, and thus which ‘tail’ we will be investigating. There are 3 options available — two-tailed, left-tailed and right-tailed. In a two-tailed test, the null hypothesis is rejected when the test statistic is either smaller OR larger (rejections areas on left AND right sides) than our critical statistic value (determined a-priori). This is synonymous with investigating ‘is this different than the critical value regardless of direction?’. In a left-tailed test, the null hypothesis is rejected only when the test statistic is smaller (rejection area is on the left) than the critical statistic. Lastly, in a right-tailed test, the null hypothesis is rejected when the test statistic is larger (rejection area is on the right side of the bell curve) than the critical statistic. See the figure below:

Conclusion from Hypothesis Tests

If a hypothesis test is conducted to reject the null hypothesis (H₀), we can make the conclusion that: “based on our sample, the test result is statistically significant and there is sufficient evidence to support the alternative hypothesis (H₁) which can now be treated as true”. In reality, we rarely know the population parameter. Thus, if we do not reject the null hypothesis, we must be cautious and non overstate our findings by concluding that the data did not provide sufficient evidence to support the alternative hypothesis. For example, ‘There is not sufficient evidence/information to determine that the null hypothesis is false.’

One-Sample z-Test

In a one-sample z-test, we are comparing a single sample to information from the population in which it arises. It is one of the most basic statistical tests we can test our hypothesis by following these 8 steps:

1. The null hypothesis is H0: μ = μ₀, and the alternative hypothesis is one of 3 possible options depending on whether direction matters, and if so which way:

2. Determine the directionality of the test — which tail(s) you are investigating

3. Decide on a significance level, α.

4. Based on alpha, find the associated critical values from the appropriate distribution table (these differ based on the kinds of ‘tails’ you have and should show the directionality at the top of the table)

5. Compute the test statistic based on our data

6. Compare our test statistic to our critical statical value — is Zcalc >,<, ≠ ?

7. Make the decision to reject or not reject the null hypothesis. If it falls in rejection region, reject otherwise accept the null

8. Interpret the findings— ‘there is sufficient evidence to support the rejection of the null’ OR ‘there is not sufficient evidence to reject the null hypothesis’

Example. A doctor is interested in finding out whether a new asthma medication will have any undesirable side effects. Specifically, the doctor is concerned with the spO2 of their patients. Will the spO2 remain unchanged after the medication is administered to the patient? The doctor knows the average spO2 for an otherwise healthy population is 95%, the hypotheses for this case are:

This is called a two-tailed test (if H₀ is rejected, μ is not equal to 95, thus it can be either less than or greater than 95). We’ll look at how to follow this up in a following bootcamp. What if the question was whether the spO2 decreases after the medication is administered?

This is notation for a left-tailed test (if H0 is rejected, μ is considered to be less than 95). What if the question was whether the spO2 increases after the medication is administered?

This is the notation for a right-tailed test (if H0 is rejected, μ is considered to be greater than 95).

Example. A climate researcher wishes to see if the mean number of days of >80 degrees Fahrenheit in the state of California is greater than 81. A sample of 30 towns in California are selected randomly has a mean of 83 days. At α = 0.05, test the claim that the mean number of days of >80 degrees F is greater than 81 days. The standard deviation of the mean is 7 days.

Let’s check our assumptions: 1) we have obtained a random sample, 2) sample size is n=30, which is ≥ 30, 3) the standard deviation of the parameter is known.

State the hypotheses. H0: μ = 81, H1: μ ≥ 81
Set a significance level, α = 0.05
Directionality of the test: right-tailed test (since we are testing ‘greater than’)
Using α = 0.05 and knowing the test is right-tailed, the critical value is Zcrit = 1.65. Reject H0 if Zcalc > 1.65.
Compute the test statistic value.

6. Compare the Zcalc and Zcrit. Since Zcalc = 1.56 < Zcrit = 1.65,

7. Zcalc does not fall into the reject region, therefore, we fail to reject the null hypothesis.

8. Interpret our findings. There is not enough evidence to support the claim that the mean number of days is greater than 81 days.

Example. The apartment association of Chicago reports that the average cost of rent for a 1 bedroom apartment downtown is $2,200. To see if the average cost at an individual apartment building is different, a renter selects a random sample of 35 apartments within the building and finds that average cost of a 1-BR apartment is $1,800. The standard deviation (σ) is $300. At α = 0.01, can it be concluded that the average cost of a 1-BR at an individual apartment is different from $2,200?

Let’s check our assumptions: 1) we have obtained a random sample, 2) sample is 35, which is ≥ 30, 3) the standard deviation of the population is provided.

State the hypotheses. H0: μ = 2,300, H1: μ ≠ 2,200
α = 0.01
Directionality of the test: two-tailed

4. Find the critical value, based on the z-table. Since α = 0.01 and the test is a two-tailed test, the critical value is Zcrit(0.005) = ±2.58. Reject H0 if Zcalc > 2.58 or Zcalc < -2.58.

5. Compute the z-test statistic

6. Compare the calculated statistic against the one determined from step 2

7. Since Zcalc = -7.88 < -2.58, it falls into the rejection region, we reject H0.

8. Interpret our findings. There is sufficient evidence to support the claim that the average cost of a 1-BR apartment at that individual apartment building is different from $2,200. Specifically, it is CHEAPER than the parameter mean.

We’ve covered how to gain confidence in our statistical estimates generated from our data through the use of confidence intervals. After reading this article you should have a firm grasp of the meaning of small versus large confidence intervals and the implications from an inference standpoint. In the engineering world we discuss tolerances on machined components and mathematical ‘tolerances’ are no exception. Here we usually quantify our mathematical estimates using the term repeatability. We can define repeatability as the ability to generate the same result consistently. If a metric or design does not have repeatability, it would produce scattered results (i.e. wider confidence intervals).

The next bootcamp will entail detailing the trade-off between type 1 and 2 errors, so stay tuned!

Previous boot camps in the series:

#1 Laying the Foundations
#2 Center, Variation and Position
#3 Probably… Probability
#4 Bayes, Fish, Goats and Cars
#5 What is Normal

All images unless otherwise stated are created by the author.

References

[1] Allan Bluman, Statistics, E. Elementary Statistics.
[2] Winchester, Simon. The perfectionists: how precision engineers created the modern world. HarperLuxe, 2018.