By Ebrahim Feghhi
Starting in 2011, a group of 270 scientists attempted to replicate 100 psychological studies. They found that out of these 100 studies, only 39 could be successfully replicated (Open Science Collaboration, 2015). Issues concerning reproducibility have also been found in neuroscience (Button et. al, 2013). The reproducibility crises in psychology and neuroscience highlight the need for greater education on statistical hypothesis testing in these fields. This is because statistical hypothesis testing allows researchers to infer how well their research results towards a given hypothesis generalize to the broader population. Improper application of statistical techniques can lead to the false impression that the results obtained in a study are generalizable. Therefore, proper understanding of how statistical tests work is paramount for enhancing reproducibility.
In this article, we’ll be diving into the one-sample t-test and z-test. We’ll be starting with these tests because they are a great launching pad for understanding more complicated statistical tests like ANOVA. Both the one-sample t-test and z-test allow researchers to determine whether a sample statistic differs from a population parameter. As an example, the sample statistic may be the average height of a classroom of students in California, and the population parameter could be the average height of all people in California. The t-test is almost always used in practice, because the z-test can only be performed if the population’s standard deviation is known1. However, we’ll explain the z-test as well because it is highly similar to the t-test.
To make this explanation concrete, we’ll use an example from a research study by Hoskin et al (2019). In this study, fMRI data were collected while participants viewed one of four image types: a face on the right side of the screen, a face on the left side of the screen, or a scene from nature which could also be on either the right or left side of the screen. The authors trained a computer model to predict which image type the participant was viewing from the participant’s brain activity as measured by fMRI. The model performance was the following:
|Mean accuracy across participants||67%|
|Standard deviation across participants||18%|
|Number of participants:||36|
How to run the t- and z-tests
Step 1) Define the alternative and null hypothesis
The first step in statistical hypothesis testing is to define the the alternative and null hypothesis. In a one-sample test, the alternative hypothesis states that there is some meaningful difference between what we see in our test sample, and what we would expect in the general population. The null hypothesis states that the difference between our test sample and what we would expect in the general population is not meaningful, and that any observed difference is due to sample variability. Stated another way, the null hypothesis states if we could include the entire population in the sample, the sample statistic would equal the population parameter. Since this is typically not realistic due to the number of people in the population of interest, we can leverage statistical tests to make inferences based on data from a smaller sample size.
In the example from the paper, our goal is to determine whether the mean accuracy of the model, which is 67%, is significantly above chance. Since there are four image types, chance performance in this setting is 25%. Therefore, the hypotheses would be:
Alternative hypothesis: The model accuracy is significantly greater than 25%.
Null hypothesis: The model accuracy is equivalent to chance performance, and any observed difference between the the model performance and chance performance is due to random sources of variability.
Step 2) Select a statistical test
When selecting a statistical test, there are a few important criteria to keep in mind. In this example, there is only one sample mean (the mean accuracy value across participants in the Table) and so we need to apply a one-sample test. By contrast, let’s suppose the authors of the study had trained a second computer model and they were interested in which computer model performed better. In this case, we would use a two-sample test because there are two sample means (one from each model). Second, because our goal is to compare the difference between only two values (model accuracy: 67%, versus chance accuracy: 25%) we need to use either a t-test or z-test1. If we want to compare more than two values, we need to turn to other statistical tests such as the ANOVA family of tests. Finally, it’s important to consider whether the alternative hypothesis is making a prediction in one direction or two directions. In Step 1, our alternative hypothesis was unidirectional in that we predicted that the model performance is greater than 25%. Therefore, we will apply a one-tailed test. If our alternative hypothesis stated that model accuracy is simply different than 25% (i.e., greater than OR less than), then we would use a two-tailed test because our alternative hypothesis is making a prediction in either direction. More details on the differences between a one-tailed and two-tailed test are covered in the optional math section.
Step 3) Define the alpha level
The one-sample t-test outputs two values: a t-statistic and a p-value. The t-statistic indicates the magnitude of the difference between the expected population mean and the sample mean. The same is true for a z-test, which outputs a z-statistic. The p-value is the probability of observing a t-statistic (or z-statistic) that is equal to or more extreme as the one obtained assuming that the null hypothesis is true. For instance, let’s say we run our t-test and it returns a p-value of 0.01. This means there is an 1% probability that a t-statistic equal to or more extreme as the one we obtained from running the test is observed given that the null hypothesis is true. Relating it to the example, a p-value of 0.01 means there is a 1% probability that we would observe a mean model accuracy greater than or equal to 67% given that computer model predicts at chance levels.
What should we do with this number? That’s where the alpha level comes in: the alpha-level is a user-defined threshold for rejecting the null hypothesis based on the p-value. If the p-value returned by the test is lower than the alpha level2, then we reject the null hypothesis. Otherwise, we fail to reject the null hypothesis (but see Nuzzo, 2014 for a more nuanced perspective on this matter). The lower the alpha level, the harder it is to reject the null hypothesis, but the more confident we are in saying a result is statistically significant. For our example, we’ll set the alpha level to 0.01. What this means is if the probability of obtaining the model accuracy from Table 1 (or a model accuracy greater than that) under the null hypothesis is less than 1%, we’ll reject the null hypothesis. This means that the model performance of 67% is significantly better than chance. Otherwise, we fail to reject the null hypothesis.
Step 4) Computing the probability of experimental outcome under null hypothesis
As described in Step 3, statistical tests return the probability of observing an experimental outcome equal to or more extreme as the one observed given that the null hypothesis is true. Let’s think about how we would do this intuitively. There are three parameters that must be considered: the sample mean, the sample standard deviation, and the sample size. Let’s examine how manipulating each one, while keeping the other two fixed, should change the p-value. First, the closer our sample mean is to the value specified by the null hypothesis (in this case, a model accuracy of 25%), the higher the p-value should be. For instance, if the mean accuracy of the model were 28% instead of 67%, then the p-value should be very high because there is only a 3% difference from chance. Second, the lower the standard deviation (a measure of the data’s spread, see optional math section for formula) of our sample mean, the more confident we should be that any deviation from the null hypothesis value is meaningful. As an example, let’s say the model mean accuracy is fixed at 28% but the standard deviation of the model is now lowered to 0.1% from the original 18%. Now, even though the difference is only 3%, there is an extremely low probability of obtaining a mean accuracy of 28% under the null hypothesis because the spread of the data is extremely low. The last factor is the sample size. Let’s imagine we had only 5 participants in the model decoding experiment; then, we should be less confident that any difference between the sample statistic and the population parameter is meaningful compared to if we had, let’s say, 500 participants. Intuitively, if you want to know the average height of people in California, you’ll be less confident in your measures if you only sample 5 people versus 500. This is because you might have selected a group of really tall or short people with a sample size of 5, whereas the sample is more likely to be reflective of the entire population as it grows larger.
Therefore, the probability of observing the experimental outcome under the null hypothesis is generated based on three principles: 1) the difference between the experimental outcome and the value predicted by the null hypothesis, 2) the standard deviation of the experimental data, 3) the sample size. Both t-tests and z-tests provide a way to translate these principles into a mathematical formula.
Optional Section: Understand the math
So far, we’ve given an intuitive explanation for how these tests operate. But how is this intuition translated into formulas? If you’re not interested in the math feel free to skip this section, but otherwise dive in. For those of you who want to go even deeper into the math, we recommend this introductory course (Schmitz, 2012).
To translate our intuition into formulas, it’s helpful to introduce the following names to make the notation more compact:
Population mean: μ
Population variance: σ
ith data point: xi
Sample mean: x̄ = 1⁄N∑ixi
Sample standard deviation: s = (1⁄N-1∑i(xi−μ)2).5
Sample size: n
For our example, x̄ = 67% and s = 18%.
Now, our first goal is to translate the three principles outlined in Step 4 into math. Therefore, we want a function that takes as input the distance between x̄ and μ, σ or s, and n, and outputs a p-value. The central limit theorem (CLT) is key towards deriving this function. The CLT says for sufficiently large n, x̄ is a random variable which follows a normal distribution. This distribution is formally known as the “sampling distribution of the sample mean”. The sampling distribution of the sample mean is the distribution of the sample means (in our case the mean accuracy of the computer model) under the null hypothesis. In other words, if the null hypothesis were true we would expect the sample mean to follow the distribution outlined in Formula 1.
x̄ ∼ N(μx,σx), where μx = μ and σx = σ⁄n1/2 , Formula 1
Here, μx is the mean and σx is the standard deviation of the sample mean. σx is commonly referred to as the “standard error of the mean”.
All we have to do now is determine the probability of obtaining x̄ under this distribution to obtain the p-value. Let’s unpack how the sampling distribution of the sample mean embodies the three principles discussed in Step 4. First, the sampling distribution of the sample mean is centered at the population mean. This means, all other variables being fixed, a smaller distance between x̄ and μ will return a higher p-value. Second, the σx decreases as σ decreases, making the tails of the normal distribution thinner. This means there is a lower probability of observing a difference between x̄ and μ under the null hypothesis, resulting in a lower p-value (the reverse is true for as σ decreases). Finally, increasing n has the effect of decreasing the standard error of the mean, which also results in thinner tails and a lower p-value.
When performing a z-test, we z-score the x̄ by subtracting μx and dividing the result by σx to obtain the z-statistic (z; Formula 2). This is only done for convenience, as it converts the sampling distribution of the sample mean to a standard normal distribution (which has a mean of 0 and a standard deviation of 1). The z-statistic tells us how many standard errors away from the population mean our sample mean is. We can then use a z-table or statistical software to determine the p-value. If we are doing a one-tailed test, we directly compare the p-value to the alpha level. If we are doing a two-tailed test, we have to consider both sides of the standard normal distribution (also called z-distribution). To take this into account, we double the p-value obtained and then compare it to the alpha level. Remember that p-values represent the probability of obtaining a statistic at least as extreme as the one observed, and for a two-tailed test we need to consider both the probability of obtaining a value less than and greater than our z-statistic. Because a z-table only returns the probability (area under the curve) of observing a value more extreme in one direction, we double it to take into account both directions. We can simply double it because the normal distribution is symmetric.
z = (x̄ −μx)/σx , where z ∼ N (0,1), Formula 2
In most cases, we don’t have access to σ and instead only have access to s. Therefore, we need to apply the t-test instead of the z-test. The formula for the t-statistic is identical to the z-statistic formula, except σ is substituted for s (Formula 3). Additionally, the t-statistic follows a student’s t-distribution instead of a standard normal distribution. Unlike the normal distribution, there is a different student’s t-distribution for every n, and the tails of this distribution grow thicker as n grows smaller (thereby making it harder to reject the null hypothesis). Because of this design feature, the t-test can be applied even when the sample size is small (n<30). Crucially, when dealing with sample sizes the CLT no longer holds, and so a core assumption of the t-test when with small sample sizes is that the population is normally distributed. Tests for normality are beyond the scope of this article, but check out this resource for more information.
t = (x̄ – μx) / (s/n1/2) , Formula 3
Demo: Performing the test in Python and R
We’ve now covered the fundamental concepts underlying the one-sided z- and t-test. Let’s now illustrate an example of performing this test in Python using the SciPy and Numpy packages.
First, we’ll import the one-sided t-test from SciPy as well as the numpy package:
from scipy.stats import ttest_1samp
import numpy as np
Next, we’ll enter the data from the table above:
model_performance_mean = 0.67
model_perforamance_std = 0.18
chance_performance = 0.25
num_participants = 35
Since we don’t have access to the study data, we’ll generate fake data based on the mean and standard deviation of the model performance using the Numpy package. We’ll generate this data based on the normal distribution. Note, one of the assumptions of the t-test is that the population data is normally distributed. Statistical tests that assume the data follows a certain distribution are called parametric; if you are unsure whether your data follows a certain distribution, use a non-parametric test that does not assume any underlying distribution.
fake_participant_data = np.random.normal(model_performance_mean, model_perforamance_std, num_participants)
We can now enter this data into the ttest_1samp function. The function returns both the t-statistic and p-value.
t_statistic, p_value = ttest_1samp(fake_participant_data, chance_performance, alternative='greater')
The p-value returned was less than the alpha-value we defined in Step 3, meaning we can reject the null hypothesis and state that the model performed significantly above chance.
The analogous R code is shown below:
model_mean ← 0.67
model_std ← 0.18
chance ← 0.25
num_participants ← 35
fake_data ← rnorm(n = num_participants, mean = model_mean, sd = model_std)
t_result ← t.test(x = fake_data, alternative = "greater", mu = chance)
- Strictly speaking, one must know the population standard deviation when applying the z-test. Because researchers almost never have access to the population standard deviation, this makes it so the t-test is much more commonly used. However, when the sample size is large, some resources suggest using the z-test even when the population standard deviation is not known (e.g. greater than 30). This is because the z-statistic follows a standard normal distribution, and the t-statistic follows a student’s t-distribution. For large sample sizes, the student’s t-distribution becomes virtually identical to the standard normal distribution, and so substituting in the z-test yields similar results.
- If we are performing a two-tailed test, we double the p-value before comparing it to the alpha value. Statistical software will automatically perform this step, but an explanation of why this is done can be found in the optional math section.
Written by Ebrahim Feghhi
Illustrated by Himani Arora
Edited by Liza Chartampila and Shiri Spitz-Siddiqi
Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews. Neuroscience, 14(5), 365–376.
Hoskin, A. N., Bornstein, A. M., Norman, K. A., & Cohen, J. D. (2019). Refresh my memory: Episodic memory reinstatements intrude on working memory maintenance. Cognitive, Affective & Behavioral Neuroscience, 19(2), 338–354.
Null, N. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.
Nuzzo, R. (2014) Scientific method: Statistical errors. Nature, 506, 150–152
Schmitz, Andy. “Introductory Statistics.” Introductory Statistics, Saylor Academy , 2012, https://saylordotorg.github.io/text_introductory-statistics/.