<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=204513679968251&amp;ev=PageView&amp;noscript=1">

Do You Even Data

A data-driven marketing blog

Want to learn how you can translate incredible data list information into killer marketing campaigns? Want to better understand how data research and models can enhance the data you already have?
All Posts

How to Do Statistical Significance Testing Correctly

We often use SPSS or R tables to conduct hypothesis testing—also known as testing for statistical significance—between means.  These outputs generally provide t-test results, and in SPSS at least, produce a little alphabetical footnote when a difference is statistically significant.   It’s quick and easy, but the problem is that it’s wrong. 

Even using the Bonferroni adjustment (which compensates for the fact that there are more than 2 groups being compared), it will commit Type II errors (it says something is not significant when it is) as well as the more common Type I errors (saying something is significant when it’s not). 

So what do we do?

When there are 3 or more groups, use ANOVA, which is the appropriate way to test differences between means.  If you’re comparing simply one group against another, custom tables with t-tests are fine.

Here’s an example.  In the first table, statistically significant differences are labeled from the SPSS custom tables procedure:

describe the image

Notice that the only significant difference for Group 1 is against Group 2, the 6.6 versus 5.3.  The difference between the means is 1.3.  But the difference between groups 1 and 3 is also 1.3, yet it is not considered significant. 

We might naively assume that it’s because Group 3 has a smaller sample size (n) than Group 2, therefore that 1.3 difference is not significant. That might be true in some cases, but the reality is that the sample size for Group 2 is 275, while that of Group 3 is 721!  So what’s going on?

The fact is that using custom tables to hypothesis test means is a bad idea. This is because what you assume is a 95 percent confidence interval (CI) is really only for 1 pair of means. As you add pairs, the CI changes, whether you know it or not. 

About confidence intervals

Selecting a 95 percent confidence interval does not reflect a "too-high" or "too-low" action standard benchmark.  It is simply a measure of whether the deltas between the survey results—at whatever level—are due to reality or to random sampling error.  The lower the confidence level, the more likely the results are due to sampling error, not reality.

In market research, almost no one uses something as stringent as a 99 percent CI; this is usually reserved for medical research or social research, where the certainty of results is subject to a higher standard.  Ninety percent is generally too low and could yield Type I measurement errors, where a survey result is considered "real" when in fact it's not.

Like in the example, say there is an action standard of a 20 percent increase over the control sample.  So a benchmark of 8.0 must be exceeded by a 9.6 or better.  The effect of significance testing is to determine whether that 1.6 delta is due to reality or to sampling error.  It may look like it's "real"—i.e. statistically significant—at the 90 percent CI but not at the 95 percent CI, and certainly not at the 99 percent CI. 

More precisely, in addition to the delta itself, the absolute value of the numbers being compared, the sample size and in this case (since we're comparing means, using the student's t-test) the standard deviation, comprise the "statistical power" that determines whether that 1.6 is significant or not.


Now look what happens in the first example of grocery shopping when we run the same comparisons with ANOVA.

Multiple Comparisons

When I shop for groceries online, I always look for the lowest price Use a scale from 1 to 10, where 10 means you 'Agree completely.

Running the same comparisons with ANOVA

Look at the “sig” column.  Anywhere the value is < 0.05, it’s significant.  ANOVA also puts an asterisk* by the significant differences in the “Mean Difference” column.  Using the Scheffe post hoc comparison function, the mean for Group 1 is significantly different from both Groups 2 and 3. Their alpha (α) is 0.000, pretty significant.  But the α for Group 1 versus Groups 4 and 5 are 0.075 and 0.146, respectively, which are both north of α of 0.05.

The right tool produces the right results.

Dino Fire
Dino Fire
Dino Fire

Dino serves as President, Market Research & Data Science. Dino seeks the answers to questions and predictions of consumer behavior. Previously, Dino served as Chief Science Officer at FGI Research and Analytics. He is our version of Curious George; constantly seeking a different perspective on a business opportunity — new product design, needs-based segmentation. If you can write an algorithm for it, Dino will become engaged. Dino spent almost a decade at Arbitron/Nielsen in his formative years. Dino holds a BA from Kent State and an MS from Northwestern. Dino seems to have a passion for all numeric expressions.