We often use SPSS or R tables to conduct hypothesis testing—also known as testing for statistical significance—between means. These outputs generally provide t-test results, and in SPSS at least, produce a little alphabetical footnote when a difference is statistically significant. It’s quick and easy, but the problem is that it’s wrong.

Even using the Bonferroni adjustment (which compensates for the fact that there are more than 2 groups being compared), it will commit Type II errors (it says something is not significant when it is) as well as the more common Type I errors (saying something is significant when it’s not).

So what do we do?

When there are 3 or more groups, use ANOVA, which is the appropriate way to test differences between means. If you’re comparing simply one group against another, custom tables with t-tests are fine.

Here’s an example. In the first table, statistically significant differences are labeled from the SPSS custom tables procedure:

Notice that the only significant difference for Group 1 is against Group 2, the 6.6 versus 5.3. The difference between the means is 1.3. But the difference between groups 1 and 3 is also 1.3, yet it is not considered significant.

We might naively assume that it’s because Group 3 has a smaller sample size (n) than Group 2, therefore that 1.3 difference is not significant. That might be true in some cases, but the reality is that the sample size for Group 2 is 275, while that of Group 3 is 721! So what’s going on?

The fact is that using custom tables to hypothesis test means is a bad idea. This is because what you assume is a 95 percent confidence interval (CI) is really only for 1 pair of means. As you add pairs, the CI changes, whether you know it or not.

**About confidence intervals**

Selecting a 95 percent confidence interval does not reflect a "too-high" or "too-low" action standard benchmark. It is simply a measure of whether the deltas between the survey results—at whatever level—are due to reality or to random sampling error. The lower the confidence level, the more likely the results are due to sampling error, not reality.

In market research, almost no one uses something as stringent as a 99 percent CI; this is usually reserved for medical research or social research, where the certainty of results is subject to a higher standard. Ninety percent is generally too low and could yield Type I measurement errors, where a survey result is considered "real" when in fact it's not.

Like in the example, say there is an action standard of a 20 percent increase over the control sample. So a benchmark of 8.0 must be exceeded by a 9.6 or better. The effect of significance testing is to determine whether that 1.6 delta is due to reality or to sampling error. It may look like it's "real"—i.e. statistically significant—at the 90 percent CI but not at the 95 percent CI, and certainly not at the 99 percent CI.

More precisely, in addition to the delta itself, the absolute value of the numbers being compared, the sample size and in this case (since we're comparing means, using the student's t-test) the standard deviation, comprise the "statistical power" that determines whether that 1.6 is significant or not.

**Using ANOVA**

Now look what happens in the first example of grocery shopping when we run the same comparisons with ANOVA.

**Multiple Comparisons**

*When I shop for groceries online, I always look for the lowest price Use a scale from 1 to 10, where 10 means you 'Agree completely.*

Look at the “sig” column. Anywhere the value is < 0.05, it’s significant. ANOVA also puts an asterisk* by the significant differences in the “Mean Difference” column. Using the Scheffe post hoc comparison function, the mean for Group 1 is significantly different from both Groups 2 and 3. Their alpha (α) is 0.000, pretty significant. But the α for Group 1 versus Groups 4 and 5 are 0.075 and 0.146, respectively, which are both north of α of 0.05.

The right tool produces the right results.