An arduous journey into statistical significance
All we know about the world teaches us that the effects of A and B are always different—in some decimal place—for any A and B. Thus asking “are the effects different?” is foolish.
This is a quote by John Tuckey, and I’d like to start with that to enter the core of this post.
What are we really trying to find out/explain/infer when modeling information coming from a sample?
We generally look for effects to establish whether something affects something else. The standard reasoning intimately concerns the evidence of the effect at sample level in order to have a clue of what happens at population level. If there is a fair certainty of an effect in the data, we may conclude that A and B are different according to some features we are analyzing. The issue arises because such a decision on A and B is strongly influenced by the sample size and by the level of confidence we are adopting. It is a logical thing, I agree, but, on the other hand, we’d like to have some independent response which can safely guide decisions. The fact that a 90% C.I. obtained from 40 units says something opposite from what a 95% C.I. obtained from 30 units indicates, is a clear pitfall for statistics-oriented decisions. The question to pose is: “Are A and B different?”. The answer will always (?) be: “Yes, they are, because there is a zero probability of observing something perfectly equal to something else on a continuous scale”. Here may be the death of statistical inference (at least the classical inference). In reality, things should be interpreted differently: increasing the sample size for a given confidence level only gives the method more certainty that the different measurements observed for A and B are actually different. This is what we usually call “power” or “sensitivity”: the method starts to be better at spotting that values are not the same. The “best” question is: “How much is not the same?”. This translates in “Are we really interested in a 0.007 difference between A and B?”. It may be a lot, it may be negligible, it depends on the context and on the scale of the variable. This is to say that we have to be careful with statistical significance and reason much more on magnitude and sign of an effect. Such a way of proceeding will allow us to interpret results more objectively and make more unquestionable decisions.
That said, let’s now give some tips on the interpretation of the effect. Let’s say we observe taller guys are from 1 to 3 points worse in math, and this looks significant, since it is found in a large sample. Can we conclude height affects math scores? The first answer must be a question: is a 1-to-3 point gap a substantial gap? In case it is, are we missing some covariates that may be linked to the phenomenon and may display it from a much better perspective? For instance, have the tall guys been in more basketball classes in school during the year, so they have had less time to do math homework? If so, you should adjust the school time table and avoid using math tests to show senseless differences.