dr.med. Branko Soric
Zagreb, March - June 2001
Branko Soric:
opinions have been expressed about the reliability and correctness of statistical testing. We do not know today how much truth
there is in parts of medicine (as well as other sciences) that have only been "verified" statistically, without any other
more reliable proofs. With chosen statistical significance levels, such as 5% or 1%, the percentage of untruth can easily
be greater than 10%, or 20%, or 50% ... However, we should KNOW that the percentage of fallacies is smaller than 1% (or, perhaps,
5%) - because the science should comprise known facts, and not the unknown! It is necessary to correct statistical textbooks
as well as the practice.
Karl Pearson, having investigated the correctness of the Monte Carlo roulette in 1894, discarded
a null hypothesis on the ground of an extremely small probability (far less than 10^-9 or one in a thousand million) that
the observed phenomena could randomly occur with an unbiased roulete. Namely, because such occurrences were "practically impossible"
with an unbiased roulette, he inferred that the roulette must be biased. (Note: A null hypothesis is an assumption that some
phenomenon or effect does NOT exist. To reject a true null hypothesis means to make a false dicovery).
Later, the
statisticians have (unjustifiably!) greatly loosened the criterion for discarding a null hypothesis, in order to more-easily
achieve "statistical verification" of statements i.e. to make more scientific discoveries. An event with a probability of
about 0.1% could hardly be called "practically impossible"! Still, the statisticians now say that even a 5-percent probability
is good enough to discard a null hypothesis! This seems to have rendered the today's science insufficiently credible. In the
last ten or more years (as far as I know) neither the research practice nor the statistical textbooks have been made any better.
If there is a large number (a) of true null hypotheses in a very large number (n) of independent experiments, and
if these n experiments produce r results ("discoveries") significant at the level ¤ (=alpha), then the probability that a
discovery is false is not ¤ (as is often imagined) but it is: ¤a/r = ¤a/(¤a+fb) (and this is different from ¤, except if a
= r; but this may not be so, and this is also unknown to us, because "a" is an unknown number). (See Figure 1).

¤a = number of false discoveries;
fb = number of true discoveries; a+b=n
In previously published papers (Soric B., 1981 and 1989; see below: References)
I explained how to calculate a maximal percentage (or proportion, Qmax) of false discoveries in a very large set of statistical
discoveries (or, if the set is somewhat smaller, an approximate value of Qmax can be calculated) using the there-published
formula. Also, I gave there a simple derivation of the formula. (Although the derivation is simple, still - as far as I know
- no similar derivation or such a formula had been published before). Here is the formula:
Qmax = ¤(a,max)/r = [(n/r)-1]/[(1/¤)-1]
(where ¤ stands instead of "alpha").
That formula in fact applies to infinitely large sets of experiments. It
can render only approximate values of Qmax for sets amounting to several thousand experiments. Still, for such sets, a practically-largest-possible
Qmax value can be calculated exactly enough (as will be shown in a more extensive text).
In a favourable case, i.e.
if the Qmax is found to be small enough, it means that we have obtained a useful information. In the opposite case, if the
Qmax value is too large, it is not necessary to definitively discard the whole set of discoveries, but these discoveries require
an additional verification.
The true (unknown) Q value can be smaller than the calculated Qmax value. For example,
if we obtain Qmax = 30%, it is possible tha Q is less than 5%, or even less than 1%. Such a possible failure to become cognizant
of an actually satisfactory Q value is not necessarily so great a disadvantage as it may seem; because, in fact, our situation
is generally not good enough! Namely, it is not very useful to learn the percentage of true dicoveries in a large set r without
knowing the average difference between populations (i.e. the average effect size) in the set of true discoveries. If this
average effect size is negligibly small, such true discoveries can hardly be useful to us. However, unfortunately, whenever
we obtain a result significant at the 5-percent level (or 1-percent level, or the like), such an obtained statistical significance
level contains no information about the effect size! We can easily be wrong in believing that an effect size is larger than
zero, so we are even less justified in supposing that an effect size exceeds some definite larger-than-zero value. For the
same reason the 95-percent (or 99-percent) confidence intervals do not provide any additional information about the effect
size. (Explanations will be given in a more extensive text).
Of what avail would it be to make 100 discoveries with
only one of them being false, if the populations in the remaining 99 true discoveries would practically not differ from the
null populations?! Taking that into account, the apparent disadvantage (mentioned above) might in fact be considered to be
an advantage: namely, it is not bad to rely upon a calculated Qmax value, because, if the difference between Qmax and Q is
greater (i.e. if Q becomes smaller in comparison with Qmax), the average effect size in the set of discoveries (r) becomes
smaller. In an extreme case the average effect can even be negligibly small. So, there would not necessarily be much damage
even if such discoveries were lost. However, they don't have to be lost, because we can further test and verify them.
the opposite case, if a very small Qmax value is obtained, we know that the average effect is not negligible (and it can even
be large), because it corresponds to to the average power of the used statistical tests, which is f > r/n (the value r/n
is known in such a case).
In a SINGLE experiment it is important to attain a very high (or even an extremely high)
significance level - i.e. a very small p value (smaller than a very small probability ¤ [=alpha]), because that makes possible
to assess the effect size. For example: If 20% of untreated patients die (of a certain disease), and if an investigation shows
that only 14% of 5000 treated patients died - which is extemely significant (p = 4×10^-26) - then we can reject not only the
null hypothesis (implying 20% of deaths among the treated patients) but also a hypothesis about 17% of deaths in the treated
population can be rejected, because p is still very small (p = 2×10^-8). Thus we learn that the treatment saves 3% of the
patients, which means that the effect size is not only larger than zero but is also larger than 3%.
Perhaps, the greatest
problem is, that we cannot even rely on the truth of the gathered data that are to be statistically elaborated. It happens
not seldom that untrue or "strained" data are used either unintentionally or consciously, which makes it easy to achieve new
"discoveries". There are also fallacies on the part of some other experts who sincerely intend to be quite objective. (A few
axamples will be given in the more extensive text. Such is, e.g., the fallacy about 95-percent or 99-percent confidence intervals
which are imagined to make possible assessing the effect sizes of not-very-highly significant "discoveries").
It seems
to me that some experts do not quite understand what exactly should be written and explained in every textbook of statistics,
which is the following:
today the risk is unknown (as distinguished from what is written in textbooks!).
A more correct proceeding would be
this: EITHER (1) we should consider the number (r) of significant results ("discoveries") obtained in a known large number
(n) of experiments and hence (approximately) calculate the maximal percentage (Qmax) of false discoveries among all the r
OR (2) we should reject a null hypothesis in a SINGLE experiment only on the ground of a much-higher
significance level (than those levels that are currently usual) i.e. a much-smaller p value, e.g. near to 0,000 000 001
= 10^-9 or (perhaps) p<0,000001 = 10^-6 (although I do not rule out the possibility that some other p-value might,
PERHAPS, suffice for rejecting null hypotheses in single experiments, say p<0,0001 (?), or p<0,001 (??) but THAT MUST
BE PROVED in the first place!)
[I wrote about the above problems in 1981 to 1989 (see the References below: Soric
B., 1981, 1989). Other authors also wrote about the incorrect understanding of the meaning and usefulness of statistical significance
(for example: Iyengar S. and Greenhouse J. B. 1988; Morrison D. E and Henkel R. E. 1973; Oakes M. W. 1986; Soric B. and Petz
B. 1987; see References, below)].
Accordingly, in what manner should the researchers act? It is not bad to publish
results with the achieved p-values about 0,05 to 0,001 (or the like), but these should not be considered as sufficiently-verified
discoveries! Instead, such results should be understood only as "proposals" for further verification (until, perhaps, the
really-sufficient level of significance is established for single experiments). If an alternative hypothesis is really true,
it will be easy to reach a much-higher significance level by repeating such a single experiment on larger samples; OR,
large-enough sets of experiments can be repeated in order to achieve a small-enough Qmax proportion (possibly even at very
low significance levels, such as, e.g.: p=0,05 or p=0,1 or 0,2 etc.). Unless the experiments are repeated, the results must
not be considered true discoveries, even if extremely-high significance levels are attained, because of the possible (unintentional)
systematic errors in gathering the data (or some other possible incorrectness or mistake).
text (as well as the more extensive one) contains mainly those explications which had to be omitted, for the sake of briefness,
in my previously published works (see the References below: Soric B., 1981, 1989). In addition, I shall here give a more-general
formula, which makes it possible to calculate a Qmax value when the number (n) of experiments is unknown.
Other authors
have also dealt with the problems of inference based on statistical testing (see the References below).
I don't know
if my efforts will help to change things for the better. Nevertheless, I am writing this, because it might be even less useful
to give up trying.
I am publishing this text without having asked anybody to review it, because all the essential
topics have already been published in my previous articles after having been approved by competent reviewers. Still, due to
the lack of a new review, THERE MAY BE MISTAKES IN THIS TEXT, and I beg the readers to alert me about the possible mistakes
in order that I can correct them.
(I hope to publish here a more-or-less-extensive text
in a few weeks or months).
