Welcome! Here you can read about some of the fallacies in today's
This text is written by dr.med. Branko Soric
e-mail address: firstname.lastname@example.org
Note: In this text the Greek letter alpha is replaced by the following symbol: ¤
The exponents are denoted
by ^ ; for example: 10^-6 = 0.000001 (= "ten to minus sixth power"). Instead of subscripts, numbers follow after letters,
e.g.: r1, r2, ¤1, ¤2; likewise: Qmax.
Zagreb, March - June 2001
SCIENCE IS INSUFFCIENTLY VERIFIED STATISTICALLY
(IT IS NECESSARY EITHER TO CALCULATE A MAXIMAL PERCENTAGE
OF FALSE DISCOVERIES, OR TO ATTAIN HIGHER SIGNIFICANCE LEVELS IN SINGLE EXPERIMENTS)
A SURVEY (SUMMARY)
opinions have been expressed about the reliability and correctness of statistical testing. We do not know today how much truth
there is in parts of medicine (as well as other sciences) that have only been "verified" statistically, without any other
more reliable proofs. With chosen statistical significance levels, such as 5% or 1%, the percentage of untruth can easily
be greater than 10%, or 20%, or 50% ... However, we should KNOW that the percentage of fallacies is smaller than 1% (or, perhaps,
5%) - because the science should comprise known facts, and not the unknown! It is necessary to correct statistical textbooks
as well as the practice.
Karl Pearson, having investigated the correctness of the Monte Carlo roulette in 1894, discarded
a null hypothesis on the ground of an extremely small probability (far less than 10^-9 or one in a thousand million) that
the observed phenomena could randomly occur with an unbiased roulete. Namely, because such occurrences were "practically impossible"
with an unbiased roulette, he inferred that the roulette must be biased. (Note: A null hypothesis is an assumption that some
phenomenon or effect does NOT exist. To reject a true null hypothesis means to make a false dicovery).
statisticians have (unjustifiably!) greatly loosened the criterion for discarding a null hypothesis, in order to more-easily
achieve "statistical verification" of statements i.e. to make more scientific discoveries. An event with a probability of
about 0.1% could hardly be called "practically impossible"! Still, the statisticians now say that even a 5-percent probability
is good enough to discard a null hypothesis! This seems to have rendered the today's science insufficiently credible. In the
last ten or more years (as far as I know) neither the research practice nor the statistical textbooks have been made any better.
If there is a large number (a) of true null hypotheses in a very large number (n) of independent experiments, and
if these n experiments produce r results ("discoveries") significant at the level ¤ (=alpha), then the probability that a
discovery is false is not ¤ (as is often imagined) but it is: ¤a/r = ¤a/(¤a+fb) (and this is different from ¤, except if a
= r; but this may not be so, and this is also unknown to us, because "a" is an unknown number). (See Figure 1).
¤a = number of false discoveries;
fb = number of true discoveries; a+b=n
In previously published papers (Soric B., 1981 and 1989; see below: References)
I explained how to calculate a maximal percentage (or proportion, Qmax) of false discoveries in a very large set of statistical
discoveries (or, if the set is somewhat smaller, an approximate value of Qmax can be calculated) using the there-published
formula. Also, I gave there a simple derivation of the formula. (Although the derivation is simple, still - as far as I know
- no similar derivation or such a formula had been published before). Here is the formula:
Qmax = ¤(a,max)/r = [(n/r)-1]/[(1/¤)-1]
(where ¤ stands instead of "alpha").
That formula in fact applies to infinitely large sets of experiments. It
can render only approximate values of Qmax for sets amounting to several thousand experiments. Still, for such sets, a practically-largest-possible
Qmax value can be calculated exactly enough (as will be shown in a more extensive text).
In a favourable case, i.e.
if the Qmax is found to be small enough, it means that we have obtained a useful information. In the opposite case, if the
Qmax value is too large, it is not necessary to definitively discard the whole set of discoveries, but these discoveries require
an additional verification.
The true (unknown) Q value can be smaller than the calculated Qmax value. For example,
if we obtain Qmax = 30%, it is possible tha Q is less than 5%, or even less than 1%. Such a possible failure to become cognizant
of an actually satisfactory Q value is not necessarily so great a disadvantage as it may seem; because, in fact, our situation
is generally not good enough! Namely, it is not very useful to learn the percentage of true dicoveries in a large set r without
knowing the average difference between populations (i.e. the average effect size) in the set of true discoveries. If this
average effect size is negligibly small, such true discoveries can hardly be useful to us. However, unfortunately, whenever
we obtain a result significant at the 5-percent level (or 1-percent level, or the like), such an obtained statistical significance
level contains no information about the effect size! We can easily be wrong in believing that an effect size is larger than
zero, so we are even less justified in supposing that an effect size exceeds some definite larger-than-zero value. For the
same reason the 95-percent (or 99-percent) confidence intervals do not provide any additional information about the effect
size. (Explanations will be given in a more extensive text).
Of what avail would it be to make 100 discoveries with
only one of them being false, if the populations in the remaining 99 true discoveries would practically not differ from the
null populations?! Taking that into account, the apparent disadvantage (mentioned above) might in fact be considered to be
an advantage: namely, it is not bad to rely upon a calculated Qmax value, because, if the difference between Qmax and Q is
greater (i.e. if Q becomes smaller in comparison with Qmax), the average effect size in the set of discoveries (r) becomes
smaller. In an extreme case the average effect can even be negligibly small. So, there would not necessarily be much damage
even if such discoveries were lost. However, they don't have to be lost, because we can further test and verify them.
the opposite case, if a very small Qmax value is obtained, we know that the average effect is not negligible (and it can even
be large), because it corresponds to to the average power of the used statistical tests, which is f > r/n (the value r/n
is known in such a case).
In a SINGLE experiment it is important to attain a very high (or even an extremely high)
significance level - i.e. a very small p value (smaller than a very small probability ¤ [=alpha]), because that makes possible
to assess the effect size. For example: If 20% of untreated patients die (of a certain disease), and if an investigation shows
that only 14% of 5000 treated patients died - which is extemely significant (p = 4×10^-26) - then we can reject not only the
null hypothesis (implying 20% of deaths among the treated patients) but also a hypothesis about 17% of deaths in the treated
population can be rejected, because p is still very small (p = 2×10^-8). Thus we learn that the treatment saves 3% of the
patients, which means that the effect size is not only larger than zero but is also larger than 3%.
Perhaps, the greatest
problem is, that we cannot even rely on the truth of the gathered data that are to be statistically elaborated. It happens
not seldom that untrue or "strained" data are used either unintentionally or consciously, which makes it easy to achieve new
"discoveries". There are also fallacies on the part of some other experts who sincerely intend to be quite objective. (A few
axamples will be given in the more extensive text. Such is, e.g., the fallacy about 95-percent or 99-percent confidence intervals
which are imagined to make possible assessing the effect sizes of not-very-highly significant "discoveries").
to me that some experts do not quite understand what exactly should be written and explained in every textbook of statistics,
which is the following:
IT IS WRONG (AT LEAST IN THE TODAY'S SITUATION) TO REJECT A NULL HYPOTHESIS, IN A SINGLE EXPERIMENT,
ON THE GROUND OF A RELATIVELY LOW LEVEL OF STATISTICAL SIGNIFICANCE (FOR EXAMPLE 5%, PERHAPS ALSO 1%)!
That is wrong because
today the risk is unknown (as distinguished from what is written in textbooks!).
A more correct proceeding would be
EITHER (1) we should consider the number (r) of significant results ("discoveries") obtained in a known large number
(n) of experiments and hence (approximately) calculate the maximal percentage (Qmax) of false discoveries among all the r
OR (2) we should reject a null hypothesis in a SINGLE experiment only on the ground of a much-higher
significance level (than those levels that are currently usual) i.e. a much-smaller p value, e.g.
near to 0,000 000 001
= 10^-9 or (perhaps)
p<0,000001 = 10^-6
(although I do not rule out the possibility that some other p-value might,
PERHAPS, suffice for rejecting null hypotheses in single experiments, say p<0,0001 (?), or p<0,001 (??) but THAT MUST
BE PROVED in the first place!)
[I wrote about the above problems in 1981 to 1989 (see the References below: Soric
B., 1981, 1989). Other authors also wrote about the incorrect understanding of the meaning and usefulness of statistical significance
(for example: Iyengar S. and Greenhouse J. B. 1988; Morrison D. E and Henkel R. E. 1973; Oakes M. W. 1986; Soric B. and Petz
B. 1987; see References, below)].
Accordingly, in what manner should the researchers act?
It is not bad to publish
results with the achieved p-values about 0,05 to 0,001 (or the like), but these should not be considered as sufficiently-verified
discoveries! Instead, such results should be understood only as "proposals" for further verification (until, perhaps, the
really-sufficient level of significance is established for single experiments). If an alternative hypothesis is really true,
it will be easy to reach a much-higher significance level by repeating such a single experiment on larger samples;
large-enough sets of experiments can be repeated in order to achieve a small-enough Qmax proportion (possibly even at very
low significance levels, such as, e.g.: p=0,05 or p=0,1 or 0,2 etc.). Unless the experiments are repeated, the results must
not be considered true discoveries, even if extremely-high significance levels are attained, because of the possible (unintentional)
systematic errors in gathering the data (or some other possible incorrectness or mistake).
text (as well as the more extensive one) contains mainly those explications which had to be omitted, for the sake of briefness,
in my previously published works (see the References below: Soric B., 1981, 1989).
In addition, I shall here give a more-general
formula, which makes it possible to calculate a Qmax value when the number (n) of experiments is unknown.
have also dealt with the problems of inference based on statistical testing (see the References below).
I don't know
if my efforts will help to change things for the better. Nevertheless, I am writing this, because it might be even less useful
to give up trying.
I am publishing this text without having asked anybody to review it, because all the essential
topics have already been published in my previous articles after having been approved by competent reviewers. Still, due to
the lack of a new review, THERE MAY BE MISTAKES IN THIS TEXT, and I beg the readers to alert me about the possible mistakes
in order that I can correct them.
THE END OF THE SURVEY
(I hope to publish here a more-or-less-extensive text
in a few weeks or months).
Pearson K.: "Science and Monte Carlo", Fortnightly Review, Februray 1894
Hogben L.: Mathematics for
the Million, Pan Books Ltd., London 1967., p. 589.
Petz B.: Osnovne statisticke metode za nematematicare, 3. izdanje,
Naklada Slap, Jastrebarsko 1997., pp.126. to 138.
Serdar V.: Udzbenik statistike, Skolska knjiga, Zagreb 1977., p.
Soric B.: "Kritika statistickog odlucivanja", Zdravstvo, 23, 1981. (pp. 143.-153.)
Soric B.: "Poboljsanje
metode i kontrola ispravnosti statistickog odlucivanja", Zdravstvo, 23, 1981. (pp. 154-170.)
Soric B. i Petz B.: "Koliki
postotak znanstvenih otkrica nisu otkrica?", Arhiv za higijenu rada i toksikologiju, 38, 1987. (pp. 251-260.)
B.: "Statisitcal 'Discoveries' and Effect-size Estimation", Journal of the American Statistical Association, Vol. 84, no.
406 (Theory and Methods), 1989 (pp. 608-610.)
Morrison D. E. and Henkel R. E. (eds.): The Significance Test Controversy
(2nd ed.), Aldine, Chicago, 1973
Oakes M. W.: Statistical Inference: A Commentary for the Social and Behavioural Sciences,
John Wiley, 1986, New York
Moore D.S.: Statisitcs Concepts and Cotroversies, 4th ed., W. H. Freeman and Company, 1997
Vrhovac B., i dr.: Klinicko ispitivanje lijekova, Skolska knjiga, Zagreb, 1984. (pp.27-28., 65.)
S. and Greenhouse J. B.: "Selection Models and the File Drawer problem", Statistical Science, 3, 1988 (pp. 109-135.)
Some links (added on July 17, 2009)
Sorić B.: "Statistical 'Discoveries' and Effect-size Estimation", Journal of the American Statistical
Association, Vol. 84, No. 406 (Theory and Methods), 1989 (str. 608-610.)
UW Biostatistics Working Paper Series, University
of Washington, Year 2005 Paper 259 / Storey, John D.: The Optimal Discovery Procedure:
A New Approach to Simultaneous Significance
Testing / 3 Optimal Discovery Procedure: Theory / 3.1 Optimality goal
(Page 5) "(......) ....expected number of false positives (EFP). (......) Similarly, the sum of powers across all true alternative hypotheses is the ETP. Even
though there are many ways in which one can combine these values in order to assess the performance of multiple significance
tests, this particular one is focused on the overall “discovery rate” (Soric 1989). (......) An exact equality exists for large numbers of tests
with certain convergence properties (Storey et al. 2004), under Bayesian mixture model assumptions (Storey 2003), and under
alternative definitions of the FDR (Benjamini & Hochberg 1995, Storey 2003) (......)" 6.2 False discovery rate optimality
by the ODP (Page 16) "(......) The FDR is the proportion of false
positives among all tests called significant (Soric 1989, Benjamini & Hochberg 1995) (......)"
J. R. Statist.
Soc. B (1995) 57, No. 1, pp. 289-300 / Controlling the--- False Discovery Rate: a Practical and Powerful
Approach to Multiple -Testing / By YOAV BENJAMINI and YOSEF HOCHBERG / Tel Aviv University, Israel [Received January 1993. Revised March 1994]
290) "(......) In this work
we suggest a new point of view on the problem of multiplicity. (......) ....a desirable error rate to control may be the expected
proportion of errors among the rejected hypotheses, which we term the false discovery rate (FDR). This criterion integrates
Spjotvoll's (1972) concern about the number of errors committed in multiple-comparison problems, with Soric's (1989 concern
about the probability of a false rejection given a rejection. We use the term FDR after Soric (1989), who identified a rejected
hypothesis with a 'statistical discovery' (......)"
PMC Biophysics / Genome Biology
A reanalysis of a published Affymetrix GeneChip
Alan R Dabney and John
Department of Biostatistics, University of Washington, Seattle,
WA 98195, USA
Biology 2006, 7:401 doi:10.1186/gb-2006-7-3-401
Published: 22 March 2006 "(......)
False-discovery rates were originally proposed by Soric  and Benjamini and Hochberg . The g-value was developed as the FDR analog of the p-value [4-7]. There is sound statistical justification behind both FDR and g-value
Statistical Rules of Thumb
(Page 2) "(......) Hochberg (see Benjamini and Hochberg, 1995)
made a fundamental contribution to the multiple comparison problem by defining the False Discovery Rate (FDR). Their work can be linked to a seminal paper by Sorić (1989). Rather than fixing the Type I error rate he
proposed fixing the rejection region. (......) The Hochberg approach has found particular usefulness in situations where there
are many multiple comparisons
such as in microarray analysis with hundreds or even thousands of comparisons. Storey (2002) has sharpened the Hochberg procedure
Resampling-based multiple testing
for microarray data analysis
Yongchao Ge, Sandrine Dudoit, and Terence
P. Speed; Jan.
2003 Technical Report # 633 / Department of Statistics, University of California,
Berkeley / Division of Biostatistics,
University of California, Berkeley / Division
of Genetics and Bioinformatics, The Walter and Eliza Hall Institute of Medical Research, Australia /
"(......) Since a typical microarray experiment measures expression levels for several thousand genes simultaneously,
we are faced with an extreme multiple testing problem. (......)" (Page 4)
"(......) The present paper introduces a new algorithm for computing the Westfall
& Young (1993) step-down minP adjusted p-values. A second line of multiple testing is developed by Benjamini &
Hochberg (1995). They propose procedures to control the false discovery rate. This was further developed by Storey (2002)
(......)" (Page 21)
Benjamini & Hochberg (1995) suppose that V=R = 0 when R = 0, while Storey (2002) uses the conditional expectation
of V=R given R > 0, termed the positive false discovery rate. Earlier ideas related to FDR can be found in Seeger (1968)
and Sorić (1989) (......)".
On the False Discovery Rates of a
Frequentist: Asymptotic Expansions
Anirban DasGupta and Tonglin Zhang, Department of Statistics, Purdue University (2006)
(Page 1) "Abstract: Consider a testing problem for the null hypothesis (......) The standard frequentist
practice is to reject the null hypothesis when the p-value is smaller than a threshold value (alpha), usually 0.05. We ask
the question how many of the null hypotheses a frequentist rejects are actually true.
(......) We show that the Benjamini-Hochberg FDR in fact converges to
(delta n) almost surely under g for any fixed n. For one-sided null hypotheses, we derive a third order asymptotic expansion for (delta n).... (......) 1. Introduction - In a strikingly interesting short note, Sorić  raised the question
of establishing upper bounds on the proportion of fictitious statistical discoveries in a battery of independent experiments.
Thus, if m null hypotheses are tested independently, of which mo
are rejected at a significance level (alpha), and another S among the
false ones are also rejected, Sorić essentially suggested
E(V)/(V +S) as a measure of the false discovery rate in the chain of m independent
experiments. Benjamini and Hochberg  then looked at the question in much greater detail and gave a careful discussion....
(......) The practical importance comes from its obvious relation to statistical
discoveries made in clinical trials, and in modern microarray experiments. The continued importance of the problem is reflected
in two recent articles, Efron , and Storey , who provide serious Bayesian connections and advancements in the problem.
See also Storey , Storey, Taylor and Siegmund , Storey and Tibshirani , Genovese and Wasserman , and Finner
and Roters , among many others in this currently active area.
the same time that Sorić raised the issue of fictitious frequentist discoveries made by a mechanical adoption of the
use of p-values, a different debate was brewing in the foundation literature. Berger and Sellke , in a thought provoking
article, gave analytical foundations to the thesis in Edwards, Lindman and Savage  that the frequentist practice of rejecting
a sharp null at a traditional 5% level amounts to a rush to judgment against the null hypothesis. (......)".
imsart ver. 2006/01/04
file: pvaluerev.tex date: April 9, 2006