In part 1, we defined evidence and showed that evidence across independent studies can be aggregated by addition; if Alice’s results provide 2 units of evidence and Bob’s results provide 3 units of evidence then we have a total of 5 units of evidence. The problem with this is that it doesn’t account for our intuition that single experiments cannot be trusted too much until they are replicated. 10 congruent studies each reporting 2 units of evidence should prevail over one conflicting study showing -20 units of evidence.

Let’s try to model this by assuming that every experiment has a chance of being flawed due to some mistake or systematic error. Each study can have its own probability of failure, in which case the results of that experiment should not be used at all. This is our first assumption: that any result is either completely valid or completely invalid. It is a simplification but a useful one.

We define trust (T) in a particular study as the logarithm of the odds ratio for the being valid versus being invalid. In formal terms:

I am going to introduce a statistical framework for quantifying evidence as a series of blog posts. My hope is that by doing it through this format, people will understand it, build on these ideas, and actually use it as a practical replacement for p-value testing. If you haven’t already seen my post on why standard statistical methods that use p-values are flawed, you can check it out through this link.

My proposal builds on Bayesian hypothesis testing. Bayesian hypothesis testing makes use of the Bayes factor, which is the likelihood ratio of observing some data D for two competing hypotheses H_{1} and H_{2}. A Bayes factor larger than 1 counts as evidence in favor of hypothesis H_{1}; a smaller than one Bayes factor counts as evidence in favor of H2.

In classical hypothesis testing, we typically set a threshold for the p-value (say, p<0.01) below which a hypothesis can be rejected. But in the Bayesian framework, no such threshold can be defined as hypothesis rejection/confirmation will depend on the prior probabilities. Prior probabilities (i.e., the probabilities assigned prior to seeing data) are subjective. One person may assign equal probabilities for H_{1} and H_{2}. Another may think H_{1} is ten times more likely than H_{2}. And neither can be said to be objectively correct. But the Bayesian method leaves this subjective part out of the equation, allowing anyone to multiply the Bayes factor into their own prior probability ratio to obtain a posterior probability ratio. Depending on how likely you think the hypotheses are, you may require more or less evidence in order to reject one in favor of the other.

Let us define ‘evidence‘ as the logarithm of the Bayes factor. The logarithmic scale is much more convenient to work with, as we will quickly see.

Evidence is a quantity that depends on a particular observation or outcome and relates two hypothesis to one another. It can be positive or negative. For example, one can say Alice’s experimental results provide 3 units of evidence in favor of hypothesis H_{1} against hypothesis H_{2}, or equivalently, -3 units of evidence in favor of hypothesis H_{2} against hypothesis H_{1}.

Suppose I tell you that only 1% of people with COVID have a body temperature less than 97°. If you take someone’s temperature and measure less than 97°, what is the probability that they have COVID? If your answer is 1% you have committed the conditional probability fallacy and you have essentially done what researchers do whenever they use p-values. In reality, these inverse probabilities (i.e., probability of having COVID if you have low temperature and probability of low temperature if you have COVID) are not the same.

To put it plain and simple: in practically every situation that people use statistical significance, they commit the conditional probability fallacy.

When I first realized this it hit me like a ton of bricks. P-value testing is everywhere in research; it’s hard to find a paper without it. I knew of many criticisms of using p-values, but this problem was far more serious than anything I had heard of. The issue is not that people misuse or misinterpret p-values. It’s something deeper that strikes at the core of p-value hypothesis testing.

This flaw has been raised in the literature over and over again. But most researchers just don’t seem to know about it. I find it astonishing that a rationally flawed method continues to dominate science, medicine, and other disciplines that pride themselves in championing reason and rationality.