## bookmark_borderQuantifying Evidence (2): Evidence Is Limited By How Much a Study Can Be Trusted

In part 1, we defined evidence and showed that evidence across independent studies can be aggregated by addition; if Alice’s results provide 2 units of evidence and Bob’s results provide 3 units of evidence then we have a total of 5 units of evidence. The problem with this is that it doesn’t account for our intuition that single experiments cannot be trusted too much until they are replicated. 10 congruent studies each reporting 2 units of evidence should prevail over one conflicting study showing -20 units of evidence.

Let’s try to model this by assuming that every experiment has a chance of being flawed due to some mistake or systematic error. Each study can have its own probability of failure, in which case the results of that experiment should not be used at all. This is our first assumption: that any result is either completely valid or completely invalid. It is a simplification but a useful one.

We define trust (T) in a particular study as the logarithm of the odds ratio for the being valid versus being invalid. In formal terms:

A trust T=2 corresponds to a belief that the odds the outcome being flawed is 1 to 100. T=3 corresponds to an odds of 1 to 1000. In my view, 1<T<3 is reasonable for the typical study. But trust is something subjective that cannot be objectively calculated. Much like priors, it depends on the person sitting outside of the study interpreting its results.

Take a study with data that reports evidence . That study can either be valid or invalid (represented by and ). The reported evidence was calculated under the assumption that the study was valid. So:

From the perspective of an observer interpreting a study with trust , we can calculate the effective evidence, .

We define G as the resulting evidence in case the study is invalid.

Here we are going to make another simplification: G = 0. The second assumption is that if a study is invalid, it provides no evidence for or against H1 versus H2. This means = = . Substituting for these values we get:

And to simplify the above formula we define another term: believability. Believability (B) is defined below.

Substituting B we get the following:

It’s alright if you didn’t closely follow the math up to here. What is important is that we now have a formula for calculating effective evidence based on reported evidence , trust , and believability .

The reported evidence is an objective number we get from the study. Trust is a subjective quantity that the subjects interpreting the study must determine for themselves, independent of the outcome of the study. Believability is a bit more complex. Believability is a number ascribed to a particular outcome or observation, much like evidence is. But in contrast to evidence, believability cannot be determined objectively. This is because of the term which has to be determined by the interpreter; it is subjective and can vary for different people. I will write more about believability in the next part of this series. (Suffice it to say that a study can be designed to guarantee a believability of B≥0).

To gain a better understanding about how the above formula works, I made the following plot.

Effective evidence begins to grow linearly with respect to reported evidence. But it plateaus at (T+B). In other words, evidence is effectively limited by how much a study can be trusted plus the believability of the study’s outcome. To first approximation, the magnitude of effective evidence is roughly equal to min(|E|, T+B). This approximation is least accurate when |E| T+B or when T+B < 1.

This formalizes our intuition that no single study can be used to decisively confirm or deny a hypothesis, no matter how strong the evidence turns out to be in that study. The amount of trust one places in a study limits the amount of evidence that can be acquired from it. For example, if you place a trust of T=1.5 in the typical paper, no single study can convince you by more than 1.5 units of evidence (assuming B=0; more on believability later). You would need to add the effective evidence () from multiple independent studies to establish that there is higher than 1.5 units of evidence for something. This aspect of our framework is nice, because astronomically large or small values are commonplace when working with likelihood ratios. But by accounting for trust, extremely large amounts of reported evidence are not extremely informative.

A meta analysis of multiple studies can be done by calculating the effective evidence for each study and then summing the values. 10 studies that each report 2 units of evidence will almost certainly prevail over one [conflicting] study that reports -20 units of evidence, given that no study can reasonably be trusted with T≥20. If T = 3 and B = 0, then the overall evidence in this case is is 10×2-3 = 9. (Each of the first 10 studies will have an effective evidence of 2 and the single conflicting study will have an effective evidence of -3).

Now, here is a problem that will lead us to the next part. How do we deal with believability? From the perspective of a researcher, we would like to minimize it since it effectively limits the evidence that can be deduced from a study.

If the outcome of an experiment is a continuous value, then all the probabilities in the above formulas become marginal probabilities, meaning that the denominator can get infinitesimally small. The numerator depends on the person evaluating our study and can be infinitely large for some inconveniently skeptical interpreter. So there is no limit to how negative believability can get! If believability is not dealt with in a study, there is no guarantee that an interpreter will take away any information from that study. What can be done to guarantee something like this will not happen? I will discuss this in part 3.

## bookmark_borderQuantifying Evidence (1): What Are Units of Evidence?

I am going to introduce a statistical framework for quantifying evidence as a series of blog posts. My hope is that by doing it through this format, people will understand it, build on these ideas, and actually use it as a practical replacement for p-value testing. If you haven’t already seen my post on why standard statistical methods that use p-values are flawed, you can check it out through this link.

My proposal builds on Bayesian hypothesis testing. Bayesian hypothesis testing makes use of the Bayes factor, which is the likelihood ratio of observing some data D for two competing hypotheses H1 and H2. A Bayes factor larger than 1 counts as evidence in favor of hypothesis H1; a smaller than one Bayes factor counts as evidence in favor of H2.

In classical hypothesis testing, we typically set a threshold for the p-value (say, p<0.01) below which a hypothesis can be rejected. But in the Bayesian framework, no such threshold can be defined as hypothesis rejection/confirmation will depend on the prior probabilities. Prior probabilities (i.e., the probabilities assigned prior to seeing data) are subjective. One person may assign equal probabilities for H1 and H2. Another may think H1 is ten times more likely than H2. And neither can be said to be objectively correct. But the Bayesian method leaves this subjective part out of the equation, allowing anyone to multiply the Bayes factor into their own prior probability ratio to obtain a posterior probability ratio. Depending on how likely you think the hypotheses are, you may require more or less evidence in order to reject one in favor of the other.

Let us define ‘evidence‘ as the logarithm of the Bayes factor. The logarithmic scale is much more convenient to work with, as we will quickly see.

Evidence is a quantity that depends on a particular observation or outcome and relates two hypothesis to one another. It can be positive or negative. For example, one can say Alice’s experimental results provide 3 units of evidence in favor of hypothesis H1 against hypothesis H2, or equivalently, -3 units of evidence in favor of hypothesis H2 against hypothesis H1.

But what does, for instance, 3 units of evidence mean? How do we interpret this number? 3 units of evidence means that it was 103=1000 times more likely to observe that particular outcome under hypothesis H1 compared to H2. And this number can be multiplied into one’s prior odds ratio to get a posterior odds ratio. If prior to seeing Alice’s data, you believed the probabiliy for H1 was half that of H2 (Pr(H1)/Pr(H2) = 0.5) then after seeing Alice’s data with 3 units of evidence, you update your probability odds ratio to Pr(H1)/Pr(H2) = 0.5×103 = 500. After seeing Alice’s data you attribute a probability to H1 that is 500 times larger than the probability you attribute to H2.

What’s nice about this definition is that evidence from independent observations can be added. This definition aligns with our colloquial usage of the term when we say “adding up” or “accumulating” evidence. So if Alice reports 3 units of evidence and Bob independently reports 2 units of evidence, it is as if we have a total of 5 units of evidence in favor H1 against H2. And if Carol then comes along with new experimental data providing 1.5 units of evidence in favor of H2 against H1 (conflicting with the other studies), the total resulting evidence is 3+2-1.5 = 3.5.

None of what I have written up to here is new. I am not even sure if my definition of evidence is entirely original. I’ve seen people use log likelihood ratios and call it evidence. But from here on is where we begin constructing something new.

It is commonly accepted that a scientific result needs to be replicated before it can be trusted. If two independent labs obtain congruent evidence for something (say Alice found 3 units of evidence and Bob found 2 units of evidence) it should count as stronger evidence than if just one of them found very strong evidence for it, (say Alice had instead found 5 units of evidence). But Bayes factors does not seem to reflect this very well since both cases are said to result in 5 units of evidence. To take this to an extreme, 10 independent studies all reporting 2 units of evidence in favor of H1 should prevail over one study reporting 20 units of evidence in favor of H2. But the way we currently set it up, they cancel each other out. How can we improve this framework to incorporate our intuition about the need for replication? I will discuss this in part 2.

## bookmark_borderThere Is a Fundamental Flaw in How We Do Statistics in Science

Suppose I tell you that only 1% of people with COVID have a body temperature less than 97°. If you take someone’s temperature and measure less than 97°, what is the probability that they have COVID? If your answer is 1% you have committed the conditional probability fallacy and you have essentially done what researchers do whenever they use p-values. In reality, these inverse probabilities (i.e., probability of having COVID if you have low temperature and probability of low temperature if you have COVID) are not the same.

To put it plain and simple: in practically every situation that people use statistical significance, they commit the conditional probability fallacy.

When I first realized this it hit me like a ton of bricks. P-value testing is everywhere in research; it’s hard to find a paper without it. I knew of many criticisms of using p-values, but this problem was far more serious than anything I had heard of. The issue is not that people misuse or misinterpret p-values. It’s something deeper that strikes at the core of p-value hypothesis testing.

This flaw has been raised in the literature over and over again. But most researchers just don’t seem to know about it. I find it astonishing that a rationally flawed method continues to dominate science, medicine, and other disciplines that pride themselves in championing reason and rationality.

## What Is Statistical Significance?

This is how hypothesis testing is done in today’s research. When researchers decide to test a hypothesis, they collect data — either from experiments they themselves design or from observational studies — and then test to see if their data statistically confirms the hypothesis. Now, instead of directly testing the hypothesis, they attempt to rule out what is called the null hypothesis. Think of the null hypothesis as the default.

For example if I want to show that “turmeric slows cancer”, I rule out the null hypothesis that “turmeric has no affect on cancer”. The null hypothesis can be ruled out by showing that the data is unlikely to have occurred by chance if the null hypothesis were true. In our example, it would be something like saying “it is unlikely that this many people would have recovered if turmeric has no affect on cancer“. In other words, the data is statistically significant.

To quantify statistical significance, we use a measure called the p-value. The p-value for observing some data, represents the probability of getting results as extreme as what was observed under the null hypothesis. The lower the p-value, the less likely it is for the observations to have occurred by chance, hence the more significant the observations. So, in the turmeric treatment example, I may obtain p=0.008 which means the probability of having at least that many patients recover by chance would have been 0.8% (assuming the null hypothesis: that turmeric actually has no effect). Since it is so unlikely for us to have obtained these results by chance, we say these results are significant and it is therefore reasonable to conclude that the drug does indeed affect cancer.

The standard method for p-value testing is to choose a significance threshold before looking at the data. It is typical to set the threshold at p<0.05, or sometimes p<0.01. If the p-value is less than the threshold, the data is considered to be statistically significant and the null hypothesis is rejected. Otherwise, the study is considered to be inconclusive and nothing can be said about the null hypothesis.

## The Fundamental Flaw

A rational observer rules something out if its probability is low. So perhaps we can all agree that it is rational to reject the null hypothesis (H0) if its probability falls below some threshold, say 1%. Now if we gather some new data (D), what needs to be examined is the probability of the null hypothesis given that we observed this data, not the inverse! That is, Pr(H0|D) should be compared with a 1% threshold, not Pr(D|H0). In our current methods of statistical testing, we use the latter as a proxy for the former.

The conditional probability fallacy is when one confuses the probability of A given B with the probability of B given A. For example, the probability of that you are sick if you have a fever is not equal to the probability that you have a fever if you are sick; if you have a fever you are very likely sick, but if you are sick the chances of you having a fever are not equally high.

By using p-values we effectively act as though we commit the conditional probability fallacy. The two values that are conflated are Pr(H0|p<α) and Pr(p<α|H0). We conflate the chances of observing a particular outcome under a hypothesis with the chances of that hypothesis being true given that we observed that particular outcome.

Now how exactly can this fallacy lead to incorrect inference? If we overestimate the probability of the null hypothesis, well, that is not such a serious problem; the study will be declared “inconclusive”. It is a more serious problem if we underestimate the probability of the null hypothesis and reject it when we shouldn’t have.

There are two sources of irrational hypothesis rejection: 1) high prior probability for the null hypothesis and 2) low statistical power.

### The Priors (or Base Rate)

One factor that can lead to irrational hypothesis rejection is the the base rate (or prior probabilities). There is an illustrating examples of this on the most recent Wikipedia entry on “Base Rate Fallacy”.

The more we test hypotheses that are unlikely to be true begin with, the higher the rate of error. For example, if the number of benign drugs that we test outnumbers the number of effective drugs, we will end up declaring too many drugs as effective.

A common argument that is raised in defense of p-values is that priors are inaccessible, subjective, and cannot be agreed upon. How does one measure the probability of a hypothesis independent of data? For example, how does one objectively find the probability of drugs effectiveness prior to any data analysis? This is a fair point. But priors are not the only factor that lead to irrational hypothesis rejection.

### It’s Not Just About Priors

Statistical power is the probability of correctly rejecting the null hypothesis. Typically, the larger the sample size in a study, the higher its statistical power (i.e. the higher the chance of being able to declare statistical significance in the data). Rarely do scientists ever calculate statistical power or impose any constraints on it. Researchers worry less about statistical power because we [incorrectly] think that the only harm in low powered tests is obtaining inconclusive results. (Think of the case we test a new drug and don’t find evidence that it cures cancer while it actually does). That seems like a problem that can be fixed by collecting more data and increasing the population size in future studies.

What we really want to avoid is what is called type I errors, i.e. incorrectly rejecting the null hypothesis, not type II errors, i.e. failure to correctly reject the null hypothesis. Concluding that a drug cures cancer when it actually has no effect (which can have awful consequences) versus failure to find evidence that a drug cures cancer when if actually does (which is bad but can be corrected through more research). Since we are primarily concerned with type I errors, we impose constraints on the significance levels of our tests. So we set a 0.05 threshold for p-values if a 5% type I error rate is deemed acceptable. However, this line of reasoning is problematic. As I show below, low statistical power can lead to more frequent type I errors.

## Calculating the Error Rate

Let us calculate the probability of unjustifiably rejecting the null hypothesis. Remember, a rational person sets a threshold for Pr(H0|p<0.01), not Pr(p<0.01|H0), given they consider 1% probability to be an appropriate threshold for rejecting a hypothesis.

Assume that we have designed a test with statistical significance threshold α and statistical power 1-β to reject the null hypothesis H0. What is the probability that a conclusive study commits a type I error? In other words what is the probability that H0 is true given that the data passes our significance test (p<α)? Using Bayes’ rule, we have:

Let us set the probability of committing a type I error to be less than e and rearrange the equation.

This formula makes it clear how the significance level, statistical power, and the prior odds ratio can affect the rate of irrational hypothesis rejection. Assuming a prior odds ratio of 1 and α=0.01 (meaning that p<0.01 implies statistical significance), our test must have a statistical power of at least β>99% to achieve an error rate of e<1%. That is quite a high standard for a statistical power.

The diagram below is a graphical demonstration of the formula above. The math should be sufficiently convincing, but some feel that the Bayesian philosophy towards probability may give different results than the frequentist philosophy towards probability, and p-value tests were founded on the latter. So I made the figure below to visualize it. Each scientific study is represented by a paper.

In this figure, the significance level was set at p < 0.05. But the error rate among conclusive studies is not 5%. It is 41.5%. Close to half of the conclusive studies in the diagram commit type I errors. This is despite the low significance level α=0.05 and the modest statistical power (1-β = 35%). Notice how decreasing statistical power (1-β) is like moving the dashed line between the green and blue area to the right. This will lead to a smaller green area and therefore a larger error rate. Likewise notice how increasing the base rate (i.e. testing more hypotheses that are unlikely to begin with) is like moving the horizontal dashed line down, resulting in more red, less green, and a higher rate of type I errors.

Note, the standard terminology can be very misleading. The “type I error rate” is defined as the rate of error among the cases where the null hypothesis is incorrect, rather than the rate of error among the cases where the null hypothesis was rejected. What we should be looking at is the error rate among the conclusive studies, i.e. studies that reject the null hypothesis.

## Dodgy Rationalizations

It is often said that if you use p-values appropriately and do not misinterpret them, then all is fine. “Just understand what they mean and don’t draw inappropriate conclusions“. So what is a p-value supposed to mean then? How is it intended to be used?

What we call null-hypothesis testing is a hybrid between two [incompatible] approaches that are each fallacious by themselves: Fisher’s approach and the Neyman-Pearson approach.

Ronald Fisher, who popularized p-values, stated that p-values can be interpreted as “a rational and well-defined measure of reluctance to accept the hypotheses they test“. This statement is demonstrably false. Reluctance to accept a hypothesis should be measure by the probability of that hypothesis. And it is irrational to confuse inverse probabilities. Fisher interpreted p-values as a continuous measure of evidence against the null hypothesis, rather than something to be used in a test with a binary outcome.

The Neyman-Pearson approach, on the other hand, avoids assigning any interpretation to p-values. It, rather, proposes a “rule of behavior” for using p-values. One must reject the null hypothesis when its p-value fall below a predetermined threshold. They dodge the issue of inference and claim that this decision process leads to sufficiently low error rates in the long run.

But to rationalize p-values using a decision-making framework is absurd. Why would it matter what someone believes in if they behave indistinguishably from someone committing the conditional probability fallacy?

## Stop Trying to Salvage P-values

The concept of p-values is nearly a hundred years old. Despite the fact that its fundamental problem has been known, measuring significance remains the dominant method for testing statistical hypothesis. It is difficult to do statistics without using p-values and even more difficult to get published without statistical analysis of data.

However, things may be changing with the rise of the replication crisis in science [1, 2]. The extent to which the replication crisis can be attributed to the use of p-values is under debate. But the awareness about it has unleashed a wave of critisism and reconsideration of p-value testing.In 2016 the American Statistical Association published a statement saying: “Researchers often wish to turn a p-value into a statement about the truth of a null hypothesis or about the probability that random chance produced the observed data. The p-value is neither.” And in 2015 the Journal of Basic and Applied Social Psychology completely banned the use of p-values declaring that “The Null Hypothesis Significance Testing Procedure (NHSTP) is invalid” [3, 4].

Some have suggested abandoning statistical significance in favor of using continuous unthresholded p-values as a measure of evidence (Fisher’s approach) [5, 6, 7]. Others have suggested abandoning p-values as a continuous measure in favor of a dichotomous statistical significance (Neyman-Pearson’s approach) [8]. And others have suggested using more stringent thresholds for statistical significance [9]. But neither Fisher’s nor Neyman-Pearson’s approach are mathematically sound.

The mathematically sound approach is to abandon p-values, statistical significance, and null hypothesis testing all-together and to “to proceed immediately with other measures“, which is a “more radical and more difficult [proposal] but also more principled and more permanent“. (McShane et al 2018).

What alternatives do we have to p-values? Some suggest using confidence intervals to estimate effect sizes. Confidence intervals may have some advantages but they still suffer from the same fallacies (as nicely explained in Morey et al. 2016). Another alternative is to use Bayes factors as a measure for evidence. Bayesian model comparison has been around for nearly two decades but has not gained much traction, for a number of practical reasons.

The bottom line is that there is practically no correct way to use p-values. It does not matter if you understand what it means or if you frame it as a decision procedure rather than a method for inference . If you use p-values you are effectively behaving like someone that confuses conditional probabilities. Science needs a mathematically sound framework for doing statistics.

In future posts I will suggest a new simple framework for quantifying evidence. This framework is based on Bayes factors but makes a basic assumption: that every experiment has a probability of error that cannot be objectively determined. From this basic assumption a method of evidence quantification emerges that is highly reminescent of p-value testing but is 1) mathematically sound and 2) practical. (In contrast to Bayes factor, it produces numbers that are not extremely large or small).