ABOUT THE INTERPRETATION OF STATISTICAL TESTS *

 

Albert FRANK

 

 

PART I

 

Let’s assume the following hypothesis: if the reliability of a dichotomical test is f, then the probability that it gives a wrong result is 1-f.

The following question arises:  Below what reliability will a test result have a probability of being correct of less than 0.5?

Let P be the number of elements in the population, a the probability (known) for an element of this population to have a definite feature K, and f the reliability of the test. The number of K-elements detected by the test equals a f P. The number of non-K detected (wrongly) is (1-a)
(1-f) P. The probability that an element detected by the test is effectively a K-element is 0.5 if a f P = (1-a) (1-f) P, equivalent to f = 1-a.  So, as soon as f
£a, the test becomes a nonsense.

A test must be more reliable if what it attempts to detect is very rare.

This simple fact is very often neglected.

           

Let's take an example: the alcohol test.  We assume as hypothesis that one driver out of 100 is at '0.8 or more' (European norm for heavy offence is in excess of 0.8 gm/ltr.). In the following table, we examine for several reliabilities of the test the probability that somebody with a positive test is actually positive. We take a population of 100,000 persons, of which 1,000 are supposed to be 'at 0.8 or more.' 

 

 

Reliability of the test

Valid detections

Invalid detections

Probability a "detection" is valid

.999

999

99

0.91

.99

990

990

0.5

.95

950

4950

0.16

.9

900

9900

0.08

.8

800

19800

0.04

 

 

We can imagine the dangers of bad interpretations of tests in, for example, the medical field.

 

PART II

 

In the first part , we assumed the following hypothesis: if the reliability of a dichotomical test is f, then the probability that it gives a wrong result is 1-f.

 

Let’s now try to see what happens if we don’t assume this  hypothesis.

Let P be the number of elements in the population, a the probability (known) for an element of this population to have a definite feature K, f1 the probability that a K-element is actually detected as a K-element, and f2 the probability that a non-K-element is actually not erroneously detected  by the test. In practical cases, we have f1<f2.

 

The number of K-elements detected by the test equals a f1 P. The number of non-K elements detected incorrectly is (1-a) (1-f2) P. The probability that an element detected by the test will in actuality be a K-element will be 0.5 can be represented as: a f1 P = (1-a) (1-f2) P. This is equivalent to a special test condition f 2 = 1 + af1/(a -1).

 

The test becomes a nonsense if f2 < 1 + af1/(a – 1) .

 

The ratio a/(a-1) is usually very small (For a = .01, this ratio becomes – 1/99 and the special test condition becomes f2 = 1 – f1/99.)

 

For a = 0.01 and any reasonable value of f1 between 0.8 and 0.999, the test will make no sense if f2 < 0.99 !!

 

This can be easily shown with the following example: In a population of 100,000 elements, let a = 0.01. So 1000 elements will actually be K-elements. If f2<0.99, more than one percent of the non-K elements (that’s more than 990) will be invalidly detected as K-elements. And even is f1 were to have the value 1 (all the K-elements are detected), although we would still have the 1000 valid detections, we would also have more than 990 invalid ones.


* : I want to thank Fred Vaughan for his help to write this article in good English.