Now Reading
Assessing the performance of diagnostic test accuracy measures

Assessing the performance of diagnostic test accuracy measures



American Journal of Orthodontics and Dentofacial Orthopedics, 2022-05-01, Volume 161, Issue 5, Pages 748-751, Copyright © 2021


Diagnosis of a dental condition is the process by which we determine whether a person has the condition (target condition) of interest or not, and this is usually achieved by using relevant diagnostic tests. This short paper aimed to introduce the measures used to describe the performance of diagnostic tests.

Measures of test accuracy

Diagnostic accuracy refers to the ability of a test to correctly detect the presence or not of the target condition. Diagnostic tests are not perfect in discriminating between patients with or without the target condition, and therefore the diagnostic accuracy of each test should be evaluated.

The evaluation comes through comparing the diagnostic test examined, called index test, with a reference standard. The reference standard is a test or a procedure considered a reliable guide as to the absence or presence of the target condition. There can be 2 index test errors: false-positive and false-negative results. A test may be a false positive when the person does not have the target condition or a false negative when the person has the condition ( Table I ).

Table I
Two-by-two classification table of a diagnostic test
Target condition status (reference standard)
With the target condition (diseased) Without the target condition (healthy)
Index test result Positive

Negative

True positive (TP):
The participant has the condition, and the test result is positive
False negative (FN):
The participant has the condition, and the test result is negative
False positive (FP):
The participant has not the condition, and the test result is positive
True negative (TN):
The participant has not the condition, and the test result is negative

Index tests can be based on a binary marker, directly providing a positive or negative test result, like the x-rays, which can directly reveal a root fracture. There are also continuous index tests, like blood tests; these tests usually measure the levels of a substance or biomarker and require setting a cutoff (threshold) value to dichotomize the test results and then decide on the basis of this cutoff. A test is considered positive if the measured value exceeds the predefined threshold. For example, when testing for periodontitis by measuring C-reactive protein levels in the blood, either a threshold of 5 mg/dL or 7mg/dL can be used. Hence, when a participant has a C-reactive protein value greater than 5 (or 7) mg/dL, (s)he is considered to have periodontitis. Often, a set of different thresholds can be used for a single index test; however, as can be seen in Figure 1 and Table II , lower thresholds produce more true and false positives.

The number of participants with and without the target condition, classified at 2 different thresholds. A higher threshold (T2-purple) produces fewer true positive (TP) results decreasing sensitivity, and more true negative (TN) results, increasing the specificity of the test. FP, false positive; FN, false negative.
Fig 1
The number of participants with and without the target condition, classified at 2 different thresholds. A higher threshold (T2-purple) produces fewer true positive (TP) results decreasing sensitivity, and more true negative (TN) results, increasing the specificity of the test. FP, false positive; FN, false negative.

Table II
C-reactive protein levels in the blood to test for periodontitis at threshold 1 (5 mg/dL) and 2 (7 mg/dL)
Threshold Data Sensitivity Specificity DOR LR+ PPV NPV
Threshold 1 ( 5 mg/dL ) TP = 238
FN = 17
FP = 104
TN = 255




238



238


+


17







=


93


%





255



255


+


104







=


71


%


32.5 3.2 70% 94%
Threshold 2 ( 7 mg/dL ) TP = 116
FN = 71
FP = 51
TN = 376




116



116


+


71







=


62


%





376



51


+


376







=


88


%


12.0 5.2 69% 84%
TP , true positive; TN , true negative; FP , false positive; FN , false negative.

The diagnostic performance of index tests is commonly described using 2 basic concepts: sensitivity and specificity. The sensitivity, or true-positive rate, is the probability an individual has a positive index test result when the target condition is present; it describes the ability of a test to correctly identify diseased patients. The specificity, or true-negative rate, is defined as the probability an individual has a negative index test result when the target condition is absent; it describes the ability of a test to rightly identify healthy participants. Both can be treated as proportions.

An ideal test would have both sensitivity and specificity close to 100%, in the sense that false negatives and false positives are close to zero. High specificity and high sensitivity of a test indicate that this test would be very useful, especially if it is easier to conduct than the gold standard; for example, a diagnosis from clinical examination (index test) vs magnetic resonance imaging (gold standard) for temporomandibular joint disc displacement. Unfortunately, 100% sensitivity and specificity are very uncommon in real life, and the choice between optimal sensitivity vs optimal specificity can depend on the question at hand. High sensitivity is important when the cost of a false negative is high, whereas high specificity is important when the goal is to rule out the target condition on the basis of a test result.

When test thresholds vary, sensitivity and specificity are inversely proportional; with a threshold change, an increase in sensitivity leads to a decrease in specificity and vice versa (threshold effect). In Figure 1 , the 2 different threshold values for testing for periodontitis are displayed: a lower (5 mg/dL) and a higher (7 mg/dL). When the threshold increased from 5 to 7 mg/dL, the number of true-positive cases decreased, whereas true-negative cases increased. Consequently, the test’s sensitivity (ie, the ratio of true positive over patients with the target condition) decreased, whereas the specificity (ie, the ratio of true negative over patients without the target condition) increased.

In Table I , at the 5 mg/dL threshold, sensitivity is 93% and specificity 71%. That means that the test correctly gives a positive result for 93% of participants with periodontitis (7% of participants with the target condition were classified falsely as negative), and a negative test result for 71% of healthy participants regarding periodontitis (29% of participants without the target condition is classified falsely as positive). When the threshold increases at 7 mg/dL, sensitivity decreases to 62%, and specificity increases to 88%. In brief, threshold selection plays a crucial role in diagnostic test accuracy studies as a change may change the patients’ classification and, consequently, diagnostic test accuracy measures.

A likelihood ratio (LR) of a diagnostic test describes how much the probability of having the target condition changes, given a test result. It is defined as the probability of a participant to have the target condition, given a test result, divided by the probability of a participant not having the target condition, given the same test result. Test results are either positive or negative. Consequently, there are 2 ratios, the positive LR (LR+) and the negative LR (LR−), which describe how many times more likely positive (or negative for LR−) test results are in the participants’ group with the target condition rather than the participants’ group without the target condition. LRs range from zero to infinity and can be derived using sensitivity and specificity ( Table III ). The greater the LR+ than 1, the better the test for confirming the target condition, and the lower the LR− the better the test ruling out the target condition. For example, in the data provided in Table II , the LR+ for 5 mg/dL test is 3.2. This means that a positive periodontitis test result is 3.2 times more likely in participants with periodontitis than in participants without periodontitis.

Table III
Measures of diagnostic test accuracy
Measure Definition Formula
Sensitivity Probability of test to detect the diseased patients




TP



TP


+


FN




Specificity Probability of test to detect the healthy patients




TN



TN


+


FP




LR Positive: how many times more likely positive test results are in participants with the target condition vs participants without the target condition
Negative: how many times more likely negative test results are in participants without the target condition vs participants without the target condition




S


e


n


s


i


t


i


v


i


t


y




1





S


p


e


c


i


f


i


c


i


t


y








1





S


e


n


s


i


t


i


v


i


t


y




S


p


e


c


i


f


i


c


i


t


y




Diagnostic odds ratio How many times more likely is a positive test result in participants with vs participants without the target condition





S


e


n


s


i


t


i


v


i


t


y


×


S


p


e


c


i


f


i


c


i


t


y





(



1





S


e


n


s


i


t


i


v


i


t


y




)



×



(



1





S


p


e


c


i


f


i


c


i


t


y



)





Predictive values Positive: probability to have the condition given a positive test result
Negative: probability not to have the condition given a negative test result




TP



TP


+


FP








TN



TN


+


FN




Prevalence The proportion of participants with the target condition




TP


+


FN




TP


+


FN


+


TN


+


FP



ROC curve A plot of sensitivity against 1 − specificity, constructed to illustrate the diagnostic performance of a test. The closer the curve to the upper left corner of the ROC space, the better the test

The sensitivity and specificity of a test are typically reported as a pair. The diagnostic odds ratio (DOR) is a common approach to combine the 2 quantities into a single measure; it is defined as the ratio of the odds of test positivity in diseased over the odds of test positivity in healthy patients and can also be derived using the estimated sensitivity and specificity. , DOR is easy to calculate but often difficult to interpret. It ranges from zero to infinity: a DOR greater than 1 indicates that the test has a good discriminating ability, whereas the higher the DOR, the better the test. In the data provided in Table II , a DOR at threshold 5 mg/dL is 32.5, and a DOR at threshold 7 mg/dL is 12.0.

Sensitivity and specificity refer to the performance of a test, and given the status of the medical condition, we see if the test performs well or poorly. However, a question of interest would be the following: given the results of the test, what is the condition of the person? This can be provided by the positive predictive value (PPV) and negative predictive value (NPV).

PPV is the probability that a participant truly has the target condition given a positive index test result. , NPV is the probability that a participant does not have the target condition given a negative index test result. , In the data provided in Table II , at 5 mg/dL threshold, PPV is 70%, and NPV is 94%. Hence, a patient is 70% likely to have periodontitis, given a positive test result, whereas a patient is 94% likely not to have periodontitis, given a negative test result.

The prevalence measures how common is the target condition in a defined population and is expressed as a proportion. Sensitivity and specificity are the test’s characteristics and remain unaffected by any prevalence changes. Consequently, because DOR and likelihood ratios are estimated through sensitivity and specificity, they are also robust measures, irrespective of the prevalence of the target condition. However, changes in prevalence can influence the predictive values. More specifically, as prevalence increases, PPV would increase, whereas NPV would decrease as for every true-positive test result, and there would be fewer false positives. In contrast, a decrease in prevalence would decrease PPV and increase NPV. , ,

Receiver operating characteristic curve

The receiver operating characteristic (ROC) curve is a graphical way to represent the performance of diagnostic tests. A ROC curve is created by plotting sensitivity (y-axis) against 1 − specificity (x-axis); it illustrates the trade-off between sensitivity and specificity at every threshold included. The closer the curve to the upper left corner, the better the test; such a test would have sensitivity and specificity close to 100%. In ROC space, the diagonal line represents tests with no accuracy. A test with a ROC curve close to the diagonal line tends to be less accurate, whereas a ROC curve beneath the diagonal implies a misclassification problem (healthy are classified as diseased and vice versa). In Figure 2 , the ROC curves of blue fluorescence (BF), violet fluorescence (VF), and orange fluorescence (OF) for diagnosing dental caries are displayed. BF ROC curve is the closest to the top left corner, above VF and OF ROC curves. This shows that BF is the best among the 3 tests. However, OF ROC curve is under the no-accuracy line, which means that patients with dental caries may be wrongly classified as non-problematic patients using the OF.

ROC curves of blue fluorescence, violet fluorescence, and orange fluorescence device (test) for diagnosing dental caries. The closer the curve to upper left corner, the better the test.
Fig 2
ROC curves of blue fluorescence, violet fluorescence, and orange fluorescence device (test) for diagnosing dental caries. The closer the curve to upper left corner, the better the test.

You're Reading a Preview

Become a DentistryKey membership for Full access and enjoy Unlimited articles

Become membership

If you are a member. Log in here