The goal of this program is to provide an introductory framework for clinicians who want to interpret statistical tests of reliability and validity. After you study the information presented here, you will be able to —
Approval Information
Gannett Healthcare Group is an approved sponsor by the New York State Education Department of continuing education for physical therapists and physical therapist assistants from October 21, 2009 to October 21, 2012.
This activity is provided by the Texas Board of Physical Therapy Examiners Accredited Provider #GED012010TPTA2012004 and meets continuing competence requirements for physical therapist and physical therapist assistant licensure renewal in
As of 4/5/10, Gannett Education is recognized by the Physical Therapy Board of California as an approved reviewer and provider of continuing competency courses for the state of
Gannett Education was approved as a provider of continuing education by the North Carolina Physical Therapy Association (provider no. 09-0215-001PR) from March 8, 2009 through March 8, 2010.
This course has been approved as meeting the continuing education requirements for PTs and PTAs by the Ohio Physical Therapy Association (approval no. 10S0348 for 2/14/10 to 2/14/11; 11S0674 for 02/15/11 to 02/15/12; 12S0171 for 02/16/12 to 02/16/13), the Florida Physical Therapy Association (approval no. CP100015261, expiration date 12/31/10; CE110017074 for 01/01/11 to 12/31/11; CE120417123 for 01/01/12 to 12/31/12); the Tennessee Physical Therapy Association (aproval no. 3735 for 04/12/11 to 04/11/12) for Class 1 Continuing Education Requirement; the Pennsylvania Board of Physical Therapy (approval no. PTCE002234 for 05/22/11 to 12/31/12); and the New Jersey Board of Physical Therapy Examiners (approval no. 295-2010 for 2/1/10 to 1/31/12). Approval of this course does not necessarily imply the Florida Physical Therapy Association supports the views of the presenter or the sponsors.
This course has been approved by the Maryland State Board of Physical Therapy Examiners for 0.1 CEU for 03/24/11 to 03/23/15.
The Illinois Chapter Continuing Education Committee has certified that this course meets the criteria for approval of Continuing Education offerings established by The Illinois Physical Therapy Association (approval no. 437.3197 for 02/01/10 to 02/01/11; 437-3776 for 04/01/11 to 04/01/12). According to the Rules for the Administration of the Illinois Physical Therapy Act (section 13460.61) published by the Illinois Department of Professional Regulation, a physical therapist or physical therapist assistant applying for re-licensure in Illinois can earn a maximum of 50 percent of their required continuing education hours from self-study. The hours awarded of this course are designated for self-study CE credit.
Other states may accept this course for meeting their CE requirements. Check with your state association or board.
| Sidebars | References | Authors | Print Course | Start Test | |||
With the advent of evidence-based practice, an increased responsibility has been placed on clinicians to understand and apply evidence to patient care. Many clinicians have recognized this responsibility only to find themselves at an impasse when arriving at the results and data analysis section of a journal article. Although research reports are supposed to be written with the goal of meeting the diverse needs of both researchers and those involved in patient care, some of these reports fall short of clinical utility. Research reports may be written with a degree of intricacy or use of nomenclature that is only appreciated by the researcher or academician; thus clinicians may fail to find value in the results.
Research is an evolving process, and some PTs may have gaps in their research acumen, leading to barriers when attempting to translate research findings into clinical practice. The process of interpreting research requires a process of active learning and a reasonable starting point, as numerous statistical tests, analyses, and measurement principles exist. Learning every type of statistical test or experimental design would be an arduous task for anyone not pursuing formal research training. So, choosing a realistic and relevant starting point is essential. From an introductory perspective, interpreting statistical reports of tests and measurements is a reasonable skill for the clinician to acquire, and it has direct implications for patient care.
Most PTs would agree that tests and measurements are an integral part of the examination, as they provide a means to describe and interpret clinical findings. Moreover, tests and measurements are the basis for clinical diagnosis and identifying appropriate interventions. The selection of tests and measurements is often based upon the inclination of the clinician with factors such as familiarity and educational dogma driving these choices. Unfortunately, many of the textbooks used to illustrate such tests and measurements offer no guide to interpreting the clinical value of these tests, leading the clinician to assume that all tests may be of equal value. With the evolving paradigm shift toward evidence-based practice, many clinicians have begun to recognize that reproducibility (reliability) and validity are necessary considerations when choosing a test or measurement. With an understanding of reliability and validity, clinicians can recognize the merits and limitations of a particular test and thus have confidence in their interpretation of the clinical examination.
Clinical Scenarios
Scenario A: A PT measures a patient’s shoulder internal rotation range of motion and records 55° using a gravity based inclinometer. If the PT repeated this measurement the next day and recorded 58° would you assume the patient has increased mobility or the change is within the expected range of error? If you were to measure internal rotation after applying an intervention, how would you know if an increase in mobility is the result of true change or if your change in the measurement was the result of error?
This scenario highlights a clinical question requiring an understanding of reliability and interpretation of change.
Scenario B: You perform the drop arm test for a patient suspected of having a rotator cuff tear and record a negative result. The validity of this test has been reported as having Sensitivity = 0.10 and Specificity = 0.98.1 Could you assume that this patient does not have a rotator cuff tear? If you recorded a positive result, could you assume with some degree of certainty that this patient has a rotator cuff tear?
This scenario highlights a clinical question requiring an understanding of validity.
Measurement Levels
Researchers conducting scientific inquiry assign a value to their data before performing statistical tests. These values allow the researcher to classify the level or scale of a measurement that is necessary to perform the appropriate statistical tests. Four levels or scales have been systematically defined: nominal, ordinal, interval, and ratio measurements (see Table 1).
|
Table 1: Measurement Levels | ||
|
Measurement Level |
Description |
Example |
|
Nominal |
Mutually exclusive categories |
Gender, + or - test, arm dominance |
|
Ordinal |
Data is ranked with unequal intervals between ranking |
Manual muscle test Verbal pain scale |
|
Interval |
Data is ranked with equal intervals between ranking |
Temperature |
|
Ratio |
Data represents numbers with a true zero point and equal intervals between numbers |
Range of motion Height Weight Strength from hand-held dynamometer |
Nominal data is the lowest, most basic level of measurement. It serves to label or classify a characteristic or outcome into a category. Dichotomous results of a test or question are classified as nominal. An example of a nominal measurement is a test that has either a “positive or negative” result or a “yes or no” response. A key characteristic of a nominal measurement is that classification is exhaustive, thus there is a category for every outcome and no one can be assigned to more than one category. Technically speaking, there is no ranking to nominal measurements as one is not necessarily “greater” than the other. Since the categories cannot be technically ranked, nominal data cannot be used to demonstrate change.
Ordinal measurements require a ranking, and data is organized into categories exhibiting a greater or less than relationship. An example of an ordinal measurement is manual muscle testing, which can be ranked from 0 to 5, or the verbal pain scale, which is often rated a 0 to 10. Ordinal measurements by their definition do not have a true zero point; thus the zero is often arbitrary. Another facet of ordinal measurements is that the differences between ranks are not true quantities. In other words, the change in strength from a 3 to 4 is not the absolute same as a change from a 2 to a 3. Thus the difference between ranks does not have a true value.
Interval and ratio measurements have equal values between intervals or ranks; however, the zero point of interval measurements is often arbitrary similar to ordinal. Temperature is an example of an interval measurement. The ratio level of measurement is the highest level of measurement, and it has equal intervals between measurements. For example, we can state with certainty that a change in shoulder internal rotation from 50° to 60° is the same amount as a change from 60° to 70°, as they both are 10°. Unlike an interval measurement, a ratio has a true zero point. The zero in a ratio measurement represents the absence of an attribute or property. Examples of a ratio measurement are range of motion and force. An example to distinguish an arbitrary vs. true zero point may be reflected in the difference between temperature and range-of-motion readings. When recording temperature, a zero degrees recording implies a totally different temperature when considering Fahrenheit vs. Celsius, and the 0° does not imply the absence of the temperature attribute. A 0° range-of-motion measurement implies an absence of motion. Table 1 summarizes the facets for each of the previously described measurement levels.
Understanding the level of a measurement has relevance for performing statistical analyses, as statistical tests for nominal measurements are usually not appropriate for ratio measurements and vice versa. Each level has a set of rules for data analysis for which the relevance will be established in the subsequent sections denoted for reliability and validity. While it may not be clinicians’ responsibility to assign measurement levels to their clinical tests, an awareness of measurement scales may alert them when an inappropriate statistical test is applied.
The extent to which clinicians interpret the consistency and degree of error in their measurements requires an understanding of reliability. Validity, on the other hand, is a necessary prerequisite to ensure the chosen test is measuring what it is intended to measure. Validity allows us to draw inferences from the test. Scenarios A and B illustrate the importance of understanding the reliability and validity of tests and measurements.
Reliability
The reliability of a test or measurement may be viewed as the reproducibility of that measurement or test. If a measurement is reproducible and free from error, it will produce a consistent response when repeated by the same or a different examiner. A test that is reproducible or consistent when the same examiner performs the test on multiple occasions is said to have intrarater reliability; whereas a test that is reproducible when different examiners perform the measurement is said to have interrater reliability. Typically, measurements have higher intrarater reliability than interrater as a result of procedural variations between testers. An example of this could be seated range-of-motion measurements of shoulder flexion where one clinician asks a patient to elevate his arm in the strict sagittal plane unlike another clinician who may allow a degree of deviation from the sagittal plane. This deviation would produce two different measurements. This difference would be reflected in the calculation of interrater reliability.
When assessing the reliability of a test, both correlation and agreement need to be considered. Correlation is essentially an association, and agreement tells if the scores are the same. Correlation does not look at actual values and agreement does. For example, consider the case of two PTs who are treating the same patient. They have different measurements but are consistent in their differences. If this were the case, a reliability calculation of correlation alone using a Pearson value would be high since the Pearson test does not take agreement into account. As a result, there are inherent limitations to reliability reports that use the Pearson statistic.
The preferred test for reliability is the Intraclass Correlation Coefficient because it measures both correlation and agreement. The Intraclass Correlation Coefficient may be used for ratio, interval, or ordinal data. For nominal data, the percent agreement beyond chance is determined with the Kappa statistic. An interpretation of the Kappa and Intraclass Correlation Coefficient reliability coefficients are offered in Tables 2 and 3, respectively. Clinicians should, however, recognize that interpretation is not an absolute scale, and individual considerations must be made. For example, a measuring instrument that has an Intraclass Correlation Coefficient reliability coefficient of 0.49 would be considered poor, according to information presented in Table 3. However, if this is the only measuring instrument or technique available, the clinician may not have a better option. In these cases, clinical decisions can be made with an understanding of both the merits and limitations of the instrument. Clinicians faced with choosing between two measurement instruments should opt for the instrument with higher reliability to ensure consistent measurement potential. Lastly, when considering reliability values, one must recognize that if a test is nominal in nature with two categories (i.e., + or -), it would be easier to achieve high reliability compared to an ordinal scale, such as the pain scale that has 11 possible categories.
|
Table 2: Interpretation Agreement Using the Kappa Value | |
|
Kappa Value |
Interpretation Of Agreement |
|
< 0.00 |
Poor |
|
0.00 to 0.20 |
Slight |
|
0.21 to 0.40 |
Fair |
|
0.41 to 0.60 |
Moderate |
|
0.61 to 0.80 |
Substantial |
|
0.81 to 1 |
Almost perfect |
|
Adapted from Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159-174. | |
|
Table 3: Interpreting Reliability Coefficients Using the Intraclass Correlation Coefficient | |
|
Reliability Coefficient |
Interpretation |
|
< 0.50 |
Poor |
|
0.50 to 0.75 |
Moderate |
|
> 0.75 |
Good |
|
>0.90 |
Excellent — recommended for clinical decision making |
Interpretation of Change
Assume you are in the clinic and want to determine if your patient has made progress with a particular outcome measure. One of the questions that should be reflected upon is if the change is the result of error or limitations in reproducibility and if the change represents a true difference. Another question that must be considered is if the change represents clinically important change to the patient. For example, a patient presents for a shoulder examination due to pain and functional limitations. During their examination, you decide to use the Shoulder Pain and Disability Index, which is a 13-item self-report questionnaire that measures shoulder pain and disability. The Shoulder Pain and Disability Index is scored on a visual analog or numerical pain scale with the total score added up to range from 0 to 100 (0 is the best score and suggestive of no pain and disability attributed to the shoulder; 100 is the worst possible score). Assume your patient scored a 40 on his initial examination and one week later scored a 25. Did the patient experience improvement? How much change is truly needed to exceed the threshold of error and/or have a clinically important change?
The two statistical calculations that provide this information are the minimum detectable change and the minimum clinically important difference. The minimum detectable change is a statistical calculation that takes into account reliability values and measurement error. The minimum clinically important difference may be either a statistical calculation based on research using a known outcome measure for comparison or determined by expert consensus.
The minimum detectable change is the amount of change that will exceed the threshold of error; the minimum clinically important difference is the smallest change that represents an important difference for the patient. The minimum detectable change for the Shoulder Pain and Disability Index has been reported as = 18, and the minimum clinically important difference = 13.2 When comparing these values to the change made in the above example, we can state that this patient made clinically important change; however, the change may not have exceeded the threshold of error based on the minimum detectable change. When considering change scores, it is desirable to exceed both the minimum detectable change and minimum clinically important difference; but exceeding either is still considered change.
In regards to the information presented in Scenario A, we can refer to a study that reported an Intraclass Correlation Coefficient of 0.99 for internal rotation range of motion and an minimum detectable change of 4°.3 We can interpret the reliability coefficient as being excellent and appropriate for clinical decision making. The minimum detectable change indicates that a change of greater than or equal to 4° is required to represent a true difference; thus the 3° change reported in Scenario A would likely be the result of measurement error. An important consideration that needs to be recognized when interpreting statistical values of tests is that the results are specific to the procedures used. For example, a reliability coefficient reported for measuring shoulder range of motion in supine would not be applicable to measurements performed seated.
Validity/Diagnostic Utility
Validity is defined as the degree to which an instrument measures what it is intended to measure.4 For example, goniometry is a valid measure of mobility, and isokinetic dynamometry is a valid measure of strength. Palpable tenderness, on the other hand, is not necessarily a valid measure of disability. Numerous classifications of validity have been described; however, a discussion of all classifications is well beyond the intended scope of this module. The focus of this reading is on criterion-related validity, which is determined by comparing a clinical test to a known gold or reference standard. The calculation is based on whether or not the results found during the clinical test match the result of a gold standard.
For example, the straight leg raise clinical test has been found to be a valid indicator of a lumbar spine discogenic disorder when studies were performed comparing the results from the straight leg raise to the gold standard — magnetic resonance imaging. There is more than one gold standard for many diagnoses. A positive clinical finding on a special test that is positive when tested against the gold standard is considered a “true positive”; whereas a positive clinical test finding that has a negative diagnosis with the gold standard would be considered a “false positive.” A test that is negative in the clinic and negative when using the diagnostic gold standard would be referred to as a “true negative.” The validity of a test will be lower based on the number of false positives or false negatives.
Table 4 illustrates a standard 2 x 2 contingency table that is used by epidemiologists to determine the validity of tests. The 2 x 2 contingency table is limited to a dichotomous outcome, such as + or - or yes or no. When considering clinical tests, every possible result could be classified into one of the four cells. For example, a patient who tests negative for the empty can test but has a partial tear or tendinopathy of the supraspinatus on MRI would be thought to have a false negative empty can test and thus categorized into cell C. A patient who centralizes in response to a repetitive extension test and who has a disc herniation based upon discography is said to have a true positive and thus would be classified into cell A.5 Once the test outcomes are identified (false positive, true negative, etc.), the validity results are commonly reported based on the test’s sensitivity, specificity, and likelihood ratio. Other validity statistics may be used; however, the aforementioned examples are the most commonly encountered in research reports.
|
Table 4: Results of the Diagnostic/Reference Standard | |||
|
Results of Clinical Test |
|
Positive |
Negative |
|
Positive |
True Positive (A) |
False Positive (B) | |
|
Negative |
False Negative (C) |
True Negative (D) | |
Sensitivity
Sensitivity is a test’s ability to obtain a positive result when the condition is truly positive (positive on both clinical test and gold standard). The formula for this calculation is Sensitivity = A/(A+C) as illustrated in Table 4. For example, if we had 10 subjects and four were classified as true positive, four were false positive, one was a true negative, and one was false negative, the sensitivity would be 4/4+1 = 4/5= 0.80 or 80%. Notice that cell C (false negatives) is used to calculate sensitivity; therefore, a test that has few false negatives would have a high sensitivity, and a test with many false negatives would have a lower sensitivity. Clinically, this could be interpreted by considering that a test with a high sensitivity and a negative result is therefore valid in regards to the negative result. Sensitivity is therefore valuable in ruling out a disorder or condition. SnNout is a mnemonic applied to sensitivity. When a test has a high sensitivity, a negative result rules out the diagnosis.
Specificity
Specificity is a test’s ability to obtain a negative test when the condition is truly negative (negative on both clinical test and gold standard) and is tested by the formula Specificity = D/B + D. For the above situation, the specificity is 1/1+4 = 1/5 = 0.2 or 20 %. Notice that cell B is used to calculate specificity; thus, a test that has few false positives would have a high specificity. A positive result in a test with a high specificity is therefore valid in regards to the positive result. Specificity is valuable in ruling in a disorder or condition. SpPin is a mnemonic applied to specificity. When a test has a high specificity, a positive result rules in the diagnosis.
Application of Sensitivity and Specificity
The clinical utility of sensitivity and specificity calculations determines whether a test is valuable for ruling in or ruling out a condition of interest. In other words, if a negative result is obtained in the clinic, can we be confident that the test is truly negative? If the test has been previously researched, we could ascertain this by referring to the test’s sensitivity. If the test has a high sensitivity, we can then recognize the SnNout mnemonic and could be confident that a negative result for that particular test is valuable for ruling out the condition of interest. If the sensitivity was low, we should not be confident that the negative result is valid, as this particular test has many false negative results.
Scenario B illustrates how interpretation of sensitivity and specificity can lend to the decision-making process. In the scenario, the drop arm clinical test is performed on a patient with a shoulder disorder, and a negative result is recorded. An investigation of clinical tests reported a sensitivity of 0.10 or 10% and specificity of 0.98 or 98% for the tests validity to identify a rotator cuff tear using arthroscopic examination as the gold standard for comparison.1 Using SnNout and SpPin, we can recognize the drop arm test is valuable for ruling in a rotator cuff tear. In other words a positive result, suggests a rotator cuff tear. In regards to the negative result, the sensitivity is low (10%); thus the drop arm test does not have appreciable validity for ruling out a rotator cuff tear, so the clinician cannot be confident that a negative result is truly present.
Table 5 illustrates the use of a 2 x 2 contingency table for calculating sensitivity and specificity. It provides a visual illustration of how sensitivity and specificity are calculated using data from an investigation published by Calis and colleagues. In this example, the Hawkins-Kennedy test for shoulder impingement syndrome is reported to have a sensitivity of 91% and specificity of 25% based on comparison with a diagnostic gold standard-subacromial injection. Based on these results, the test could be considered valid for ruling out (high sensitivity) impingement syndrome. A negative result for the Hawkins-Kennedy impingement test is valuable for ruling out the condition because the test has few false negatives based on the formula for sensitivity (SnNout). On the other hand, the test has many false positives, lending to the lower specificity, thus a positive clinical test may be a false positive.
In summary, tests with a low sensitivity often have many false negatives, whereas tests with low specificity have many false positives. With this understanding, clinicians familiar with the statistical values of the tests they use could interpret the value of a positive or negative result. One of the recognized limitations of sensitivity and specificity is that an explanation has not been offered as to the cut-off point for a high vs. low value. Additionally, there is no suggestion as to how these values may affect the probability of a condition being present or absent as a result of the test’s outcome. As a result, many researchers have recognized the value of reporting likelihood ratios in their validity studies.
|
Table 5: Results of the Diagnostic Standard (Injection Test) | |||
|
Results of Clinical Test (Hawkins Kennedy) |
|
Positive |
Negative |
|
Positive |
True Positive (A)(80) |
False Positive (B)(27) | |
|
Negative |
False Negative (C)(8) |
True Negative (D)(9) | |
|
Hawkins Kennedy impingement test results on patients diagnosed with subacromial impingement syndrome based on the subacromial injection test | |||
|
Sensitivity = A/(A+C) |
Specificity = D/B + D | ||
|
Sensitivity = 80/(80+8) = 91% or 0.91 |
Specificity = 9/(27+9) = 25% or 0.25 | ||
Likelihood Ratios
The likelihood ratio is a statistical measure of validity that allows us to be more confident or certain in a suspected diagnosis. The likelihood ratio tells us how much more likely a person is to have a condition of interest after the test is performed.4 The likelihood ratio statistic incorporates both sensitivity and specificity. The positive likelihood ratio (+LR) indicates how much the odds of the disease increases (post-test probability) when a clinical test is positive. The negative likelihood ratio (-LR) indicates how much the pretest probability of the disease decreases when a test is negative. The higher the +LR, the greater the value of a positive diagnosis and the chances of the person actually having the disease; the lower the -LR, the greater the probability of a negative test result and the chances of the person not having the disease. The statistical calculation of a likelihood ratio takes into account both the sensitivity and specificity of a test, offering greater diagnostic utility than sensitivity or specificity alone. The formulas for calculating likelihood ratios are —
Table 6 offers a guide for interpreting likelihood ratios. When interpreting a likelihood ratio, it should be recognized that a test with a positive or negative likelihood ratio of 1 in no way alters the outcome of a test. A likelihood ratio of 1 indicates that false positive rates are the same as true positive rates, and false negative rates are the same as true negative rates. For example, assume a clinical test is positive and that the particular test has been reported to have a positive likelihood ratio of 1. A clinician who has an understanding of likelihood ratios should then recognize that the positive result identified does not change the probability of the patient having the diagnosis to any degree more than the information gained in the absence of performing the test. A positive likelihood ratio value greater than 1 proportionally increases the probability that the patient has the diagnosis of interest.
In the study, the Hawkins-Kennedy test was compared to a subacromial injection. A patient who had a negative Hawkins-Kennedy and a negative subacromial injection test was deemed to be a true negative. The validity statistics included a negative likelihood ratio of 0.32. According to Table 6, it could be determined that a negative Hawkins-Kennedy alters the likelihood of a negative result to a small degree.6
|
Table 6: Interpreting Likelihood Ratios | ||
|
+LR |
Interpretation |
-LR |
|
1 to 2 |
Alters post-test probability minimally |
0.5 to 1 |
|
2 to 5 |
Alters post-test probability to a small degree |
0.2 to 0.5 |
|
5 to 10 |
Alters post-test probability to a moderate degree |
0.1 to 0.2 |
|
> 10 |
Significantly alters post-test probability |
Less than 0.1 |
|
Adapted from: Jaeschke R, Guyatt GH, Sackett DL. Users’ guides to the medical literature. III. How to use an article about a diagnostic test. B. What are the results and will they help me in caring for my patients? The Evidence-Based Medicine Working Group. JAMA. 1994;271(9):703-707. | ||
Conclusion
Evidence-based practice is continuing to evolve with a growing consensus that it is important and here to stay. The goals of improving patient care and outcomes are important to PTs; however, clinicians and researchers must equally share the steps necessary to meet these goals. Clinicians hold the responsibility of directly implementing patient care while researchers test and develop procedures at the hands of clinicians. Clinicians who acquire an understanding of reliability may confidently choose test and measurements with the greatest reproducibility, reducing error in their results. Moreover, understanding the minimum detectable change and minimum clinically important difference allows clinicians to interpret whether change scores measured represent true change that exceeds error and is clinically important to patients and their conditions. Lastly, recognizing and interpreting statistical measures of validity allows clinicians to have confidence in their diagnosis and to interpret the attributes of a test in regards to potential for false positive or false negative results.
Understanding statistical measures of reliability and validity offer direct implications for clinical decision-making and patient care. Although it is not imperative for clinicians to memorize the reliability and validity values of all tests aligned with their subspecialty (musculoskeletal vs. pediatrics), those desiring to implement evidence-based practice into patient care should at minimum be familiar with the test and measurements they routinely use. As patients and third-party payers become consumers of evidence-based practice and clinicians accept the responsibility for translating research into practice, common sense and clinical judgment must not be overlooked. A test with a reliability coefficient of 0.99 and a positive likelihood ratio of 10 is of no value if it is not appropriate or safe for your patient.
Editor’s note: Those desiring an in-depth review of statistical methods and analysis may want to read Foundations of Clinical Research: Applications to Practice (3rd Edition), by Leslie Gross Portney and Mary P. Watkins.
Gannett Education guarantees this educational activity is free from bias.
|
Page 1 |
|
