Your activity: 21445 p.v.
your limit has been reached. plz Donate us to allow your ip full access, Email: [email protected]

Glossary of common biostatistical and epidemiological terms

Glossary of common biostatistical and epidemiological terms
Author:
Peter A L Bonis, MD
Section Editors:
Joann G Elmore, MD, MPH
David M Rind, MD
Deputy Editor:
Carrie Armsby, MD, MPH
Literature review current through: Feb 2022. | This topic last updated: Mar 04, 2020.

INTRODUCTION — This topic review will provide a catalog of common biostatistical and epidemiological terms encountered in the medical literature.

STATISTICS THAT DESCRIBE HOW DATA ARE DISTRIBUTED

Measures of central tendency — Three measures of central tendency are most frequently used to describe data:

Mean equals the sum of observations divided by the number of observations.

Median equals the observation in the middle when all observations are ordered from smallest to largest; when there are an even number of observations, the median is defined as the mean of the middle two data points.

Mode equals the observation that occurs most frequently.

Measures of dispersion — Dispersion refers to the degree to which data are scattered around a specific value (such as the mean). The most commonly used measures of dispersion are:

Range – The range equals the difference between the largest and smallest observation.

Standard deviation – The standard deviation measures the variability of data around the mean. It provides information on how much variability can be expected among individuals within a population. In samples that follow a "normal" distribution (ie, Gaussian), 68 and 95 percent of values fall within one and two standard deviations of the mean, respectively.

Standard error of the mean – Standard deviation of the mean (for a sample population) should be distinguished from the standard error of the mean, which describes how much variability can be expected when measuring the mean from several different samples.

Percentile – The percentile equals the percentage of a distribution that is below a specific value. As an example, a child is in 90th percentile for weight if only 10 percent of children the same age weigh more than he or she does.

Interquartile range – The interquartile range refers to the upper and lower values defining the central 50 percent of observations. The boundaries are equal to the observations representing the 25th and 75th percentiles. The interquartile range is depicted in a box and whiskers plot (figure 1).

TERMS USED TO DESCRIBE THE FREQUENCY OF AN EVENT — Incidence and prevalence are the two main terms used to describe the frequency of an event.

Incidence — Incidence represents the number of new events that have occurred in a specific time interval divided by the population at risk at the beginning of the time interval. The result gives the likelihood of developing an event in that time interval.

Prevalence — Prevalence refers to the number of individuals with a given disease at a given point in time divided by the population at risk at that point in time. Prevalence has been further defined as being "point" or "period." Point prevalence refers to the proportion of individuals with a condition at a specified point in time, while period prevalence refers to the proportion of individuals with a condition during a specified interval (eg, a year).

Person-years — Person-years refers to the total number of years that each member of a study population has been under observation or treatment. It is obtained by multiplying the number of years by the number of members of a sample population who have had a certain condition or undergone a particular treatment.

TERMS USED TO DESCRIBE THE MAGNITUDE OF AN EFFECT — The types of descriptors used to define the relationship among variables of interest in a data set and the effect of one variable on another depend upon the type of data. Important examples are the relative risk and odds ratio, which are commonly encountered expressions describing the relationship between nominal characteristics (ie, variables that are grouped as unique categories) (figure 2).

Relative risk — The relative risk (or risk ratio) equals the incidence in exposed individuals divided by the incidence in unexposed individuals. The relative risk can be calculated from studies in which the proportion of patients exposed and unexposed to a risk is known, such as a cohort study. (See 'Cohort study' below.)

Odds ratio — The odds ratio equals the odds that an individual with a specific condition has been exposed to a risk factor divided by the odds that a control has been exposed. The odds ratio is used in case-control studies and is often generated in multivariate analyses as well (see 'Case-control study' below). The odds ratio provides a reasonable estimate of the relative risk for uncommon conditions (figure 2).

The relative risk and odds ratio are interpreted relative to the number 1. An odds ratio of 0.6, for example, suggests that patients exposed to a variable of interest were 40 percent less likely to develop a specific outcome compared with the control group. Similarly, an odds ratio of 1.5 suggests that the risk was increased by 50 percent.

Absolute risk — The relative risk and odds ratio provide an understanding of the magnitude of risk compared with a standard. However, it is more often desirable to know information about the absolute risk (also known as risk difference). As an example, a 40 percent increase in mortality due to a particular exposure does not provide direct insight into the likelihood that exposure in an individual patient will lead to mortality. In some cases, a large relative risk reduction may not be clinically important. A 50 percent reduction in an outcome, for example, is much more impressive if the baseline rate of the outcome is 25 percent compared with 1 percent.

The "attributable risk" represents the difference in the rate of a disease in an exposed, compared with a nonexposed, population. It reflects the additional incidence of disease related to an exposure, taking into account the background rate of the disease. The attributable risk is calculated by subtracting the incidence of a disease in nonexposed persons from the incidence of disease in exposed persons.

A related term, the "population attributable risk" is used to describe the contribution that an exposure has on the incidence of a specific disease in a population. It is calculated by multiplying the attributable risk by the prevalence of exposure to a risk factor in a population. The population attributable risk is particularly important when considering public health measures and the allocation of resources intended to reduce the incidence of a disease.

Number needed to treat — The benefit of an intervention can be expressed by the "number needed to treat (NNT)." NNT is the reciprocal of the absolute risk reduction (ARR; the event rate in the control arm minus the event rate in the treatment arm). The NNT can be interpreted as follows: "This study suggests that for every five patients treated with the new treatment, one additional death would be prevented compared with the control treatment."

As an example, consider a clinical trial involving 100 patients randomized to treatment with a new drug or placebo, with 50 patients in each arm. Thirty patients died during the study period (10 receiving active drug and 20 receiving placebo), giving a mortality rate of 20 percent with active drug versus 40 percent with placebo, as shown in the left panel of the figure (figure 3). The absolute risk difference between the two treatment arms is used to calculate NNT.

ARR = 40 percent minus 20 percent = 20 percent = 0.2

NNT = 1 divided by ARR = 1 divided by 0.2 = 5

Thus, this study suggests that only five patients need to be treated with the drug to prevent one death (compared with placebo).

Because it is intuitive, the NNT is a popular way to express absolute benefit or risk, potentially allowing for comparison of the relative benefit (or harm) of different interventions. However, the NNT can be misleading:

It implies that the option is to treat or not to treat, rather than to treat or switch to another more effective treatment [1].

There are variations on how NNT is determined; NNTs from different studies cannot be compared unless the methods used to determine them are identical [2]. This may be a particular consideration when NNTs are calculated for treatment of chronic diseases in which outcomes (such as mortality) do not cluster in time.

Calculation of the NNT depends upon the control rate (ie, the rate of events in the control arm). The control rate can be variable (particularly in small controlled trials, which are more vulnerable to random effects). As a result, the NNT may not accurately reflect the benefit of an intervention if events occurred in the control arm more or less than would be expected based upon the biology of the disease. This effect can be particularly problematic when comparing the NNTs among placebo-controlled trials (figure 3) [3].

When the outcome is a harm rather than a benefit, a number needed to harm (NNH) can be calculated similarly. (See 'Number needed to harm' below.)

Other variations that sometimes appear in the medical literature include number needed to prevent and number needed to diagnose.

Number needed to harm — NNH is a measure of harm caused by the investigational treatment. Like the NNT, the NNH is the reciprocal of the absolute risk difference, which in this case, is an increase rather than reduction (ie, the event rate in the treatment arm minus the event rate in the control arm). The NNH can be interpreted as follows: "This study suggests that treating 20 patients with the investigational treatment would result in one additional adverse event compared with the control treatment."

As an example, consider a randomized trial comparing an investigational new drug versus the current standard treatment for a certain condition. Adverse drug reactions occurred in 20 percent of patients treated with the new drug compared with 15 percent with the standard therapy. Thus, the absolute risk difference is 5 percent (20 minus 15), and the NNH is 20 (1 divided by 0.05). This means that for every 20 patients treated with the new drug, there would be one additional adverse drug reaction compared with standard therapy.

TERMS USED TO DESCRIBE THE QUALITY OF MEASUREMENTS — The most commonly used measures to describe the quality of an observation are reliability and validity.

Reliability — Reliability refers to the extent to which repeated measurements of a relatively stable phenomenon fall closely to each other. Several different types of reliability can be measured, such as inter- and intraobserver reliability and test-retest reliability.

As an example, the kappa statistic is a measure of the agreement between two observers who independently measure the same data. It can range from -1.0 to +1.0. If there is perfect agreement, the value is 1.0, whereas if the observed agreement is what would be expected by chance alone, the value is 0. If the degree of agreement is worse than what would be expected by chance, the kappa value will be negative, with complete disagreement resulting in a value of -1.0. Kappa statistics are often interpreted as:

Excellent agreement – 0.8 to 1.0

Good agreement – 0.6 to 0.8

Moderate agreement – 0.4 to 0.6

Fair agreement – 0.2 to 0.4

Poor agreement – Less than 0.2

Validity — Validity refers to the extent to which an observation reflects the "truth" of the phenomenon being measured. Several types can be measured. In studies involving questionnaire development, for example, types of validity include content (the extent to which the measure reflects the dimensions of a particular problem), construct (the extent to which a measure is affirmed by an external established indicator), and criterion validity (the extent to which a measure can predict an observable phenomenon). These types of validity are useful since the "truth" may not be physically verifiable.

MEASURES OF DIAGNOSTIC TEST PERFORMANCE — The most common terms used to describe the performance of a diagnostic test are sensitivity and specificity (table 1). (See "Evaluating diagnostic tests".)

Sensitivity — The number of patients with a positive test who have a disease divided by all patients who have the disease. A test with high sensitivity will not miss many patients who have the disease (ie, few false-negative results).

Specificity — The number of patients who have a negative test and do not have the disease divided by the number of patients who do not have the disease. A test with high specificity will infrequently identify patients as having a disease when they do not (ie, few false-positive results).

Sensitivity and specificity are properties of tests that should be considered when tests are obtained. In addition, sensitivity and specificity are interdependent. Thus, for a given test, an increase in sensitivity is accompanied by a decrease in specificity and vice versa. This can be illustrated by the following example. Consider two populations of patients: One has chronic hepatitis as defined by a reference standard such as a liver biopsy, and the other does not. The diagnostic test being used to evaluate for chronic hepatitis is the serum alanine aminotransferase (ALT) concentration. The sensitivity and specificity of the ALT depend upon the value chosen as a cutoff (figure 4).

The interdependence of sensitivity and specificity can be depicted graphically using a receiver operating characteristic (ROC) curve. The ROC curve plots sensitivity on the Y axis and 1-specificity (which is the false-positive rate) on the X axis. The area under the ROC curves (the area to the right and below the curve) gives an estimate of the accuracy of a test. An ideal test would have a cutoff value that perfectly discriminated those with disease and would have an area under the ROC curve of 1.00 (figure 5). The ROC curve can be adapted to multivariate analysis (such as logistic regression), in which it provides an estimate of the accuracy of the statistical model (ie, how well it predicts an outcome).

Predictive values — In addition to sensitivity and specificity, the predictive values of a diagnostic test must be considered when interpreting the results of a test (calculator 1). The positive predictive value of a test represents the likelihood that a patient with a positive test has the disease. Conversely, the negative predictive value represents the likelihood that a patient who has a negative test is free of the disease (table 1).

The predictive values (and the proportion of positive and negative evaluations that can be expected) depend upon the prevalence of a disease within a population. Thus, for given values of sensitivity and specificity, a patient with a positive test is more likely to truly have the disease if the patient belongs to a population with a high prevalence of the disease (figure 6).

This observation has significant implications for screening tests, in which false-positive results may lead to expensive and sometimes dangerous testing and false-negative tests may be associated with morbidity or mortality. As an example, a positive stool test for occult blood is much more likely to predict colon cancer in a 70-year-old compared with a 20-year-old. Thus, routine screening of stools in young patients would lead to a high rate of subsequent false-positive examinations and is not recommended. The predictive values of a test should be considered when selecting among diagnostic tests for an individual patient in whom demographic or other clinical risk factors influence the likelihood that the disease is present (ie, the "prior probability" of the disease).

Likelihood ratio — As discussed above, a limitation to predictive values as expressions of test characteristics is their dependence upon disease prevalence. To overcome this limitation, the likelihood ratio has been used as an expression of the performance of diagnostic tests [4]. Likelihood ratios are an expression of sensitivity and specificity that can be used to estimate the odds that a condition is present or absent (calculator 2). (See "Evaluating diagnostic tests".)

The likelihood ratio represents a measure of the odds of having a disease relative to the prior probability of the disease. The estimate is independent of the disease prevalence. A positive likelihood ratio is calculated by dividing sensitivity by 1 minus specificity: sensitivity/(1-specificity). Similarly, a negative likelihood ratio is calculated by dividing 1 minus sensitivity by specificity: (1-sensitivity)/specificity. Positive and negative likelihood ratios of 9 and 0.25, for example, can be interpreted as meaning that a positive result is seen 9 times as frequently, while a negative test is seen 0.25 times as frequently, in those with a specific condition than those without it. Likelihood ratios can be established for many cutoff points for a diagnostic test, permitting an appreciation for the relative importance of a large versus small increase in a test result.

Accuracy — The performance of a diagnostic test is sometimes expressed as accuracy, which refers to the number of true positives and true negatives divided by the total number of observations (table 2). However, accuracy by itself is not a good indicator of test performance, since it obscures important information related to its component parts.

Net reclassification improvement and integrated discrimination improvement — Net reclassification improvement (NRI) is a method used for evaluating improvements in risk predictions from diagnostic tests and prediction models. It attempts to quantify the extent to which the addition of a diagnostic test or prediction model will influence clinical practice. Typical use of the NRI would be to depict the extent to which a new model correctly classifies patients (eg, for risk of dying) compared with the old model. While appealing, there is variability in how the NRI is calculated and reported [5].

Another measure, integrated discrimination improvement (IDI), also attempts to provide a quantitative view on how much value a new diagnostic test or prediction rule provides [6].

EXPRESSIONS USED WHEN MAKING INFERENCES ABOUT DATA

Confidence interval — A point estimate (ie, a single value) from a sample population may not reflect the "true" value from the entire population. As a result, it is often helpful to provide a range that is likely to include the true value. A confidence interval is a commonly used estimate. The boundaries of a confidence interval give values within which there is a high probability (95 percent by convention) that the true population value can be found. The calculation of a confidence interval considers the standard deviation of the data and the number of observations. Thus, a confidence interval narrows as the number of observations increases or its variance (dispersion) decreases. The interpretation of confidence intervals is discussed in more detail separately. (See "Proof, p-values, and hypothesis testing", section on 'Confidence intervals'.)

Credible interval — A credible interval is used in Bayesian analysis to describe the range in which a posterior probability estimate is likely to reside. As an example, a 95 percent credible interval for a posterior probability estimate of 40 percent could range from 30 to 50 percent, indicating that there is a 95 percent chance that the true posterior probability estimate lies within the 30 to 50 percent range. There are fundamental differences in how credible intervals are derived compared with the more commonly used confidence intervals. Nevertheless, their intuitive interpretation is similar.

Errors — Two potential errors are commonly recognized when testing a hypothesis:

A type I error (also referred to as an "alpha error") is incorrectly concluding that there is a statistically significant difference in a dataset; the probability of making a type I error is called "alpha." A typical value for alpha is 0.05. Thus, a p<0.05 leads to a decision to reject the null hypothesis, although lower values for claiming statistical significance have been proposed [7]. (See "Proof, p-values, and hypothesis testing", section on 'P-values'.)

A type II error (also referred to as a "beta error") is incorrectly concluding that there was no statistically significant difference in a dataset; the probability of making a type II error is called "beta." This error often reflects insufficient power of the study.

Power — The term "power" (calculated as 1 – beta) refers to the ability of a study to detect a true difference. Negative findings in a study may reflect that the study was underpowered to detect a difference. A "power calculation" should be performed prior to conducting a study to be sure that there are a sufficient number of observations to detect a desired degree of difference. The larger the difference, the fewer the number of observations that will be required. As an example, it takes fewer patients to detect a 50 percent difference in blood pressure from a new antihypertensive medication compared with placebo than a 5 percent difference. The interpretation of power calculations is discussed in more detail separately. (See "Proof, p-values, and hypothesis testing", section on 'Power in a negative study'.)

TERMS USED IN MULTIVARIATE ANALYSIS — The effect of more than one variable often needs to be considered when predicting an outcome. As an example, the effect of smoking status and age needs to be simultaneously considered when assessing the risk of lung cancer.

Statistical methods that can simultaneously account for multiple variables are known as "multivariate" (or multivariable) analysis. These methods help to "control" (or "adjust") for variables that are extraneous to the main causal question and might confound it. A commonly encountered form of multivariable analysis, logistic regression, is applied to models in which the outcome is dichotomous (eg, alive or dead, or a complication occurs or does not occur).

TIME-TO-EVENT ANALYSIS (SURVIVAL ANALYSIS) — Many examples of medical research deal with an event that may or may not occur in a given period of time (such as death, stroke, myocardial infarction). During the study, several outcomes are possible in addition to the outcome of interest (eg, patients might die of other causes or drop out from the analysis). Furthermore, the duration of follow-up can vary among individuals in the study. A patient who is observed for five years should count more in the statistical analysis than one observed for five months.

Several methods are available to account for these considerations. The most commonly used methods in medical research are Kaplan-Meier and Cox proportional hazards analyses.

Kaplan-Meier analysis — Kaplan-Meier analysis measures the ratio of surviving patients (or those free from an outcome) divided by the total number of patients at risk for the outcome. Every time a patient has an outcome, the ratio is recalculated. Using these calculations, a curve can be generated that graphically depicts the probability of survival as time passes (figure 7).

In many studies, the benefit of a drug or intervention on an outcome is compared with a control population, permitting the construction of two or more Kaplan-Meier curves. Curves that are close together or cross are unlikely to reflect a statistically significant difference. Several formal statistical tests can be used to assess a significant difference. Examples include the log-rank test and the Breslow test.

Cox proportional hazards analysis — Cox proportional hazards analysis is similar to logistic regression because it can account for many variables that are relevant for predicting a dichotomous outcome. However, unlike logistic regression, Cox proportional hazards analysis permits time to be included as a variable and for patients to be counted only for the period of time in which they were observed.

The term "hazard ratio" is sometimes used when referring to variables included in the analysis. A hazard ratio is analogous to an odds ratio. Thus, a hazard ratio of 10 means that a group of patients exposed to a specific risk factor has 10 times the chance of developing the outcome compared with unexposed controls.

STUDY DESIGNS

Cohort study — A cohort is a clearly identified group of people to be studied. A cohort study might identify persons specifically because they were or were not exposed to a risk factor or by taking a random sample of a given population. A cohort study can then move forward to observing the outcome of interest, even if the data are collected retrospectively. As an example, a group of patients who have variable exposure to a risk factor of interest can be followed over time for an outcome.

The Nurses' Health Study is an example of a cohort study. A large number of nurses are followed over time for an outcome such as colon cancer, providing an estimate of the risk of colon cancer in this population. In addition, dietary intake of various components can be assessed, and the risk of colon cancer in those with high and low intake of fiber can be evaluated to determine if fiber is a risk factor (or a protective factor) for colon cancer. The relative risk of colon cancer in those with high or low fiber intakes can be calculated from such a cohort study. (See 'Relative risk' above.)

Case-control study — A case-control study starts with the outcome of interest and works backward to the exposure. For instance, patients with a disease are identified and compared with controls for exposure to a risk factor. This design does not permit measurement of the proportion of the population who were exposed to the risk factor and then developed or did not develop the disease; thus, the relative risk or the incidence of disease cannot be calculated. However, in case-control studies, the odds ratio provides a reasonable estimate of the relative risk (figure 2). (See 'Odds ratio' above.)

If one were to perform a case-control study to assess the role of dietary fiber in colon cancer as noted above for the cohort study, a group of patients with colon cancer could be compared with matched controls without colon cancer; the fiber intake in the two groups would then be compared. The case-control study is most useful for uncommon diseases in which a very large cohort would be required to accumulate enough cases for analysis.

Randomized controlled trial — A randomized controlled trial (RCT) is an experimental design in which patients are assigned to two or more interventions. One group of patients is often assigned to a placebo (placebo control), but a randomized trial can involve two active therapies (active control).

As an example, patients with a prior colonic polyp could be randomly assigned to take a fiber supplement or a placebo supplement to determine if fiber supplementation decreases the risk of developing colon cancer.

RCTs are generally the only type of study that can adequately control for unmeasured confounders and are generally the best evidence for proving causality. (See "Proof, p-values, and hypothesis testing", section on 'Explanation for the results of a study'.)

Intention to treat — The central principle underlying intention-to-treat analysis is that study participants should be analyzed according to the groups in which they were randomized, even if they did not receive or comply with treatment. Such analysis is contrasted with "as-treated" (or "per-protocol") analysis, in which subjects are analyzed according to the actual treatment that they received.

The theoretical advantage of intention-to-treat analysis is that it preserves the benefits of randomization (ie, assuring that all of the unmeasured factors that could differ in the treatment and control groups remain accounted for in the analysis). For example, it is possible that patients who complied with treatment differed in some important ways than those who did not. Another way to consider the advantage of intention-to-treat analysis is that it better accounts for factors that can influence the outcomes of a prescribed treatment, not just the effects on those who adhered to it. A drug which has serious side effects but is highly effective, for example, might look favorable in an "as-treated" analysis but less favorable in an intention-to-treat analysis if the majority of patients could not tolerate it.

Although conceptually simple, analyzing according to intention-to-treat principles can be complex. For example, optimal methods to account for subjects who were lost to follow-up remain uncertain [8]. Such patients could be considered as treatment failures; however, such an approach can be overly punitive for an otherwise promising therapy. Because of these complexities, studies that reported that they performed an intention-to-treat analysis may not have always done so or have modified the approach in some way [9].

Mendelian randomization — Mendelian randomization refers to a nonexperimental epidemiological study design that examines the impact of natural genetic variation in the population on the relationship between an environmental exposure and disease. The primary goal is to establish evidence for a causal association between the exposure and disease.

In this design, subjects comprising the study population are classified according to their genotype at a specific polymorphic locus known to modify the exposure of interest but not directly influence susceptibility to the disease of interest. Because subjects are typically unaware of their genotype status, and because a subject's genotypes are randomly assigned during meiosis (according to Mendel's law of independent assortment), this exposure modification can be considered a form of natural randomization. This is therefore referred to as "Mendelian randomization." (See "Mendelian randomization".)

Topic 2759 Version 22.0

References

1 : Relative risk reduction versus number needed to treat as measures of lipid-lowering trial results.

2 : Implications of trial results: the potentially misleading notions of number needed to treat and average duration of life gained.

3 : Number-needed-to-treat and placebo-controlled trials.

4 : A perspective on standardizing the predictive power of noninvasive cardiovascular tests by likelihood ratio computation: 1. Mathematical principles.

5 : Net reclassification improvement: computation, interpretation, and controversies: a literature review and clinician's guide.

6 : Novel metrics for evaluating improvement in discrimination: net reclassification and integrated discrimination improvement for normal variables and nested models.

7 : The Proposal to Lower P Value Thresholds to .005.

8 : Discordance between reported intention-to-treat and per protocol analyses.

9 : The intention-to-treat approach in randomized controlled trials: are authors saying what they do and doing what they say?