With the routine use of electronic health records (EHRs) in hospitals, health systems, and physician practices, there has been rapid growth in the availability of health care data over the last decade. In addition to the structured data in EHRs, new methods such as natural language processing can derive meaning from unstructured data, permitting the capture of substantial clinical information embedded in clinical notes. Furthermore, the growth in the availability of registries and claims data and the linkages between all these data sources have created a big data platform in health care, vast in both size and scope.
Concurrently, new computational machine learning approaches promise ever-more-accurate prediction. The marvel of Google and of Watson, the inexorability of Moore’s law (ie, computing power doubles every 2 years for the same cost), suggest a future in which medicine will be transformed into an information science, and each clinical decision may be optimized based on a forecasting of outcomes under alternative treatment options, beyond the knowledge and understanding of the individual physician.
Yet despite these innovations and those to come, quantitative risk prediction in medicine has been available for several decades, based on more classical statistical learning from more structured data sources. Despite reports that risk models outperform physicians in prognostic accuracy,1 application in actual clinical practice remains limited. For example, more than 1000 cardiovascular clinical prediction models have been developed and cataloged, yet only a small number of these are routinely used to support decision making in clinical care.2 It seems unlikely that incremental improvements in discriminative performance of the kind typically demonstrated in machine learning research will ultimately drive a major shift in clinical care. In this Viewpoint, we describe 4 major barriers to useful risk prediction that may not be easily overcome by new methods in machine learning and, in some instances, may be more difficult to overcome in the era of big data.
Thoughtful Identification of Risk-Sensitive Decisions
Beyond just generating new models and predictions, the goal of big data analytics is to improve decision making. However, not every decision is sensitive to the type of prognostic information prediction models yield. Prediction modelers frequently give scant attention to the formal properties of clinical decisions that might make them risk-sensitive. Methods to assess a model’s ability to improve decision making have been developed,3 but models are most usually assessed by measures of statistical accuracy. Merely providing the probability of a particular outcome, such as readmission risk or 1-year mortality risk, is unlikely to change physician or patient behavior in most settings. It is only when there is a clear decision to be made and the risk threshold for that decision (ie, the level of risk at which the balance of trade-offs changes from one decision to another) is near the population average risk that a prediction model is likely to influence clinical decisions.
For example, the CHA2DS2-VASc score (congestive heart failure, hypertension, age ≥75 years [doubled], diabetes, stroke/transient ischemic attack/thromboembolism [doubled], vascular disease [prior myocardial infarction, peripheral artery disease, or aortic plaque], age 65-74 years, sex category [female]) is a useful model (even with limited discrimination) because, on average, the harms and benefits of anticoagulation for patients with nonvalvular atrial fibrillation are finely balanced. Although such finely balanced and consequential decisions may seem abundant in medicine, and there is certainly no shortage of prediction models, it is surprising how rarely good predictions are brought to bear on difficult clinical decisions in a way that affects clinical outcomes.
Calibration: The Achilles Heel of Prediction
Statistical performance is most typically decomposed into discrimination (Do patients with the outcome have higher risk predictions than those without?) and calibration (Do x of 100 patients with a risk prediction of x% have the outcome?).4 Discrimination is usually quantified with a concordance statistic (c statistic, equivalent to area under the receiver operating characteristic curve), whereas calibration can be assessed graphically and with key parameters, such as observed-to-expected ratios. A serious concern about model evaluation practices is that measures of discrimination are emphasized much more than measures of calibration,2 and when measures of calibration are reported they are most often reported in a way that is clinically uninterpretable (eg, a P value corresponding to the Hosmer-Lemeshow χ2 statistic for goodness-of-fit).
Model calibration is frequently less stable than discrimination, yet highly consequential to decision making.5 Whereas stable discrimination relies on the consistency of the effects of the measured variables and the case mix of the population,6 stable calibration is much more sensitive to differences in the distribution and effects of unmeasured covariates. Indeed, poor calibration might often be expected with application of a model from one population to another. The critical importance of poor calibration is frequently underappreciated; poor calibration can lead to harmful decisions.5
User Trust, Transparency, and Commercial Interests
The incentives of business and the demands of cautious, rigorous, and transparent science are often at odds. In particular, as decision makers in health care delivery systems invest in data, tools, and personnel related to big data, there is enormous pressure for rapid results. The relative ease of implementing newly developed models rapidly in the EHR further contributes to a tendency to operationalize models in routine clinical care without an adequate understanding of their validity and potential influence. To sell their products, vendors of health care information technology have an incentive to overstate the value of predictive analytics generally and of their tools specifically. Thus, the only sure prediction about the future of big data and predictive analytics is that it is unlikely to live up to some of the hype.
More concerning, limited attention has been given to the significant risk of harm, both from wasting resources as well as inaccurate prediction supporting poor decisions. Some hospital administrators and clinicians are unlikely to have the methodological understanding to critically evaluate the products they purchase. Moreover, even though the current literature is replete with incompletely reported predictive models that cannot be independently evaluated (eg, because model intercepts are not reported), machine learning algorithms are typically more opaque than classic statistical models and developers are reluctant to transparently report proprietary models.
Data Quality and Heterogeneity
The quality of prediction models depends on the quality of the data on which they are derived. Even though increasing data are available, issues involving data inaccuracy, data missingness, and selective measurement remain substantial concerns when EHR data are used and potentially affect prognostic variables, treatment exposures, and outcome ascertainment. Because patients may see multiple clinicians who use different EHR platforms or who practice in different delivery systems in which data might not be shared, these data are often incomplete, thereby potentially creating bias in effect estimation and ultimately prediction. Although modern imputation methods can mitigate some bias due to informative missingness, these methods are less useful in EHR settings where it is not possible to distinguish the true absence of a relevant characteristic (such as a particular comorbidity) from data incompleteness.
Coding behavior for billing varies from hospital to hospital and EHR completeness varies from clinician to clinician. There is relatively little standardization of EHR data and structures, and information about the performance of natural language processing algorithms is typically unavailable. Application of models derived in one system for use across another is unlikely to yield accurate prediction and, likewise, results from validation studies in one EHR may not apply to others. Indeed, the use of automatically obtained claims and EHR data for model derivation may simply be inappropriate for some real-time clinical decision making, without more rigorous standardization across health care systems and practitioners.
With the wide availability of data and predictive analytics, developing prediction models has never been easier. Options for implementation increase with EHR availability. These developments underscore the need for rigorous studies that evaluate the effects of prediction models on health care decisions and patient outcomes, including cost and quality of care.
Big data and predictive analytics have substantial potential to support better, more efficient care, and there have been notable recent advances, particularly in image analytics.7 However, the potential of prediction to influence decision making also implies the potential for harm, through the dissemination of misinformation at the point of care. This potential for harm from insufficiently validated models in a profit-driven market suggests the need for oversight. The public demands such oversight for new medical technologies such as pharmaceuticals and devices. Physicians, too, are subject to licensing requirements and board certification. Although an inadequately prepared physician may make mistakes one patient at a time, a faulty algorithm could affect larger numbers of patients. An independent agency that certifies prediction models and approaches to deploy them into clinical practice could potentially address some of these challenges and may help ensure that predictive analytics deliver on their promise of better value and outcomes for individual patients and the entire health care system.
- 48During the past decade, the US health care system has gone digital. In 2008, fewer than 1 in 10 US hospitals had an electronic health record (EHR) system; today, fewer than 1 in 10 does not. The increase in use of an EHR system in ambulatory practices has been similarly…