Promises and Pitfalls of 'Big' Medical Records Data for Research

电子健康记录的广泛采用促进了大量计算机化医疗数据的被动收集。研究人员希望将这些数据转化为有意义地改善临床护理和患者结局的见解。然而,对这些数据集令人印象深刻的规模和可用性的热情不应削弱我们对其弱点的认识;与所有研究一样,必须得出得到数据充分支持的仔细结论。本次谈话将概述电子健康记录数据,回顾其潜在优势,并概述五个常见缺陷,以及如何减轻这些缺陷的建议。

展开查看详情

1. Promises & pitfalls: Using ‘big’ medical records data for research Kathryn Rough, ScD XLDB Conference April 4, 2019 Confidential + Proprietary

2.Disclosures & acknowledgements I am employed by Google, where I do research on machine learning for health applications. This work was done in collaboration with John T. Thompson, MD Confidential + Proprietary 2

3. What is an electronic health record (EHR)? Digitized records of clinical encounters ● Medical history ● Diagnoses ● ● Medications Procedures >90% ~150k of US hospitals measurements ● Treatment plans now use EHRs per hospitalization ● Allergies ● Laboratory and test results ● ... Confidential + Proprietary 3

4.“U.S. health care data alone reached 150 exabytes in 2011. Five exabytes (1018 gigabytes) of data would contain all the words ever spoken by human beings on earth.” Cottle M, Kanwal S, Kohn M, Strome T, Treister N (2013) Confidential + Proprietary 4

5.However... Confidential + Proprietary 5

6.EHR data has limitations. Having more data doesn’t necessarily solve these. Confidential + Proprietary 6

7.EHR data has limitations. Having more data doesn’t necessarily solve these. How do we draw useful conclusions from the data we have? Confidential + Proprietary 7

8.This talk ● ‘Big data’ basics ● Promises of EHR data ● Pitfalls of EHR (& how to avoid them!) ● Q&A Confidential + Proprietary 8

9. ‘Big data’ basics Confidential + Proprietary 9

10.Dimensions of ‘big’ data: wide datasets ● Genetic data ● Single human genome contains >3 billion base pairs ● Typical genome wide association study (GWAS): ○ 200,000 to 2 million single nucleotide polymorphisms (SNPs) Visscher PM et al. (2017) American Journal of Confidential + Proprietary Human Genetics 10

11.Dimensions of ‘big’ data: long datasets ● Census data ● Survey of 117 million households ○ Over 308 million individuals covered ○ Only 10 questions Lofquist D, Lugiala T, O’Connell M, Feliz S. (2010) Households and Confidential + Proprietary families: 2010 census brief. 11

12.Dimensions of EHR data ● Long ● Wide ● + Temporal component Confidential + Proprietary 12

13.What can we do with EHR data? ● Description ○ How many patients are diagnosed with hypertension each year? ● Prediction ○ Which patients are at highest risk of developing hypertension? ● Causal inference ○ When should I start a patient on antihypertensive drugs? Confidential + Proprietary 13

14. Promises Confidential + Proprietary 14

15.Promises of EHR data: scalable data collection ● EHR data ○ High initial cost to implement systems ○ Relatively small marginal cost for a new patient or a new measurement ■ Data collection must facilitate clinical reality ● Traditional medical research ○ Each additional patient and measurement increases study costs Confidential + Proprietary 15

16.Promises of EHR data: number and variety of patients ● EHR data ○ Large absolute number of patients ○ Includes all patients seeking care at clinical site ○ Captures rare diseases, uncommon events, and many patient subgroups ● Traditional medical research ○ Substantially smaller absolute number of patients ○ Depending on inclusion & exclusion criteria, may not be representative of any real-world population ○ Recruitment often condition-specific Confidential + Proprietary 16

17.Promises of EHR data: variety and depth of variables ● EHR data ○ Captures wide variety of clinical & utilization data ○ Information necessary for delivering clinical care ■ Substantial amount of detail included ■ Information captured on all clinical conditions ● Traditional medical research ○ Scope of variables collected constrained by specific research question ■ Requires a priori specification of needed information Confidential + Proprietary 17

18.Promises of EHR data: promptness of research ● EHR data ○ Data passively collected on ongoing basis ○ As clinical research questions arise, data are available to investigators ● Traditional medical research ○ Data collection only can begin after specifying research questions ■ Can take years to complete Confidential + Proprietary 18

19. Pitfalls Confidential + Proprietary 19

20. Pitfall #1: data quality 250.0 98° C Errors in data Important Rule-out Errors in data entry information in diagnoses/ processing unstructured text upcoding Confidential + Proprietary 20

21.Pitfall #1: data quality ● What can be impacted: ○ Inexact measurement of case definitions, inclusion criteria, predictors, exposures, or outcomes ● What can be done: ○ Think through (and check!) data quality when conducting analyses ○ Validation study for key measures ■ External source ■ Subset of data Confidential + Proprietary 21

22.Pitfall #2: patient loss to follow-up What we observe in the EHR data Confidential + Proprietary 22

23.Pitfall #2: patient loss to follow-up The ‘complete’ picture Confidential + Proprietary 23

24.Pitfall #2: patient loss to follow-up ● What can be impacted: ○ Descriptive studies: undercounting of events ○ Predictive studies: incorrect outcome labels, biased performance metrics ○ Causal inference studies: selection/collider bias ● What can be done: ○ Linkage to additional data sources ○ Consider alternative data sources (e.g., claims data) ○ Quantitative bias analysis Confidential + Proprietary 24

25.Pitfall #3: overemphasizing statistical significance Confidential + Proprietary 25

26. Pitfall #3: overemphasizing statistical significance p < 0.001 Statistical significance Magnification of Data dredging & ≠ clinical significance model p-hacking misspecification Confidential + Proprietary 26

27.Pitfall #3: overemphasizing statistical significance ● What can be impacted: ○ Very small p-values for small magnitude of effect ○ Narrow confidence intervals that exclude the true value ○ Type-I errors (false positives) ● What can be done: ○ If p-values are presented, also report clinically meaningful effect estimates ○ Pre-specify analyses ○ Reporting all subgroup tests, regardless of significance Confidential + Proprietary 27

28.Pitfall #4: confounding Age Influenza complications Influenza vaccination Confidential + Proprietary 28

29.Pitfall #4: confounding Age Influenza complications Influenza vaccination Confidential + Proprietary 29