Promises & pitfalls: Using 'big' medical records data for research
Kathryn Rough, ScD
XLDB Conference
April 4, 2019

Disclosures & acknowledgements
I am employed by Google, where I do research on machine learning for health applications. This work was done in collaboration with John T. Thompson, MD

What is an electronic health record (EHR)?
Digitized records of clinical encounters
● Medical history
● Diagnoses
● Medications
● Procedures
● Treatment plans
● Allergies
● Laboratory and test results
● ...

>90% of US hospitals now use EHRs
~150k measurements per hospitalization

"U.S. health care data alone reached 150 exabytes in 2011. Five exabytes (1018 gigabytes) of data would contain all the words ever spoken by human beings on earth."
Cottle M, Kanwal S, Kohn M, Strome T, Treister N (2013)

EHR data has limitations. Having more data doesn't necessarily solve these.

How do we draw useful conclusions from the data we have?

7.EHR data has limitations. Having more data doesn’t necessarily solve these. How do we draw useful conclusions from the data we have? Confidential + Proprietary 7

This talk
● 'Big data' basics
● Promises of EHR data
● Pitfalls of EHR (& how to avoid them!)
● Q&A

'Big data' basics

Dimensions of 'big' data: wide datasets
● Genetic data
● Single human genome contains >3 billion base pairs
● Typical genome wide association study (GWAS):
○ 200,000 to 2 million single nucleotide polymorphisms (SNPs)

Visscher PM et al. (2017) American Journal of Human Genetics

Dimensions of 'big' data: long datasets
● Census data
● Survey of 117 million households
○ Over 308 million individuals covered
○ Only 10 questions

Lofquist D, Lugiala T, O'Connell M, Feliz S. (2010) Households and families: 2010 census brief.

Dimensions of EHR data
● Long
● Wide
● + Temporal component

What can we do with EHR data?
● Description
○ How many patients are diagnosed with hypertension each year?
● Prediction
○ Which patients are at highest risk of developing hypertension?
● Causal inference
○ When should I start a patient on antihypertensive drugs?

Promises

Promises of EHR data: scalable data collection
● EHR data
○ High initial cost to implement systems
○ Relatively small marginal cost for a new patient or a new measurement
■ Data collection must facilitate clinical reality
● Traditional medical research
○ Each additional patient and measurement increases study costs

Promises of EHR data: number and variety of patients
● EHR data
○ Large absolute number of patients
○ Includes all patients seeking care at clinical site
○ Captures rare diseases, uncommon events, and many patient subgroups
● Traditional medical research
○ Substantially smaller absolute number of patients
○ Depending on inclusion & exclusion criteria, may not be representative of any real-world population
○ Recruitment often condition-specific

Promises of EHR data: variety and depth of variables
● EHR data
○ Captures wide variety of clinical & utilization data
○ Information necessary for delivering clinical care
■ Substantial amount of detail included
■ Information captured on all clinical conditions
● Traditional medical research
○ Scope of variables collected constrained by specific research question
■ Requires a priori specification of needed information

Promises of EHR data: promptness of research
● EHR data
○ Data passively collected on ongoing basis
○ As clinical research questions arise, data are available to investigators
● Traditional medical research
○ Data collection only can begin after specifying research questions
■ Can take years to complete

Pitfalls

Pitfall #1: data quality

250.0
98° C

Errors in data processing | Important information in unstructured text | Rule-out diagnoses/ upcoding | Errors in data entry

Pitfall #1: data quality
● What can be impacted:
○ Inexact measurement of case definitions, inclusion criteria, predictors, exposures, or outcomes
● What can be done:
○ Think through (and check!) data quality when conducting analyses
○ Validation study for key measures
■ External source
■ Subset of data

Pitfall #2: patient loss to follow-up

What we observe in the EHR data

Pitfall #2: patient loss to follow-up

The 'complete' picture

Pitfall #2: patient loss to follow-up
● What can be impacted:
○ Descriptive studies: undercounting of events
○ Predictive studies: incorrect outcome labels, biased performance metrics
○ Causal inference studies: selection/collider bias
● What can be done:
○ Linkage to additional data sources
○ Consider alternative data sources (e.g., claims data)
○ Quantitative bias analysis

Pitfall #3: overemphasizing statistical significance

Pitfall #3: overemphasizing statistical significance

p < 0.001

Statistical significance ≠ clinical significance | Magnification of model misspecification | Data dredging & p-hacking

Pitfall #3: overemphasizing statistical significance
● What can be impacted:
○ Very small p-values for small magnitude of effect
○ Narrow confidence intervals that exclude the true value
○ Type-I errors (false positives)
● What can be done:
○ If p-values are presented, also report clinically meaningful effect estimates
○ Pre-specify analyses
○ Reporting all subgroup tests, regardless of significance

Pitfall #4: confounding

Age → Influenza complications
Age → Influenza vaccination

Pitfall #4: confounding

Age → Influenza complications
Age → Influenza vaccination