From Genomics to Medicine: Advancing Healthcare at Scale

With the exponential growth of genomic data sets, healthcare practitioners now have the opportunity to improve human outcomes at an unprecedented pace. These outcomes are difficult to realize in the existing ecosystem of genomic tools, where biostatisticians regularly chain together command-line interfaces based on a single-node setup on premise. The Databricks Unified Analytics Platform for Genomics empowers users to perform end-to-end analysis on our massively scalable platform in the cloud: in only minutes, a data scientist can visualize an individual’s disease risk based on their raw genomic data. Built on Apache Spark, we provide click-button implementations of accepted best practice workflows, as well as low-level Spark SQL optimizations for common genomics operations.
展开查看详情

1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

2.From Genomics to Medicine: Advancing Healthcare at Scale Karen Feng, Databricks #UnifiedAnalytics #SparkAISummit

3. Genomics in the real world • 6-year-old Nic Volker • Intestinal inflammation – Unknown cause – 100+ surgeries • Whole exome sequencing: mutation in XIAP gene https://www.washingtonpost.com/opinions/a-boys-mysterious-illness-a-bold-ga mble-and-a-breakthrough-in-genetic-medicine/2016/04/20/13f20b16-e638-11e5 • Cured through stem cell -bc08-3e03a5b41910_story.html transplant 3

4. Genomics in the real world • 6-year-old Nic Volker • Intestinal inflammation – Unknown cause – 100+ surgeries • Whole exome sequencing: mutation in XIAP gene • Cured through stem cell https://www.researchgate.net/publication/318420329_Health_t transplant echnology_assessment_of_next-generation_sequencing 4

5.Genomics in the real world • 6-year-old Nic Volker 0 5 20 20 • Intestinal inflammation Human CFCCG – Unknown cause – 100+ surgeries • Whole exome sequencing: mutation in XIAP gene • Cured through stem cell transplant http://cbm.msoe.edu/markMyweb/genomicJmols/xiap.html 5

6.Genomics in the real world • 6-year-old Nic Volker 0 5 20 20 • Intestinal inflammation Human CFCCG – Unknown cause Chicken AFCCG – 100+ surgeries Zebra fish CFCCG • Whole exome sequencing: Frog CFHCD mutation in XIAP gene House fly CVWCN • Cured through stem cell transplant http://cbm.msoe.edu/markMyweb/genomicJmols/xiap.html 6

7.Genomics in the real world • 6-year-old Nic Volker 0 5 20 20 • Intestinal inflammation Nic’s XIAP CFCYG – Unknown cause Human CFCCG – 100+ surgeries Chicken AFCCG • Whole exome sequencing: Zebra fish CFCCG mutation in XIAP gene Frog CFHCD • Cured through stem cell House fly CVWCN transplant http://cbm.msoe.edu/markMyweb/genomicJmols/xiap.html 7

8.Genomics in the real world • 6-year-old Nic Volker • Intestinal inflammation – Unknown cause – 100+ surgeries • Whole exome sequencing: mutation in XIAP gene • Cured through stem cell http://archive.jsonline.com/news/health/young-patient-faces-new-struggles-yea rs-after-dna-sequencing-b99602505z1-336977681.html transplant 8

9.Genomics is a big data problem From $2.7B to <$1,000 40,000 Petabytes / year by 2025 https://www.genome.gov/27541954/dna-sequencing-costs-data/ https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195 9

10.Agenda • Genomics overview – Big data problem – Real-world applications – Pain points at industrial scale • Joint genotyping – Existing approach – Databricks approach • Genomics on Databricks 10

11.Agenda • Genomics overview – Big data problem – Real-world applications – Pain points at industrial scale • Joint genotyping – Existing approach – Databricks approach • Genomics on Databricks 11

12.The power of big genomic data Motivation: clinical trials with genomic evidence are 2x more likely to be approved by the FDA Goal: identify a biological target (eg. protein) that can be mediated with a drug Approach: large-scale regressions to correlate DNA Accelerate variants and the trait Target Discovery 12

13.The power of big genomic data Motivation: clinical trials with genomic evidence are 2x more likely to be approved by the FDA Goal: identify a biological target (eg. protein) that can be mediated with a drug Approach: large-scale regressions to correlate DNA Accelerate variants and the trait Target Discovery 13

14.The power of big genomic data Motivation: clinical trials with genomic evidence are 2x more likely to be approved by the FDA Goal: identify a biological target (eg. protein) that can be mediated with a drug Approach: large-scale regressions to correlate DNA Accelerate variants and the trait Target Discovery 14

15.The power of big genomic data Motivation: propose personalized lifestyle changes to decrease disease risk Goal: calculate individual’s disease risk Approach: large-scale regressions to identify contributing genetic variants Reduce Costs via Precision Prevention 15

16.The power of big genomic data Motivation: propose personalized lifestyle changes to decrease disease risk Goal: calculate individual’s disease risk Approach: large-scale regressions to identify contributing genetic variants Reduce Costs via Precision Prevention 16

17.The power of big genomic data Motivation: propose personalized lifestyle changes to decrease disease risk Goal: calculate individual’s disease risk Approach: large-scale regressions to identify contributing genetic variants Reduce Costs via Precision Prevention 17

18.The power of big genomic data https://jamanetwork.com/journals/jama/fullarticle/2585977 Motivation: decrease ER admissions Goal: personalize dosage based on genetic variants Approach: large-scale regressions between Improve ineffective/effective/toxic dosages and genetic variants Survival with Optimized Treatment 18

19.The power of big genomic data Motivation: decrease ER http://www.bloodjournal.org/content/106/7/2329 admissions Goal: personalize dosage based on genetic variants Approach: large-scale regressions between Improve ineffective/effective/toxic dosages and genetic variants Survival with Optimized Treatment 19

20.The power of big genomic data Motivation: decrease ER admissions Goal: personalize dosage based on genetic variants Approach: large-scale regressions between Improve ineffective/effective/toxic dosages and genetic variants Survival with Optimized Treatment 20

21.The power of big genomic data Accelerate Reduce Costs Improve Target via Precision Survival with Discovery Prevention Optimized Treatment 21

22.Genomic analysis on big data is hard! • Existing tools are often – Inflexible – Single-node – Stitched together https://www.biostars.org/p/98582/ 22

23.Genomic analysis on big data is hard! • Existing tools are often – Inflexible – Single-node – Stitched together “GATK or VCFtools … have different chromosomal notation, one has Chr, the other does not.” https://www.biostars.org/p/98582/ 23

24.Genomic analysis on big data is hard! • Existing tools are often – Inflexible – Single-node – Stitched together awk '{gsub(/^chr/,""); print}' your.vcf > no_chr.vcf https://www.biostars.org/p/98582/ 24

25.Genomic analysis on big data is hard! • Existing tools are often – Inflexible – Single-node “Give a statistical geneticist – Stitched together an awk line, feed him for a day, teach a statistical geneticist how to awk, feed him for a lifetime...” https://www.biostars.org/p/98582/ 25

26.Genomic analysis on big data is hard! vt normalize dbsnp.vcf • Existing tools are often -r seq.fa -o dbsnp.normalized.vcf – Inflexible – Single-node – Stitched together 26

27.Genomic analysis on big data is hard! vt normalize dbsnp.vcf • Existing tools are often -r seq.fa -o dbsnp.normalized.vcf – Inflexible – Single-node https://academic.oup.com/bioinformatics/article/31/13/2202/196142 – Stitched together 27

28.Genomic analysis on big data is hard! vt normalize dbsnp.vcf • Existing tools are often -r seq.fa -o dbsnp.normalized.vcf – Inflexible – Single-node https://academic.oup.com/bioinformatics/article/31/13/2202/196142 – Stitched together Row in a TSV file 28

29.Genomic analysis on big data is hard! Raw Data • Existing tools are often Alignment BWA – Inflexible – Single-node Variant Calling – Stitched together Annotation Quality Control Picard Analysis plink 29