- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
From Genomics to Medicine: Advancing Healthcare at Scale
展开查看详情
1 .WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
2 .From Genomics to Medicine: Advancing Healthcare at Scale Karen Feng, Databricks #UnifiedAnalytics #SparkAISummit
3 . Genomics in the real world • 6-year-old Nic Volker • Intestinal inflammation – Unknown cause – 100+ surgeries • Whole exome sequencing: mutation in XIAP gene https://www.washingtonpost.com/opinions/a-boys-mysterious-illness-a-bold-ga mble-and-a-breakthrough-in-genetic-medicine/2016/04/20/13f20b16-e638-11e5 • Cured through stem cell -bc08-3e03a5b41910_story.html transplant 3
4 . Genomics in the real world • 6-year-old Nic Volker • Intestinal inflammation – Unknown cause – 100+ surgeries • Whole exome sequencing: mutation in XIAP gene • Cured through stem cell https://www.researchgate.net/publication/318420329_Health_t transplant echnology_assessment_of_next-generation_sequencing 4
5 .Genomics in the real world • 6-year-old Nic Volker 0 5 20 20 • Intestinal inflammation Human CFCCG – Unknown cause – 100+ surgeries • Whole exome sequencing: mutation in XIAP gene • Cured through stem cell transplant http://cbm.msoe.edu/markMyweb/genomicJmols/xiap.html 5
6 .Genomics in the real world • 6-year-old Nic Volker 0 5 20 20 • Intestinal inflammation Human CFCCG – Unknown cause Chicken AFCCG – 100+ surgeries Zebra fish CFCCG • Whole exome sequencing: Frog CFHCD mutation in XIAP gene House fly CVWCN • Cured through stem cell transplant http://cbm.msoe.edu/markMyweb/genomicJmols/xiap.html 6
7 .Genomics in the real world • 6-year-old Nic Volker 0 5 20 20 • Intestinal inflammation Nic’s XIAP CFCYG – Unknown cause Human CFCCG – 100+ surgeries Chicken AFCCG • Whole exome sequencing: Zebra fish CFCCG mutation in XIAP gene Frog CFHCD • Cured through stem cell House fly CVWCN transplant http://cbm.msoe.edu/markMyweb/genomicJmols/xiap.html 7
8 .Genomics in the real world • 6-year-old Nic Volker • Intestinal inflammation – Unknown cause – 100+ surgeries • Whole exome sequencing: mutation in XIAP gene • Cured through stem cell http://archive.jsonline.com/news/health/young-patient-faces-new-struggles-yea rs-after-dna-sequencing-b99602505z1-336977681.html transplant 8
9 .Genomics is a big data problem From $2.7B to <$1,000 40,000 Petabytes / year by 2025 https://www.genome.gov/27541954/dna-sequencing-costs-data/ https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195 9
10 .Agenda • Genomics overview – Big data problem – Real-world applications – Pain points at industrial scale • Joint genotyping – Existing approach – Databricks approach • Genomics on Databricks 10
11 .Agenda • Genomics overview – Big data problem – Real-world applications – Pain points at industrial scale • Joint genotyping – Existing approach – Databricks approach • Genomics on Databricks 11
12 .The power of big genomic data Motivation: clinical trials with genomic evidence are 2x more likely to be approved by the FDA Goal: identify a biological target (eg. protein) that can be mediated with a drug Approach: large-scale regressions to correlate DNA Accelerate variants and the trait Target Discovery 12
13 .The power of big genomic data Motivation: clinical trials with genomic evidence are 2x more likely to be approved by the FDA Goal: identify a biological target (eg. protein) that can be mediated with a drug Approach: large-scale regressions to correlate DNA Accelerate variants and the trait Target Discovery 13
14 .The power of big genomic data Motivation: clinical trials with genomic evidence are 2x more likely to be approved by the FDA Goal: identify a biological target (eg. protein) that can be mediated with a drug Approach: large-scale regressions to correlate DNA Accelerate variants and the trait Target Discovery 14
15 .The power of big genomic data Motivation: propose personalized lifestyle changes to decrease disease risk Goal: calculate individual’s disease risk Approach: large-scale regressions to identify contributing genetic variants Reduce Costs via Precision Prevention 15
16 .The power of big genomic data Motivation: propose personalized lifestyle changes to decrease disease risk Goal: calculate individual’s disease risk Approach: large-scale regressions to identify contributing genetic variants Reduce Costs via Precision Prevention 16
17 .The power of big genomic data Motivation: propose personalized lifestyle changes to decrease disease risk Goal: calculate individual’s disease risk Approach: large-scale regressions to identify contributing genetic variants Reduce Costs via Precision Prevention 17
18 .The power of big genomic data https://jamanetwork.com/journals/jama/fullarticle/2585977 Motivation: decrease ER admissions Goal: personalize dosage based on genetic variants Approach: large-scale regressions between Improve ineffective/effective/toxic dosages and genetic variants Survival with Optimized Treatment 18
19 .The power of big genomic data Motivation: decrease ER http://www.bloodjournal.org/content/106/7/2329 admissions Goal: personalize dosage based on genetic variants Approach: large-scale regressions between Improve ineffective/effective/toxic dosages and genetic variants Survival with Optimized Treatment 19
20 .The power of big genomic data Motivation: decrease ER admissions Goal: personalize dosage based on genetic variants Approach: large-scale regressions between Improve ineffective/effective/toxic dosages and genetic variants Survival with Optimized Treatment 20
21 .The power of big genomic data Accelerate Reduce Costs Improve Target via Precision Survival with Discovery Prevention Optimized Treatment 21
22 .Genomic analysis on big data is hard! • Existing tools are often – Inflexible – Single-node – Stitched together https://www.biostars.org/p/98582/ 22
23 .Genomic analysis on big data is hard! • Existing tools are often – Inflexible – Single-node – Stitched together “GATK or VCFtools … have different chromosomal notation, one has Chr, the other does not.” https://www.biostars.org/p/98582/ 23
24 .Genomic analysis on big data is hard! • Existing tools are often – Inflexible – Single-node – Stitched together awk '{gsub(/^chr/,""); print}' your.vcf > no_chr.vcf https://www.biostars.org/p/98582/ 24
25 .Genomic analysis on big data is hard! • Existing tools are often – Inflexible – Single-node “Give a statistical geneticist – Stitched together an awk line, feed him for a day, teach a statistical geneticist how to awk, feed him for a lifetime...” https://www.biostars.org/p/98582/ 25
26 .Genomic analysis on big data is hard! vt normalize dbsnp.vcf • Existing tools are often -r seq.fa -o dbsnp.normalized.vcf – Inflexible – Single-node – Stitched together 26
27 .Genomic analysis on big data is hard! vt normalize dbsnp.vcf • Existing tools are often -r seq.fa -o dbsnp.normalized.vcf – Inflexible – Single-node https://academic.oup.com/bioinformatics/article/31/13/2202/196142 – Stitched together 27
28 .Genomic analysis on big data is hard! vt normalize dbsnp.vcf • Existing tools are often -r seq.fa -o dbsnp.normalized.vcf – Inflexible – Single-node https://academic.oup.com/bioinformatics/article/31/13/2202/196142 – Stitched together Row in a TSV file 28
29 .Genomic analysis on big data is hard! Raw Data • Existing tools are often Alignment BWA – Inflexible – Single-node Variant Calling – Stitched together Annotation Quality Control Picard Analysis plink 29