Accelerating Genomics SNPs Processing and Interpretation with Apache Spark

Interpretation of SNPs data is a non-trivial task: The analysis of the whole exome and/or whole genome data processing and later on interpretation is a challenging process in which Apache spark usage significantly speeds up the end-to-end analysis from FASTQ to annotated vcf file. In this talk we’ll share how implements Apache spark technology for bioinformatics purposes.

1.Spark Accelerated Genomics Processing Accelerating Personal Genomics with Apache Spark Kartik Thakore

2. 1. Intro to Genomics Agenda 2. Genomics processing 3. Traditional GATK Pipelines 4. App Ready

3.Intro to (Personal) Genomics ● DNA code ○ DNA sequence ■ sequence of A, C, T or G ○ Human Genome ■ 3 billion bases (letters) ● SNP (single nucleotide polymorphisms) ○ Represents a difference in a single DNA building block (called a nucleotide) ○ 4-5 million SNPs in a person’s Genome ○ More than 100 million SNPs in populations

4.Intro to (Personal) Genomics ● Analyzing and interpreting the genome of a single individual ● Defining unique genetic characteristics ○ Non-medically relevant traits (taste preference, exercise, behavior, etc) ○ Ancestry information ○ Understand risk for various diseases ○ Carrier status ○ Disease diagnosis ○ Response to certain treatments & drugs (pharmacogenomics)

5.Obtaining personal genomic information ● From biological sample to raw data ● “Decoding” the DNA sequence ○ Technological revolution ○ Multiple different platforms (and providers) ○ Two technologies ■ Genotyping (23andme, Ancestry) ■ Sequencing (full genome (WGS), exome)

6.Obtaining personal genomic information Genotyping Sequencing ● Looks at specific, interesting known variants in the DNA ● Reads whole sequences ● Technology: SNP arrays/chips ● Technology: Next-Generation-Sequencing ○ Efficient and cost-effective (~$50-$100) ○ More data and context: total information content ○ Straightforward analysis ○ No prior knowledge needed: discover variation you ● BUT: don’t know about beforehand ○ Requires prior identification of variants of interest ○ Full picture: deeper discovery about the genetic ○ Limited information, no novel information underpinnings (rare diseases, ...) ● Examples: 23andme, Ancestry ● BUT: ○ Some information is not useful (much of the human genome does not vary between individuals, so it is redundant) ○ Big datasets → more elaborate analysis and storage ○ Bit more expensive (~ $1000 range) ■ Still significant advances in technology, decreasing the time and labor costs - genotyping SNPs Processing MVP Genetic Download Sign up testing raw data app User Upload raw data Ancestry analysis Data analysis Data Interpretation pipeline Variant annotation dbSNP ClinVar Polyphen ...

8.WGS Variant Calling: Traditional GATK GATK Best practices

9.GATK Pipeline Data Engineering ● Setup of tools ○ Usually involves downloading several tools ● Pipeline runs ○ Manually executing scripts ○ Manual verification of results ● Reproducibility and reliability ○ Particularly difficult ○ Need to hire specialist to run ● Future proofing ○ Updating data sets manually ○ Updating tools manually ● Can be automated with significant work

10.GATK Pipeline User Experience ● File Verification Real World Data ○ Pipeline assumes VCFs are valid ● Job failures are not trackable ● Development and dockerization is hard ● Scaling also becomes difficult

11.Product and Design Considerations ● Transparent Processing ● Jobs must be able to handle large ○ Actual progress variations in file sizes ● As fast as possible feedback ● Data store cannot lose files ○ File verification ○ High Friction for the User to upload again ● Job run results and failure inspectability ● Ability to painlessly scale ● How the heck do we do this with an App? Genomics Pipeline: Ingestion uploaded Stored and verified Mobile TLS terminated Stored VCFs App Unimatrix Pipeline on GCS 1. Variant Call Formats from 23andMe and are uploaded via the mobile app 2. Files are verified by the Unimatrix (Borg shout out :D ) a. This is done without processing the file!!! 3. If the files are invalid we can tell the user right away 4. Additionally if files are valid we can also terminate the TLS connection which allows the user to move on 5. The UX is significantly smoother - VCF to Dataframe val PARQUET_FILE_NAME = INPUT_FILE.stripSuffix(".txt") + ".parquet" val schema = StructType( List( StructField("rsid", StringType, false), StructField("chromosome", StringType, false), StructField("position", IntegerType, false), StructField("genotype", StringType, false) ) ) val sc_file = sc.textFile(INPUT_FILE) val rdd = sc_file.filter(line => !line.startsWith("#")).map(line => line.split("\t")).map(entries => Row(entries(0), entries(1), entries(2).toInt, entries(3))) val df = spark.createDataFrame(rdd, schema) val updated_df = df.withColumn("genotype", when(col("genotype").equalTo("--"), "..").otherwise(col("genotype"))).sort(asc("chromosome"), asc("position")) Genomics Pipeline: Processing Google Cloud ● Ancestry analysis is a notebook as Progress and Stored a job Job completion Unimatrix Pipeline signalled back on GCS ● Parameters and file locations are sent via the Job REST API DB Jobs trigger AWS Cloud ● Results are stored back on GCS ● Results are then transformed and Genomics prepared for mobile application DataBricks Job REST API Ancestry (Mongo) Analysis ● Progress from job runs (times) are also periodically sent - Reference Mapping // A reference data is used that can help mapp ancestry to an individuals SNPs rsids val df ="doc_ai_ancestry_frequencies_50_hgdp_pca_filtered_v3.parquet").createTempView( "ANCESTRY_FREQS") // A SNPs hashing look up table is used to prepare the users’ snps to join with ANCESTRY_FREQS val ancestryDF = spark.sql(" select population_freq, snps, … from JOINED_USER_POPULATION_DF) // Finally results are verified by a suite of assertions to detect potential issues - What it looks and feels like

17.What’s next ● Testing our limits of our approach ● Askew VCFs ● Genotyped Samples over Sequenced ● Helps accounts for issues with vendor specific VCFs dnaseq-pipeline-at-scale.html