03_Going Beyond The Browser

Ensembl Data: Going Beyond The Browser
展开查看详情

1. @ensembl @ensemblgenomes @drdanstaines Ensembl Data: Going Beyond The Browser Dan Staines & Andy Yates Genomics Technology Infrastructure EMBL-EBI

2.Ensembl has lots of data types...

3....and complex data in many dimensions... • ~45k genomes • ~450 Gbp of sequence • ~175 million genes • ~170 million proteins • ~1 billion protein features • ~1.5 billion cross-references • ~500 million homologous pairs • ~800 million variants • ~200 billion genotypes

4....and complex data in many dimensions... • ~45k genomes 75kBills • ~450 Gbp of sequence • ~175 million genes • ~170 million proteins 90kPotters • ~1 billion protein features • ~1.5 billion cross-references • ~500 million homologous pairs • ~800 million variants • ~200 billion genotypes

5....and can grow rapidly... 150m Number of 100m genes 500m 0 2009 2013 2017

6.Accessing Ensembl data

7.What we’re building {REST} query manager expression genes variation

8.The Cambrian Explosion of Databases

9.The Cambrian Explosion of Databases

10.Why Elastic? • Handles our complex, nested data structures • Scales horizontally • Meets our performance needs: • Complex query on >100 million genes: ~500ms • Retrieval of all the genes from the human genome: <2 minutes

11.Gene search 30k genomes 782Gb indices 110m genes 3.2bn documents 8 data nodes (2core/32G/200G) 571 Gb JSON dumps

12.REST API query={"genome":"homo_sapiens", "name":"BRCA2"} fields=["name","description"] /query /fetch query manager expression genes variation

13.Where are we now? • REST interface • Beta by invitation (helpdesk@ensembl.org) • Web interface • Initial prototyping to support endpoint development • Full development now underway

14.Acknowledgements • Genomics Technology Infrastructure • Andy Yates • Ensembl Genomes • Mike Smith • Paul Kersey • Wolfgang Huber • Molecular Archives • Laura Clarke • Peter Harrison • Genome Analysis • Magali Ruffier • Jim Proctor • European Variation Archive • Mungo Carstairs • Cristina Yenyxe Gonzalez • Gene Expression Atlas • Irene Papatheodorou • Alfonso Munoz-Pomer Fuentes