PSH-DB,一个索引和检索生物DNA序列的关键值系统

PSH-DB, a key-value system to index and retrieve biological DNA sequences
展开查看详情

1.PSH-DB, a key-value system to index and retrieve biological DNA sequences Jocelyn DE GOËR, Myoung-Ah KANG, Xavier BAILLY, Engerlbert MEPHU-NGUIFO XLDB2017 - October 10th–12th | Clermont-Ferrand, France UMR EPIA Axe 2 - DSI – MINERS Team Animal Epidemiology Unit Data, Services and Interoperability

2.Biological context 2 DNA sequencing Sanger method (1977) High-throughput Portable sequencer DNA sequencer for sequencer (2014) smartphone (since 2005) (2016)

3.Data management and analysis 3 § Evolution of human genome sequencing price: § 2000: $100 million § 2017: $500 § Challenges: § To analyze a constant increasing amount genomics data § To store all data during the analysis pipeline Size of genome Capacity of storage media Génomes: Taille: Supports de stockage: Capacité: Flu 0,013 Mbp Floppy Disk (3,5’) 1,4 Mb Borrelia burgdorferi 0,9Mbp Compact Disk (CD) 700 Mb Escherichia coli 4,64 Mbp Digital Versatil Disk (DVD) 4,7 Gb Ixodes ricinus 1 Gbp Blu-ray Disk (BRD) 7,5 à 128 Gb Human 3,2 Gbp Hard Drive : 500 Go à 10 Tb Wheat 17 Gbp 1 base pair (bp) = 1 byte Paris Japonica 150 Gbp Polychaos Dubium 675 Gbp

4.Bioinformatics analysis 4 § The first step of genomics data analysis § To identify DNA sequence through reference databases § Sequence assembly § Studying mutation § In phylogenetics § In metagenomics § Study the microbial diversity of a biological sample § Ex: seawater or intestinal macrobiota

5.What is PSH-DB? 5 § PSH : Perceptual Sequence Hashing § Adapted from perceptual hashing algorithm to index images or sounds § Not reversible § Hash key length : 64 bits (8 chars) § Collision probability : § 50% to have one collision for 5 billions hash keys § Key comparison: § Hamming Distance Seq1: GTGTAATAACCCGCCGGAAGCCTGGATAGTGTATAGTTGTTCCTTGATATGGAAGTTTCATCAG Seq2: GTGTAACATCCCGCCGGAAGCCTGGAGATTGTCTAGTTGTTCCTTGATATGGAAGTTTCATCAG Hash1: 0 1 1 1 1 1 1 0 1 1 0 0 1 0 1 0 1 Hamming Distance = 3 Hash2: 0 1 1 0 1 1 1 0 1 1 1 1 1 0 1 0 1

6.What is PSH-DB? 6 § Data structure engine § Developed in C++11 § In-memory § Client-Server § Data-structure § Keys : Optimized for binary keys (3x less memory than REDIS) § Values integer or string § Queries § SET, GET, DEL, DBSIZE, FLUSHALL § Special query § HAMMING_DIST : search all keys with specified Hamming Distance

7. Thank you for your attention Jocelyn DE GOËR http://epia.clermont.inra.fr/jgoer UMR EPIA Axe 2 - DSI – MINERS Team Animal Epidemiology Unit Data, Services and Interoperability