Using S3 Select to Deliver 100X Performance Improvements Versus the Public Cloud

Modern object storage offers the opportunity to combine software and hardware to create high performance, disaggregated data infrastructure. By decoupling compute and storage, enterprises can tune their environments to meet an expanded set of use cases including machine learning/big data. These modern object storage solutions boast throughput that is capable of saturating a 100 GBe switches, changing how we perceive, and how we ultimately deploy object storage.
展开查看详情

1.Using S3 Select to Deliver 100X Performance Improvements Versus the Public Cloud Frank Wessels CTO, MinIO

2. Applications S3 Select Before Up to 400% faster After ▪ Recent addition to S3 API Applications S3 SELECT ○ Offload filtering to storage Up to 80% Cheaper ○ Formats: CSV, JSON, Parquet ▪ Advantages ○ Faster ○ Less network traffic ○ Smaller compute nodes ■ S3 Select for Spark ○ https://github.com/minio/spark-select 2

3.Introduction to MinIO MinIO is a high performance, distributed object storage server, designed for peta-scale data infrastructure. S3-Compatible Scalable Simple Performant Optimized for Intel/ ARM/Power9 CPUs 3

4.Global Scale 4

5.Focus on Performance 5

6.S3 Select Performance on AWS Format Time (s) Records Throughput csv 5.46 733K/s 94 MB/s json 14.28 280K/s 98 MB/s parquet 32.25 124K/s 4.3 MB/s 6

7.Accelerating S3 Select on minio CSV JSON Parquet Parsing Parsing Loading Evaluation (“where”) Processing (“select”) 7

8.First 10X Acceleration: Zero Copy Manage memory allocations: garbage collected vs. non-garbage collected Source: https://bitbucket.org/ewanhiggs/csv-game 8

9.Second 10X Acceleration: SIMD ▪ SIMD = Single Instruction Multiple Data ○ Intel: AVX2 ▪ Process 32 bytes in parallel ○ delimiter / separator detection ○ bitmap handling & parsing ○ string compares ▪ Performance (single core) 9

10.Results using select-simd ▪ Same queries as before ○ minio with select-simd vs AWS S3 10

11. Demo ■ Source data ○ parking-citations.csv (25M rows / 3.5 GB) ■ AWS region ○ us-east-1 ■ minio with select-simd-integration branch running on a single instance: c5.2xlarge (8 vCPUs) ■ mc client running in same region on c5.large instance

12.Status and what’s next ▪ Works in progress ○ Initial focus on CSV ▪ Next: add support for ○ Parquet ○ JSON: https://github.com/lemire/simdjson ▪ Investigate AVX-512 ○ erasure coding ▫ AVX-512 4x speedup over AVX2 ○ k-registers are great / 2KB on-core register space ▪ Dynamic code generation (think LLVM) 12

13.High performance object storage Power9 CPUs PCIe Gen4 24x NVMe Dual Mellanox CX5 (4x100 GbE/s) 13

14.S3 Select benefits for Spark ▪ Benefits ○ Faster queries ○ Less network traffic ○ Smaller compute needs ▪ Stay tuned for overall impact ○ S3 “plain” vs S3 Select ○ minio/simd-select vs AWS S3 Select

15.Questions? Visit our booth #509 @minio https://github.com/minio/minio https://slack.minio.io https://minio.io