确定删除吗?
1.Using S3 Select to Deliver 100X Performance Improvements Versus the Public Cloud Frank Wessels CTO, MinIO
2. Applications S3 Select Before Up to 400% faster After ▪ Recent addition to S3 API Applications S3 SELECT ○ Offload filtering to storage Up to 80% Cheaper ○ Formats: CSV, JSON, Parquet ▪ Advantages ○ Faster ○ Less network traffic ○ Smaller compute nodes ■ S3 Select for Spark ○ https://github.com/minio/spark-select 2
3.Introduction to MinIO MinIO is a high performance, distributed object storage server, designed for peta-scale data infrastructure. S3-Compatible Scalable Simple Performant Optimized for Intel/ ARM/Power9 CPUs 3
4.Global Scale 4
5.Focus on Performance 5
6.S3 Select Performance on AWS Format Time (s) Records Throughput csv 5.46 733K/s 94 MB/s json 14.28 280K/s 98 MB/s parquet 32.25 124K/s 4.3 MB/s 6
7.Accelerating S3 Select on minio CSV JSON Parquet Parsing Parsing Loading Evaluation (“where”) Processing (“select”) 7
8.First 10X Acceleration: Zero Copy Manage memory allocations: garbage collected vs. non-garbage collected Source: https://bitbucket.org/ewanhiggs/csv-game 8
9.Second 10X Acceleration: SIMD ▪ SIMD = Single Instruction Multiple Data ○ Intel: AVX2 ▪ Process 32 bytes in parallel ○ delimiter / separator detection ○ bitmap handling & parsing ○ string compares ▪ Performance (single core) 9
10.Results using select-simd ▪ Same queries as before ○ minio with select-simd vs AWS S3 10
11. Demo ■ Source data ○ parking-citations.csv (25M rows / 3.5 GB) ■ AWS region ○ us-east-1 ■ minio with select-simd-integration branch running on a single instance: c5.2xlarge (8 vCPUs) ■ mc client running in same region on c5.large instance
12.Status and what’s next ▪ Works in progress ○ Initial focus on CSV ▪ Next: add support for ○ Parquet ○ JSON: https://github.com/lemire/simdjson ▪ Investigate AVX-512 ○ erasure coding ▫ AVX-512 4x speedup over AVX2 ○ k-registers are great / 2KB on-core register space ▪ Dynamic code generation (think LLVM) 12
13.High performance object storage Power9 CPUs PCIe Gen4 24x NVMe Dual Mellanox CX5 (4x100 GbE/s) 13
14.S3 Select benefits for Spark ▪ Benefits ○ Faster queries ○ Less network traffic ○ Smaller compute needs ▪ Stay tuned for overall impact ○ S3 “plain” vs S3 Select ○ minio/simd-select vs AWS S3 Select
15.Questions? Visit our booth #509 @minio https://github.com/minio/minio https://slack.minio.io https://minio.io