BlazingDB: Data Lake to AI on GPUs

BlazingSQL是基于GPU加速的SQL引擎,它运行在RAPIDS生态之上,RAPIDS是一个基于Apache Arrow的列式内存格式,cuDF是GPU DataFrame库,利用来加载,连接,聚合,过滤以及其它的数据操作。BlazingSQL是cuDF对外的SQL接口,包含大量的特性,来支持大规模的数据科学工作流和企业级的数据集操作:

  • 查询数据存储在外部:一行简单的代码就可以把远程比如Amazon S3的数据注册进来,类似与Spark的数据源API;
  • 简单的SQL:非常简单的SQL,可以基于GPU DataFrames(GDFs)来运行SQL查询和存访结果;
  • 协同性:GDFs对RAPIDS库完全开放,可以很简单的基于GDFs来完成数据科学分析业务;

1.Data Lake to AI on GPUs

2. CPUs can no longer handle the growing data demands of data science workloads Slow Process Suboptimal Infrastructure Preparing data and training models Hundreds to tens of thousands of CPU can take days or even weeks. servers are needed in data centers. @blazingdb

3. GPUs are well known for accelerating the training of machine learning and deep learning models. Performance Machine improvements Learning increase at scale. Deep Learning 40x Improvement over CPU. (Neural Networks) @blazingdb

4. But data preparation still happens on CPUs, and can’t keep up with GPU accelerated machine learning. • Apache Spark Query ETL ML Train • Apache Spark + GPU ML ML Query ETL Train Enterprise GPU users find it challenging to “Feed the Beast”. @blazingdb

5. An end-to-end analytics solution on GPUs is the only way to maximize GPU power. RAPIDS (All GPU) ML Query ETL Train Expertise: Expertise: Expertise: · GPU DBMS · CUDA · Python · GPU Columnar Analytics · Machine Learning · Data Science · Data Lakes · Deep Learning · Machine Learning @blazingdb

6. RAPIDS, the end-to-end GPU analytics ecosystem import cudf A set of open source libraries for GPU from cuml import KNN accelerating data preparation and import numpy as np machine learning. np_float = np.array([ [1,2,3], #Point 1 [1,2,3], #Point 2 [1,2,3], #Point 3 ]).astype('float32') Data Preparation Model Training Visualization gdf_float = cudf.DataFrame() gdf_float['dim_0'] = np.ascontiguousarray(np_float[:,0]) gdf_float['dim_1'] = np.ascontiguousarray(np_float[:,1]) gdf_float['dim_2'] = np.ascontiguousarray(np_float[:,2]) cuDF cuML cuGRAPH print('n_samples = 3, n_dims = 3') print(gdf_float) Data Preparation Machine Learning Graph Analytics knn_float = KNN(n_gpus=1) Distance,Index = knn_float.query(gdf_float,k=3) # Get 3 nearest neighbors In GPU Memory print(Index) print(Distance) @blazingdb

7. BlazingSQL: The GPU SQL Engine on RAPIDS A SQL engine built on RAPIDS. Query enterprise data lakes lightning fast with full interoperability with the RAPIDS stack. @blazingdb

8. BlazingSQL, The GPU SQL Engine for RAPIDS from blazingsql import BlazingContext A SQL engine built on RAPIDS. Query enterprise data lakes lightning fast with bc = BlazingContext() full interoperability with RAPIDS stack. #Register Filesystem bc.hdfs('data', host='', port=54310) cuDF Data Preparation # Create Table bc.create_table('performance', file_type='parquet', path='hdfs://data/performance/') cuML Machine Learning In GPU Memory #Execute Query result_gdf = bc.run_query('SELECT * FROM performance WHERE YEAR(maturity_date)>2005') cuGRAPH Graph Analytics print(result_gdf) @blazingdb

9. Getting Started Demo @blazingdb

10. BlazingSQL + XGBoost Loan Risk Demo Train a model to assess risk of new mortgage loans based on Fannie Mae loan performance data Mortgage Data 4.22M Loans ETL/ 148M Perf. Records Feature Engineering XGBoost Training CSV Files on HDFS + + 4 Nodes 1 Nodes + + 8 vCPUs per node 16 vCPUs per node CLUSTER CLUSTER 30GB RAM 1 Tesla T4 GPU 2560 16GB CUDA Cores VRAM @blazingdb

11. RAPIDS + BlazingSQL outperforms traditional CPU pipelines Demo Timings (ETL Phase) 3.8GB (1 x T4) 3.8GB (4 Nodes) 15.6GB (1 x T4) 15.6GB (4 Nodes) TIME IN SECONDS 0’’ 1000’’ 2000’’ 3000’’ @blazingdb

12. Scale up the data on a DGX 4 x V100 GPUs @blazingdb

13. BlazingSQL + Graphistry Netflow Analysis Visually analyze the VAST netflow data set inside Graphistry in order to quickly detect anomalous events. Netflow Data ETL Visualization 65M Events 1,440 Devices 2 Weeks @blazingdb

14. Benchmarks Netflow Demo Timings (ETL Only) @blazingdb

15. Benefits of BlazingSQL Data Lake to RAPIDS Blazing Fast. Query data from Data Lakes Massive time savings with our directly with SQL in to GPU GPU accelerated ETL pipeline. memory, let RAPIDS do the rest. Minimal Code Changes Required. Stateless and Simple. RAPIDS with BlazingSQL mirrors Underlying services being Pandas and SQL interfaces for stateless reduces complexity seamless onboarding. and increase extensibility. @blazingdb

16. Upcoming BlazingSQL Releases VO.1 VO.2 VO.3 VO.4 VO.5 Query Direct Query String Distributed Physical Plan GDFs Flat Files Support Scheduler Optimizer Use the PyBlazing Integrate FileSystem API, String support and string SQL queries are fanned Partition culling for where connection to execute SQL adding the ability to operation support. out across multiple GPUs clauses and joins. queries on GDFs that are directly query flat files and servers. loaded by the cuDF API (Apache Parquet & CSV) inside distributed file systems. @blazingdb

17. Get Started BlazingSQL is quick to get up and running using either DockerHub or Conda Install: @blazingdb