Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”

NEC has recently released new vector system “SX-Aurora TSUBASA”. This system is usually used for HPC, but is also designed for data analytics by building the vector processor as a PCIe-attached accelerator. In comparison with GPGPU, it suits for memory intensive workloads, often see at statistical machine learning and data frame processing. To accelerate data analytics on Spark, we have created acceleration framework “Frovedis” for SX-Aurora TSUBASA. It supports several machine learning algorithms on MLlib and Data Frame processing that are fully optimized for the vector processor. It is also optimized for distributed systems with multiple vector processors, and has API that is mostly the same with Spark MLlib and Data Frame. These features enables Spark developers to use multiple vector processors seamlessly from Spark and get a huge performance improvement. The performance evaluation shows that the “Frovedis” on the vector processor shows 10x to 50x speedup on several machine learning and data frame kernels compared with a Spark on Xeon Gold.

1.Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA” Takeo Hosomi ( Takuya Araki, Ph.D. ( Data Science Research Laboratories NEC Corporation 1 © NEC Corporation 2019

2.Summary ▌NEC released new vector processor SX-Aurora TSUBASA Different characteristics than GPGPU: • Larger memory and higher memory bandwidth • Compatible with standard programming languages ▌Vector processor evolved from HPC Optimized for unified Big Data analytics Especially suitable for statistical ML ▌Packaged with machine learning middleware in C++/MPI Distributed and vectorized implementation Adapts Apache Spark APIs ~100x faster than Spark on x86 2 © NEC Corporation 2019

3.What is a Vector Processor ? Processes many elements with one instruction, which is supported by large memory bandwidth Scalar processor Vector processor Unit of computation is small Computes many elements at once Suitable for web server, etc. Suitable for simulation, AI, Big Data, etc. Data Data 0.12TB/s 1.2TB/s 256 Scalar Vector computation computation 256 Result Result 3 © NEC Corporation 2019

4.New Vector Processor System “SX-Aurora TSUBASA” (Supercomputer) Tower On-card Vector Processor Rack Downsized super computer: Can be used as an accelerator for Big Data and AI 4 © NEC Corporation 2019

5.On-card Vector Processor (Vector Engine)  NEC-designed vector processor  PCIe card implementation  8 cores / processor  4.9TF performance (single precision)  1.2TB/s memory bandwidth, 48GB memory  Standard programing interface (C/C++/Fortran) 5 © NEC Corporation 2019

6.Processor Specifications VE1.0 Specification 2.45TF 256 words vector length 307GF core core core (16k bits) cores/CPU 8 core core core core frequency 1.6GHz 0.4TB/s 307GF(DP) 3TB/s core performance 614GF(SP) Software controllable cache 2.45TF(DP) 16MB CPU performance 4.91TF(SP) cache capacity 16MB shared 1.2TB/s Memory bandwidth 1.2TB/s Memory capacity 48GB HBM2 memory x 6 6 © NEC Corporation 2019

7.GPGPU and Vector Engine Execution Models GPGPU: Offloading Model Vector Engine: Native Model Parts of App. are executed on GPGPU Whole App. is executed on VE OS App OS App CUDA Function Memory x86 VE Memory Memory x86 GPGPU Memory PCIe PCIe exec exec Start Processing Data Transmission OS I/O, etc. Result Transmission Function : exit End Processing exit Advantage of Native Model  Can reduce the data transfer between x86 and Vector Engine 7 © NEC Corporation 2019

8.Usability Programing Environment Vector Cross Compiler automatic vectorization, automatic parallelization OS: RedHat Linux, Cent OS $ vi sample.c Fortran: F2003, F2008(partially) $ ncc sample.c C: C11 C++: C++14 OpenMP: OpenMP4.5 MPI: MPI3.1 Execution Environment x86 $ ve_exec ./a.out execution 8 © NEC Corporation 2019

9.Why Vector Engine? Can accelerate memory intensive workloads  High memory bandwidth and large memory capacity  Supports native execution model  Standard programing model  Scale to multiple vector processors  Direct data transfer among multiple vector processors through PCIe and InfiniBand 9 © NEC Corporation 2019

10.Why are we here? Spark machine learning framework and data frame processing, requiring extreme large memory, can be accelerated by vector engine Memory Vector performance Engine Machine Learning Data Frame MLP LSTM z CNN General purpose CPU GPU Computation performance 10 © NEC Corporation 2019

11.Frovedis: Framework of vectorized and distributed data analytics

12.Frovedis: FRamework Of VEctorized and DIStributed data analytics ▌C++ framework similar to Spark Supports Spark/Python interface ▌MPI is used for high performance communication ▌Optimized for SX-Aurora TSUBASA (also works on x86) Open Source! Spark / Python Interface Matrix Library Machine Learning DataFrame Frovedis Core 12 © NEC Corporation 2019

13.Frovedis Core ▌Provides Spark core-like functionalities (e.g. map, reduce) Internally uses MPI to implement distributed processing Inherently supports multiple cards/servers ▌Users need not be aware of MPI to write distributed processing code Write functions in C++ Provide functions to the framework to run them in parallel ▌Example: double each element of distributed variable int two_times(int i) {return i * 2;} int main(...) { distributed variable ... dvector<int> r =; run “two_times” in parallel } 13 © NEC Corporation 2019

14.Complete Sample Program (1/2) ▌Scatter a vector; double each element; then gather #include <frovedis.hpp> using namespace frovedis; int two_times(int i) {return i*2;} int main(int argc, char* argv[]) { use_frovedis use(argc, argv); initialization scatter to std::vector<int> v = {1,2,3,4,5,6,7,8}; create dvector dvector<int> d1 = make_dvector_scatter(v); dvector<int> d2 =; std::vector<int> r = d2.gather(); gather to std::vector } ▌Do not have to be aware of MPI (SPMD programming style) Looks more like a sequential program 14 © NEC Corporation 2019

15.Complete Sample Program (2/2) ▌Works as an MPI program #include <frovedis.hpp> using namespace frovedis; MPI_Init is called in the constructor, then branch: • rank int two_times(int i) 0: execute {return i*2;}the below statements • rank 1-N: wait for RPC request from rank 0 int main(int argc, char* argv[]) { use_frovedis use(argc, argv); std::vector<int> v = {1,2,3,4,5,6,7,8}; dvector<int> d1 = make_dvector_scatter(v); dvector<int> d2 =; std::vector<int> r = d2.gather(); } in the destructor of “use”, MPI_Finalize is called and rank 0 sends RPC request to send RPC request to rank 1-N to stop the program rank 1-N to do the work 15 © NEC Corporation 2019

16.Matrix Library ▌Implemented using Frovedis core and existing MPI libraries[*] [*] ScaLAPACK/PBLAS, LAPACK/BLAS, Parallel ARPACK ▌Supports dense and sparse matrix of various formats Dense: row-major, column-major, block-cyclic Sparse: CRS, CCS, ELL, JDS, JDS/CRS Hybrid (for better vectorization) ▌Provides basic matrix operations and linear algebra Dense: matrix multiply, solve, transpose, etc. Sparse: matrix-vector multiply (SpMV), transpose, etc. Example blockcyclic_matrix<double> A = X * Y; // mat mul gesv(A, b); // solve Ax = b 16 © NEC Corporation 2019

17.Machine Learning Library Implemented with Frovedis Core and Matrix Library  Supports both dense and sparse data  Sparse data support is important in large scale machine learning ▌Supported algorithms: ▌Under development:  Linear model  Word2vec  Frequent Pattern Mining • Logistic Regression  Factorization Machines  Spectral Clustering • Multinominal Logistic  Decision Tree  Hierarchical Clustering Regression  Naïve Bayes  Latent Dirichlet Allocation • Linear Regression  Deep Learning (MLP, CNN) • Linear SVM  Graph algorithms  Random Forest  ALS • Shortest Path, PageRank, Connected Components  Gradient Boosting Decision Tree  K-means  Preprocessing ▌We will support more! • SVD, PCA 17 © NEC Corporation 2019

18.DataFrame ▌Supports similar interface as Spark DataFrame Select, Filter, Sort, Join, Group by/Aggregate A B C D (SQL interface is not supported yet) ▌Implemented as distributed column store Each column is represented as distributed vector Each operation only scans argument columns: A B C D other columns are created when necessary rank #0 (late materialization) rank #1 rank #2 Reduces size of data to access 18 © NEC Corporation 2019

19.Spark / Python Interface ▌Writing C++ programs is sometimes tedious, so we created a wrapper interface to Spark Call the framework through the same Spark API Users do not have to be aware of vector hardware ▌Implementation: created a server with the functionalities Receives RPC request from Spark and executes ML algorithm, etc. Only pre-built algorithms can be used from Spark ▌Other languages can also be supported by this architecture Currently Python is supported (scikit-learn API) 19 © NEC Corporation 2019

20.How it works ▌Rank 0 of the Frovedis server waits for RPC from driver of Spark ▌Data communication is done in parallel All workers/ranks send/receive data in parallel Assuming that the data can fit in the memory of the Frovedis server Interactive Spark RPC Frovedis Server operation request driver rank 0 worker 0 rank 1 worker 1 rank 2 worker 2 Data communication 20 © NEC Corporation 2019

21.Programming Interface ▌Provides same interface as the Spark’s MLlib By changing importing module ▌How to use: Original Spark program: logistic regression … import org.apache.spark.mllib.classification.LogisticRegressionWithSGD … val model = LogisticRegressionWithSGD.train(data) … Change to call Frovedis implementation … import //change import … Specify command to invoke server (e.g. mpirun, qsub) FrovedisServer.initialize(...) // invoke Server val model = LogisticRegressionWithSGD.train(data) // no change: same API FrovedisServer.shut_down() // stop Server 21 © NEC Corporation 2019

22.YARN Support ▌Resource allocation by YARN is also supported Implemented in the collaboration with Cloudera (formerly Hortonworks) team ▌Implementation: YARN is modified to support Vector Engine (VE) as resource (like GPU) Created a wrapper program of mpirun, which works as YARN client • Obtain VE from YARN Resource Manager, and run MPI program on the given VE Used the wrapper as the server invocation command • Specified in FrovedisServer.initialize(...) Spark mpirun YARN VE wrapper RM VE mpirun VE 22 © NEC Corporation 2019

23.Performance Evaluation: Machine Learning ▌Xeon (Gold 6126) 1 socket vs 1 113.2 120 VE, with sparse data (w/o I/O) 100 LR uses CTR data provided by Speed Up (Spark = 1) Criteo (1/4 of the original, 6GB) 80 56.8 K-means and SVD used 60 42.8 Wikipedia doc-term matrix 40 (10GB) 20 10.6 8.8 1 5.3 Spark version: 2.2.1 0 1 1 LR K-means SVD Spark/x86 Frovedis/x86 Frovedis/VE 23 © NEC Corporation 2019

24.Performance Evaluation: DataFrame ▌Evaluated with TPC-H SF-20 50 47.3 Q1: group by/aggregate 40 34.8 Q3: filter, join, group 33.8 Speed Up (Spark = 1) by/aggregate 30 Q5: filter, join, group 20 by/aggregate (larger join) 10.1 8.8 10.6 10 5.8 Q6: filter, group by/aggregate 1 3.2 1 1 1 0 Q01 Q03 Q05 Q06 Spark/x86 Frovedis/x86 Frovedis/VE 24 © NEC Corporation 2019

25.NEC X Vector Engine Data Acceleration Center (VEDAC) ▌NEC X is the innovation accelerator for NEC’s emerging technologies, located in Silicon Valley ▌Today opened VEDAC Lab: Several servers with multiple SX-Aurora vector engine cards running Frovedis middleware ▌Remote access for qualified companies, universities and government labs; or physically by entrepreneurs: Short series of tutorials on vector processing Upload data to VEDAC; get true sense of performance in actual applications ▌Request access: Please follow @incNECX for more latest news 25 © NEC Corporation 2019

26.Conclusion ▌NEC released new vector processor SX-Aurora TSUBASA that can accelerate data analytics and machine learning applications ▌We have developed data analytics middleware Frovedis for SX- Aurora TSUBASA ▌We show a 10x to 100x performance improvement on several machine learning and data frame processing ▌NEC-X has opened VEDAC lab for accessing SX-Aurora TSUBASA AI platform with Frovedis. 26 © NEC Corporation 2019

27.Biographies ▌ Takeo Hosomi Takeo Hosomi is a senior engineer at NEC Data Science Research Laboratories. He has a broad experience in High Performance Computing and Big Data. ▌ Takuya Araki, Ph.D. Takuya Araki received B.E., M.E., and Ph.D. degrees from the University of Tokyo, Japan in 1994, 1996, and 1999, respectively. He was a visiting researcher at Argonne National Laboratory from 2003 to 2004. He is currently a Senior Principal Researcher at NEC Data Science Research Laboratories. His research interests include parallel and distributed computing and its application to AI/machine learning. He is also a director of the Information Processing Society of Japan (IPSJ). 27 © NEC Corporation 2019

28.Biographies ▌ NEC X NEC X, Inc. accelerates the development of innovative products and services based on the strengths of NEC Laboratories technologies. The organization was launched by NEC Corp. in 2018 to fast-track technologies and business ideas selected from inside and outside NEC. For companies launched by its Corporate Accelerator Program, often with partner venture capital investments, NEC X supports their business development activities to help them achieve revenue growth. Additionally, NEC X provides an option for entrepreneurs, startups and existing companies to use NEC’s emerging technologies in the Americas. The company is centrally located in Silicon Valley for access to its entrepreneurial ecosystem and its strong high-technology market. Learn more at 28 © NEC Corporation 2019