Encrypted Computation in Apache Spark

Homomorphic encryption is a relatively new type of encryption technology that allows computations to be done directly on encrypted data. ”Microsoft SEAL” is an easy-to-use homomorphic encryption library that enables software engineers to build end-to-end encrypted data storage and computation services where the customer never needs to share their key with the service. Bottlenecks in the technology are in performance and the size of encrypted data, and Spark, along with hardware acceleration, can help solve these scalability challenges. In this talk we will describe homomorphic encryption at a high level, and see what the performance of the technology can be when complex encrypted computations are executed in Spark. The talk will include demos.


1.Encrypted Computation in Apache Spark Kim Laine, Microsoft #UnifiedDataAnalytics #SparkAISummit

2.Brief intro Kim Laine kim.laine@microsoft.com Cryptography and Privacy Research Group Microsoft Research, Redmond, WA

3.Brief intro Joint work with Peizhao Hu (RIT), Asma Aloufi (RIT), and Wei Dai (Microsoft) Thanks to Asma Aloufi for letting me use some of her slides!

4.Goals What is encrypted computation and why it matters? What is homomorphic encryption? What does Spark have to do with it? Initial results

5.Three stages of data privacy In transit to Cloud (TLS) At rest in Cloud (AES) During computation in Cloud ? Access Policies

6.Why it matters? Insider threats Outsider threats Data segregation issues Regulations Liability concerns

7.Could we have this? Privacy Barrier

8.Could we have this? Many ways to achieve encrypted computation! Homomorphic Encryption Secure Hardware Secure Multi-Party Computation

9.Homomorphic encryption Untrusted environment Data owner $fA4!&s2FDfs4 Secret input 20 $fA4!&s2FDfs4 Compute while encrypted Encrypt -5 x2 Computed result e#3Ad09!B%gD 30 Decrypt e#3Ad09!B%gD

10.Homomorphic encryption • User/customer always keeps the key • Very strong data privacy guarantees • Open-source implementations available!

11.Microsoft SEAL • MIT licensed • Actively developed today • GitHub.com/Microsoft/SEAL • • •

12.OK, what’s the catch? Encrypted computation is very complicated • Encrypted computations are limited to addition and multiplication • Best on small computations • Strange programming model so hard to use for developers • Big slow-down so prefer many threads (or cluster) • Data expansion so need lots of storage We are working on these challenges but some are inherent!

13.OK, so what’s the catch? • No branching allowed • Cannot detect overflow • Storage and computation are vectorized Plaintext Ciphertexts 1 2 3 4 1 2 3 4 ... Enc Add 6 8 10 12 ... 5 6 7 8 5 6 7 8 ...

14.Can this be useful? Collaborative analytics on private data ML Private predictions on private key-value storedata



17.Spark integration • Costly computation and large data size • Simple and parallelizable computations work well • Lots of people already use Spark! Can we store encrypted data in Spark and bring encrypted computation directly to Spark?

18.Spark integration Security HE+Spark Distributed Software Systems Engineering

19.Spark integration First approach: pipe data to Second approach: provide high- customly written HE programs level API to HE within Spark Easy to implement Usage through Python, Java, Scala Hard for developers to use Familiar to Spark users Abysmal performance Good performance possible Hard to implement

20.SparkSEAL Integrate Microsoft SEAL with Spark ○ Support data analytics and machine learning on encrypted data ○ Speed up homomorphic computations through parallelization Map Microsoft SEAL operations into RDDs Fully utilize ciphertext packing methods to reduce space and communication cost Develop new task and resource allocation algorithms for Microsoft SEAL

21.SparkSEAL SparkSEAL Examples (examples and tests in Java) spark-submit Components for HE Support Apache Spark (version 2.4.0) sparkseal-api-java SparkSEAL Plugin Mesos/Yarn/Standalone libSparkSEAL Loading library Distributed File System (HDFS) primitives and and register algorithms UDTs libseal (Microsoft SEAL)

22.SparkSEAL SparkSEAL Examples SparkSEAL SparkSEAL API Plugin SWIG libSparkSEAL Apache Spark libseal.a

23.Programming Abstractions API available to data scientists HE setup and IO operations Basic HE operations Selective HE operations generate_keys he_add save_keys he_subtract he_sum_EncVec load_keys he_multiply he_add_EncVec encrypt he_is_equal he_subtract_EncVec decrypt he_total_sum he_multiply_EncVec store_ciphertext_to_file he_dot_product he_retrieve_EncVec read_ciphertext_from_file he_slot_sum Call graph Example SparkSEAL Generated Microsoft code SWIG native code JNI SEAL (Java) (C++)

24.Conclusions and future work SparkSEAL enables computation on encrypted large datasets and hides much of the complexity of using Microsoft SEAL for encrypted computation! To do (ongoing work) • How to make user experience easier • How to fully leverage vectorized storage and computation (HARD!) • Releasing the code under an OSS license

25. Thank You! Contact: kim.laine@microsoft.com