1.Great Models with Great Privacy Optimizing ML & AI Over Sensitive Data Sim Simeonov, CTO, Swoop Slater Victoroff, CTO, Indico firstname.lastname@example.org / @simeons email@example.com / @sl8rv #UnifiedAnalytics #SparkAISummit
2.privacy-preserving ML/AI to improve patient outcomes and pharmaceutical operations e.g., we improve the diagnosis rate of rare diseases
3. Intelligent Process Automation for Unstructured Content using ML/AI with strong data protection guarantees e.g. we automate the processing of loan documents using a combination of NLP and Computer Vision
4.Regulation affects ML & AI • General Data Protection Regulation (GDPR) – Already in effect in the EU • California Consumer Protection Act (CCPA) – Comes into effect Jan 1, 2020 • Many federal efforts under way – Information Transparency and Data Control Act (DelBene, D-WA) – Consumer Data Protection Act (Wyden, D-OR) – Data Care Act (Schatz, D-HI)
5.accuracy vs. privacy is a false dichotomy (if you are willing to invest in privacy infrastructure)
6.Privacy-preserving computation frontiers • Stochastic – Differential privacy (DP) • Encryption-based – Fully homomorphic encryption (FHE) • Protocol-based – Secure multi-party computation (SMC)
7.When privacy-preserving algorithms are immature, sanitize the data the algorithms are trained on
8.Session roadmap • The rest of this session – Sanitizing a single dataset using Spark • After the break – Sanitizing joinable datasets the smart way – Embeddings for unstructured data (deep learning!)
9.What does it mean to sanitize data for privacy?
10.Identifiability spectrum Personally-Identified Information (PII) De-Identified Information Device-Identified Information (DII)
11.Direct identifiability • Personal information (PI) Sim Simeonov; Male; July 7, 1977 One Swoop Way, Cambridge, MA 02140 • Sanitize with pseudonymous identifiers – Secure one-way mapping of PI to opaque IDs
12.Secure pseudonymous ID generation Sim Simeonov; Male; July 7, 1977 One Swoop Way, Cambridge, MA 02140 ... Simeon Simeonov; M; 1977-07-07 One Swoop Way, Suite 305, Cambridge, MA 02140 Sim|Simeonov|M|1977-07-07|02140 // canonical representation 8daed4fa67a07d7a5 … 6f574021 // secure destructive hashing (SHA-xxx) gPGIoVw … nNpij1LveZRtKeWU= // master encryption (AES-xxx) Vw50jZjh6BCWUz … mfUFtyGZ3q // partner A encryption 6ykWEv7A2lis8 … VT2ZddaOeML // partner B encryption
13.Dealing with dirty data Sim|Simeonov|M|1977-07-07|02140 // full entry when data is clean S|S551|M|1977-07-07|02140 // fuzzify names to handle limited entry & typos Sim|Simeonov|M|1977-07|02140 // generalize structured fields (dates, locations, …) tune fuzzification to use cases & desired FP/FN rates
14.Building pseudonymous IDs with Spark
15.Indirect identifiability via quasi-identifiers
16.Sanitizing quasi-identifiers • k-anonymity – Generalize or suppress quasi-identifiers – Any record is “similar” to at least k-1 other records • (k, ℇ)-anonymity – adds noise
17.Sanitizing quasi-identifiers in Spark • Optimal k-anonymity is an NP-hard problem – Mondrian algorithm: greedy O(nlogn) approximation https://github.com/eubr-bigsea/k-anonymity-mondrian • Active research – Locale-sensitive hashing (LSH) improvements – Risk-based approaches (LBS algorithm) – Academic Spark implementations (TDS, MDSBA)
18.Sanitizing joinable datasets • The industry standard: centralized sanitization • Sanitize the data as if it were joined together • Big increase in quasi-identifiers
19.The curse of dimensionality for sanitization We show that when the data contains a large number of attributes which may be considered quasi-identifiers, it becomes difficult to anonymize the data without an unacceptably high amount of information loss. ... we are faced with ... either completely suppressing most of the data or losing the desired level of anonymity. On k-Anonymity and the Curse of Dimensionality 2005 Aggarwal, C. @ IBM T. J. Watson Research Center
20.Curse of dimensionality example • Titanic passenger data • k-anonymize for different values of k by – Age – Age & gender • Compute normalized certainty penalty (NCP) – Higher values mean more information loss
21.Normalized Certainty Penalty 40% 35% 30% 25% 300% 20% 15% 10% 5% 0% 2 3 4 5 6 7 8 9 10 k age gender & age k-anonymizing Titanic passenger survivability
22.Centralized sanitization describes an alternate reality
23.Centralized sanitization increases risk We find that for privacy budgets effective at preventing attacks, patients would be exposed to increased risk of stroke, bleeding events, and mortality. Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing 2014 Fredrikson, M. et. al. @ UW Madison and Marshfield Clinic Research Foundation
24.High-accuracy ML/AI requires federated sanitization
25. Federated sanitization (Swoop’s prAIvacy™) • Isolated execution environments • Workflows execute across EEs • Task-specific sanitization firewalls • Sanitization is often lossless Model condition X without mixing health data with other data
26. There is no standard anonymization framework for unstructured data: suppress or structure or ???
27.The Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation.
28.The Problem With Text John Malkovitch plays tennis in Winchester. Problem PII He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation.
29.The Problem With Text John Malkovitch plays tennis in Winchester. Problem PII He has been reporting soreness in his elbow. His 60th birthday is in two weeks. Solution(s) After he returns from his birthday trip to • Remove common names? • Tell Doctors to stop using Casablanca we will recommend a steroid names in their notes? • Lookup patient information in notes and intentional shot to reduce inflammation. remove