Great Models with Great Privacy: Optimizing ML and AI Over Sensitive Data (conti

There is a growing feeling that privacy concerns dampen innovation in machine learning and AI applied to personal and/or sensitive data. After all, ML and AI are hungry for rich, detailed data and sanitizing data to improve privacy typically involves redacting or fuzzing inputs, which multiple studies have shown can seriously affect model quality and predictive power. While this is technically true for some privacy-safe modeling techniques, it’s not true in general. The root cause of the problem is two-fold. First, most data scientists have never learned how to produce great models with great privacy. Second, most companies lack the systems to make privacy-preserving machine learning & AI easy. This talk will challenge the implicit assumption that more privacy means worse predictions. Using practical examples from production environments involving personal and sensitive data, the speakers will introduce a wide range of techniques-from simple hashing to advanced embeddings-for high-accuracy, privacy-safe model development. Key topics include pseudonymous ID generation, semantic scrubbing, structure-preserving data fuzzing, task-specific vs. task-independent sanitization and ensuring downstream privacy in multi-party collaborations. In addition, we will dig into embeddings as a unique deep learning-based approach for privacy-preserving modeling over unstructured data. Special attention will be given to Spark-based production environments.
展开查看详情

1.Great Models with Great Privacy Optimizing ML & AI Over Sensitive Data Sim Simeonov, CTO, Swoop Slater Victoroff, CTO, Indico sim@swoop.com / @simeons slater@indico.io / @sl8rv #UnifiedAnalytics #SparkAISummit

2.privacy-preserving ML/AI to improve patient outcomes and pharmaceutical operations e.g., we improve the diagnosis rate of rare diseases

3. Intelligent Process Automation for Unstructured Content using ML/AI with strong data protection guarantees e.g. we automate the processing of loan documents using a combination of NLP and Computer Vision

4.Regulation affects ML & AI • General Data Protection Regulation (GDPR) – Already in effect in the EU • California Consumer Protection Act (CCPA) – Comes into effect Jan 1, 2020 • Many federal efforts under way – Information Transparency and Data Control Act (DelBene, D-WA) – Consumer Data Protection Act (Wyden, D-OR) – Data Care Act (Schatz, D-HI)

5.accuracy vs. privacy is a false dichotomy (if you are willing to invest in privacy infrastructure)

6.Privacy-preserving computation frontiers • Stochastic – Differential privacy (DP) • Encryption-based – Fully homomorphic encryption (FHE) • Protocol-based – Secure multi-party computation (SMC)

7.When privacy-preserving algorithms are immature, sanitize the data the algorithms are trained on

8.Session roadmap • The rest of this session – Sanitizing a single dataset using Spark • After the break – Sanitizing joinable datasets the smart way – Embeddings for unstructured data (deep learning!)

9.What does it mean to sanitize data for privacy?

10.Identifiability spectrum Personally-Identified Information (PII) De-Identified Information Device-Identified Information (DII)

11.Direct identifiability • Personal information (PI) Sim Simeonov; Male; July 7, 1977 One Swoop Way, Cambridge, MA 02140 • Sanitize with pseudonymous identifiers – Secure one-way mapping of PI to opaque IDs

12.Secure pseudonymous ID generation Sim Simeonov; Male; July 7, 1977 One Swoop Way, Cambridge, MA 02140 ... Simeon Simeonov; M; 1977-07-07 One Swoop Way, Suite 305, Cambridge, MA 02140 Sim|Simeonov|M|1977-07-07|02140 // canonical representation 8daed4fa67a07d7a5 … 6f574021 // secure destructive hashing (SHA-xxx) gPGIoVw … nNpij1LveZRtKeWU= // master encryption (AES-xxx) Vw50jZjh6BCWUz … mfUFtyGZ3q // partner A encryption 6ykWEv7A2lis8 … VT2ZddaOeML // partner B encryption

13.Dealing with dirty data Sim|Simeonov|M|1977-07-07|02140 // full entry when data is clean S|S551|M|1977-07-07|02140 // fuzzify names to handle limited entry & typos Sim|Simeonov|M|1977-07|02140 // generalize structured fields (dates, locations, …) tune fuzzification to use cases & desired FP/FN rates

14.Building pseudonymous IDs with Spark

15.Indirect identifiability via quasi-identifiers

16.Sanitizing quasi-identifiers • k-anonymity – Generalize or suppress quasi-identifiers – Any record is “similar” to at least k-1 other records • (k, ℇ)-anonymity – adds noise

17.Sanitizing quasi-identifiers in Spark • Optimal k-anonymity is an NP-hard problem – Mondrian algorithm: greedy O(nlogn) approximation https://github.com/eubr-bigsea/k-anonymity-mondrian • Active research – Locale-sensitive hashing (LSH) improvements – Risk-based approaches (LBS algorithm) – Academic Spark implementations (TDS, MDSBA)

18.Sanitizing joinable datasets • The industry standard: centralized sanitization • Sanitize the data as if it were joined together • Big increase in quasi-identifiers

19.The curse of dimensionality for sanitization We show that when the data contains a large number of attributes which may be considered quasi-identifiers, it becomes difficult to anonymize the data without an unacceptably high amount of information loss. ... we are faced with ... either completely suppressing most of the data or losing the desired level of anonymity. On k-Anonymity and the Curse of Dimensionality 2005 Aggarwal, C. @ IBM T. J. Watson Research Center

20.Curse of dimensionality example • Titanic passenger data • k-anonymize for different values of k by – Age – Age & gender • Compute normalized certainty penalty (NCP) – Higher values mean more information loss

21.Normalized Certainty Penalty 40% 35% 30% 25% 300% 20% 15% 10% 5% 0% 2 3 4 5 6 7 8 9 10 k age gender & age k-anonymizing Titanic passenger survivability

22.Centralized sanitization describes an alternate reality

23.Centralized sanitization increases risk We find that for privacy budgets effective at preventing attacks, patients would be exposed to increased risk of stroke, bleeding events, and mortality. Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing 2014 Fredrikson, M. et. al. @ UW Madison and Marshfield Clinic Research Foundation

24.High-accuracy ML/AI requires federated sanitization

25. Federated sanitization (Swoop’s prAIvacy™) • Isolated execution environments • Workflows execute across EEs • Task-specific sanitization firewalls • Sanitization is often lossless Model condition X without mixing health data with other data

26. There is no standard anonymization framework for unstructured data: suppress or structure or ???

27.The Problem With Text John Malkovitch plays tennis in Winchester. He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation.

28.The Problem With Text John Malkovitch plays tennis in Winchester. Problem PII He has been reporting soreness in his elbow. His 60th birthday is in two weeks. After he returns from his birthday trip to Casablanca we will recommend a steroid shot to reduce inflammation.

29.The Problem With Text John Malkovitch plays tennis in Winchester. Problem PII He has been reporting soreness in his elbow. His 60th birthday is in two weeks. Solution(s) After he returns from his birthday trip to • Remove common names? • Tell Doctors to stop using Casablanca we will recommend a steroid names in their notes? • Lookup patient information in notes and intentional shot to reduce inflammation. remove