- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
在规模企业中的数据理及其合规性
展开查看详情
1 .Enterprise Data Governance and Compliance at Scale Sri Esha Subbiah, ssubbiah@twilio.com, DataPlatform, Twilio Sunil Patil, spatil@twilio.com, Data Platform, Twilio
2 .Who are We? Presenters / Q&A • Sri Esha Subbiah • Senior Engineering Manager, Data Platform • https://www.linkedin.com/in/sri-esha-subbiah/ • @srieshas • Sunil Patil • Senior Software Engineer, Data Platform • https://www.linkedin.com/in/wpcertification/ • @pppsunil • Jeechee Chen • Senior Software Engineer, Data Platform • https://www.linkedin.com/in/jeechee/ #EntSAIS11 2
3 . Communication Cloud • Twilio Cloud Communication Platform provides programmable API for SMS, Voice, Video, IM Chat plus lots more. • Twilio's mission is to fuel the future of communications • 46000+ Customers, https://customers.twilio.com/ • 1 Billion Voice & Message data points per day • 1.9 Million Developers • 100+ Countries with varying compliance requirements #EntSAIS11 3
4 .Twilio’s Data Platform Scale • 25+ teams • 150K Messages/sec • 30+ Brokers/ Nodes • 210+ Kafka Topics • 150+ Bulk Load • Petabytes of data • 350+ Cores Spark • Multiple Sources • Multiple Destinations #EntSAIS11 4
5 .Factors to consider for Governance & Compliance • Collect what is needed • MetaData management • Identify kinds of data and Classify • Data cleansing and wrangling • Enable easy onboarding • Collaboration and accessibility • Visualization of Data • Data Lineage • Security • Auditing • Data Retention and Cleanup • Compliance: SOX , GDPR, HIPAA, PCI #EntSAIS11 5
6 .GDPR • Personal Data • Data Processing and Data Subjects • Processor and Controller Obligations Measures ● Lawfulness, fairness, and ❖ Secure Storage transparency ❖ Anonymization ● Purpose limitation ❖ Encryption ● Data minimization ❖ Retention Policies ● Accuracy ❖ Deletion of Data ● Storage limitation ❖ Auditing ● Integrity and confidentiality ❖ Access Control ● Accountability #EntSAIS11 6
7 .Kafka Pipeline #EntSAIS11 7
8 .Schema Registry Schema registry is a dynamo backed REST service that is used by different team in Twilio • JVM client for producing and consuming compliant data • HTTP API for producing JSON compliant with Schema • API for managing schema entities • API for storing schema entity to topic mapping • Each Kafka topic has schema entity associated with it Entity -> Topic -> Redshift Table/ ElasticSearch Index #EntSAIS11 8
9 .Sample Schema #EntSAIS11 9
10 .Anonymization - Redaction Input Redaction Redaction is removing PII information in type specific manner 1. Phone Number:- Remove last 4 digits 2. Email :- Remove everything but first Redacted Output letter and domain 3. Customer Text: - Remove completely #EntSAIS11 10
11 .Sample Schema with Twilio Type #EntSAIS11 11
12 .Compliance - Anonymization Architecture #EntSAIS11 12
13 .Historical Anonymization - Spark #EntSAIS11 13
14 .Anonymization - Encryption Input Encryption Twilio has field level encryption in addition to volume level encryption. Encryption & Decryption API 1. Input: AccountId, Value that needs Encrypted Output encrypting 2. Output: Encrypted value in base64 3. Uses account specific encryption key 4. Provides point and bulk API 5. Symmetric key encryption #EntSAIS11 14
15 .Data Lake(TwilioFS) and not Swamp Challenges: Solution: – Teams across Twilio – Metadata Management: Descriptive, Structural, store data in Administrative different places and – Deduplicated different forms, – Standardized timestamps, indexing, directories, tags difficult for internal etc. teams to access – Versioning, Encryption – Will Turn into Swamp if not managed – Library for direct access to cleansed data in S3 – Access control based on Roles, IAM Rules and Type of Data – Auditing using CloudTrail #EntSAIS11 15
16 .Data cleansing and wrangling - Spark #EntSAIS11 16
17 .Data Processing - Spark Solutions: Challenges: 1. Dynamic Transformers for entities • Data lake in Petabytes Scale 2. Transforming the data formats • Various Data Processing from sources. requirements across different 3. Compliance: Bulk Redaction and teams Encryption processors using Spark • Compliance on a huge volume 4. Transformation Library on and Variety of data standard Time zones, Indexing • Migrating from One System to 5. Parquet format suitable for another crunching 6. SparkSQL, Spark DataFrames, • Standing up a new System with RDD, Spark Streaming, Spark all the historical data MLib #EntSAIS11 17
18 .Dynamic Transformers - Spark #EntSAIS11 18
19 .Data Deletion and Retention - Spark Requirements: Solution: • deleting our customers’ data • Spark for deleting and migrating data in • deleting their customers’ data bulk • customer data legal holds – Distributed • Customer Initiated ~200k – Simplicity - few lines of code to deletions/day achieve variety of deletions, Challenges: SparkSQL • Deleting data in Data Lake is not – Scalability as simple as DB • Load Testing & Tuning Executors • Number of days that we have to delete can go back to 3285 days • Indexing and Deletion Strategy #EntSAIS11 19
20 .Compliance - Retention and Deletion #EntSAIS11 20
21 .Spark Deletion - Performance & Capacity • 3 Approaches have been analyzed: Account Index, Group Index, Day Index Account Index Executor Core Driver Core Memory Files Processed Duration in Hours 100 50 50g 12 Failed 128 110 110g 24 42 Group Index Executor Core Driver Core Memory Files Processed Duration in Hours 64 32 100 288(24 hour) 72 32 16 100 288(24 hour) 168 #EntSAIS11 21
22 .Spark Deletion - Learnings • Make sure Data lake is indexed, partitioned appropriately • Consider IO: Too many small files and too many indexes • Need for tracking & Locking if job runs longer Day Index Worker Core Date Range Indexes Affected Duration in Hours 56 2018-01-08 through 2018-01-15 115 1.5 56 2018-01-15 through 2018-01-22 260 2.6 56 2018-01-22 through 2018-01- 452 4.8 #EntSAIS11 22
23 .What Next? Self Service at all Layers 23
24 .Related Links Twilio’s GDPR White Paper: https://s3.amazonaws.com/ahoy-assets.twilio.com/Whitepapers/Twilio_Whitepaper_GDPR.pdf PII Description: https://www.twilio.com/blog/2018/05/personally-identifiable-information-pii-fields-twilio-docs-gdpr-compliance.html Twilio’s Support: https://www.twilio.com/blog/2017/09/twilios-gdpr-commitment-support-for-customer-compliance-objectives.html We are Hiring Twilio Job Board: https://www.twilio.com/company/jobs Sr. Engineering Manager: https://boards.greenhouse.io/twilio/jobs/961366 Sr. Software Engineer: https://boards.greenhouse.io/twilio/jobs/1101370 #EntSAIS11 24
25 . Thank You, Q & A Twilio’s Compliance Officer Sheila Jambekar https://www.linkedin.com/in/sheilajambekar/ @sheilajambekar #EntSAIS11 25