Cloud Storage Spring Cleaning: A Treasure Hunt

Trying to decide which data to keep, archive, or delete? Yelp extracts real, actionable business value from data accesses using Spark and Parquet. By gaining crucial insight into our data at the API response level, we can launch new initiatives around right-sizing, security audits, and provenance. Over the last decade, Yelp amassed petabytes of data within Amazon S3. Classifying them – and determining their value to the organization – is like walking through a flea market. Sure, some inventory is priceless… but most had little value when it was new, and it has zero value now. Retention is expensive, auditing is impossible, and analysis is harder than stealing the Declaration of Independence. Serendipitously, we discovered access logs for Yelp’s most expensive data archive. We learned that no one ever analyzed them because the bucket contains millions of small objects. This makes processing with Spark difficult due to S3 behavior: Typically, Spark drivers run out of memory in this scenario. As our hunt continued, we created a novel solution that began as a Jupyter notebook. By first processing key names – instead of using Spark’s HDFS abstraction – we transformed our data with RDDs, schematized it into dataframes, and converted it to Apache Parquet. From there, we saved it in our S3-based Data Lake. This talk outlines our design, shares our configurations, calls out a few pitfalls, and ends by applying our results to use cases from security, accounting, and curation. Learn to take charge of your storage! Every new machine learning model – and every new product feature – creates dozens of intermediate data models and thousands of files stuffed with usually-useless logging and debug data. By attending our session, you will learn to manage your organization’s data sprawl with our quantitative, evidence-based approach.

1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

2.Cloud Storage Spring Cleaning: A Treasure Hunt Zach Musgrave and Steven Moy Yelp Inc #UnifiedAnalytics #SparkAISummit

3.Agenda today • What is our treasure? • Where can we find some clues? • How do we uncover the treasure map? • What can we find with it? !3

4.Motivation • Insight into cloud storage patterns • Justification to delete or transition data • Classification of data via real-world accesses !4

5.Motivation • In the beginning, there was infinite storage. • Yelp said, "Yes, please!" • Now, we don't have a problem, per se... #UnifiedAnalytics #SparkAISummit !5

6.Motivation • Now, we don't have a problem, per se... • We have dozens of PBs worth of problems :/ #UnifiedAnalytics #SparkAISummit !6

7.What is our treasure? Spark+AI Summit ztm & smoy !7

8.What is our treasure? We keep many things in S3, and we want to lifecycle data into the right cost tier. How do we make the right decision? For example, should we archive after 365 days? 400 days? !8

9.Motivation • Cloud object storage: It's everywhere! • S3 continues to eat the world's storage workloads • No more filing tickets to add hard drives • As long as you pay the monthly bill, it’s easy... • Put(key, object) • Get(key) #UnifiedAnalytics #SparkAISummit !9

10.Eventually, leadership says... Why are we spending so much money? Do we really need tens of petabytes stored in the cloud? !10

11.Eventually, leadership says... Why are we spending so much money? Do we really need tens of petabytes for immediate access at all times? !11

12.Eventually, leadership says... Why are we spending so much money? Do we really access all data equally all the time? !12

13.And then we say... If only we knew something - anything! - about engineers' access patterns... !13

14.We need a map... !14

15.What is the treasure? • Large storage savings, measured in $$$ • You may end up in your org's quarterly slide deck • Security implications • Principle of least privilege made easy easier • Access trails • Make decisions without rolling a 12-sided dice !15

16.Where are the clues? • We use AWS S3 as our example • Your storage provider likely has similar features • S3 Server Side Logging • Apache-like access logs for every operation !16

17.Where are the clues? • Use S3? • Get access logs. • Where do those go? • ... Inside S3 ... • If you enable them. !17

18.Where are the clues? analyze the S3 access logs... !18

19.How do we do that? Let's look at the contents of these logs. !19

20.S3 Access Logs • Non-compressed • Row-based format • Files can be very small • Range from KB to MB • Dozens - to many thousands - of rows • Require a complex regex to parse !20

21.S3 Access Logs: Example 79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d521 8e7cd47ef2be awsexamplebucket [06/Feb/2019:00:00:38 +0000] 79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d521 8e7cd47ef2be 3E57427F3EXAMPLE REST.GET.VERSIONING - "GET /awsexamplebucket?versioning HTTP/1.1" 200 - 113 - 7 - "-" "S3Console/0.4" - s9lzHYrFp76ZVxRcpX9+5cjAnEH2ROuNkd2BHfIa6UkFVdtjf5mK R3/eTPFvsiP/XV/VLi31234= SigV2 ECDHE-RSA-AES128-GCM- SHA256 AuthHeader !21

22.Persistence and structure • Conversion is a bit expensive • No need to repeat conversion over 10+ years of data • May as well store a queryable result • Determinism: Easy to replicate prior results. !22

23.Before we go further All queries, code fragments, snippets, etc. we show you Are included in our Github repository! aws_logs_to_parquet_converter So, no need to photograph each slide :) !23

24.Where are the clues? !24

25. SQL Analytics (AWS Athena) Columnar Files (Parquet) Batch Conversion (Spark + Jupyter) S3 Prefix Scan (boto3 + RDD -> dataframe) !25

26.Clue #1 - SQL declarative analysis • Our developers and analysts speak SQL • We can execute SQL against flat files • Lots of different ways to do this • AWS Athena is easiest for us !26

27.Clue #1(a) Nomenclature • Use your naming conventions • s3://bucketname/logs/2019/03/04/any_file_name_here.tgz • Extract object's creation date: date_parse( array_join( regexp_extract_all(key, '/(\d+)', 1), '-' ), '%Y-%m-%d' ) AS dt_written ==> date(2019, 03, 04) !27

28.Clue #1(a) Nomenclature !28

29.Clue #1(b) Days Apart • We don't care when accesses occur • We do care how far in the past the access is date_diff( 'day', date_parse(array_join(regexp_extract_all(...... , date_trunc('day', request_time) ) AS days_apart, !29