- 微博 QQ QQ空间 贴吧
Cloud Storage Spring Cleaning: A Treasure Hunt
收藏 0下载 1
Trying to decide which data to keep, archive, or delete? Yelp extracts real, actionable business value from data accesses using Spark and Parquet. By gaining crucial insight into our data at the API response level, we can launch new initiatives around right-sizing, security audits, and provenance. Over the last decade, Yelp amassed petabytes of data within Amazon S3. Classifying them – and determining their value to the organization – is like walking through a flea market. Sure, some inventory is pricelessâ€¦ but most had little value when it was new, and it has zero value now. Retention is expensive, auditing is impossible, and analysis is harder than stealing the Declaration of Independence. Serendipitously, we discovered access logs for Yelp’s most expensive data archive. We learned that no one ever analyzed them because the bucket contains millions of small objects. This makes processing with Spark difficult due to S3 behavior: Typically, Spark drivers run out of memory in this scenario. As our hunt continued, we created a novel solution that began as a Jupyter notebook. By first processing key names – instead of using Spark’s HDFS abstraction – we transformed our data with RDDs, schematized it into dataframes, and converted it to Apache Parquet. From there, we saved it in our S3-based Data Lake. This talk outlines our design, shares our configurations, calls out a few pitfalls, and ends by applying our results to use cases from security, accounting, and curation. Learn to take charge of your storage! Every new machine learning model – and every new product feature – creates dozens of intermediate data models and thousands of files stuffed with usually-useless logging and debug data. By attending our session, you will learn to manage your organization’s data sprawl with our quantitative, evidence-based approach.