云计算成本控制与Apache Spark

在这节课中,我们将分享我们在云成本管理方面的经验,从我们犯的错误,收集的数据,吸取的教训,到我们构建的解决方案。我们将讨论管理会计和服务的一般原则,并将预算和归属成本分配给内部团队。使用AWS作为具体示例,使用Databricks和Spark作为解决方案的一部分,我们将展示我们如何为:1、使财务和预算所有者可以使用AWS成本和使用数据;2、构建数据产品,以帮助预算所有者监控成本,并通过购买保留实例和setti来采取行动NG保留政策,3、使用数据科学技术来检测变化并进行预测。我们建立的一般原则和解决方案也适用于其他云提供商。
展开查看详情

1.Cloud Cost Management and Apache Spark Xuan Wang, Databricks #DSSAIS13

2.Introduction ● Goal of this talk ○ share our experience in managing cloud costs ○ tools and technologies ○ lessons learnt and good practices ○ go wide rather than go deep #DSSAIS13 2

3.Introduction ● Goal of this talk ○ share our experience in managing cloud costs ○ tools and technologies ○ lessons learnt and good practices ○ go wide rather than go deep ● Why do we care about cloud cost? ○ growth in cloud revenue in Q1 2018: Amazon: 49%, Microsoft: 58% ● #DSSAIS13 3

4.Databricks’ Unified Analytics Platform COLLABORATIVE NOTEBOOKS Unifies Data Engineers and Data Scientists Data Engineers Data Scientists DATABRICKS RUNTIME Unifies Data and Powered by AI Technologies Delta SQL Streaming XGBoost Eliminates CLOUD NATIVE SERVICE infrastructure complexity 4

5.Three paths toward cost control ● Native reporting from cloud providers ○ Good general information and supports ○ Limited options, not scalable as environment grows ● Commercial tools ○ More details and flexibilities, connectors to raw data ○ Not enough customization, additional charges #DSSAIS13 5

6.Three paths toward cost control ● Native reporting from cloud providers ○ Good general information and supports ○ Limited options, not scalable as environment grows ● Commercial tools ○ More details and flexibilities, connectors to raw data ○ Not enough customization, additional charges ● In-house solutions ○ Most flexible, deeper understanding of the costs ○ Opportunity costs #DSSAIS13 6

7.Challenges in cloud cost control ● overwhelming and complex usage details ○ need to convert data into insights/actions ● gaps between “hands” and “wallets” ○ developers consume resources without realizing the charges ● evolving cloud landscape ○ external: new services, new discounts, ... ○ internal: new use cases, new architecture, ... #DSSAIS13 7

8. Our solutions Analytics Raw Data Databricks cost and usage DATABRICKS Notebooks s3 access logs DELTA BI tools: s3 inventory DATA LAKE Superset, Tableau, ec2/rds snapshot ... reserved instances Monitors and alerts ... #DSSAIS13 8

9. Our solutions Analytics Raw Data Databricks cost and usage DATABRICKS Notebooks s3 access logs DELTA BI tools: s3 inventory DATA LAKE Superset, Tableau, ec2/rds snapshot ... reserved instances Monitors and alerts ... The process problem: The data problem: prioritize, optimize, monitor, ETL and attribute costs automate #DSSAIS13 9

10.The data problem ● cost and usage report (detailed billing) ○ CSV, grouped by month, updated daily #DSSAIS13 10

11.The data problem ● cost and usage report (detailed billing) ○ CSV, grouped by month, updated daily ● EC2/RDS snapshots and reserved instances ○ JSON, from REST API #DSSAIS13 11

12.The data problem ● cost and usage report (detailed billing) ○ CSV, grouped by month, updated daily ● EC2/RDS snapshots and reserved instances ○ JSON, from REST API ● S3 inventory ○ CSV/ORC, snapshot, updated daily/weekly ● S3 access logs ○ raw logs in text, updated multiple times a day #DSSAIS13 12

13.The data problem ● cost and usage report (detailed billing) ○ CSV, grouped by month, updated daily ● EC2/RDS snapshots and reserved instances ○ JSON, from REST API ● S3 inventory ○ CSV/ORC, snapshot, updated daily/weekly ● S3 access logs ○ raw logs in text, updated multiple times a day #DSSAIS13 13

14.Data pipelines with Spark Raw Data Data Lake Insight ETL Analytics Challenges ● Data corruptions ● Multiple jobs/staging tables ● Reliability and consistency #DSSAIS13 14

15.Databricks Delta: Analytics Ready Data 1. Data Reliability 2. Query Performance ACID Compliant Transactions Very Fast at Scale Schema Enforcement & Evolution Indexing & Caching (10-100x Faster) LOTS OF NEW DATA Reporting Customer Data DATABRICKS Dashboards Click Streams DELTA Sensor data (IoT) Alerting DATA LAKE Video/Speech Machine Learning … 3. Simplified Architecture Unify batch & streaming Early data availability for analytics

16.ETL: AWS cost and usage #DSSAIS13 16

17.ETL: AWS cost and usage #DSSAIS13 17

18.ETL: AWS s3 access logs #DSSAIS13 18

19.Manage Databricks Delta tables ● Create table CREATE TABLE s3_access_logs USING delta LOCATION '$path' ● Optimize table OPTIMIZE s3_access_logs ZORDER BY bucket #DSSAIS13 19

20.Manage Databricks Delta tables ● Create table CREATE TABLE s3_access_logs USING delta LOCATION '$path' ● Optimize table OPTIMIZE s3_access_logs ZORDER BY bucket ● Query table SELECT * FROM s3_access_logs WHERE bucket = 'my-bucket' Delta Logs: Files layout & File1 File2 File3 File1: min='a', max='g' statistics: File2: min='g', max='n' File3: min='o', max='z' #DSSAIS13 20

21.Attributions ● Rule based attributions ○ accounts ■ dedicated accounts for different teams / use cases ○ tagging ■ tag resources with budget groups ○ manual rules ■ should avoid this as much as possible #DSSAIS13 21

22.The process problem ● Prioritize ○ high data transfer cost ● Optimize ○ reserved instance purchases ● Monitor ○ predictions and alerts ● Automate ○ auto-shutdown unused resources #DSSAIS13 22

23.Story: high data transfer cost ● Observation ○ Cross region data transfers are expensive ○ Two buckets cost about $1k/day #DSSAIS13 23

24.Story: high data transfer cost ● Observation ○ Cross region data transfers are expensive ○ Two buckets cost about $1k/day ● Root cause ○ downloading spark images #DSSAIS13 24

25.Story: high data transfer cost ● Actions ○ Distribute images to multiple regions. ○ Monitor on cross region cost #DSSAIS13 25

26.Story: high data transfer cost ● Actions ○ Distribute images to multiple regions. ○ Monitor on cross region cost ● Results ○ Significantly reduced cost ○ Faster cluster creation #DSSAIS13 26

27.Optimization: reserved instances ● Reserved instances (RI) ○ 1-yr/3-yr commitment in exchange for discounts ○ underutilized instances, upfront cost ○ significant discounts, availability #DSSAIS13 27

28.Optimization: reserved instances ● Reserved instances (RI) ○ 1-yr/3-yr commitment in exchange for discounts ○ underutilized instances, upfront cost ○ significant discounts, availability ● Challenges ○ non-trivial to decide how much RI to purchase ○ need to predict the future #DSSAIS13 28

29.Optimization: reserved instances ● Assign budgets to teams ● Provide tool to compute the optimal RI to buy ● Define process for RI purchase requests and approvals #DSSAIS13 29