- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
16/07 - Cassandra Netflix by summit16
展开查看详情
1 .Capacity Forecast @ Scale CDE, Cloud Database Engineering Netflix.
2 . Who are we? ● CDE, Cloud Database Engineering ● Providing data stores as a service ○Cassandra, ○ Dynomite, ○ Elasticsearch and RDS Ajay Upadhyay Cloud Data Architect @ Netflix Arun Agrawal Sr. Software Engineer @ Netflix
3 . Agenda ●Cassandra @ Netflix ●Cassandra footprint ●Capacity planning lifecycle ●Forecasting the capacity ●Q and A
4 . Cassandra @ Netflix • 98% of streaming data is stored in Cassandra • Data ranges from customer details to Viewing history / streaming bookmarks to billing and payment
5 . Cassandra Footprint Hundreds C*
6 . Cassandra Footprint Thousands
7 . Capacity Planning • Able to predict – Current usage and available capacity – Resources needing upgrade – Life cycle of current configuration – Appropriate configuration for new and existing App/Service • Optimize – Under or over utilized resource – Increased business productivity
8 . Capacity Planning Avoid: • Impact on Business • No service or SLA disruption • Un-planned maintenance • Firefighting
9 . Life Cycle Proxy or Simulate Capture Requirement Requirement Requirement Analysis/feasibility New / Increased Monitoring / traffic Optimization Trending
10 . Capture Requirement – IOPs and SLA – Maintenance overhead – Failover – Access pattern
11 . IOPs and SLA Questions Response Read OPS/sec [avg, peak] 5k - 10k Data store Read Latency requirement 95th - 20ms C* 99th - 100ms Write OPS/sec [avg, peak] 1k - 2k Write Latency requirement 95th - 20ms 99th - 100ms Num Columns / Row 100 Avg col size / or avg row size 64k Gutenberg publisher service Read Num of rows 100 Mil Gutenberg publisher service Write TTL [life Cycle of data] 365 Days
12 . Maintenance Overhead Type Response Repairs / Compactions Y/N Node replacement Y Backup - Full / Y/N Incrementals
13 . Failover Questions Response Region Failover Y/N SLA in case of region Y/N failover
14 . Access Pattern Questions Response Read Point read All row readers Column slices Write Part existing row New rows
15 . Proxy/Simulate Traffic – Proxy existing traffic – Simulate traffic –NDBench – Generate actual / synthetic traffic before final deployment using app
16 . Optimization • Cache - Application level - Fronting cache engine before C* - Stagger R - W operations if possible
17 .Cluster Sharding
18 . Trend Analysis Continuous monitoring / trending on usage pattern
19 . New / Increased Traffic Capacity planning cycle begins Capture Proxy or Simulate Requirement Requirement Analysis/feasibility Requirement New / Increased Monitoring / Optimization traffic Trending
20 .Capacity Forecasting
21 .Arun Agrawal Sr. Software Engineer
22 .Demo
23 .
24 .
25 .Previous Architecture Atlas Metrics
26 . Pain Points • No support for complex relationships • Hardware failure could fail leading to false positives
27 . Winston • Bridge between atlas and oncall • Complex relationship modeling between metrics • Reduce false positives • Auto remediation platform
28 . Lesson Learnt • It might be already too late to fix the system. • Reactive than proactive
29 . Requirements • Show us trend for the clusters. • Warn us of what is coming if trend continues. • Give us time to scale their cluster