16/07 - Cassandra Netflix by summit16
展开查看详情
1.Capacity Forecast @ Scale CDE, Cloud Database Engineering Netflix.
2. Who are we? ● CDE, Cloud Database Engineering ● Providing data stores as a service ○Cassandra, ○ Dynomite, ○ Elasticsearch and RDS Ajay Upadhyay Cloud Data Architect @ Netflix Arun Agrawal Sr. Software Engineer @ Netflix
3. Agenda ●Cassandra @ Netflix ●Cassandra footprint ●Capacity planning lifecycle ●Forecasting the capacity ●Q and A
4. Cassandra @ Netflix • 98% of streaming data is stored in Cassandra • Data ranges from customer details to Viewing history / streaming bookmarks to billing and payment
5. Cassandra Footprint Hundreds C*
6. Cassandra Footprint Thousands
7. Capacity Planning • Able to predict – Current usage and available capacity – Resources needing upgrade – Life cycle of current configuration – Appropriate configuration for new and existing App/Service • Optimize – Under or over utilized resource – Increased business productivity
8. Capacity Planning Avoid: • Impact on Business • No service or SLA disruption • Un-planned maintenance • Firefighting
9. Life Cycle Proxy or Simulate Capture Requirement Requirement Requirement Analysis/feasibility New / Increased Monitoring / traffic Optimization Trending
10. Capture Requirement – IOPs and SLA – Maintenance overhead – Failover – Access pattern
11. IOPs and SLA Questions Response Read OPS/sec [avg, peak] 5k - 10k Data store Read Latency requirement 95th - 20ms C* 99th - 100ms Write OPS/sec [avg, peak] 1k - 2k Write Latency requirement 95th - 20ms 99th - 100ms Num Columns / Row 100 Avg col size / or avg row size 64k Gutenberg publisher service Read Num of rows 100 Mil Gutenberg publisher service Write TTL [life Cycle of data] 365 Days
12. Maintenance Overhead Type Response Repairs / Compactions Y/N Node replacement Y Backup - Full / Y/N Incrementals
13. Failover Questions Response Region Failover Y/N SLA in case of region Y/N failover
14. Access Pattern Questions Response Read Point read All row readers Column slices Write Part existing row New rows
15. Proxy/Simulate Traffic – Proxy existing traffic – Simulate traffic –NDBench – Generate actual / synthetic traffic before final deployment using app
16. Optimization • Cache - Application level - Fronting cache engine before C* - Stagger R - W operations if possible
17.Cluster Sharding
18. Trend Analysis Continuous monitoring / trending on usage pattern
19. New / Increased Traffic Capacity planning cycle begins Capture Proxy or Simulate Requirement Requirement Analysis/feasibility Requirement New / Increased Monitoring / Optimization traffic Trending
20.Capacity Forecasting
21.Arun Agrawal Sr. Software Engineer
22.Demo
23.
24.
25.Previous Architecture Atlas Metrics
26. Pain Points • No support for complex relationships • Hardware failure could fail leading to false positives
27. Winston • Bridge between atlas and oncall • Complex relationship modeling between metrics • Reduce false positives • Auto remediation platform
28. Lesson Learnt • It might be already too late to fix the system. • Reactive than proactive
29. Requirements • Show us trend for the clusters. • Warn us of what is coming if trend continues. • Give us time to scale their cluster