16/07 - Cassandra Netflix by summit16

●Cassandra @ Netflix ●Cassandra footprint ●Capacity planning lifecycle ●Forecasting the capacity ●Q and A

1.Capacity Forecast @ Scale CDE, Cloud Database Engineering Netflix.

2. Who are we? ● CDE, Cloud Database Engineering ● Providing data stores as a service ○Cassandra, ○ Dynomite, ○ Elasticsearch and RDS Ajay Upadhyay Cloud Data Architect @ Netflix Arun Agrawal Sr. Software Engineer @ Netflix

3. Agenda ●Cassandra @ Netflix ●Cassandra footprint ●Capacity planning lifecycle ●Forecasting the capacity ●Q and A

4. Cassandra @ Netflix • 98% of streaming data is stored in Cassandra • Data ranges from customer details to Viewing history / streaming bookmarks to billing and payment

5. Cassandra Footprint Hundreds C*

6. Cassandra Footprint Thousands

7. Capacity Planning • Able to predict – Current usage and available capacity – Resources needing upgrade – Life cycle of current configuration – Appropriate configuration for new and existing App/Service • Optimize – Under or over utilized resource – Increased business productivity

8. Capacity Planning Avoid: • Impact on Business • No service or SLA disruption • Un-planned maintenance • Firefighting

9. Life Cycle Proxy or Simulate Capture Requirement Requirement Requirement Analysis/feasibility New / Increased Monitoring / traffic Optimization Trending

10. Capture Requirement – IOPs and SLA – Maintenance overhead – Failover – Access pattern

11. IOPs and SLA Questions Response Read OPS/sec [avg, peak] 5k - 10k Data store Read Latency requirement 95th - 20ms C* 99th - 100ms Write OPS/sec [avg, peak] 1k - 2k Write Latency requirement 95th - 20ms 99th - 100ms Num Columns / Row 100 Avg col size / or avg row size 64k Gutenberg publisher service Read Num of rows 100 Mil Gutenberg publisher service Write TTL [life Cycle of data] 365 Days

12. Maintenance Overhead Type Response Repairs / Compactions Y/N Node replacement Y Backup - Full / Y/N Incrementals

13. Failover Questions Response Region Failover Y/N SLA in case of region Y/N failover

14. Access Pattern Questions Response Read Point read All row readers Column slices Write Part existing row New rows

15. Proxy/Simulate Traffic – Proxy existing traffic – Simulate traffic –NDBench – Generate actual / synthetic traffic before final deployment using app

16. Optimization • Cache - Application level - Fronting cache engine before C* - Stagger R - W operations if possible

17.Cluster Sharding

18. Trend Analysis Continuous monitoring / trending on usage pattern

19. New / Increased Traffic Capacity planning cycle begins Capture Proxy or Simulate Requirement Requirement Analysis/feasibility Requirement New / Increased Monitoring / Optimization traffic Trending

20.Capacity Forecasting

21.Arun Agrawal Sr. Software Engineer




25.Previous Architecture Atlas Metrics

26. Pain Points • No support for complex relationships • Hardware failure could fail leading to false positives

27. Winston • Bridge between atlas and oncall • Complex relationship modeling between metrics • Reduce false positives • Auto remediation platform

28. Lesson Learnt • It might be already too late to fix the system. • Reactive than proactive

29. Requirements • Show us trend for the clusters. • Warn us of what is coming if trend continues. • Give us time to scale their cluster