16/07 - Cassandra Netflix by summit16

下载 1

快召唤伙伴们来围观吧
微博 QQ QQ空间 贴吧
文档嵌入链接
<iframe src="https://www.slidestalk.com/Cassandra/1607CassandraNetflixbysummit1677012?embed" frame border="0" width="640" height="360" scrolling="no" allowfullscreen="true">复制
微信扫一扫分享
已成功复制到剪贴板

中国Cassandra技术社区

发布于

6年前

6619

人观看

#信息技术

●Cassandra @ Netflix ●Cassandra footprint ●Capacity planning lifecycle ●Forecasting the capacity ●Q and A

展开查看详情

1 .Capacity Forecast @ Scale CDE, Cloud Database Engineering Netflix.

2 . Who are we? ● CDE, Cloud Database Engineering ● Providing data stores as a service ○Cassandra, ○ Dynomite, ○ Elasticsearch and RDS Ajay Upadhyay Cloud Data Architect @ Netflix Arun Agrawal Sr. Software Engineer @ Netflix

3 . Agenda ●Cassandra @ Netflix ●Cassandra footprint ●Capacity planning lifecycle ●Forecasting the capacity ●Q and A

4 . Cassandra @ Netflix • 98% of streaming data is stored in Cassandra • Data ranges from customer details to Viewing history / streaming bookmarks to billing and payment

5 . Cassandra Footprint Hundreds C*

6 . Cassandra Footprint Thousands

7 . Capacity Planning • Able to predict – Current usage and available capacity – Resources needing upgrade – Life cycle of current configuration – Appropriate configuration for new and existing App/Service • Optimize – Under or over utilized resource – Increased business productivity

8 . Capacity Planning Avoid: • Impact on Business • No service or SLA disruption • Un-planned maintenance • Firefighting

9 . Life Cycle Proxy or Simulate Capture Requirement Requirement Requirement Analysis/feasibility New / Increased Monitoring / traffic Optimization Trending

10 . Capture Requirement – IOPs and SLA – Maintenance overhead – Failover – Access pattern

11 . IOPs and SLA Questions Response Read OPS/sec [avg, peak] 5k - 10k Data store Read Latency requirement 95th - 20ms C* 99th - 100ms Write OPS/sec [avg, peak] 1k - 2k Write Latency requirement 95th - 20ms 99th - 100ms Num Columns / Row 100 Avg col size / or avg row size 64k Gutenberg publisher service Read Num of rows 100 Mil Gutenberg publisher service Write TTL [life Cycle of data] 365 Days

12 . Maintenance Overhead Type Response Repairs / Compactions Y/N Node replacement Y Backup - Full / Y/N Incrementals

13 . Failover Questions Response Region Failover Y/N SLA in case of region Y/N failover

14 . Access Pattern Questions Response Read Point read All row readers Column slices Write Part existing row New rows

15 . Proxy/Simulate Traffic – Proxy existing traffic – Simulate traffic –NDBench – Generate actual / synthetic traffic before final deployment using app

16 . Optimization • Cache - Application level - Fronting cache engine before C* - Stagger R - W operations if possible

17 .Cluster Sharding

18 . Trend Analysis Continuous monitoring / trending on usage pattern

19 . New / Increased Traffic Capacity planning cycle begins Capture Proxy or Simulate Requirement Requirement Analysis/feasibility Requirement New / Increased Monitoring / Optimization traffic Trending

20 .Capacity Forecasting

21 .Arun Agrawal Sr. Software Engineer

22 .Demo

23 .

24 .

25 .Previous Architecture Atlas Metrics

26 . Pain Points • No support for complex relationships • Hardware failure could fail leading to false positives

27 . Winston • Bridge between atlas and oncall • Complex relationship modeling between metrics • Reduce false positives • Auto remediation platform

28 . Lesson Learnt • It might be already too late to fix the system. • Reactive than proactive

29 . Requirements • Show us trend for the clusters. • Warn us of what is coming if trend continues. • Give us time to scale their cluster

0点赞

0收藏

1下载