20_08 How Netflix Debugs And Fixes Apache Cassandra When it Breaks

从timeout、驱动器失效、high latency等多个问题详细的分析了Netflix崩溃时如何调试和修复Apache Cassandra

展开查看详情

1.How Netflix Debugs and Fixes Apache Cassandra … when it breaks Joey Lynch

2.Speaker Joey Lynch Senior Software Engineer Cloud Data Engineering at Netflix Distributed system addict and data wrangler

3.Debugging Debugging is the practice of observing Methodology a system, building a mental model of the system, and then rapidly testing your newly formed mental model. Typically while that system is on fire 🔥

4.Building Mental Models

5.Distributed Distributed systems fail in especially fun Woes ways

6.“We see a large number of timeouts from Cassandra” -Many a developer

7.Initial Theories Action Plan No actual problem Degraded replica Retry storm Load shift

8.Initial Theories Action Plan No actual problem Observe metrics + Degraded replica Read client logs, look for clues of timeouts Retry storm + Read server logs, look for Load shift errors

9.

10.

11.$ grep Exception client.log | wc -l 1213 # not good $ grep Exception client.log | grep "Caused by" | cut -f 2 -d ':' | tail com.datastax.driver.core.exceptions.OperationTimedOutException com.datastax.driver.core.exceptions.OperationTimedOutException $ grep OperationTimedOutException client.log -C 5 OperationTimedOutException: [/IP:9042] Timed out waiting for server How many Errors are there? response ... $ grep Exception client.log | grep "Caused by" | cut -f 2 -d ':' | sort | uniq -c 1211 com.datastax.driver...OperationTimedOutException 2 com.netflix...Exception

12.$ grep Exception client.log | wc -l 1213 # not good $ grep Exception client.log | grep "Caused by" | cut -f 2 -d ':' | tail com.datastax.driver.core.exceptions.OperationTimedOutException com.datastax.driver.core.exceptions.OperationTimedOutException $ grep OperationTimedOutException client.log -C 5 OperationTimedOutException: [/IP:9042] Timed out waiting for server response ... Which Errors precisely are happening? $ grep Exception client.log | grep "Caused by" | cut -f 2 -d ':' | sort | uniq -c 1211 com.datastax.driver...OperationTimedOutException 2 com.netflix...Exception

13.$ grep Exception client.log | wc -l 1213 # not good $ grep Exception client.log | grep "Caused by" | cut -f 2 -d ':' | tail com.datastax.driver.core.exceptions.OperationTimedOutException com.datastax.driver.core.exceptions.OperationTimedOutException $ grep OperationTimedOutException client.log -C 5 OperationTimedOutException: [/IP:9042] Timed out waiting for server response $ grep Exception client.log | grep "Caused by" | cut -f 2 -d ':' | sort | uniq -c More context 1211 com.datastax.driver...OperationTimedOutException 2 com.netflix...Exception

14.$ grep Exception client.log | wc -l 1213 # not good $ grep Exception client.log | grep "Caused by" | cut -f 2 -d ':' | tail com.datastax.driver.core.exceptions.OperationTimedOutException What is the error distribution? com.datastax.driver.core.exceptions.OperationTimedOutException $ grep OperationTimedOutException client.log -C 5 OperationTimedOutException: [/IP:9042] Timed out waiting for server response ... $ grep Exception client.log | grep "Caused by" | cut -f 2 -d ':' | sort | uniq -c 1211 com.datastax.driver...OperationTimedOutException 2 com.netflix...Exception

15.$ grep <time> server.log | egrep “WARN|ERROR” | less # … quickly skim … nothing interesting

16.Initial Theories Reason No actual problem Latency and Exceptions Degraded replica Consistent latencies Retry storm Likely Load shift Plausible

17.What do we know? Action Plan There is a problem Retry storm appears most likely, need to Spike in C* traffic determine client timeout value. Client side timeouts No server timeouts or irregularities

18.1. Timeout catalogs are convenient 2. 40ms is very low for a timeout 3. Datastax driver retries by default … a lot of times

19.

20.Proposed Experiment Reasoning Turn off retries Retries are almost always wrong (except UnavailableExceptions) Increase timeout to C* Concurrency limited SLO of 100ms speculations are acceptable, but not commonly implemented.

21.

22.Automate — Turn off the retries!! 😀 Stop retrying ◆ Fail fast instead! Inflict chaos! 😀

23.Automate — Turn off the retries!! 😀 Stop retrying ◆ Fail fast instead! Inflict chaos! 😀 — Still want to retry? 😢 ◆ Concurrency limited p95 speculation instead?

24.Automate — Turn off the retries!! 😀 Stop retrying ◆ Fail fast instead! Inflict chaos! 😀 Or raise the timeout — Still want to retry? 😢 ◆ Concurrency limited p95 speculation instead? — Too hard? 😢 ◆ Can you use exponential backoff? ◆ CASSANDRA-15013+ will fix this server side ...

25.Concurrency limits prevent cascading failures

26.What if it wasn’t that easy?

27.

28.Ways Cassandra replicas degrade for many reasons Replicas Break

29.Lay of Land Getting a quick glimpse Tools