17/08 - why cassandra

Cassandra was designed to fall in the “AP” intersection of the CAP theorem that states that any distributed system can only guarantee two of the following capabilities at same time; Consistency, Availability and Partition Tolerance. In this way Cassandra is a best fit for a solution seeking a distributed database that brings high availability to a system and is also very tolerant to partition to its data when some node in the cluster is offline, which is common in distributed systems.
展开查看详情

1. tyfs.rocks tayfun.sevimli 26.07.2017 tyfs.rocks 1

2. The History of Cassandra 26.07.2017 tyfs.rocks 2

3. Where is Cassandra? 26.07.2017 tyfs.rocks 3

4. Cassandra Architecture – CAP Theorem Cassandra was designed to fall in the “AP” intersection of the CAP theorem that states that any distributed system can only guarantee two of the following capabilities at same time; Consistency, Availability and Partition Tolerance. In this way Cassandra is a best fit for a solution seeking a distributed database that brings high availability to a system and is also very tolerant to partition to its data when some node in the cluster is offline, which is common in distributed systems. 26.07.2017 tyfs.rocks 4

5. Cassandra Architecture – Data Model Cassandra is classified as a column based database, which means that its basic structure to store data is based upon a set of columns, which are comprised, by a pair of column key and column value. Every row is identified by a unique key, a string without a size limit, called partition key. Each set of columns are called column families, similar to a relational database table. 26.07.2017 tyfs.rocks 5

6. Cassandra Architecture – Data Model SortedMap<RowKey,SortedMap<ColumnKey, ColumnValue>>  A map gives efficient key lookup, and the sorted nature gives efficient scans. In Cassandra, we can use row keys and column keys to do efficient lookups and range scans.  The number of column keys is unbounded. This means, you can have wide rows.  A key can itself hold a value, meaning In other words, you can have a valueless column. 26.07.2017 tyfs.rocks 6

7. Cassandra Architecture – Write Path Cassandra Write Path  Every node first writes the mutation to the commit log and then writes the mutation to the memtable.  Writing to the commit log ensures durability of the write as the memtable is an in-memory structure and is only written to disk when the memtable is flushed to disk. A memtable is flushed to disk when: • It reaches its maximum allocated size in memory • The number of minutes a memtable can stay in memory elapses.  A memtable is flushed to an immutable structure called and SSTable (Sorted String Table). The commit log is used • Manually flushed by a user for playback purposes in case data from the memtable is lost due to node failure.  Every SSTable creates three files on disk which include a bloom filter, a key index and a data file. 26.07.2017 tyfs.rocks 7

8. Cassandra Architecture – Read Path Cassandra Read Path  Every Column Family stores data in a number of SSTables. Thus Data for a particular row can be located in a number of SSTables and the memtable. Thus for every read request Cassandra needs to read data from all applicable SSTables ( all SSTables for a column family) and scan the memtable for applicable data fragments. This data is then merged and returned to the coordinator.  If the contacted replicas has a different version of the data the coordinator returns the latest version to the client and issues a read repair command to the node/nodes with the older version of the data. The read repair operation pushes the newer version of the data to nodes with the older version. 26.07.2017 tyfs.rocks 8

9. Cassandra Architecture – Cluster Topology Cluster Concepts  a node is a cassandra instance (in production: one node per machine)  a partition is one ordered and replicable unit of data on a node  a rack is a logical set of nodes  a Data Center is a logical set or racks  Cluster is the full set of nodes which map to a single complete token ring  peer-to-peer communication gossip protocol 26.07.2017 tyfs.rocks 9

10. Cassandra Architecture – Data Consistency Tunable Data Consistency How many nodes must acknowledge a read/write request  choose between STRONG to EVENTUAL  possible CL: ANY, ONE, QUORUM (RF/2+1), ALL  tunable per request support  multi-datacenter support 26.07.2017 tyfs.rocks 10

11. Cassandra Architecture – CQL Language Cassandra Query Language  very similar to RDBMS SQL syntax  create objects via DDL  core DML commands insert, update, delete supported  query data with Select commands 26.07.2017 tyfs.rocks 11

12. Cassandra Architecture – Security Cassandra Security Features  Authentication based on internally controlled rolename/passwords  Authorization based on object permission management  Authentication and authorization based on JMX username/passwords  SSL encryption 26.07.2017 tyfs.rocks 12

13. Why Cassandra ? • Scales linearly with massive write  Cassandra is a great database which can handle a big amount of data. So it is preferred for the companies that provide Mobile phones and messaging services. These companies have a huge amount of data, so Cassandra is best for them. • Highly Fault Tolerant  Masterless cluster with no single point of failure. In simple terms, your users will never know if a server, an entire rack of servers, or even if an entire data center fails. There is also the potential for zero downtime rolling upgrades. • Easy Replication / Data Distribution • Homogenous Environment  No master-slave or sharding setup and that all nodes in the ring are equal. • Ease of Administration  Masterless, fault-tolerant, supports temporary loss of nodes with minimal impact to production performance. • Wide Community  No master-slave or sharding setup and that all nodes in the ring are equal. 26.07.2017 tyfs.rocks 13

14. Use Cases of Cassandra • Messaging & Event Sourcing  Cassandra is a great database which can handle a big amount of data. So it is preferred for the companies that provide Mobile phones and messaging services. These companies have a huge amount of data, so Cassandra is best for them. • IoT & High Speed Applications  Cassandra can handle the high speed data so it is a great database for the applications where data is coming at very high speed from different devices or sensors. • Product Catalogs and Retail Apps  Cassandra is used by many retailers for durable shopping cart protection and fast product catalog input and output. • Social Media Analytics & Recommendations  Cassandra is a great database for many online companies and social media providers for analysis and recommendation to their customers. 26.07.2017 tyfs.rocks 14

15. Cassandra for Akka Persistence • Linear scalability • AKKA Persistence  Expected Massive Load  CQRS with Event-Sourcing  Akka’s supported up to date plugin • No SPOF (Lightbend)  Fault-tolerant, Resilient • Akka Streams • Always-On Multi-Data Center  Batch Processing over Streaming  Data Distribution & Replication  Cluster over Multi-Data Centers 26.07.2017 tyfs.rocks 15

16. Cassandra Benchmarks University of TORONTO, NoSQL Database Performance Benchmarks, 2012 Write latency for workload read/write Read latency for workload read/write Throughput for workload read/scan/write Throughput for workload read/write 26.07.2017 tyfs.rocks 16

17. Cassandra Benchmarks Netflix, Benchmarking Cassandra Scalability on AWS, 2011 26.07.2017 tyfs.rocks 17

18. Cassandra Benchmarks EndPoint database and open source consulting company, 2014 26.07.2017 tyfs.rocks 18

19. Cassandra Benchmarks EndPoint database and open source consulting company, 2014 26.07.2017 tyfs.rocks 19

20. Resources • Apache Cassandra Web Site • Planet Cassandra Community • DataStax Web Site • The Distributed Architecture Behind Apache Cassandra, Bruno TINOCO • Introduction to Apache Cassandra's Architecture, Akhil Mehra • An Overview of Apache Cassandra, DataStax • NoSQL Performance Benchmarks, DataStax • Top 10 Reasons to Use Cassandra, Michael COLBY • Security in Cassandra, IBM Developer Works 26.07.2017 tyfs.rocks 20