17/01 Presentation of Apache Cassandra

1.Introduction to NoSQL systems, Extensible Record Stores and Amazon’s Dynamo + Google Bigtable 2. What Cassandra is and how it is compared with other similar systems 3. What applications are better supported - examples, case studies 4. Technical Description, architecture, internals 5. How is it used and installed, requirements and in what platforms does it run on 6. Demo 7. References
展开查看详情

1.Apache Cassandra

2.Contents 1. Introduction to NoSQL systems, Extensible Record Stores and Amazon’s Dynamo + Google Bigtable 2. What Cassandra is and how it is compared with other similar systems 3. What applications are better supported - examples, case studies 4. Technical Description, architecture, internals 5. How is it used and installed, requirements and in what platforms does it run on 6. Demo 7. References

3.1. Background NoSQL, Extensible Record Stores, Cassandra’s Parents

4.NoSQL Systems NoSQL or Not-Only-SQL systems: Next Generation Databases. The initial movement started in 2009 with the goal of creating modern, web-scale DBs. Currently, they exist more than 225 NoSQL systems. In general, they share the following features: • Schema-free databases • BASE (instead of ACID) • Easy replication support • Huge amount of data • Simple API • Horizontally scalable • Distributed • Open Source

5.Extensible Record Stores (or Wide Column Stores) • Motivated by Google’s Big Table. • Basic Data Model: Rows and Columns • Basic Scalability Model: Rows and Columns are splitted into nodes. • Rows: split across nodes through sharding on the primary key. • Columns: distributed over multiple nodes by using ‘column groups’. • Other systems that use this technology: Hypertable, HBase.

6.Cassandra’s Parents - Amazon Dynamo What is it? A highly-available and scalable storage system used by Amazon to store and retrieve user shopping charts and other core services. It pioneered the idea of eventual consistency. Key-Value Store. How it works? Allows read and write operations to continue even during network partitions and resolves update conflicts using different conflict resolution mechanisms. Sacrifices consistency for availability. Allows customization to meet desired preference. Consistent Hashing, Vector Clocks (not in Cassandra), Gossip Protocol, Hinted Handoff, Read Repair

7.Cassandra’s Parents - Google Bigtable What is it? A high performance data storage system built on Google File System and other Google technologies. How it works? Provides both structure and data distribution but relies on a distributed file system for durability. Richer data model from Dynamo. One key, many values. Fast sequential access. Columnar, SSTable Storage, Append-only, Memtable, Compaction

8.Cassandra’s Parents What features does Cassandra use from Google’s BigTable? 1. Column Families 2. Memtables 3. SSTables What features does Cassandra use from Amazon Dynamo? 1. Consistent hashing 2. Partitioning 3. Replication

9.Cassandra and Parents

10.2. Description and Comparisons What Cassandra is and how it is compared with other similar systems

11.Avinash Lakshman • Inventor, Apache Cassandra • Co-inventor, Amazon Dynamo

12.Prashant Malik • Inventor, Apache Cassandra • Technical Leader, Facebook

13.

14.What is cassandra?

15.Definition • A distributed NoSQL database system for managing large amounts of structured data across many commodity servers, while providing highly available service and no single point of failure.

16.Timeline with activities • July 2008 Facebook released Cassandra as an open-source project • March 2009 Cassandra became an Apache Incubator project • 17th February 2010 Cassandra graduated to a top-level project • 2012 University of Toronto researchers studying NoSQL systems concluded that “In terms of scalability, there is a clear winner throughout our experiments” • 2010-2015 New releases of Cassandra

17.Strengths • Linear scale performance The ability to add nodes without failures leads to predictable increases in performance • Supports multiple languages Python, C#/.NET, C++, Ruby, Java, Go, and many more… • Operational and developmental simplicity There are no complex software tiers to be managed, so administration duties are greatly simplified. • Ability to deploy across data centres Cassandra can be deployed across multiple, geographically dispersed data centres

18.Strengths (1) • Cloud availability Installations in cloud environments • Peer to peer architecture Cassandra follows a peer-to-peer architecture, instead of master-slave architecture • Flexible data model Supports modern data types with fast writes and reads • Fault tolerance Nodes that fail can easily be restored or replaced • High Performance Cassandra has demonstrated brilliant performance under large sets of data

19.Strengths (2) • ColumnFamily Store Cassandra stores columns based on the column names, leading to very quick slicing • Tunable consistency Support for strong or eventual data consistency across a widely distributed cluster • Schema-free/Schema-less In Cassandra, columns can be created at your will within the rows. Cassandra data model is also famously known as a schema-optional data model • AP-CAP Cassandra is typically classified as an AP system, meaning that availability and partition tolerance are generally considered to be more important than consistency in Cassandra

20.CAP and Cassandra

21.Variable number of columns per row

22.Weaknesses Use Cases where is better to avoid using Cassandra • If there are too many joins required to retrieve the data • To store configuration data • During compaction, things slow down and throughput degrades • Basic things like aggregation operators are not supported • Range queries on partition key are not supported • If there are transactional data which require 100% consistency • Cassandra can update and delete data but it is not designed to do so

23.Business Insider “The basic problem Cassandra solved is that when you have a lot of data sitting on a lot of servers, as Facebook does, you end up with a house of cards. A single server going down can collapse the whole stack.”

24.Cassandra compared to other NoSQL Systems

25.Read & Write latency for workload Read/Write

26.Throughput for workload Read/Write & Read/Scan/Write

27.Insert-mostly Workload

28.Mixed Operational & Analytical Workload

29.Read-Modify-Write Workload