15:06 - Apache cassandra and spark. you got the the lighter, let

15/06 - Apache cassandra and spark. you got the the lighter, let's start the fire

1. Apache Cassandra and Spark You got the lighter, let’s spark the fire Patrick McFadin
 Chief Evangelist for Apache Cassandra @PatrickMcFadin ©2013 DataStax Confidential. Do not distribute without consent. 1

2.6 years. How’s it going?

3.Cassandra 3.0 & 3.1 Spring and Fall

4.Cassandra is… • Shared nothing • Masterless peer-to-peer • Great scaling story • Resilient to failure

5.Cassandra for Applications APACHE CASSANDRA

6. A Data Ocean , Lake or Pond. An In-Memory Database A Key-Value Store A magical database unicorn that farts rainbows

7.Apache Spark

8.Apache Spark • 10x faster on disk,100x faster in memory than Hadoop MR • Works out of the box on EMR • Fault Tolerant Distributed Datasets Up to 100× faster • Batch, iterative and streaming analysis (2-10× on disk) • In Memory Storage and Disk • Integrates with Most File and Storage Options 2-5× less code

9.Spark Components Spark Spark SQL MLlib GraphX Streaming structured machine learning graph real-time Spark Core


11.org.apache.spark.rdd.RDD Resilient Distributed Dataset (RDD) •Created through transformations on data (map,filter..) or other RDDs •Immutable •Partitioned •Reusable

12.RDD Operations •Transformations - Similar to scala collections API •Produce new RDDs • filter, flatmap, map, distinct, groupBy, union, zip, reduceByKey, subtract •Actions •Require materialization of the records to generate a value • collect: Array[T], count, fold, reduce..

13.RDD Operations Analytic Transformation Action Analytic Search

14.Cassandra and Spark

15.Cassandra & Spark: A Great Combo •Both are Easy to Use •Spark Can Help You Bridge Your Hadoop and Cassandra Systems •Use Spark Libraries, Caching on-top of Cassandra-stored Data •Combine Spark Streaming with Cassandra Storage Datastax: spark-cassandra-connector:

16.Spark On Cassandra •Server-Side filters (where clauses) •Cross-table operations (JOIN, UNION, etc.) •Data locality-aware (speed) •Data transformation, aggregation, etc. •Natural Time Series Integration

17.Apache Spark and Cassandra Open Source Stack Cassandra

18.Spark Cassandra Connector 18

19.Spark Cassandra Connector *Cassandra tables exposed as Spark RDDs *Read from and write to Cassandra *Mapping of C* tables and rows to Scala objects *All Cassandra types supported and converted to Scala types *Server side data selection *Virtual Nodes support *Use with Scala or Java *Compatible with, Spark 1.1.0, Cassandra 2.1 & 2.0

20.Type Mapping CQL Type Scala Type ascii String bigint Long boolean Boolean counter Long decimal BigDecimal, java.math.BigDecimal double Double float Float inet java.net.InetAddress int Int list Vector, List, Iterable, Seq, IndexedSeq, java.util.List map Map, TreeMap, java.util.HashMap set Set, TreeSet, java.util.HashSet text, varchar String timestamp Long, java.util.Date, java.sql.Date, org.joda.time.DateTime timeuuid java.util.UUID uuid java.util.UUID varint BigInt, java.math.BigInteger *nullable values Option

21.Connecting to Cassandra // Import Cassandra-specific functions on SparkContext and RDD objects import com.datastax.driver.spark._ // Spark connection options val conf = new SparkConf(true) .setMaster("spark://") .setAppName("cassandra-demo") .set("cassandra.connection.host", "") // initial contact .set("cassandra.username", "cassandra") .set("cassandra.password", "cassandra") val sc = new SparkContext(conf)

22.Accessing Data CREATE TABLE test.words (word text PRIMARY KEY, count int); INSERT INTO test.words (word, count) VALUES ('bar', 30); INSERT INTO test.words (word, count) VALUES ('foo', 20); * Accessing table above as RDD: // Use table as RDD val rdd = sc.cassandraTable("test", "words") // rdd: CassandraRDD[CassandraRow] = CassandraRDD[0] rdd.toArray.foreach(println) // CassandraRow[word: bar, count: 30] // CassandraRow[word: foo, count: 20] rdd.columnNames // Stream(word, count) rdd.size // 2 val firstRow = rdd.first // firstRow: CassandraRow = CassandraRow[word: bar, count: 30] firstRow.getInt("count") // Int = 30

23.Saving Data val newRdd = sc.parallelize(Seq(("cat", 40), ("fox", 50))) // newRdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[2] newRdd.saveToCassandra("test", "words", Seq("word", "count")) * RDD above saved to Cassandra: SELECT * FROM test.words; word | count ------+------- bar | 30 foo | 20 cat | 40 fox | 50 (4 rows)

24.Spark Cassandra Connector https://github.com/datastax/spark-­‐cassandra-­‐connector Cassandra Spark RDD[CassandraRow] Keyspace Table RDD[Tuples] Bundled  and  Supported  with  DSE  4.5!

25.Spark Cassandra Connector Spark Cassandra Connector uses the DataStax Java Driver to Read from and Write to C* Spark Each Executor Maintains Executor Spark DataStax C* a connection to the C* Java Driver Cluster Tokens … Tokens 1001 -2000 Full Token RDD’s read into different Range Tokens 1-1000 splits based on sets of tokens

26. Co-locate Spark and C* for Best Performance Spark Worker Running Spark Workers on C* Spark Spark
 the same nodes as your Master Worker C* Cluster will save C* Spark
 C* network hops when Worker reading and writing C*

27.Analytics Workload Isolation Mixed Load Cassandra Cluster Online Cassandra Cassandra Analytical App Only DC + Spark DC App

28.Data Locality

29.Example 1: Weather Station • Weather station collects data • Cassandra stores in sequence • Application reads in sequence

由Apache Cassandra PMC & Committers发起。致力于发布与传播Apache Cassandra技术,生态,最佳实践,前沿信息。