16/07 - An Effective Approach to Migrate Thrift to CQL Cassandra

1 Problem Overview 2 Background Discussion 3 Thrift to CQL Transition 4 C* Storage Engine (pre 3.0) 5 Data Migration 6 Conclusion

1.Yabin Meng An Effective Approach to Migrate Thrift to CQL

2.About Pythian • Founded in 1997, a global leader in Enterprise IT transformation and operational excellence •www.pythian.com •https://www.linkedin.com/company/pythian • Love your Data •6000+ database under management •Over 400 DBA’s, in 35 countries •Top 5% of DBA workforce – 9 Oracle ACE’s, 2 Microsoft MVP’s, 1 Cassandra MVP •Partners with Amazon, Cloudera, Oracle, Microsoft, DataStax, and a lot more • Specialities: Databases, Cloud, DevOps, Big Data Infrastructure, Advanced Analytics, Infrastructure Management • Industry Recognition •DBTA1 100 2016 - The Companies That Matter Most in Data •IAOP2 - GLOBAL OUTSOURCING 100® RISING STAR FOR 2016 •Top 25 Canadian ICT3 Professional Services Companies •… … DBTA : Database Trend and Applications (http://www.dbta.com/) 1 IAOP2: International Association of Outsourcing Professionals (https://www.iaop.org/) ICT3: Information & Communication Technology © Pythian, All Rights Reserved. 2

3.About Myself • 15+ years IT experience - relational database/data warehousing, business intelligence and analytics (BIA), NoSQL database/Big Data • Started my career as a RDBMS client application developer (PowerBuilder, Java) • 5 years as IBM DB2 Kernel Developer, New Development (DB2 v9.5, 9.7, 9.8, 10) • Multi-years data warehousing and BIA experience in insurance and capital market sectors as technical lead/architect (IBM/Cognos, SAS/SAS 9.x, Information Builders/WebFocus) • DataStax Certified Developer, Administrator, and Partner Architect • https://www.linkedin.com/in/yabinmeng • @yabinmeng © Pythian, All Rights Reserved. 3

4. 1 Problem Overview 2 Background Discussion 3 Thrift to CQL Transition 4 C* Storage Engine (pre 3.0) 5 Data Migration 6 Conclusion © Pythian, All Rights Reserved. 4

5.Problem Overview • C* cluster upgrade from version 1.2 to 2.1 • Thrift based application (Hector) •Plan to convert to CQL, but not done yet • Data copied from one C* cluster to another •C* COPY command • Application failed in DEV/TEST •Cannot retrieve some required data •cqlsh output shows NO difference • Schema • Data query result © Pythian, All Rights Reserved. 5

6.Background Disccsion, COPY commands • COPY commands •Syntax: COPY <table_name> (column1, column2, …) TO/FROM <file_name> WITH options •Since 1.1.3 (PostgresSQL COPY commands) •Simple yet with limitation. Continuous improvement •Small datasets (max. a few million rows) •CSV format •No primary key check •New options and better performance in cqlsh copy – 2.1.13, 2.2.5, 3.0.3, 3.2 – ex (common): NUMPROCESSES, CONFIGFILE – ex (TO): BEGINTOKEN / ENDTOKEN – ex (FROM): INGESTRATE (rows/second) •Predefined schema © Pythian, All Rights Reserved. 6

7.Background Discussion, Schema-less? • Schema-less or Schema •Many NoSQL DBs is marked as “schema-less” • A lot of debate •C* is schema optional (Thrift) •C* is schema required (CQL) © Pythian, All Rights Reserved. 7

8.Background Discussion, C* Utilities •cassandra-cli utility SET users['key_val']['column1_name']='some value'; •Command line interface since very early years of C* SET users['key_val']['column2_name']='some value'; •Thrift-based counterpart of “cqlsh” ... •C* data engine format (pre 3.0) •Deprecated since C* 2.1 and will be removed in 3.0 LIST users; •Comparator and Validator GET users['key_val']['column1_name'] •Validator: row key, column value •Comparator: column name + column sorting order •sstable2json utility •Deprecated and will be removed in 3.0

9.Thrift to CQL Transition • What is Thrift and CQL? •Thrift is a general framework (interface definition language and binary communication protocol) • "scalable cross-language services development" •APIs to interact with C* •CQL is also a query language Access CQL query CQL query Abstraction N/A * language language Layer Protocol Thrift CQL Thrift Layer Binary Storage C* Data Engine Layer C* evolvement © Pythian, All Rights Reserved. 9

10.Thrift to CQL Transition, continued • Thrift in C* •Since the beginning as a communication protocol •Low level API •Deprecated • CQL in C* •0.8, CQL as a language •1.2, CQL as a language, version 3 (CQL3) •1.2, CQL as a communication protocol (CQL3 only) •Continuously evolving • Thrift vs. CQL as protocol •Thrift: RPC based; Synchronous •CQL: Server notification (event registration); Asynchronous • CQL as a language •Feature rich: UDF, lightweight transaction, JSON support, materialized view, … • Thrift vs. CQL as client driver •Thrift driver is deprecated and not supported © Pythian, All Rights Reserved. 10

11.Thrift to CQL Transition, continued • What is needed for the transition? •Application •Data Model •Data itself •Understanding C* storage engine * * Pre-3.0 C* © Pythian, All Rights Reserved. 11

12.C* Storage Engine (pre 3.0) • C* ColumnFamily (Thrift) / Table (CQL) data structure •Map of sorted map: Map<RowKey, SortedMap<ColumnKey, ColumnValue>> CREATE TABLE song_tags ( id uuid, tag_name text, PRIMARY KEY (id, tag_name) ) WITH COMPACT STORAGE © Pythian, All Rights Reserved. 12

13.C* Storage Engine – Thrift vs. CQL Terminology • Table –Thrift: Column Family –CQL: Table • Row –Thrift: Storage Row / Partition –CQL: Logic Representation / Part of a partition • Column –Thrift: Cell (column key + column value) –CQL: Metadata / column key or column value © Pythian, All Rights Reserved. 13

14.C* Storage Engine – Compact Storage CREATE TABLE avg_grade2 ( student_id int, class_id int, grade double, PRIMARY KEY (student_id, class_id) ) WITH COMPACT STORAGE • Storage space vs. table metadata – Backward compatibility – Thrift: always compact – CQL: optional • Caveats: – No adding / dropping column – Single non-key column if compound primary key – Non-frozen collection type © Pythian, All Rights Reserved. 14

15.C* Storage Engine – Static Table Definition create table avg_grade ( Thrift / cassandra_cli key int primary key, create column family avg_grade class_id int, with key_validation_class = Int32Type grade float and comparator = UTF8Type ) with compact storage and column_metadata = [ {column_name: class_id, validation_class: Int32Type}, {column_name: grade, validation_class: FloatType} •Equivalent definition ] •Thrift: no name for key -- default name used in CQL: “key” © Pythian, All Rights Reserved. 15

16.C* Storage Engine – Dynamic Table Definition • Thrift only Thrift / cassandra_cli • No “column_metadata” section create column family sensor_data with key_validation_class = Int32Type CREATE TABLE sensor_data ( and comparator = TimeUUIDType key int, and default_validation_class = DecimalType; column1 timeuuid, value decimal, ------------------- PRIMARY KEY (key, column1) RowKey: 1 ) WITH COMPACT STORAGE => (name=1171d7e0-14d2-11e6-858b-5f3b22f4f11c, value=21.5, timestamp=1462680314974000) => (name=23371940-14d2-11e6-858b-5f3b22f4f11c, value=32.1, timestamp=1462680344792000) ------------------- •Default “key” for key name RowKey: 2 •Default “column1” for column name/key => (name=7537fcf0-14d2-11e6-858b-5f3b22f4f11c, value=11.0, timestamp=1462680482367000) •Default “value” for column value •Alter Table © Pythian, All Rights Reserved. 16

17.C* Storage Engine – Mixed Table Definition Thrift / cassandra_cli create column family blogs CREATE TABLE blogs ( with key_validation_class = Int32Type key int PRIMARY KEY, and comparator = UTF8Type author text and column_metadata = [ ) WITH COMPACT STORAGE {column_name: author, validation_class: UTF8Type} ] key | author ------------------- -----+-------- RowKey: 1 1 | Donald => (name=author, value=Donald, timestamp=1462720696717000) 2 | John => (name=tags:category, value=music, timestamp=1462720526053000) => (name=tags:rating, value=A, timestamp=1462720564590000) => (name=tags:recommend, value=Yes, timestamp=1462720640817000) ------------------- RowKey: 2 => (name=author, value=John, timestamp=1462720690911000) © Pythian, All Rights Reserved. 17

18.Migrate Thrift Data – Overview • Do not fully trust C* schema •“show schema” / “describe keyspace” • Sample your data •“cassandra-cli”, yes •“cqlsh”, no • Review application design and code •Schema-less / schema -optional • Determine CQL schema • Data Migration Method © Pythian, All Rights Reserved. 18

19.Migrate Thrift Data – Determine CQL Schema Thrift / cassandra-cli CQL / cqlsh schema create column family some_tbl_name • Application specific with key_validation_class = BytesType • Could it be a little bit more generic? and comparator = UTF8Type and column_metadata = [ CREATE TABLE some_tbl_name ( {column_name: StaticColName1, validation_class: xxx1}, key blob, {column_name: StaticColName2, validation_class: xxx2} StaticColName1 yyy1, {column_name: StaticColName3, validation_class: xxx3} StaticColName2 yyy2, …… StaticColName3 yyy3, ] … …, DynamicColMapName map<text, blob>; StaticColNameSet = {StaticColName1, StaticColName2, StaticColName3, …} PRIMARY KEY (key) DynamicColNameSet = {DynamicColName1, DynmaicColName2, DyanmicColName3, …} ) © Pythian, All Rights Reserved. 19

20.Migrate Thrift Data - ETL Method • ETL •COPY command •Commercial product (jasper soft, pentaho, …) •Self developed • Natural Approach •Read every record from the source table and write it into the target table •Slow • C* load utility •sstableloader © Pythian, All Rights Reserved. 20

21.Migrate Thrift Data – Methods, continued writer = CQLSSTableWriter.build() • Read source data thrift_rows = thirft_conn.fetchRows() for each row in thrift_rows() { • Thrift API keyValue = row.getKey() dynColMap = new Map() • Writing SSTables Directly thirft_columns = row.getColumns() • SSTableSimpleUnsortedWriter for each column in thirft_columns { • CQLSSTableWriter colName = column.getName() if colName is in DynamicColNameSet { dynColMap.addKeyValuePair(column.name, column.value) } else { staticColVal1 = column.getValue(StaticColName1) staticColVal2 = column.getValue(StaticColName2) ……… }}} writer.addRow(keyValue, staticColVal1, staticColVal2, …, dynColMap) } writer.close()

22.Migrate Thrift Data - Result Fetching result from source table and generate SSTable ... $ sstableloader -d <node_ip> ------------------------- <sstable_root>/<keyspace_name>/<target_cql_table_name>/ > batch_run: 1; batch_size: 20001; batch_runtime: PT5.594S ……… > batch_run: 2; batch_size: 20000; batch_runtime: PT4.368S Summary statistics: > batch_run: 3; batch_size: 20000; batch_runtime: PT3.937S Connections per host: :1 > batch_run: 4; batch_size: 20000; batch_runtime: PT3.838S Total files transferred: : 15 > batch_run: 5; batch_size: 20000; batch_runtime: PT2.015S Total bytes transferred: : 511566477 ……… Total duration (ms): : 47451 > batch_run: 59; batch_size: 20000; batch_runtime: PT1.941S Average transfer rate (MB/s): : 10 > batch_run: 60; batch_size: 12581; batch_runtime: PT2.82S Peak transfer rate (MB/s): : 20 #### Total Records Processed: 1192582; Total Run Time: PT2M35.178S © Pythian, All Rights Reserved. 22

23.Conclusion • C* is moving from Thrift to CQL –Do it ! • C* schema may not tell you all • Effective C* data migration is not easy •C* does have “load” utility (framework) • C* storage engine changes a lot in 3.0 • Pythian Blog Space: •https://www.pythian.com/blog/ •An Effective Approach to Migrate Dynamic Thrift Data to CQL (Part 1, Part 2, Part 3) © DataStax, All Rights Reserved. 23

24.© DataStax, All Rights Reserved. 24