Supercharge Your Analytics with ClickHouse



1. Supercharge Your Analytics with ClickHouse Webinar September 14th, 2017 Vadim Tkachenko CTO, Percona Alexander Zaitsev CTO, Altinity 1 © 2017 Percona

2.Analytic database landscape 2 © 2017 Percona

3. Vertica RedShift Commercial solutions – fast and expensive Teradata • Etc The cost scales with your data 3 © 2017 Percona

4. InfiniDB (now MariaDB ColumnStore) Open Source: InfoBright somewhat slow, sometime buggy. GreenPlum (started as commerical) But free Hadoop systems Apache Spark 4 © 2017 Percona

5.ClickHouse – fast and free! OpenSourced in Jun 2016 5 © 2017 Percona

6.ClickHouse story - Russian search engine Yandex Metrika - Russian “Google Analytics” Interactive Ad Hoc reports at multiple petabytes • 30+ billions of events daily No commercial solution would be cost effective and no OpenSource solution to handle this scale. That’s how ClickHouse was born 6 © 2017 Percona

7.ClickHouse is extremely fast and scalable. "We had no choice, but make it fast" by ClickHouse developers 7 © 2017 Percona

8.Initial Requirements Capable of Data processing Fast. Really fast storing petabytes in real time of data Fault-tolerance in Flexible query terms of language datacenters 8 © 2017 Percona

9.Technical details Vectorized processing Massively Parallel Processing Shared nothing Column store with late materialization (like C-Store and Vertica): • Data compression • Column locality • No random reads (more in details, in Russian, 9 © 2017 Percona

10.Vectorized processing Data is represented as small single-dimensional arrays (vectors), easily accessible for CPUs. The percentage of instructions spent in interpretation logic is reduced by a factor equal to the vector- size The functions that perform work now typically process an array of values in a tight loop Tight loops can be optimized well by compilers, enable compilers to generate SIMD instructions automatically. Modern CPUs also do well on such loops, out-of-order execution in CPUs often takes multiple loop iterations into execution concurrently, exploiting the deeply pipelined resources of modern CPUs. It was shown that vectorized execution can improve data-intensive (OLAP) queries by a factor 50. 10 © 2017 Percona

11.Column-oriented * The image taken from 11 © 2017 Percona

12.Efficient execution SELECT Referer, count(*) AS count FROM hits WHERE CounterID = 1234 AND Date >= today() - 7 GROUP BY Referer ORDER BY count DESC LIMIT 10 (* example from Read only With index Vectorized needed columns: (CounterID, Date) Compression processing CounterID, - fast discard of Referer, Date unneeded blocks 12 © 2017 Percona

13.Single Server - MPP Use multiple CPU cores on the single server Real case: Apache log from the real web site – 1.56 billion records Query: SELECT extract(request_uri,'(w+)$') p,sum(bytes) sm,count(*) c FROM apachelog GROUP BY p ORDER by c DESC limit 100 Query is suited for parallel execution – most time spent in extract function 13 © 2017 Percona

14.Execution on single server 56 threads / 28 cores | Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz Query execution time With 1 thread allowed: 823.646 sec ~ 1.89 mln records/sec With 56 threads allowed: 23.587 sec ~ 66.14 mln records/sec Speedup: 34.9x times 14 © 2017 Percona

15. DATABASE PERFORMANCE Database Performance Matters MATTERS

16.Query 3 SELECT y, request_uri, cnt FROM ( SELECT access_date y, request_uri, count(*) AS cnt FROM apachelog GROUP BY y, request_uri ORDER BY y ASC ) ORDER BY y,cnt DESC LIMIT 1 BY y Less suitable for parallel execution – serialization to build a temporary table for internal subquery Speedup: 6.4x times 16 © 2017 Percona

17. DATABASE PERFORMANCE Database Performance Matters MATTERS

18.More details in the blog post: processing-clickhouse/ 18 © 2017 Percona

19.Data distribution If a single server is not enough 19 © 2017 Percona

20.Distributed query SELECT foo FROM distributed_table SELECT foo FROM local_table GROUP BY col1 • Server 1 SELECT foo FROM local_table GROUP BY col1 • Server 2 SELECT foo FROM local_table GROUP BY col1 • Server 3 20 © 2017 Percona

21.NYC taxi benchmark CSV 227 GB, ~1.3 bln rows SELECT passenger_count, avg(total_amount) FROM trips GROUP BY passenger_count N Servers 1 3 140 Time, sec 1.224 0.438 0.043 Speedup x2.8 x28.5 * Taken from 21 © 2017 Percona

22.Reliability Any number of replicas Any replication topology Multi-master Cross-DC Asynchronous (for speed) • è Delayed replicas, possible stale data reads • More on data distribution and replication distribution 22 © 2017 Percona

23.Benchmarks! 23 © 2017 Percona

24.ClickHouse vs Spark vs MariaDB ColumnStore Wikipedia page Counts, loaded full with the year 2008, ~26 billion rows mariadb-columnstore-vs-clickhouse-vs-apache-spark/ 24 © 2017 Percona

25.ClickHouse vs Spark vs MariaDB ColumnStore 25 © 2017 Percona

26.Cloud: ClickHouse vs RedShift 5 queries based on NYC taxi dataset Query 1: SELECT dictGetString('taxi_zones', 'zone', toUInt64(pickup_location_id)) AS zone, count() AS c FROM yellow_tripdata_staging GROUP BY pickup_location_id ORDER BY c DESC LIMIT 10 RedShift 1 instance / 3 instances of ds2.xlarge (4 vCPU / 31 GiB memory) ClickHouse 1 instance r4.xlarge (4 vCPU / 30.5 GiB memory) 26 © 2017 Percona

27.seconds sec query query 27 © 2017 Percona

28.By Yandex, see [2] 28 © 2017 Percona

29.ClickHouse – use cases Adv networks data Web/App analytics Ecommerce/Telecom logs Online games Sensor data Monitoring 29 © 2017 Percona