Low Cost Transactional and Analytics With MySQL and Clickhouse

您希望MySQL协议的兼容性和近乎实时的仪表板分析与之兼容吗?
我们正在囤积数据,以期从中获得有意义的信息,从而做出明智的商业决策。但是,我们也会通过提高成本来满足(或不满足)这种需求。虽然有许多开源的滚动您自己的选项,它也操作成本高,有时维护压力大,更不用说每个缺点。如果您一直在使用MySQL存储数据,并且一直在使用同一组服务器运行查询,那么就不用担心了——Clickhouse可能正是您所需要的。在本演示中,我们将讨论如何部署、设计和维护一个Clickhouse分析平台,该平台可以以近乎实时的方式连续读取MySQL服务器中的数据取决于您是否有转换或复杂的模式。

展开查看详情

1. Low Cost Transactional and Analytics with MySQL + Clickhouse Have your Cake and Eat it Too

2.Clickhouse Overview ... intentionally left blank ... 2 / 25

3. Why Clickhouse Query language is very familiar to MySQL users Means low barrier of entry Lean design, highly parallel but not too resource greedy https://clickhouse.yandex/benchmark.html 3 / 25

4. Why Not Clickhouse Highly dynamic data Especially those that spans multiple partitions Highly complex relationships between multiple large tables Multiple JOINs not possible at the time 4 / 25

5. Our Use Case Requirement <15mins dashboard stats latency From MySQL db.r4.8xlarge (~$3.84/hr) 30s upwards existing dashboard queries Heavy caching, async 4+ RDS instances 5 / 25

6. Our Use Case To Clickhouse Single i3.8xlarge (~$2.496/hr) AVG(5s) dashboard queries 15 queries 120 entities ~5mins dashboard lag 6 / 25

7. How Replicating to Clickhouse

8. Method #1 https://www.altinity.com/blog/2018/6/30/realtime-mysql- clickhouse-replication-in-practice UPDATE/DELETE ignored 8 / 25

9. Method #2 Native MySQL connection INSERT INTO clickhouse_table  SELECT * FROM mysql('host', 'db', 'table', 'user', password) 9 / 25

10. Method #3 Roll our own Heavy UPDATE/DELETE Clickhouse UPDATE/DELETE implementation was not available at the time and have not tested Takes most of the concepts from method #1 Use partitions to update data Do not rely on consistent partitions But can easily correct partitions as needed Tables independently replicated 10 / 25

11. Initial Table Import Constraints we can live with: Map MySQL table to Clickhouse de nition New columns will not be included until rebuild No replication but multiple Clickhouse nodes Table rebuild requires static replica 11 / 25

12. Initial Table Import Process 1. Initialize dump worker Create dummy Clickhouse table Start dumping table in chunks 12 / 25

13. Initial Table Import Process 2. Keep dumping chunks until Load CSV chunks as they complete Once dump completes, record binlog coordinates to metadata server 13 / 25

14. Incremental Import Separate CDC metadata capture and import worker model 14 / 25

15. Incremental Import One partition at a time, import speed per partition may predict how table is partitioned i.e. daily vs weekly 15 / 25

16. Incremental Import Import partition to dummy table, swap out from real table This is not atomic 16 / 25

17. Limitations Phantom reads during partition swap** Distributed lock, only one table instance being updated Add import latency, two CH instances was good 17 / 25

18. Limitations max_execution_time being breached max_concurrent_queries Using async workers to queue and cache dashboard queries. 18 / 25

19. Sample Queries (1) DISTINCT clause dillema SELECT SQL_NO_CACHE      COUNT(*) AS hit_count,      COUNT(DISTINCT(reflog.user_id)) AS visitors,      SUM(reflog.time_amount) AS time_spent,      AVG(time_amount) AS avg_time  FROM reflog  INNER JOIN page_uri pg ON pg.uri_id = reflog.uri_id  WHERE reflog.entity_id = 396      AND (reflog.created_at BETWEEN '2017‑08‑13 05:00:00.000000'        AND '2017‑09‑14 04:59:59.999999')      AND reflog.status = 'statusA' AND (reflog.approved = 1)      AND pg.entity_id = 396 AND pg.admin_id = 3275  +‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+ | hit_count     | visitors        | time_spent           | avg_time           | +‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+ |       2827576 |         2077654 |    60283159.65371944 | 21.319730982905302 | +‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+ 1 row in set (31.57 sec) 19 / 25

20. Sample Queries (1) SELECT     COUNT(*) AS hit_count,     COUNTDistinct(user_id) AS visitors,     SUM(time_amount) AS time_spent,     AVG(time_amount) AS avg_time FROM reflog WHERE (approved = 1) AND (status = 'statusA')    AND ((created_at >= 1502600400) AND (created_at <= 1505365199))    AND (uri_id IN   (       SELECT uri_id FROM page_uri WHERE (admin_id = 3275 AND entity_id = 396)   ))   AND entity_id = 396 ┌─hit_count─────┬─visitors───────┬─time_spent──────────┬────avg_time──────┐ │       2827576 │         2077654 │     60283159.1976388 │ 21.31973082160791 │ └──────────────┴────────────────┴────────────────────┴─────────────────┘ 1 rows in set. Elapsed: 0.243 sec. Processed 6.09 million rows,    184.37 MB (25.10 million rows/s., 760.03 MB/s.) 20 / 25

21. Sample Queries (2) Large date range (1month) Low cardinality index constant We can optimize to death ... but SELECT sum(`click_amount`) AS `click_amount`  FROM `reflog`  WHERE (entity_id = 594 AND created_at >= 1520793000    AND created_at <= 1523557799    AND id IN ((     SELECT `event_id` FROM `user_logs`      WHERE ((campaign_type = 'thisthatlogtype' AND log_id IN        ((SELECT id FROM `some_log_types`        WHERE (entity_id = 594))) AND control_group = 0)          AND (`event_type` = 'login'))))    AND approved = 1 AND status = 'statusA'); ... 57d1f674f8e75c4319e9a7a88afdd350  ‑ 1 row in set (12 min 30.40 sec) 21 / 25

22. Sample Queries (2) There is still room for optimization in Clickhouse, but good enough SELECT sum(click_amount) AS click_amount FROM reflog WHERE (entity_id = 594) AND (created_at >= 1520793000)    AND (created_at <= 1523557799) AND (id IN (     SELECT event_id     FROM user_logs     WHERE ((log_type = 'thisthatlogtype') AND (log_id IN     (         SELECT id         FROM some_log_types         WHERE entity_id = 594     )) AND (control_group = 0)) AND (event_type = 'login') )) AND (approved = 1) AND (status = 'statusA') ┌─────click_amount───┐ │ 4771.6999979019165  │ └───────────────────┘ 1 rows in set. Elapsed: 5.403 sec. Processed 598.31 million rows,    22.94 GB (110.74 million rows/s., 4.25 GB/s.) 22 / 25

23.Rate This Talk 23 / 25

24. Thank you for joining us this year! ... and thanks to our sponsors! 24 / 25

25.Questions?