ClickHouse-大数据分析的屠龙刀

下载 13

快召唤伙伴们来围观吧
微博 QQ QQ空间 贴吧
文档嵌入链接
<iframe src="https://www.slidestalk.com/ClickHouse/ClickHouse_Dragon_Slaughter_for_Big_Data_Analysis?embed" frame border="0" width="640" height="360" scrolling="no" allowfullscreen="true">复制
微信扫一扫分享
已成功复制到剪贴板

ClickHouse

发布于

6年前

4868

人观看

#信息技术

• 快到什什么程度（Benchmark）
• 为什什么快
• 存储
• 计算
• 怎么才能快/最佳实践

展开查看详情

1 . ClickHouse ⼤大数据分析的屠⻰龙⼑刀尚书杰

2 . ⽬目录 • 快到什什么程度（Benchmark） • 为什什么快 • 存储 • 计算 • 怎么才能快/最佳实践

3 . About me • 2010 - 2014 云平台/云存储 • EBS / S3 • CDN / MQ • Cloudfoudry • hadoop / hbase / cassandra • 2014 - ⾄至今⼤大数据 • Kafka • spark(streaming, sql) / Hadoop / hive • Greenplum / HAWQ / Postgres • Druid • ETL • Now, I use CLICKHOUSE

4 .How fast - Benchmark 1 • 1.1 billion taxi rides benchmark • http://tech.marksblogg.com/benchmarks.html

5 .How fast - benchmark 2 • https://clickhouse.yandex/benchmark.html • Select AggregateFunction, fields from Table group by xxx having xxx • 性能甚⾄至好于成熟商业数据库(vertica, greenplum, redshift, …)

6 . How fast • https://www.percona.com/blog/2017/03/17/column-store-database- benchmarks-mariadb-columnstore-vs-clickhouse-vs-apache-spark/

7 . Who use it • 成熟度⾼高 • 数百家公司使⽤用 • 中国许多⼩小集群, 遍地开花 • 国外许多集群, 重度使⽤用 • Yandex ~ 500 nodes • 每天百亿⾮非常常⻅见 • https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/ (2018.5) • It is blazing fast, linearly scalable, hardware efficient, fault tolerant, feature rich, highly reliable, simple and handy • Yandex已经上云,对外服务 • 英⽂文 • 代码，注释，⽂文档 • 适⽤用于 • 多维分析 • 事件分析 • 宽表分析

8 . Who use it 2017

9 . Who use it • 2017.10, 142 -> 2018.10, 103

10 . Why fast • StorageEngine • Pure column based storage • LSM-Tree Family • Max-min index • Partition • Sharding / Distributed table • ExecuteEngine • Multi-threads / Multi-nodes • Vector engine • Processed by block, not line • SIMD enhancement • LLVM enhancement • C++

11 . MergeTree：LSM-Tree • MergeTree • Each folder is a part • Partition_StartBlock_EndBlock_MergeTreeLevel • Part Tree (MergeTree) • 2 partitions, 17802, 17803

12 . How to: MergeTree • Data In file

13 . Why fast: MergeTree • LSM-Tree • Buffer + MergeTree • Primary key block index • Chosen primary key • 此主键⾮非mysql主键, 不不唯⼀一，⽤用于排序 • 经常查询的key作为主键 • 多主键联合索引 • 主键不不要太碎, no timestamp, eg. toHour

14 .Why fast: Vector Enging • Vector Engine • data processed by each block, not each line • SIMD加速 • LLVM加速 • set compile_expressions=1

15 . Why fast:极致代码 • ⽅方法⼀一: ‘B’ = (‘b’ - ‘a’) + ‘A’ • ⽅方法⼆二：’B’ = ‘b’ ^ (‘A’ ^ ‘a’) • ⽅方法三：SIMD

16 .Why fast: parallel processing • Multi-threads • Cpu 100% • set max-threads • Multi-nodes • Liner scalability • shard by user_id for event_processing

17 . How to: Select SELECT [DISTINCT] expr_list [FROM [db.]table | (subquery) | table_function] [FINAL] [SAMPLE sample_coeff] [ARRAY JOIN ...] [GLOBAL] ANY|ALL INNER|LEFT JOIN (subquery)|table USING columns_list [PREWHERE expr] [WHERE expr] [GROUP BY expr_list] [WITH TOTALS/ROLLUP/CUBE] [HAVING expr] [ORDER BY expr_list] [LIMIT [n, ]m] [UNION ALL ...] [INTO OUTFILE filename] [FORMAT format] [LIMIT n BY columns]

18 . Cluster • Cluster is just a route map, defined in config file • Local table / Distributed table(cluster)

19 .Local table/ Distributed table • Cluster • Like router table • Distributed table • Radom distributed • Hash, hash-key, only work when inserting • On cluster 语句句 • ⽅方便便运维

20 . How to: replica • 主主复制 • Zookeeper • Re-insert • Just retry

21 . Sharding • Hash sharding • Fully copy sharing • Co-locate join • …

22 . How to: 配置 • https://clickhouse.yandex/docs/en/single/#usage- recommendations • Raid • 加速io • 提升容量量 • SSD • ⼤大内存 • 多核

23 . How to: 并⾏行行导⼊入 • CreateTable • String => int, StringWithDictionary • 多个命令⾏行行导⼊入 • Hadoop fs -cat xxxx | clickhouse-client --host=xxx --query="INSERT INTO criteo_log FORMAT TabSeparated” (CSV, JSONEachRow) • input_format_allow_errors_num / input_format_allow_errors_ratio • json ~ 30w/s per node, 约100字段 • Csv ~ 50w/s per node • Default value, visitParamExtractString • TmpTable/Attach Partition / Detach Partition • ReplicateTable: just retry • 压缩率⾼高 • 与原始txt⽂文件相⽐比, 30~50倍以上very easy • 优于hadoop⽂文本⽂文件存储，snappy • 优于parquet

24 . How to: ⼤大宽表模型 • Less join • Depend on ETL • Wide table + dictionary • In/global In • join/global join

25 . 抽样⼤大法好 • ⾃自带抽样，Sample语句句 • 近似计算 • HyperLogLog • Uniq • Quantiles

26 . 物化视图好 • 级联计算 • Rollup • Cube

27 . Working with Others • Kafka Engine • Mysql • ODBC • Oracle, sqlserver, postgres, … • MongoDB

28 . DataFlow Example 1 • Fact table + MV

29 . DataFlow Example 2 • KAFKA + CH

1点赞

2收藏

13下载