Tips and Tricks with ColumnStore

充分利用你的专栏店专栏分析引擎!
深入研究柱状引擎的一般最佳实践,什么是柱状引擎的最佳用例,以及柱状存储和其他分析引擎的提示和技巧。
无数据摄入的分析
索引无索引引擎
如何和何时使用交叉引擎联接
启用低延迟数据接收
行+列混合方法
优化数据加载
拆分列以提高性能

展开查看详情

1.Tips ‘n Tricks with ColumnStore Jim Tommaney Alibaba Cloud 2006-2014 - InfiniDB Chief Architect/CTO

2. Tips ‘n Tricks with ColumnStore about Jim Tommaney • 25+ years data architecture, modeling, tuning • 2006-2014 Chief Architect/CTO for InfiniDB (now ColumnStore) • Production: InfiniDB, MySQL, Oracle, Postgres, Redshift, Snowflake • Verticals: Telecom, Web/Mobile Marketing, Genomics, Retail, Manufacturing • Architectural Understanding of (not SME): Vectorwise, Vertica, Paraccel, Greenplum, InfoBright, Netezza, Teradata, Hive, Spark, Impala, RapidsAI, RocksDB, BlazingDB, Brytlyt, OmniSci, Dremio 2

3. Columnar vs Column Family Redshift is described as “column-oriented” HBase is also described as “column-oriented” Redshift Architecture Hbase Architecture Columnar: Vertica, Redshift, ColumnStore, InfoBright, Vectorwise, Snowflake Column Family: BigTable, Hbase, Cassandra (not this talk) Additional discussion from Daniel Abadi: http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-major-types-of_29.html

4.Short Background on Columnar Click to add text

5.Column Restriction and Projection Column 1 Column 20 |-------------- Column # Four ---------------| |-------- Column # Seventeen -----------| |-------------- Column # Six ---------------| Row 1 Extent # 5 Filter 3 Projection Projection Filter 1 Filter 2 Row 1000000000 Extent # 27 • Automatic Vertical Partitioning + Horizontal Partitioning • Just-In-Time Materialization 5

6.Columnar Optimal Use Case Columnar ~10x worse I/O Columnar ~10x better I/O About 100x slower queries About 100x faster queries About 100x slower inserts About 100x faster inserts 6

7.Physical I/O cost is minimized select l_linenumber, avg(l_quantity), count(*) from lineitem where l_discount < 0.1 group by 1; +--------------+-----------------+-----------+ | l_linenumber | avg(l_quantity) | count(*) | +--------------+-----------------+-----------+ 10.2 seconds PIO vs | 6 | 25.502396 | 31177895 | . . . 7.7 seconds cached | 5 | 25.502795 | 46753111 | Physical I/O +--------------+-----------------+-----------+ 7 rows in set, 1 warning (10.191 sec) 1,171,958 blocks select calgettrace()\G vs 0 blocks PIO *************************** 1. row *************************** calgettrace(): Desc Mode Table TableOID ReferencedColumns PIO LIO PBE Elapsed Rows BPS PM lineitem 3000 (l_discount,l_linenumber,l_quantity) 1171958 1173541 0 10.181 6489 select l_linenumber, avg(l_quantity), count(*) from lineitem where l_discount < 0.1 group by 1; +--------------+-----------------+-----------+ | l_linenumber | avg(l_quantity) | count(*) | +--------------+-----------------+-----------+ | 6 | 25.502396 | 31177895 | Cached . . . | 5 | 25.502795 | 46753111 | +--------------+-----------------+-----------+ 7 rows in set, 1 warning (7.747 sec) select calgettrace()\G *************************** 1. row *************************** calgettrace(): Desc Mode Table TableOID ReferencedColumns PIO LIO PBE Elapsed Rows BPS PM lineitem 3000 (l_discount,l_linenumber,l_quantity) 0 1171946 0 7.738 6489

8.Internal Column Sizes 1-Byte BOOLEAN, TINYINT, CHAR(1), VARCHAR(1) 2-Byte SMALLINT, CHAR(2), VARCHAR(2), DECIMAL 4-Byte INT, DECIMAL, FLOAT, DATE, CHAR(3 or 4), VARCHAR(3 or 4) BIGINT, DECIMAL, DOUBLE, DATETIME, CHAR(5 - 8), 8-Byte VARCHAR(5 - 8) 8-Byte + Variable CHAR(>8), VARCHAR(>8), TEXT, BLOB, ETC Length

9.Behind the scenes MySQL Processes ColumnStore 1 thread Execution Mgr 1 to 8 threads Parallel Workers 16 threads (or more) Row storage Columnar storage

10.Utility Statements Click to add text

11. calGetTrace (actual metrics) select calSetTrace(1); select l_shipinstruct, count(*) from lineitem group by 1. /*. < your query >. */ select calGetTrace()\G *************************** 1. row *************************** calGetTrace(): Desc Mode Table TableOID ReferencedColumns PIO LIO PBE Elapsed Rows BPS PM lineitem 3000 (l_shipinstruct) 0 527380 0 10.901 3708 TAS UM - - - - - - 10.889 4 TNS UM - - - - - - 0.000 4 PM - Performance Module – Distributed, parallel processing UM - User Module – Final aggregation, multi-threaded PIO – Physical I/O - Blocks read from storage LIO – Logical I/O - Blocks touched, from memory PBE – Partition Blocks Eliminated – Blocks skipped with min/max meta-data check

12. select calflushcache(); select l_shipinstruct_code /*CHAR(1)*/ , count(*) c from lineitem group by 1; 4 rows in set, 1 warning (3.113 sec) *************************** 1. row *************************** calgettrace(): Desc Mode Table TableOID ReferencedColumns PIO LIO PBE Elapsed Rows BPS PM lineitem 3000 (l_shipinstruct_code) 0 58598 0 3.105 3708 select calflushcache(); /* not useful for production */ select l_shipinstruct_code /*CHAR(1)*/ , count(*) c from lineitem group by 1 ; 4 rows in set, 1 warning (4.895 sec) *************************** 1. row *************************** calgettrace(): Desc Mode Table TableOID ReferencedColumns PIO LIO PBE Elapsed Rows BPS PM lineitem 3000 (l_shipinstruct_code) 58598 59314 0 4.887 3708

13.Tips n Tricks Click to add text

14.Desc Lineitem (~1/2 billion rows, scale factor 80 ) +---------------------+---------------+ CREATE TABLE num ( | Field | Type | n tinyint(4) +---------------------+---------------+ ) ENGINE=Columnstore | l_orderkey | int(11) | | l_partkey | int(11) |. select * from num; | l_suppkey | int(11) | +------+ | l_linenumber | int(11) | | n | | l_quantity | decimal(15,2) | +------+ | l_extendedprice | decimal(15,2) | | 1 | | l_discount | decimal(15,2) | +------+ | l_tax | decimal(15,2) | | l_returnflag | char(1) | select * from shipmode_fk_innodb ; | l_linestatus | char(1) | select * from shipmode_fk_columnstore ; | l_shipdate | date | +------------+------------+ | l_commitdate | date | | l_shipmode | l_shipcode | | l_receiptdate | date | +------------+------------+ | l_shipinstruct | char(25) | | AIR | 1 | | l_shipmode | char(10) | | RAIL | 4 | | l_comment | varchar(44) | standard | REG AIR | 5 | -------------------------------------------------------- | MAIL | 3 | | l_shipcode | tinyint(4) | added | TRUCK | 7 | | l_comment_code | char(2) | | FOB | 2 | | l_shipinstruct_code | char(1) | | SHIP | 6 | | l_ship_datetime | datetime | +------------+------------+ | l_shipdate_yy | tinyint(4) | | l_shipdate_mm | tinyint(4) | | l_shipdate_yymm | smallint(6) | +---------------------+---------------+

15.Narrow column with aggregation ( 1-1 mapping ) +-------------------+-----------+ | l_shipinstruct | count(*) | +-------------------+-----------+ | NONE | 120011699 | | TAKE BACK RETURN | 120002249 | | COLLECT COD | 120010978 | | DELIVER IN PERSON | 120000203 | +-------------------+-----------+ | l_shipinstruct | char(25) | standard ------------------------------------------------- | l_shipinstruct_code | char(1) | added

16. Narrow column with aggregation ( 1-1 mapping ) select l_shipinstruct /*char(25)*/ , count(*) from lineitem group by 1; +-------------------+-----------+ | l_shipinstruct | count(*) | +-------------------+-----------+ | NONE | 120011699 | | TAKE BACK RETURN | 120002249 | | l_shipinstruct | char(25) | standard | COLLECT COD | 120010978 | | DELIVER IN PERSON | 120000203 | ------------------------------------------------- +-------------------+-----------+ | l_shipinstruct_code | char(1) | added 4 rows in set, 1 warning (10.942 sec) About 3.5x faster LIO = 527,380 About 9x reduced LIO, PIO (if needed) select case l_shipinstruct_code -> when 'C' then 'COLLECT COD' when 'D' then 'DELIVER IN PERSON' -> when 'N' then 'NONE' when 'T' then 'TAKE BACK RETURN' -> else 9 end instruct_code, c -> from ( select l_shipinstruct_code /*CHAR(1)*/ , count(*) c from lineitem group by 1 ) a ; +-------------------+-----------+ | instruct_code | c | 10.9 seconds vs +-------------------+-----------+ | TAKE BACK RETURN | 120002249 | | DELIVER IN PERSON | 120000203 | 3.1 seconds | NONE | 120011699 | | COLLECT COD | 120010978 | +-------------------+-----------+ 4 rows in set, 1 warning (3.109 sec) 527k LIO vs LIO = 58,598 59k LIO

17. Narrow column as filter ( 1-1 mapping ) select l_shipinstruct /*char(25)*/ , count(*) from lineitem -> where l_shipinstruct = "TAKE BACK RETURN" group by 1; +-------------------+-----------+ | l_shipinstruct | count(*) | +-------------------+-----------+ | NONE | 120011699 | | TAKE BACK RETURN | 120002249 | | l_shipinstruct | char(25) | standard | COLLECT COD | 120010978 | ------------------------------------------------- | DELIVER IN PERSON | 120000203 | | l_shipinstruct_code | char(1) | added +-------------------+-----------+ 4 rows in set, 1 warning (7.204 sec) About 3.5x faster About 17x reduced LIO LIO = 996,155 select case l_shipinstruct_code -> when 'C' then 'COLLECT COD' when 'D' then 'DELIVER IN PERSON' -> when 'N' then 'NONE' when 'T' then 'TAKE BACK RETURN' else 9 end instruct_code, c -> from ( select l_shipinstruct_code /*CHAR(1)*/ , count(*) c from lineitem -> where l_shipinstruct_code = ”T" group by 1 ) a ; +-------------------+-----------+ 7.2 seconds vs | instruct_code | c | +-------------------+-----------+ | TAKE BACK RETURN | 120002249 | 2.1 seconds | DELIVER IN PERSON | 120000203 | | NONE | 120011699 | | COLLECT COD | 120010978 | +-------------------+-----------+ 4 rows in set, 1 warning (2.108 sec) 996k LIO vs LIO = 58,598 59k LIO

18. Deferred case statements Standard: select case l_shipcode when 1 then 'AIR’ when 2 then 'FOB’ when 3 then 'MAIL’ when 4 then 'RAIL’ when 5 then 'REG AIR’ when 6 then 'SHIP’ when 7 then 'TRUCK' else 9 end shipcode, count(*) from lineitem group by 1; Nested Group By: select case l_shipcode when 1 then 'AIR’ when 2 then 'FOB’ when 3 then 'MAIL’ when 4 then 'RAIL’ when 5 then 'REG AIR’ when 6 then 'SHIP’ when 7 then 'TRUCK' else 9 end shipcode, c from ( select l_shipcode, count(*) c from lineitem group by 1 ) a ;

19. Deferred case statements Standard: select case l_shipcode when 1 then 'AIR' when 2 then 'FOB' -> when 3 then 'MAIL' when 4 then 'RAIL' when 5 then 'REG AIR' -> when 6 then 'SHIP' when 7 then 'TRUCK' else 9 end shipcode, -> count(*) from lineitem group by 1; +----------+----------+ | shipcode | count(*) | +----------+----------+ | AIR | 68580321 | . . . | RAIL | 68587640 | +----------+----------+ 7 rows in set, 1 warning (28.043 sec) 480,000,000 case statements Nested Group By: select case l_shipcode when 1 then 'AIR' when 2 then 'FOB' -> when 3 then 'MAIL' when 4 then 'RAIL' when 5 then 'REG AIR' -> when 6 then 'SHIP' when 7 then 'TRUCK' else 9 end shipcode, c -> from ( select l_shipcode, -> count(*) c from lineitem group by 1 ) a ; +----------+----------+ | shipcode | c | +----------+----------+ | RAIL | 68587640 | . . . | FOB | 68575597 | +----------+----------+ 7 rows in set, 1 warning (3.172 sec) 7 case statements

20.Create table, insert few rows CREATE TABLE `shipmode_fk_innodb` ( -> `l_shipmode` char(10) DEFAULT NULL, -> `l_shipcode` tinyint(4) DEFAULT NULL -> ) ENGINE=InnoDB ; InnoDB Query OK, 0 rows affected (0.006 sec) insert into shipmode_fk_innodb select distinct l_shipmode, l_shipcode -> from lineitem where l_orderkey < 1000000 ; Query OK, 7 rows affected, 1 warning (0.163 sec) Records: 7 Duplicates: 0 Warnings: 0 CREATE TABLE `shipmode_fk_columnstore` ( -> `l_shipmode` char(10) DEFAULT NULL, -> `l_shipcode` tinyint(4) DEFAULT NULL -> ) ENGINE=Columnstore ; ColumnStore Query OK, 0 rows affected (0.165 sec) insert into shipmode_fk_columnstore select distinct l_shipmode, l_shipcode -> from lineitem where l_orderkey < 1000000 ; Query OK, 7 rows affected, 1 warning (1.679 sec) Records: 7 Duplicates: 0 Warnings: 0

21. Deferred join Standard: select dim.l_shipmode, count(*) from lineitem join shipmode_fk_columnstore dim using (l_shipcode) group by 1; Nested Group By: select dim.l_shipmode, sum(c) from ( select l_shipcode, count(*) c from lineitem group by 1 ) L join shipmode_fk_columnstore dim using (l_shipcode) group by 1 ;

22. Deferred join Standard: select dim.l_shipmode, count(*) -> from lineitem join shipmode_fk_columnstore dim using (l_shipcode) -> group by 1; . . . 7 rows in set, 1 warning (10.721 sec) 480,000,000 rows joined Nested Group By : select dim.l_shipmode, sum(c) -> from ( select l_shipcode, count(*) c from lineitem group by 1 ) L -> join shipmode_fk_columnstore dim using (l_shipcode) group by 1 ; . . . 7 rows in set, 1 warning (3.239 sec) 7 rows joined

23. Insert select many rows (10 million) CREATE TABLE `lineitem_innodb` ( `l_orderkey` int(11) DEFAULT NULL, . . . `l_shipdate_yymm` smallint(6) DEFAULT NULL, KEY `i_l_shipdate` (`l_shipdate`), KEY `i_l_suppkey_partkey` (`l_partkey`,`l_suppkey`), KEY `i_l_partkey` (`l_partkey`), KEY `i_l_suppkey` (`l_suppkey`), KEY `i_l_receiptdate` (`l_receiptdate`), KEY `i_l_orderkey` (`l_orderkey`), KEY `i_l_orderkey_quantity` (`l_orderkey`,`l_quantity`), KEY `i_l_commitdate` (`l_commitdate`)) InnoDB insert into lineitem_innodb select * from lineitem_source_Innodb; Query OK, 10000000 rows affected (34 min 29.838 sec) Records: 10000000 Duplicates: 0 Warnings: 0 insert into lineitem_Columnstore select * from lineitem_source_Columnstore; ColumnStore Query OK, 10000000 rows affected, 1 warning (1 min 57.115 sec) Records: 10000000 Duplicates: 0 Warnings: 0

24. Load Data Infile, cpimport 10 million load data infile '/home/mysql/lineitem_10m' into table lineitem_Innodb InnoDB FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n' ; load data infile Query OK, 10000000 rows affected (33 min 14.988 sec) Records: 10000000 Deleted: 0 Skipped: 0 Warnings: 0 load data infile '/home/mysql/lineitem_10m' into table ColumnStore lineitem_columnstore FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n' ; load data infile Query OK, 10000000 rows affected (1 min 21.691 sec) Records: 10000000 Deleted: 0 Skipped: 0 Warnings: 0 ColumnStore cpimport -s "," -E \" -l lineitem_10m tpch80 lineitem_columnstore . . . 2019-05-24 01:49:23 (3164) INFO : No of Read Threads Spawned = 1 cpimport 2019-05-24 01:49:23 (3164) INFO : No of Parse Threads Spawned = 3 2019-05-24 01:49:43 (3164) INFO : For table tpch80.lineitem_columnstore: 10000000 rows processed and 10000000 rows inserted. 2019-05-24 01:49:44 (3164) INFO : Bulk load completed, total run time : 21.1454 seconds

25.Cross-Engine Join select dim.l_shipmode, count(*) InnoDB from lineitem dimension join shipmode_fk_innodb dim using (l_shipcode) group by 1; select dim.l_shipmode, count(*) ColumnStore from lineitem join shipmode_fk_columnstore dim dimension using (l_shipcode) group by 1;

26.Cross-Engine Join: 7 rows x .48 billion select dim.l_shipmode, count(*) -> from lineitem join shipmode_fk_innodb dim using (l_shipcode) InnoDB -> group by 1; . . . (9.899 sec) /* InnoDB dimension is faster */ dimension 7 rows in set, 1 warning select calgettrace()\G *************************** 1. row *************************** calgettrace(): Desc Mode Table TableOID ReferencedColumns PIO LIO PBE Elapsed Rows CES UM - - - - - - 0.000 7 BPS PM lineitem 3000 (l_shipcode) 0 58598 0 9.883 6489 HJS PM lineitem-dim 3000 - - - - ----- - TAS UM - - - - - - 9.871 7 TNS UM - - - - - - 0.000 7 ColumnStore select dim.l_shipmode, count(*) -> from lineitem join shipmode_fk_columnstore dim using (l_shipcode) -> group by 1; dimension . . . 7 rows in set, 1 warning (10.721 sec) select calgettrace()\G *************************** 1. row *************************** calgettrace(): Desc Mode Table TableOID ReferencedColumns PIO LIO PBE Elapsed Rows BPS PM dim 3081 (l_shipcode,l_shipmode) 0 4 0 0.002 7 BPS PM lineitem 3000 (l_shipcode) 0 58598 0 10.708 6489 HJS PM lineitem-dim 3000 - - - - ----- - TAS UM - - - - - - 10.690 7 TNS UM - - - - - - 0.000 7

27. Cross-Engine Join + deferred join select dim.l_shipmode, sum(c) InnoDB from ( select l_shipcode, count(*) c dimension from lineitem group by 1 ) L join shipmode_fk_innodb dim using (l_shipcode) group by 1 ; 7 rows in set, 1 warning (3.223 sec) ColumnStore select dim.l_shipmode, sum(c) from ( select l_shipcode, count(*) c dimension from lineitem group by 1 ) L join shipmode_fk_columnstore dim using (l_shipcode) group by 1 ; 7 rows in set, 1 warning (3.239 sec)

28.Behind the scenes MySQL Processes ColumnStore Execution Mgr Read 7 1 thread Read 7 Rows Rows 1 to 8 threads Parallel Workers 16 threads (or more)

29.Comment, add 2 byte field on leading +---------------------------------------------+----------------+ | l_comment | l_comment_code | +---------------------------------------------+----------------+ | carefully expres | ca | | counts will ha | co | | express deposits nag quickly regular depth | ex | | furiously pending, ev | fu | | mise blithely ironic | mi | | quests. regular, fina | qu | +---------------------------------------------+----------------+ | l_comment | varchar(44) | standard ------------------------------------------------- | l_comment_code | char(2) | added