Zheap: The Next Generation Storage Engine for Postgres

在过去的几年中,Postgres在许多核心系统的功能、性能和可扩展性方面都取得了很大的进步。然而,许多企业仍然抱怨的一个问题是,其规模随着时间的推移而增加,这通常被称为膨胀。EnterpriseDB正在领导社区为Postgres构建一个新的存储系统zheap,这将更好地控制膨胀。本次会议将讨论:
该计划的目标包括更好地控制“bloat”,在大多数情况下消除写放大,并减少Postgres数据库的磁盘存储容量。 新存储系统的技术架构,包括将新设计与目前Postgres中的实现进行对比
*项目的当前状态,包括引入新的存储管理接口、撤消处理以及何时将作为Postgres的一部分发布zheap。

展开查看详情

1. Zheap: The Next Generation Storage Engine for Postgres Ken Rugg, EnterpriseDB May 29, 2019 CONFIDENTIAL © Copyright EnterpriseDB Corporation, 2019. All rights reserved.

2. The world leader in Enterprise Postgres software and services PROVEN COMMITTED GLOBAL • Recognized RDBMS leader • Founded in 2004 • Customer global base > 4000 by Gartner • Largest PostgreSQL contributor— • 300+ Employees world-wide • 2013-2018 Member of 40% of core team • Offices in 16 countries Gartner Magic Quadrant CONFIDENTIAL © Copyright EnterpriseDB Corporation, 2019. All rights reserved.

3.ZHEAP: WHAT IS IT? • New storage format being developed by EnterpriseDB • Work in progress is already released on github under PostgreSQL license • Basics work, but much remains to be done • Goal is to get it integrated into PostgreSQL CONFIDENTIAL © Copyright EnterpriseDB Corporation, 2019. All rights reserved.

4.BLOAT: MOTIVATION AND DEFINITION • Original motivation for zheap: In some workloads, PostgreSQL tables tend to bloat, and when they do, it’s hard to get rid of the bloat. • Bloat occurs when the table and indexes grow even though the amount of real data being stored has not increased. • Bloat is caused mainly by updates, because we must keep both the old and new row versions for a period of time. • Bloat can be a concern because of increased disk consumption, but typically a bigger concern is performance loss – if a table is twice as big as it “should be”, scanning it takes twice as long. CONFIDENTIAL © Copyright EnterpriseDB Corporation, 2019. All rights reserved.

5.BLOAT: WHY A NEW STORAGE FORMAT? • All systems that use MVCC must deal with multiple row versions, but they store them in different places. • PostgreSQL and Firebird put all row versions in the table. • Oracle and MySQL put old row versions in the undo log. • SQL Server puts old row versions in tempdb. • Leaving old row versions in the table makes cleanup harder – sometimes need to use CLUSTER or VACUUM FULL. • Improving VACUUM helps contain bloat, but can’t prevent it completely. CONFIDENTIAL © Copyright EnterpriseDB Corporation, 2019. All rights reserved.

6. ZHEAP: HIGH-LEVEL GOALS • Better Bloat Control • Perform updates “in place” to avoid creating bloat (when possible) • Reuse space right after COMMIT or ABORT – little or no need for VACUUM • Fewer Writes • Eliminate hint-bits, freezing and anything else that could dirty a page (other than an update) • Allowing in-place updates when index column is updated by providing delete- marking in index • Indexes are not touched if the indexed columns are not changed • Smaller in Size • Narrower tuple headers – most transactional info not stored on the page itself • Eliminate most alignment padding 6 CONFIDENTIAL © Copyright EnterpriseDB Corporation, 2019. All rights reserved.

7. ZHEAP: TRANSACTION SLOTS • Each zheap page has fixed set of transaction slots containing transaction info (transaction id, epoch and the latest undo record pointer of that transaction). • As of now, the number of slots are configurable and default value of same is four. • Each transaction slot occupies 16 bytes. • We allow the transaction slots to be reused after the transaction becomes too old to matter (older than oldest xid having undo), committed or rolled back. This allows us to operate without having too many slots. 7 CONFIDENTIAL © Copyright EnterpriseDB Corporation, 2019. All rights reserved.

8.ZHEAP: PAGE FORMAT Page Header Item Item Item Tuple Tuple Tuple Slot Slot Slot Slot Transaction Slots TPD entry location CONFIDENTIAL © Copyright EnterpriseDB Corporation, 2019. All rights reserved.

9.ZHEAP: TUPLE FORMAT Xmin – inserting transaction id infomask2 – number of attrs Xmax – deleting transaction id and transaction slot id infomask – tuple flags t_cid – inserting or deleting Tuple Header command id, or both hoff – length of tuple header incl. bitmaps t_ctid – tuple id (page/item) bits – bitmap representing NULLs infomask2 – number of attrs and flags OID – object id of tuple (optional) infomask – tuple flags hoff – length of tuple header incl. bitmaps bits – bitmap representing NULLs ..... OID – object id of tuple (optional) Attributes ..... ..... Attributes ..... heap Tuple zheap Tuple CONFIDENTIAL © Copyright EnterpriseDB Corporation, 2019. All rights reserved.

10.ZHEAP: HOW DOES IT WORK? • Whenever a transaction performs an operation on a tuple, the operation is also recorded in an undo log. • If the transaction aborts, the undo log can be used to reverse all of the operations performed by the transaction. • The undo log also contains most of the data we need for MVCC purposes, so little transactional data needs to be stored in the table itself. This, and a reduction in alignment padding, mean that zheap is smaller on disk. • We avoid dirtying the page except when the data has been modified, or after an abort. Long-term goal: No VACUUM or other routine maintenance of any kind. CONFIDENTIAL © Copyright EnterpriseDB Corporation, 2019. All rights reserved.

11.PURPOSE OF UNDO • Undo is responsible for reversing the effects of aborted transactions. • When a transaction performs an operation, it also writes it to the write-ahead log (REDO) and records the information needed to reverse it (UNDO). If the transaction aborts, UNDO is used to reverse all changes made by the transaction. • It stores old versions of rows required for MVCC. • Independent of avoiding bloat, having undo can provide systematic framework for cleaning. • For example, if a transaction creates a table and, while that transaction is still in progress, there is an operating system or PostgreSQL crash, the storage used by the table is permanently leaked. This could be fixed by undo. CONFIDENTIAL © Copyright EnterpriseDB Corporation, 2019. All rights reserved.

12.ZHEAP: BASIC OPERATIONS • Requires an UNDO record for each INSERT, DELETE or UPDATE • INSERT – Same as current heap + undo • Insert the new row • Emit an UNDO to remove it • Delete - Same as current heap + undo • Emit an UNDO to put back the row • Update – Very different than current heap • And slightly more complicated… CONFIDENTIAL © Copyright EnterpriseDB Corporation, 2019. All rights reserved.

13. ZHEAP: BASIC OPERATIONS • Update can be done in 2 ways • In place update – whenever possible! •Old row placed in UNDO •New row replaces it in heap • When in-place update is not possible… •Perform DELETE combined with INSERT •Emit an UNDO to reverse both the operations • In place updates can not be done when… • New row is bigger than old row and can't fit in same page • Index column is updated • In place vs HOT • HOT requires 1) space on the page for the new row and 2) no index changes 13 CONFIDENTIAL © Copyright EnterpriseDB Corporation, 2019. All rights reserved.

14.UNDO CHAINS AND VISIBILITY • The undo chain is formed at page level for each transaction • Each undo record header contains the location of previous undo record of the transaction on the same page • When the current tuple is not visible to the scan snapshot, we can traverse undo chain to find the version which is visible (if any). CONFIDENTIAL © Copyright EnterpriseDB Corporation, 2019. All rights reserved.

15.UNDO CHAINS Undo log zHeap page-1 1, abc T-1 (1,abc) zHeap page-1 (100,xyz) (1,def) 1, def T-1 zHeap page-2 100, xyz T-1 zHeap page-1 1, ghi T-1 CONFIDENTIAL © Copyright EnterpriseDB Corporation, 2019. All rights reserved.

16.PERFORMANCE DATA (TEST SETUP) • Size and TPS comparison of heap and zheap • We have used pgbench ➢ Initialize the data (at scale factor 1000) ➢ Simple-update test (which comprises of one-update, one-select, one-insert) • Machine details: • x86_64 architecture, 2-sockets, 14-cores per socket, 2-threads per-core, 64-GB RAM • Non-default parameters: • shared_buffers=32GB, min_wal_size=15GB, max_wal_size=20GB, checkpoint_timeout=1200, maintenance_work_mem=1GB, checkpoint_completion_target=0.9, synchoronous_commit = off; CONFIDENTIAL © Copyright EnterpriseDB Corporation, 2019. All rights reserved.

17.PRELIMINARY TESTING ● The Initial size of accounts table is 13GB in heap and 11GB in zheap. ● The size in heap grows to 19GB at 8-client count test and to 26GB at 64-client count test ● The size in zheap remains at 11GB for both the client-counts at the end of test ● All the undo generated during test gets discarded within a few seconds after the open transaction is ended ● The TPS of zheap is ~40% higher than heap in above tests at 8 client-count CONFIDENTIAL © Copyright EnterpriseDB Corporation, 2019. All rights reserved.

18.BENEFITS • Because zheap is smaller on-disk, we get a small performance boost. • No worries about VACUUM kicking in unexpectedly. • Undo bloat is self-healing – good for cloud or other “unattended” workloads. • In workloads where the heap bloats and zheap only bloats the undo, we get a massive performance boost • Discarding undo happens in the background and is cheaper than HOT pruning; that helps, too! CONFIDENTIAL © Copyright EnterpriseDB Corporation, 2019. All rights reserved.

19.DRAWBACKS • Transaction abort will be more expensive. • Deletes might not perform as well. • Could be slow if most/all indexed columns are updated at the same time. • On small tables, contention for transaction slots can limit performance. • Huge amount of development work. CONFIDENTIAL © Copyright EnterpriseDB Corporation, 2019. All rights reserved.

20.PLUGGABLE STORAGE: PLAN • Allow PostgreSQL to support pluggable storage formats. • Allows innovation – major changes to the heap are impossible because everyone relies on it. Can’t go backwards for any use case! • Allows for user choice – if there are multiple storage formats available, pick the one that is best for your use case. CONFIDENTIAL © Copyright EnterpriseDB Corporation, 2019. All rights reserved.

21.PLUGGABLE STORAGE: EXAMPLES • Columnar storage • Most queries don’t need all columns • Write-once read-many (WORM) • No support UPDATE, DELETE, or SELECT FOR UPDATE/SHARE • Index-organized storage • One index is more important than all of the others • In-memory storage • No need to spill to disk • Non-transactional storage • No MVCC CONFIDENTIAL © Copyright EnterpriseDB Corporation, 2019. All rights reserved.

22.WHEN? • Pluggable Storage in PostgreSQL 12 • Undo in PostgreSQL 12 • First version of zheap targeted for PostgreSQL 13 • There will still be much more to do for “v2” CONFIDENTIAL © Copyright EnterpriseDB Corporation, 2019. All rights reserved.

23.WHO? zheap & undo Pluggable Storage • Amit Kapila (development lead) • Haribabu Kommi • Dilip Kumar (Fujitsu) • Kuntal Ghosh • Alexander Korotkov • Mithun CY (Postgres Pro) • Ashutosh Sharma • Andres Freund • Amit Khandekar • Rafia Sabih • Beena Emerson • Amit Khandekar • Thomas Munro • Neha Sharma (QA) CONFIDENTIAL © Copyright EnterpriseDB Corporation, 2019. All rights reserved.

24. QUESTIONS & DISCUSSION 24 CONFIDENTIAL © Copyright EnterpriseDB Corporation, 2019. All rights reserved.

25.25 CONFIDENTIAL © Copyright EnterpriseDB Corporation, 2019. All rights reserved.

26. THANK YOU info@enterprisedb.com www.enterprisedb.com 26 CONFIDENTIAL © Copyright EnterpriseDB Corporation, 2019. All rights reserved.