18/12 - SASI, Cassandra on the full text search ride - Apache Bi

主要介绍 SASI的全文索引

展开查看详情

1.SASI, Cassandra on the full text search ride DuyHai DOAN – Apache Cassandra™ Evangelist

2.1 5 minutes introduction to Apache Cassandra™ 2 SASI introduction 3 SASI cluster-wide 4 SASI local read/write path 5 Query planner 6 Some benchmarks 7 Take away 2 @doanduyhai

3.Trademark Policy From now on … Cassandra ⩵ Apache Cassandra™ 3 @doanduyhai

4.5 minutes introduction to Apache Cassandra™ @doanduyhai

5.The tokens Random hash of #partition à token = hash(#p) C* C* Hash: ] –x, x ] C* C* hash range: 264 values x = 264/2 C* C* C* C* 5 @doanduyhai

6. Token ranges ⎤ 3x ⎤⎥ ⎤ x ⎤⎥ B C A: ⎥−x,− ⎥ ⎥ E : ⎥0, ⎥ ⎥ ⎦ 4⎦ ⎦ 4⎦ ⎤ 3x 2x ⎤⎥ ⎤ x 2x ⎤⎥ B: ⎥⎥− ,− F :⎥ , ⎥ ⎥ A D ⎦ 4 4 ⎥⎦ ⎦4 4 ⎦ ⎤ 2x x ⎤ ⎤ 2x 3x ⎤ C : ⎥⎥− ,− ⎥⎥ G : ⎥⎥ , ⎥⎥ ⎦ 4 4⎦ ⎦ 4 4 ⎦ H E ⎤ x ⎤⎥ ⎤ 3x ⎤⎥ D: ⎥− ,0 ⎥ ⎥ H : ⎥ ,x ⎥ ⎥ ⎦ 4 ⎦ ⎦ 4 ⎦ G F 6 @doanduyhai

7.Distributed tables CREATE TABLE users( B C user_id int, …, PRIMARY KEY(user_id) A D ), user_id1 user_id2 H E user_id3 user_id4 user_id5 G F 7 @doanduyhai

8.Distributed tables B C user_id3 user_id1 A D H user_id4 E user_id2 G F user_id5 8 @doanduyhai

9.Coordinator node 2 3 Responsible for handling requests (read/write) B C Every node can be coordinator 1 A D •masterless •no SPOF •proxy role H E request coordinator G F 9 @doanduyhai

10.!" Q&A 10 @doanduyhai

11.SASI introduction @doanduyhai

12.What is SASI ? • SSTable-Attached Secondary Index à new 2nd index impl that follows SSTable life-cycle • Objective: provide more performant & capable 2nd index 12 @doanduyhai

13.Who created it ? Open-source contribution by an engineers team 13 @doanduyhai

14.Why is it better than native 2nd index ? • follow SSTable life-cycle (flush, compaction, rebuild …) à more optimized • new data-strutures • range query (<, ≤, >, ≥) possible • full text search options 14 @doanduyhai

15.Demo 15 @doanduyhai

16.SASI cluster-wide @doanduyhai

17. Distributed index On cluster level, SASI works exactly like native 2nd index B C UK user87 user176 … user987 A D UK user1 user102 … user493 US user54 user483 … user938 UK user17 user409 … user787 H E G F 17 @doanduyhai

18.Distributed search algorithm B C A D 1st round Concurrency factor = 1 H E coordinator G F 18 @doanduyhai

19.Distributed search algorithm B C A D Not enough results ? H E coordinator G F 19 @doanduyhai

20.Distributed search algorithm B C 2nd round Concurrency factor = 2 A D H E coordinator G F 20 @doanduyhai

21.Distributed search algorithm B C A D Still not enough results ? H E coordinator G F 21 @doanduyhai

22.Distributed search algorithm B C A D 3rd round Concurrency factor = 4 H E coordinator G F 22 @doanduyhai

23.Concurrency factor formula • more details at: http://www.doanduyhai.com/blog/?p=13191 23 @doanduyhai

24.Caveat 1: non restrictive filters B C Hit all nodes A D eventually L H E coordinator G F 24 @doanduyhai

25.Caveat 1 solution : always use LIMIT B C SELECT * FROM … A D WHERE ... LIMIT 1000 H E coordinator G F 25 @doanduyhai

26.Caveat 2: 1-to-1 index (user_email) B C A D Not found WHERE user_email = ‘xxx' H E coordinator G F 26 @doanduyhai

27.Caveat 2: 1-to-1 index (user_email) B C A D WHERE user_email = ‘xxx' H E Still no result coordinator G F 27 @doanduyhai

28.Caveat 2: 1-to-1 index (user_email) B C A D WHERE user_email = ‘xxx' At best 1 user found At worst 0 user found H E coordinator G F 28 @doanduyhai

29.Caveat 2 solution: materialized views For 1-to-1 index/relationship, use materialized views instead CREATE MATERIALIZED VIEW user_by_email AS SELECT * FROM users WHERE user_id IS NOT NULL and user_email IS NOT NULL PRIMARY KEY (user_email, user_id) But range queries ( <, >, ≤, ≥) not possible … 29 @doanduyhai