基于空间优化的TopK补全算法

对于自动补全查询而言,倘若待查的数据巨大,那么自动补全技术设计的算法优化(空间和时间)就显得务必重要了,本文介绍RMQ TRIE(RT)的算法原理。
展开查看详情

1.Space-Efficient Data Structures for Top-k Completion Giuseppe Ottaviano Università di Pisa Bo-June (Paul) Hsu Microsoft Research WWW 2013

2.String auto-completion

3.Scored string sets Top-k Completion query: Given prefix p, return k strings prefixed by p with highest scores Example: p=“ tr ”, k=2 (triangle, 9), ( trie , 5) three 2 trial 1 triangle 9 trie 5 triple 4 triply 3

4.Space-Efficiency Scored string sets can be very large Hundreds of millions of queries for web search auto-suggest Must fit in RAM for fast access Need space-efficient solutions! We compare three solutions RMQ Trie , based on Range Minimum Queries Completion Trie , based on a modified trie with variable-sized pointers Score-Decomposed Trie , based on succinct data structures

5.RMQ Trie (RT)

6.RMQ Trie Lexicographic order → strings starting with given prefix in a contiguous range If we can find the max in a range, it is top-1 Range is split in two subranges , can proceed recursively using a heap to retrieve top-k three 2 trial 1 triangle 9 trie 5 triple 4 triply 3

7.RMQ Trie To store the strings we can use any data structure that keeps the strings sorted We use a compressed trie To find max score in a range we use a succinct Range Minimum Query (RMQ) data structure Needs only 2.6 additional bits per score, answers queries in O(log n) time This is a standard strategy, but not very fast. We use it as a baseline.

8.Completion trie (CT)

9.t i ree ε ε l ε ε ε gle h r e p a l n Node label Branching character (Scored) compacted tries y e three 2 trial 1 triangle 9 trie 5 triple 4 triply 3 2 1 4 3 5 9

10.t i ree ε ε l ε ε ε gle h r e p a l n Completion Trie y e three 2 trial 1 triangle 9 trie 5 triple 4 triply 3 2 1 4 3 5 9 4 9 9 9

11.Completion Trie t 9 hree 2 ri 9 a 9 e 5 pl 4 e 4 y 3 ngle 9 l 1 t i ree ε ε l ε ε ε gle h r e p a l n y e 2 1 4 3 5 9 4 9 9 9

12.Completion Trie Scores encoded differentially (either from parent or previous sibling) Pointers and score deltas encoded with variable bytes All node information in the same stream, favoring cache-efficiency t 9 hree 2 ri 9 a 9 e 5 pl 4 e 4 y 3 ngle 9 l 1

13.Score-Decomposed Trie (SDT)

14.Trees as balanced parentheses () () () () (()()()) (()(()()())) 2n bits are sufficient (and necessary) to represent a tree Can support O(1) operations with 2n + o(n) bits

15.Score-decomposed trie Builds on compressed path-decomposed tries [ Grossi -Ottaviano ALENEX 2012] Parentheses-based representation of trees Dictionary-compression of node labels

16.Score-decomposed trie t i ree ε ε l ε ε ε gle h r e p a l n y e 2 1 4 3 5 9 t r i an gle h,2 e,5 p,4 l,1 9 L : t 1 ri 2 a 1 ngle BP: ( ((( ) B : h epl R : 2 541

17.three 2 trial 1 triangle 9 trie 5 triple 4 triply 3

18.Score compression ... 3 5 1 2 3 0 0 1 2 4 1900 1 1 2 3 2 1 10000 ... 3 bits/value 11 bits/value 16 bits/value Data structure to store scores in RT and SDT Packed-blocks array “Folklore” data structure, similar to many existing packed arrays, Frame-Of-Reference, PFORDelta ,… Divide the array into fixed-size blocks Encode the values of each block with the same number of bits Store separately the block offsets

19.Score compression ... 3 5 1 2 3 0 0 1 2 4 1900 1 1 2 3 2 1 10000 ... 3 bits/value 11 bits/value 16 bits/value Can be unlucky Each block may contain a large value But scores are power-law distributed Also, tree-wise monotone sorting On average, 4 bits per score

20.Space Dataset gzip CT SDT RT QueriesA 27% 57% 30% 31% QueriesB 25% 48% 26% 27% URLs 24% 57% 26% 27% Unigrams 39% 43% 35% 37%

21.Space

22.Time Time per returned completion on a top-10 query

23.Thanks for your attention! Questions?