- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
基于空间优化的TopK补全算法
展开查看详情
1 .Space-Efficient Data Structures for Top-k Completion Giuseppe Ottaviano Università di Pisa Bo-June (Paul) Hsu Microsoft Research WWW 2013
2 .String auto-completion
3 .Scored string sets Top-k Completion query: Given prefix p, return k strings prefixed by p with highest scores Example: p=“ tr ”, k=2 (triangle, 9), ( trie , 5) three 2 trial 1 triangle 9 trie 5 triple 4 triply 3
4 .Space-Efficiency Scored string sets can be very large Hundreds of millions of queries for web search auto-suggest Must fit in RAM for fast access Need space-efficient solutions! We compare three solutions RMQ Trie , based on Range Minimum Queries Completion Trie , based on a modified trie with variable-sized pointers Score-Decomposed Trie , based on succinct data structures
5 .RMQ Trie (RT)
6 .RMQ Trie Lexicographic order → strings starting with given prefix in a contiguous range If we can find the max in a range, it is top-1 Range is split in two subranges , can proceed recursively using a heap to retrieve top-k three 2 trial 1 triangle 9 trie 5 triple 4 triply 3
7 .RMQ Trie To store the strings we can use any data structure that keeps the strings sorted We use a compressed trie To find max score in a range we use a succinct Range Minimum Query (RMQ) data structure Needs only 2.6 additional bits per score, answers queries in O(log n) time This is a standard strategy, but not very fast. We use it as a baseline.
8 .Completion trie (CT)
9 .t i ree ε ε l ε ε ε gle h r e p a l n Node label Branching character (Scored) compacted tries y e three 2 trial 1 triangle 9 trie 5 triple 4 triply 3 2 1 4 3 5 9
10 .t i ree ε ε l ε ε ε gle h r e p a l n Completion Trie y e three 2 trial 1 triangle 9 trie 5 triple 4 triply 3 2 1 4 3 5 9 4 9 9 9
11 .Completion Trie t 9 hree 2 ri 9 a 9 e 5 pl 4 e 4 y 3 ngle 9 l 1 t i ree ε ε l ε ε ε gle h r e p a l n y e 2 1 4 3 5 9 4 9 9 9
12 .Completion Trie Scores encoded differentially (either from parent or previous sibling) Pointers and score deltas encoded with variable bytes All node information in the same stream, favoring cache-efficiency t 9 hree 2 ri 9 a 9 e 5 pl 4 e 4 y 3 ngle 9 l 1
13 .Score-Decomposed Trie (SDT)
14 .Trees as balanced parentheses () () () () (()()()) (()(()()())) 2n bits are sufficient (and necessary) to represent a tree Can support O(1) operations with 2n + o(n) bits
15 .Score-decomposed trie Builds on compressed path-decomposed tries [ Grossi -Ottaviano ALENEX 2012] Parentheses-based representation of trees Dictionary-compression of node labels
16 .Score-decomposed trie t i ree ε ε l ε ε ε gle h r e p a l n y e 2 1 4 3 5 9 t r i an gle h,2 e,5 p,4 l,1 9 L : t 1 ri 2 a 1 ngle BP: ( ((( ) B : h epl R : 2 541
17 .three 2 trial 1 triangle 9 trie 5 triple 4 triply 3
18 .Score compression ... 3 5 1 2 3 0 0 1 2 4 1900 1 1 2 3 2 1 10000 ... 3 bits/value 11 bits/value 16 bits/value Data structure to store scores in RT and SDT Packed-blocks array “Folklore” data structure, similar to many existing packed arrays, Frame-Of-Reference, PFORDelta ,… Divide the array into fixed-size blocks Encode the values of each block with the same number of bits Store separately the block offsets
19 .Score compression ... 3 5 1 2 3 0 0 1 2 4 1900 1 1 2 3 2 1 10000 ... 3 bits/value 11 bits/value 16 bits/value Can be unlucky Each block may contain a large value But scores are power-law distributed Also, tree-wise monotone sorting On average, 4 bits per score
20 .Space Dataset gzip CT SDT RT QueriesA 27% 57% 30% 31% QueriesB 25% 48% 26% 27% URLs 24% 57% 26% 27% Unigrams 39% 43% 35% 37%
21 .Space
22 .Time Time per returned completion on a top-10 query
23 .Thanks for your attention! Questions?