基于空间优化的TopK补全算法

下载 0

快召唤伙伴们来围观吧
微博 QQ QQ空间 贴吧
文档嵌入链接
<iframe src="https://www.slidestalk.com/u70/TopKCompletion?embed" frame border="0" width="640" height="360" scrolling="no" allowfullscreen="true">复制
微信扫一扫分享
已成功复制到剪贴板

献良

发布于

6年前

2920

人观看

#信息技术

对于自动补全查询而言，倘若待查的数据巨大，那么自动补全技术设计的算法优化（空间和时间）就显得务必重要了，本文介绍RMQ TRIE(RT)的算法原理。

展开查看详情

1 .Space-Efficient Data Structures for Top-k Completion Giuseppe Ottaviano Università di Pisa Bo-June (Paul) Hsu Microsoft Research WWW 2013

2 .String auto-completion

3 .Scored string sets Top-k Completion query: Given prefix p, return k strings prefixed by p with highest scores Example: p=“ tr ”, k=2 (triangle, 9), ( trie , 5) three 2 trial 1 triangle 9 trie 5 triple 4 triply 3

4 .Space-Efficiency Scored string sets can be very large Hundreds of millions of queries for web search auto-suggest Must fit in RAM for fast access Need space-efficient solutions! We compare three solutions RMQ Trie , based on Range Minimum Queries Completion Trie , based on a modified trie with variable-sized pointers Score-Decomposed Trie , based on succinct data structures

5 .RMQ Trie (RT)

6 .RMQ Trie Lexicographic order → strings starting with given prefix in a contiguous range If we can find the max in a range, it is top-1 Range is split in two subranges , can proceed recursively using a heap to retrieve top-k three 2 trial 1 triangle 9 trie 5 triple 4 triply 3

7 .RMQ Trie To store the strings we can use any data structure that keeps the strings sorted We use a compressed trie To find max score in a range we use a succinct Range Minimum Query (RMQ) data structure Needs only 2.6 additional bits per score, answers queries in O(log n) time This is a standard strategy, but not very fast. We use it as a baseline.

8 .Completion trie (CT)

9 .t i ree ε ε l ε ε ε gle h r e p a l n Node label Branching character (Scored) compacted tries y e three 2 trial 1 triangle 9 trie 5 triple 4 triply 3 2 1 4 3 5 9

10 .t i ree ε ε l ε ε ε gle h r e p a l n Completion Trie y e three 2 trial 1 triangle 9 trie 5 triple 4 triply 3 2 1 4 3 5 9 4 9 9 9

11 .Completion Trie t 9 hree 2 ri 9 a 9 e 5 pl 4 e 4 y 3 ngle 9 l 1 t i ree ε ε l ε ε ε gle h r e p a l n y e 2 1 4 3 5 9 4 9 9 9

12 .Completion Trie Scores encoded differentially (either from parent or previous sibling) Pointers and score deltas encoded with variable bytes All node information in the same stream, favoring cache-efficiency t 9 hree 2 ri 9 a 9 e 5 pl 4 e 4 y 3 ngle 9 l 1

13 .Score-Decomposed Trie (SDT)

14 .Trees as balanced parentheses () () () () (()()()) (()(()()())) 2n bits are sufficient (and necessary) to represent a tree Can support O(1) operations with 2n + o(n) bits

15 .Score-decomposed trie Builds on compressed path-decomposed tries [ Grossi -Ottaviano ALENEX 2012] Parentheses-based representation of trees Dictionary-compression of node labels

16 .Score-decomposed trie t i ree ε ε l ε ε ε gle h r e p a l n y e 2 1 4 3 5 9 t r i an gle h,2 e,5 p,4 l,1 9 L : t 1 ri 2 a 1 ngle BP: ( ((( ) B : h epl R : 2 541

17 .three 2 trial 1 triangle 9 trie 5 triple 4 triply 3

18 .Score compression ... 3 5 1 2 3 0 0 1 2 4 1900 1 1 2 3 2 1 10000 ... 3 bits/value 11 bits/value 16 bits/value Data structure to store scores in RT and SDT Packed-blocks array “Folklore” data structure, similar to many existing packed arrays, Frame-Of-Reference, PFORDelta ,… Divide the array into fixed-size blocks Encode the values of each block with the same number of bits Store separately the block offsets

19 .Score compression ... 3 5 1 2 3 0 0 1 2 4 1900 1 1 2 3 2 1 10000 ... 3 bits/value 11 bits/value 16 bits/value Can be unlucky Each block may contain a large value But scores are power-law distributed Also, tree-wise monotone sorting On average, 4 bits per score

20 .Space Dataset gzip CT SDT RT QueriesA 27% 57% 30% 31% QueriesB 25% 48% 26% 27% URLs 24% 57% 26% 27% Unigrams 39% 43% 35% 37%

21 .Space

22 .Time Time per returned completion on a top-10 query

23 .Thanks for your attention! Questions?

1点赞

1收藏

0下载