Randomized / Hashing Algorithms

下载 0

深度递归

发布于

2020

人观看

#信息技术

Randomized / Hashing Algorithms

展开查看详情

1 .Randomized / Hashing Algorithms Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University, and J. Leskovec , A. Rajaraman , and J. Ullman of Stanford University)

2 .Outline Bloom filters Locality-sensitive hashing Stochastic gradient descent Stochastic SVD Already covered Next Wednesday’s lecture

3 .Hash Trick - Insights Save memory: don’t store hash keys Allow collisions even though it distorts your data some Let the learner (downstream) take up the slack Here’s another famous trick that exploits these insights….

4 .Bloom filters Interface to a Bloom filter BloomFilter(int maxSize , double p ); void b f.add(String s ); // insert s b ool bd.contains(String s ); // If s was added return true; // else with probability at least 1-p return false; // else with probability at most p return true; I.e., a noisy “set” where you can test membership (and that’s it)

5 .One possible implementation BloomFilter(int maxSize , double p ) { set up an empty length- m array bits[]; } void bf.add(String s ) { bits[hash(s ) % m ] = 1; } bool bd.contains(String s ) { return bits[hash(s ) % m ]; }

6 .How well does this work? m =1,000, x ~=200, y ~=0.18

7 .How well does this work? m =10,000, x ~=1,000, y ~=0.10

8 .A better??? implementation BloomFilter(int maxSize , double p ) { set up an empty length- m array bits[]; } void bf.add(String s ) { bits[hash1(s) % m ] = 1; bits[hash2(s) % m ] = 1; } bool bd.contains(String s ) { return bits[hash1(s) % m ] && bits[hash2(s) % m ]; }

9 .How well does this work? m =1,000, x ~=200, y ~=0.18 m =1,000, x ~=13,000, y ~=0.01

10 .Bloom filters An example application Finding items in “ sharded ” data Easy if you know the sharding rule Harder if you don’t (like Google n-grams) Simple idea: Build a BF of the contents of each shard To look for key, load in the BF’s one by one, and search only the shards that probably contain key Analysis: you won’t miss anything, you might look in some extra shards You’ll hit O(1) extra shards if you set p=1/#shards

11 .Bloom filters An example application discarding rare features from a classifier seldom hurts much, can speed up experiments Scan through data once and check each w : if bf1.contains( w ): if bf2.contains(w): bf3.add( w ) else bf2.add(w) else bf1.add( w ) Now: bf2.contains(w)  w appears >= 2x bf3.contains(w)  w appears >= 3x Then train, ignoring words not in bf3

12 .Bloom filters Analysis (m bits, k hashers): Assume hash(i,s ) is a random function Look at Pr (bit j is unset after n add’s ): … and Pr (collision): …. fix m and n and minimize k: k = p =

13 .Bloom filters Analysis: Plug optimal k=m/n* ln (2) back into Pr (collision): Now we can fix any two of p, n, m and solve for the 3 rd : E.g., the value for m in terms of n and p: p =

14 .Bloom filters: demo http:// www.jasondavies.com / bloomfilter /

15 .Locality Sensitive Hashing (LSH) Two main approaches Random Projection Minhashing

16 .LSH: key ideas Goal: map feature vector x to bit vector bx e nsure that bx preserves “similarity”

17 .Random Projections

18 .Random projections u - u 2 γ + + + + + + + + + - - - - - - - - - To make those points “close” we need to project to a direction orthogonal to the line between them

19 .Random projections u - u 2 γ + + + + + + + + + - - - - - - - - - Any other direction will keep the distant points distant. So if I pick a random r and r.x and r.x ’ are closer than γ then probably x and x’ were close to start with.

20 .LSH: key ideas Goal: map feature vector x to bit vector bx e nsure that bx preserves “similarity” Basic idea: use random projections of x Repeat many times: Pick a random hyperplane r Compute the inner product of r with x Record if x is “close to” r ( r.x >=0) the next bit in bx Theory says that is x’ and x have small cosine distance then bx and bx ’ will have small Hamming distance

21 .LSH: key ideas Naïve algorithm: Initialization: For i =1 to outputBits : For each feature f: Draw r( f,i ) ~ Normal(0,1) Given an instance x For i =1 to outputBits : LSH[ i ] = sum( x [ f ]*r[ i, f ] for f with non-zero weight in x) > 0 ? 1 : 0 Return the bit-vector LSH Problem: the array of r’s is very large

22 .

23 .

24 .

25 .

26 .

27 .

28 .

29 .Distance Measures Goal: Find near-neighbors in high-dim. space We formally define “near neighbors” as points that are a “small distance” apart For each application, we first need to define what “ distance ” means Today: Jaccard distance/similarity The Jaccard s imilarity of two sets is the size of their intersection divided by the size of their union: sim (C 1 , C 2 ) = |C 1 C 2 |/|C 1 C 2 | Jaccard distance: d (C 1 , C 2 ) = 1 - |C 1 C 2 |/|C 1 C 2 | 29 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3 in intersection 8 in union Jaccard similarity= 3/8 Jaccard distance = 5/8

3点赞

1收藏

0下载