基于LSH和Tensorflow的尺度图像相似性检测

学习图像和理解内容的质量在Pinterest中起着重要的作用。本次讲座将提出一个基于火花的系统负责检测近(远)重复图像。该系统用于提高Pinterest许多生产表面的推荐和搜索结果的准确性。
展开查看详情

1.

2.Image Similarity Detection Using LSH and Tensorflow Andrey Gusev June 6, 2018

3.Help you discover and do what you love.

4. 100b+ Pins 2b+ Boards People on Pinterest each month 200m+ 10b+ Recommendations/Day

5.Agenda 1 Neardup, clustering and LSH 2 Candidate generation 3 Deep dive 4 Candidate selection 5 TF on Spark

6.Neardup

7.

8.Not Neardup

9.Unrelated Duplicate Neardup

10.Clustering

11.Not An Equivalence Class Formulation For each image find a canonical image which represents an equivalence class. Problem Neardup is not an equivalence relation because neardup relation is not a transitive relation. It means we can not find a perfect partition such that all images within a cluster are closer to each other than to the other clusters.

12.Incremental approximate K-Cut Incrementally: 1. Generate candidates via batch LSH search 2. Select candidates via a TF model 3. Take a transitive closure over selected candidates 4. Pass over clusters and greedily select sub-clusters (K-Cut).

13.LSH

14.Embeddings and LSH - Visual Embeddings are high-dimensional vector representations of entities (in our case images) which capture semantic similarity. - Produced via Neural Networks like VGG16, Inception, etc. - Locality-sensitive hashing or LSH is a modern technique used to reduce dimensionality of high-dimensional data while preserving pairwise distances between individual points.

15.LSH: Locality Sensitive Hashing - Pick random projection vectors (black) - For each embeddings vector determine 1 1 on which side of the hyperplane the embeddings vector lands - On the same side: set bit to 1 - On different side: set bit to 0 1 0 Result 1: <1 1 0> Result 2: <1 0 1> 0 1

16.LSH terms Pick optimal number of terms and bits per term - 1001110001011000 -> [00]1001 - [01]1100 - [10]0101 - [11]1000 - [x] → a term index

17.Candidate Generation

18.Neardup Candidate Generation - Input Data: RDD[(ImgId, List[LSHTerm])] // billions - Goal: RDD[(ImgId, TopK[(ImgId, Overlap))] Nearest Neighbor (KNN) problem formulation

19.Neardup Candidate Generation Given a set of documents each described by LSH terms, example: A → (1,2,3) B → (1,3,10) C → (2,10) And more generally: Di → [tj] Where each Di is a document and [tj] is a list of LSH terms (assume each is a 4 byte integer) Results: A → (B,2), (C,1) B → (A,2), (C,1) C → (A,1), (B,1)

20.Spark Candidate Generation 1. Input RDD[(ImgId, List[LSHTerm])] ← both index and query sets 2. flatMap, groupBy input into RDD[(LSHTerm, PostingList)] ← an inverted index 3. flatMap, groupBy into RDD[(LSHTerm, PostingList)] ← a query list 4. Join (2) and (3), flatMap over queries posting list, and groupBy query ImgId; RDD[(ImgId, List[PostingList])] ← search results by query. 5. Merge List[List[ImgId]] into TopK(ImgId, Overlap) counting number of times each ImgId is seen → RDD[ImgId, TopK[(ImgId, Overlap)]. * PostingList = List[ImgId]

21.Orders of magnitude too slow.

22.Deep Dive

23.Dictionary encoding def mapDocToInt(termIndexRaw: RDD[(String, List[TermId])]): RDD[(String, DocId)] = { // ensure that mapping between string and id is stable by sorting // this allows attempts to re-use partial stage completions termIndexRaw.keys.distinct().sortBy(x => x).zipWithIndex() } val stringArray = (for (ind <- 0 to 1000) yield randomString(32)).toArray val intArray = (for (ind <- 0 to 1000) yield ind).toArray 108128 Bytes* 4024 Bytes* 25x * https://www.javamex.com/classmexer/

24.Variable Byte Encoding - One bit of each byte is a continuation bit; overhead - int → byte (best case) - 32 char string up to 25x4 = 100x memory reduction https://nlp.stanford.edu/IR-book/html/htmledition/variable-byte-codes-1.html

25.Inverted Index Partitioning Inverted index is skewed /** * Build partitioned inverted index by taking module of docId into partition. */ def buildPartitionedInvertedIndex(flatTermIndexAndFreq: RDD[(TermId, (DocId, TermFreq))]): RDD[((TermId, TermPartition), Iterable[DocId])] = { flatTermIndexAndFreq.map { case (termId, (docId, _)) => // partition documents within the same term to improve balance // and reduce the posting list length ((termId, (Math.abs(docId) % TERM_PARTITIONING).toByte), docId) }.groupByKey() }

26.Packing (Int, Byte) => Long Before: Unsorted: 128.77 MB in 549ms Sort+Limit: 4.41 KB in 7511ms After: Unsorted: 38.83 MB in 219ms Sort+Limit: 4.41 KB in 467ms def packDocIdAndByteIntoLong(docId: DocId, docFreq: DocFreq): Long = { (docFreq.toLong << 32) | (docId & 0xffffffffL) } def unpackDocIdAndByteFromLong(packed: Long): (DocId, DocFreq) = { (packed.toInt, (packed >> 32).toByte) }

27.Slicing Split query set into slices to reduce spill and size for “widest” portion of the computation. Union at the end.

28.Additional Optimizations - Cost based optimizer - significant improvements to runtime can be realized by analyzing input data sets and setting performance parameters automatically. - Counting - jaccard overlap counting is done via low level, high performance collections. - Off heaping serialization when possible (spark.kryo.unsafe).

29.Generic Batch LSH Search - Can be applied generically to KNN, embedding agnostic. - Can work on arbitrary large query set via slicing.