利用doc2vec和milvus搭建相似文章召回服务

当前相似文章召回,较为流行的有 bag of words,average word vectors,tfidf-weighting word vectors,这些方法能都实现文章向量的训练,但是对文章在语义空间中的表达仍有一定的欠缺,主要原因是不能够学习到单词的顺序或者句子的语义。Doc2vec又叫Paragraph Vector是Tomas Mikolov基于word2vec模型提出的,doc2vec 相较于传统的 word2vec 的方法,考虑了文章中单词的顺序,能更好更准确的在向量空间中表示一篇文章的语义,而相比于神经网络语言模型,Doc2vec 的省时省力更适合工业落地。

文章语义向量化后,利用 Milvus 对特征向量做相似度检索。能极大的提高相似文章的召回速度,做到实时相似文章召回。最后获取召回的相似文章相似度,根据业务场景通过策略加权,最终排序输出符合当前业务的相似文章结果。

展开查看详情

1. Milvus 利用doc2vec和Milvus搭建相似文章召回服务 Speaker: 松鼠 2020.05.30

2.Zilliz • Open source software company based in Shanghai • Mission: Reinvent data science • Main contributor of Milvus project © 2020 Zilliz. All rights reserved.

3.Unlock the treasure of unstructured data AI algorithms transform image, video, voice, natural language into vectors, and enables understanding and utilization of unstructured data at scale. Unstructured data Deep learning models Vectors Knowledge, insight, $ © 2020 Zilliz. All rights reserved.

4.Philosophy of a vector search engine Ballast of an unstructured database. Unstructured Data image, video, voice, natural language store input Information Extraction Object output Result AI Models Storage Milvus Search Index query insert Knowledge Base Feature Vectors © 2020 Zilliz. All rights reserved.

5.Milvus: The journey 2018.10 2019.04 2019.06 The most active AI projects in Milvus 1st seed Linux foundation The idea 0.1 user Open Joined Source LF AI 2019.10 2020.03 © 2020 Zilliz. All rights reserved.

6.Progress Unstoppable momentum since its debut. 5.3K 3.4K 104 Commits GitHub stars Contributors 14 200+ 19 Release Users Patents filed © 2020 Zilliz. All rights reserved.

7.Users 200+ community users in initial 6 months, and rapid growing. © 2020 Zilliz. All rights reserved.

8.Useful Links https://milvus.io Live demo https://github.com/milvus-io/milvus https://milvus.io/scenarios https://milvusio.slack.com • Content-based image retrieval system (以图搜图) • Q&A chatbot powered by NLP (智能客服机器人) https://twitter.com/milvusio • Molecular analysis (化合物分析) https://www.facebook.com/io.milvus.5 https://zhuanlan.zhihu.com/ai-search https://medium.com/@milvusio © 2020 Zilliz. All rights reserved.

9.常见文档向量表示方法 Bag of words 该模型将每个文档转换为固定长度的整数矢量。 给定句子: • John likes to watch movies. Mary likes movies too. • John also likes to watch football games. Mary hates football. 模型输出向量: ["John", "likes", "to", "watch", "movies", "Mary", "too", "also", "football", "games", "ha tes"] • [1, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0] • [1, 1, 1, 1, 0, 1, 0, 1, 2, 1, 1] © 2020 Zilliz. All rights reserved.

10.常见文档向量表示方法 Word2vec 模型使用浅层神经网络将单词嵌入到低维向量空间中。结果是一组词向量,其中在向量空间中靠在一起的向 量根据上下文具有相似的含义,而彼此远离的词向量具有不同的含义。 Word Vector King + Man = Queen + Woman King King + Man - Queen = Woman Man Queen Document Vector Woman • Average word vectors • tf-idf-weighting word vectors © 2020 Zilliz. All rights reserved.

11.常见文档向量表示方法 Doc2vec 该模型将每个文档转换为固定长度的矢量。 DM模型 DBOW模型 Classifier cat Classifier the cat sat on Average/Concatenate Paragraph Matrix D W W Paragraph Matrix D Paragraph id the cute Paragraph id © 2020 Zilliz. All rights reserved.

12.相似文章召回 召回结果排序 • 增加 title/head • 文章的地理位置 picture 等相似 属性、时间属性、 度得分 网络天气等 •文章 上下文 特征 环境 运营策 用户特 略 征 • 某类文章需要增 • 用户兴趣、年龄、 加曝光 性别等,进行打 分排序 © 2020 Zilliz. All rights reserved.

13.相似文章召回 流程图 Doc2vec vector Partition 1 Partition 2 … Partition n Paragraph model milvus Paragraph Query vector Id sever Rank Recall Result sever result © 2020 Zilliz. All rights reserved.