Office Depot利用Analytics Zoo构建智能推荐系统的实践分享20200326_Kai Huang

Office Depot利用Analytics Zoo构建智能推荐系统的实践分享


Intel数据分析团队软件工程师。负责开发基于Apache Spark的深度学习框架,同时支持企业客户在大数据平台上构建端到端的深度学习应用。他是Analytics Zoo和BigDL的核心贡献者之一。

大量实验结果表明深度学习能更好地帮助商家为用户个性化推荐感兴趣的商品。Office Depot将Analytics Zoo工具包引入到他们的推荐系统中,在Spark集群上分布式训练了各种推荐算法模型,实验结果相比于传统的推荐算法有了十分显著的提升,本次分享主要介绍Office Depot使用Analytics Zoo构建智能推荐系统的实践经验。有兴趣的同学,可以提前关注此开源项目:

阿里巴巴开源大数据EMR技术团队成立Apache Spark中国技术社区,定期打造国内Spark线上线下交流活动。请持续关注。
微信公众号:Apache Spark技术交流社区


1.Use Analytics Zoo to build an intelligent recommendation system on Office Depot Kai Huang Mar 26th, 2020

2. Outline ▪ Background and use case overview ▪ Introduction to Analytic Zoo ▪ Recommenders on Analytics Zoo ▪ Performance and deployment by Office Depot ▪ Conclusion

3. Why Recommendation Systems? ▪ Help customers choose from a variety of products. ▪ Maintain user satisfaction and royalty. ▪ Turn ordinary users into potential customers. ▪ Increase revenue per user visit. ▪ ……

4.Big Data Journey for Recommendation Stage I : Office Depot tried to build intelligent models for product recommendation using Python/SAS/R. Challenges: They can not process this amount of data on a single machine: ▪ Over 100,000,000 distinct sessions monthly. ▪ More than 300,000 active products selling online. ▪ Training data often exceed 10G.

5.Big Data Journey for Recommendation Stage II : Office Depot incorporated Spark into their workflow. Challenge: Deep learning libraries such as TensorFlow/Keras/PyTorch cannot run directly on Spark clusters. Why deep learning? ▪ Better performance on larger data. ▪ Less manual feature engineering needed. ▪ Easier to involve complex functions and combine different architectures.

6. Collaborative Filtering (ALS) ▪ The Collaborative filtering approach works by decomposing the user-item interaction matrix into the product of two lower dimensionality rectangular matrices. ▪ Spark ALS (Alternating Least Squares) implementation runs matrix factorization in a parallel fashion and therefore has a pretty good scalability and performance.

7.Collaborative Filtering (ALS) Limitations of matrix factorization: ▪ Simple choice of the interaction function will hinder the performance. ▪ Data sparse problem. ▪ Not able to do incremental training. ▪ Cold start problem. ▪ Not able to capture the latest purchase intent. …

8. AI on Distributed, High-Performance A unified analytics and AI platform Deep Learning Framework for distributed Tensorflow, Keras, PyTorch and Ray for Apache Spark on Apache Spark Accelerating Data Analytics + AI Solutions At Scale

9. Analytics Zoo Unified Big Data Analytics and AI Platform Models & Recommendation Time Series Computer Vision NLP Algorithms ML Workflow AutoML for Time Series Automatic Cluster Serving Integrated Distributed TensorFlow & PyTorch on Spark RayOnSpark Analytics & AI Pipelines Spark Dataframes & ML Pipelines for DL Model Serving Library & Distributions Distributed Analytics DL Frameworks Python Libraries Framework (Cloudera/Databricks/….) (Spark/Flink/Ray/…) (TF/PyTorch/…) (Numpy/Pandas/…)

10. Unified Big Data Analytics and AI Platform Seamless Scaling from Laptop to Production Prototype on laptop Experiment on clusters Production deployment w/ using sample data with history data distributed data pipeline Production Data pipeline • Easily prototype the integrated data analytics & AI solution • “Zero” code change from laptop to distributed cluster • Directly access production data (Hadoop/Hive/HBase) without data copy • Seamlessly deployed on production big data clusters

11. Real-World Applications NLP Based Customer Service Chatbot for Microsoft Azure* platforms-on-microsoft-azure-part-1 Industrial Product Defect Detection in Midea* distributed-tensorflow-on-analytics Unsupervised Time Series Anomaly Detection for Baosight* zoo-for-apache-spark-and-bigdl Any many more…

12.Neural Collaborative Filtering (NCF) ▪ NCF stimulates matrix factorization using DNN approach and is severed as a guideline for deep learning methods for recommendation services. ▪ It combines GMF with MLP to model user-item interactions.

13. Wide & Deep Learning ▪ Wide and Deep learning model can take rich data as input. ▪ The wide part can effectively memorize sparse feature interactions using cross-product feature transformations. ▪ The deep part can generalize to previously unseen feature interactions through low dimensional user and item embeddings similar to NCF.

14. Session Recommender ▪ Each user session in an e-commerce system could be modeled as a sequence of web pages. ▪ A deep RNN could track how users browse the website using multiple hidden layers. “Mouse” “Monitor” “Office Chair” “Desk” “Mouse” “Monitor”

15.Session Recommender The Good: ▪ Can catch the latest purchase intent from current session behavior and adjust its product recommendation in real time. ▪ Can work with both anonymous / identified customers. ▪ No pre-filtering mechanism required, simpler serving architect. The Bad: ▪ Sequence window size is hard to set. ▪ Online inference requires lots of resources.

16. Performance Comparison Offline measurement: Method Top 5 Accuracy Session Recommender 52.3% Wide & Deep 45.2% NCF 46.7% ALS 16.2% Online measurement: Online A/B testing shows the test group using Session Recommender lifted sales by 1% and average order value by 1.6% compared to control group. *Tested by Office Depot Note: test data provided by Office Depot

17.Recommendation System In Production Model Training (Yarn Cluster) ▪ Automated model deployment pipeline. ▪ No down time when update model in Maintain training code production. ▪ Ability to scale up / down according to the current workload using Kubernetes. Output model files Model Serving (Kubernetes Cluster) Model storage Request Load/update model files Collect new clickstream data LocalPredictor Real time prediction; post-filtering rules

18. Conclusion and Takeaways ▪ Analytics Zoo integrates deep learning well into existing big data pipelines. ▪ Analytics Zoo provides model serving API for high performance real-time inference. ▪ Deep learning based recommendation provides more flexibility to combine different model architectures for different use cases. ▪ Lots of NLP algorithms (for example, transformers) can be utilized for recommendation. ▪ Check out the joint blog for more information: office-depot-using-apache-spark-and-analytics-zoo-on

19. Analytics Zoo on Ali E-MR + For more information and support, contact Wesley: Analytics Zoo is already out-of-box on Ali EMR: Email: DingTalk: * Version upgrade for Analytics Zoo is on-going.

20. More Information on Analytics Zoo • Project websites • • • • Tutorials • CVPR 2018: • AAAI 2019: • “BigDL: A Distributed Deep Learning Framework for Big Data” • In proceedings of ACM Symposium on Cloud Computing 2019 (SOCC’19) • • Use cases • Microsoft Azure, CERN, MasterCard, Baosight, Tencent, Midea, etc. •

21. Unified Analytics + AI Platform Distributed TensorFlow, Keras and BigDL on Apache Spark


23.Legal Disclaimers • Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at, or from the OEM or retailer. • No computer system can be absolutely secure. • Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit Intel, the Intel logo, Xeon, Xeon phi, Lake Crest, etc. are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © 2019 Intel Corporation

阿里巴巴开源大数据EMR技术团队成立Apache Spark中国技术社区,定期打造国内Spark线上线下交流活动。请持续关注。 团队群号:HPRX8117 微信公众号:Apache Spark技术交流社区