- 微博 QQ QQ空间 贴吧
Working with 1 Million Time Series a Day: How to Scale Up a Predictive Analytics
收藏 0下载 1
Most predictive analytics projects no longer rely on the use of a single machine learning model. Instead, they leverage on a collection of different algorithms to be periodically evaluated against new data. This is because the currently best performing algorithm might no longer be the preferable one in the future. To deal with such ever-evolving frameworks, we can create architectures that include a few different algorithms which are run and confronted automatically every time a decision must be taken. We present a platform built with Apache Spark that predicts the evolution of the prices of about 150 thousand goods tracked in real time. The requirement was to analyze these time series data and predict the expected price, for each of the objects, in the five subsequent days. Our platform leverages Spark in two significant ways: 1. computational effort, in that every model and related parameters tweaks needs to be run on every object. For each of these objects our infrastructure identifies the optimal algorithm, and the related prediction is published. The process repeats every day. 2. storage capabilities, which are pivotal if we want to scale up to handle ever-growing data streams. Compared to the original single-machine code, switching to parallel computing allowed us to run and confront the models faster, which also opened up the possibilities to further experiment with different parameters and additional exogenous variables. Questions you’ll be able to confidently answer after the session: – When does it make sense to set up a model based on a pool of different algorithms? – When is it time to switch to parallel computing? – What should I do if I want to scale up my model? – How complicated is it to turn an already-written, sequential model, to its parallel computing version?