Driver Location Intelligence at Scale using Apache Spark, Delta Lake, and MLflow on Databricks

TomTom has the mission of creating a world free of congestion and better driving experience. In order to do that, we need to understand driving behavoiur from end users, at the same time that we optimize the operational costs of our services. However, due to the large scale of our probe data from vehicles providing insights and performing advanced analytics can can be quite challenging.

During this discussion I will showcase two use cases where Databricks, Delta Lake and MLflow has enabled us to accelerate innovation. The first one is the IQMaps usecase. IQMaps is a system designed specifically for in-dash systems – taking the same up-to-date user experience you expect from navigation apps and bringing it to reliable, in-car navigation. IQ Maps learn the drivers’ driving patterns and updates the map regions that are most relevant to the user, using Wi-Fi or 4G. However, optimizing the data network consumption, which can have a high cost, while keeping the best driving experience, by having the map updated, requires complex simulations using millions of locations traces from vehicles. Apache Spark has been our key instrument to find the best balance to this trade off. The second use case is Destination Prediction. For many years, we have offered a personalized feature on our navigation products that predicts with high accuracy the driver’s next destination. Nonetheless, with the exponential increase and availability of data, and the access to more sophisticated Machine Learning models, we have revisited this feature to take it to the next level. Both us ecases take advantage of the latest frameworks and tools available on Databricks. With MLflow and Delta we have been able to find the best models that predict the destination for each individual driver, and to track each one of the KPIs.


1.WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

2.Driver Location Intelligence at Scale using Apache Spark, Delta Lake and MLflow on Databricks Sergio Ballesteros, TomTom Kia Eisinga, TomTom #UnifiedDataAnalytics #SparkAISummit

3.Our vision A safe, connected, autonomous world that is free of congestion and emissions.

4.Big data drives our business, but data privacy always comes first 4

5.Data • Anonymous location (GPS) traces 5

6.742.000.000 km every day 18.000 x 6


8.Data • Anonymous location (GPS) Traces • Community inputs • User events • Journalistic data • Car sensor data 8

9.Data flow ~80 billion data points per day ~150 trillion data points 9

10.Data flow 10

11.Data flow 11

12.Use case 1: IQMaps analytics In dash systems are outperformed by smartphones The embedded system is expected to be up-to-date, with no user interaction. And the most visible component of it is a map. 12

13.Drivers do not update their maps Today’s solutions provide manual updates, often with a necessity to drive to the dealer. This is way too complex and inefficient. 13


15.OEMs require data efficient solutions While drivers expect up-to-date system, the carmakers are usually concerned about the data cost required for the map management. 15


17.When radius is 0 km • User drives within 2 regions every week day • Radius of 0 km. • Download and install just home regions • Cellular data usage kept to a minimum 17

18.When radius is 150 km • User drives within 2 update regions every week day • Radius of 150 km. • Home region: 6 update regions. • Cellular data usage increased 18

19.IQMapsdemo with MLflow 19


21.Real results using 0.5M trips “This insight has led me to the conclusion that a default radius of 150km is unnecessary, and a small radius of ~10km would already satisfy most drivers while keeping cellular data usage low for OEMs.” - Rolf Dorland, PM at TomTom 21

22.Going on holidays • User goes for his holiday (less frequent updated region) • Once user starts driving, updates for all update regions the route goes through are downloaded and installed. 22

23.Destination prediction 23


25.Opportunity Past: Rule-based solution Delta Lake pipelines Present: Machine Learning 25

26.Data Original trace data from 1 source 227K device serials Filtering out invalid trips 143K device serials Users with at least 50 trips 3.6K device serials Devices feasible for modelling 2.5K device serials 26

27.Features What do we use in the end? For each trip, we have the following information: • Where did the trip start? • At what speed were you driving when the trip started? • What was the time of day (morning/afternoon/evening) when the trip started? • Was it rush hour when the trip started? • What day of the week was it? • Was it a weekend day? • What was the season? • Which driver profile do you belong to? Historical information: • Which destination did you go to your last trip? And the one before that? And the one before that? • If it is a, let's say Monday, where did you go to the last Monday you made a trip? (do this for every weekday) To predict: To which destination are you going? 27

28.Labels How do we define where you are going? • We are given the latitude and longitude of a destination of a trip. • In order to find out which latitude and longitudes belong to the same destination, we apply a clustering algorithm called DBSCAN. • DBSCAN clusters together destinations that are within 500 meters from each other. We should have at least 5 trips to a destination in order to call it a cluster. 28