Apache Spark来分析大规模的远程信息

近年来,具有嵌入式GPS设备和传感器的移动电话日益普及,促进了车辆远程信息技术的应用。远程信息处理提供了车辆的详细和连续的信息,例如位置、速度和运动。车辆远程信息处理可以进一步与其他空间数据链接,以提供上下文来理解驾驶行为。收集高频率的远程信息处理数据会产生大量的数据,这些数据必须被有效地处理。我们提出了一个解决方案,使用Apache Spark来加载和转换大规模远程信息处理数据。然后,我们介绍如何使用机器学习远程信息处理数据,以获得关于驾驶安全的见解。
展开查看详情

1.Large-Scaled Telematics Analytics in Apache Spark Wayne Zhang, Uber Neil Parker, Uber #DS3SAIS

2.Agenda • Telematics introduction • Eng pipeline 2

3.Telematics - Wide availability - Cheap - Short upgrade cycle - Lower quality - Measure phone motion Source: Smartphone-based Vehicle Telematics - A Ten-Year Anniversary 3

4.Core Pipeline Vehicle Sensor Data Preprocessing & Driving Behavior Movement Collection Transformation Inference Inference 4

5.Phone Sensor Data ● GPS ○ Absolute location, velocity and time ○ Low frequency (<= 1 point per second) ● IMU ○ Relative motion of phone ■ Accelerometer: 3D linear acceleration ■ Gyroscope: 3D angular velocity ○ High frequency (>20 points per second) 5

6.GPS Map-Matching 6

7.Phone Re-Orientation 7

8.Long Stop Detection Long Stop Long Stop 8

9.Vehicle Movement 9

10.Phone Mounting 10

11.Engineering How big is • Pipeline our data? xPbs per year – Past xTbs sensor data per day – Present • Problems millions trips per day 11

12.Data Pipeline - Past (Streaming) Input Transform Output Schema Topic Samza Job Schema Topic (Kafka) (Kafka) • Realtime 12

13.Data Pipeline - Present (Batch) Select logic in SparkSQL Input Transform Output Hive Table Spark Job Hive Table (HDFS) (HDFS) • Flexible 13

14.Data Pipeline - Actuality (λ) Input Transform Output Schema Topic Samza Job Schema Topic (Kafka) (Kafka) Business Logic (JVM) Input Transform Output Hive Table Spark Job Hive Table (HDFS) (HDFS) 14

15.Data Pipeline - Present Select logic in SparkSQL Input Transform Output Hive Table Spark Job Hive Table (HDFS) (HDFS) 15

16.Data Pipeline Join logic in SparkSQL Input Transform Output Hive Table Spark Job Hive Table (HDFS) (HDFS) Input Hive Table (HDFS) 16

17.Data Pipeline Input Transform Output Hive Table Spark Job Hive Table (HDFS) (HDFS) Input Output Hive Table Hive Table (HDFS) (HDFS) 17

18.Data Pipeline Scheduler (Every 24hrs) Input Transform Output Hive Table Spark Job Hive Table (HDFS) (HDFS) Input Output Hive Table Hive Table (HDFS) (HDFS) 18

19.Data Pipeline Input Transform Output Hive Table Spark Job Hive Table (HDFS) (HDFS) 19

20.Data Pipeline Input Transform Output / Transform Output Input Hive Table Spark Job Hive Table Spark Job Hive Table (HDFS) (HDFS) (HDFS) 20

21.Data Pipeline - Actuality 21

22.Eng Problems • Data Sources • OOM Errors • Too many Namenodes 22

23.Eng Problems - Data Sources Join logic in SparkSQL Doesn’t work Input Transform Output Hive Table Spark Job Hive Table (HDFS) (HDFS) Input Thrift Binaries (S3) 23

24.Eng Problems - Data Sources Input Transform Output Thrift Binaries Spark Job Hive Table (S3) (HDFS) github.com/airbnb/airbnb-spark-thrift 24

25.Aside: Encode Decode Invariant Java Thrift Spark SQL Class Instance Row 25

26.Aside: Encode Decode Invariant Java Thrift Spark SQL Class Instance Row Generate random data to test (ScalaCheck Library) 26

27. Eng Problems - Data Sources Join logic in SparkSQL Works Input Transform Output Hive Table Spark Job Hive Table (HDFS) (HDFS) Input Transform Output Thrift Binaries Spark Job Hive Table (S3) (HDFS) 27

28.Eng Problems - OOM Errors • If hitting OOM related issues, usually increasing partitions help – `spark.sql.shuffle.partitions=X • Where x > 200 (default) • Might also play with executor-cores and memory settings 28

29.Eng Problems - # Namenode • If we have more partitions, we create more files • Solution: Merge file after job run – github.com/apache/parquet-mr/tree/master/parquet-tools 29