Geospatial Analytics at Scale with Deep Learning and Apache Spark

“Deep Learning is now the standard in object detection, but it is not easy to analyze large amounts of images, especially in an interactive fashion. Traditionally, there has been a gap between Deep Learning frameworks, which excel at image processing, and more traditional ETL and data science tools, which are usually not designed to handle huge batches of complex data types such as images. In this talk, we show how manipulating large corpora of images can be accomplished in a few lines of code because of recent developments in Apache Spark. Thanks to Spark’s unique ability to blend different libraries, we show how to start from satellite images and rapidly build complex queries on high level information such as houses or buildings. This is possible thanks to Magellan, a geospatial package, and Deep Learning Pipelines, a library that streamlines the integration of Deep Learning frameworks in Spark. At the end of this session, you will walk away with the confidence that you can solve your own image detection problems at any scale thanks to the power of Spark.”
展开查看详情

1.Geospatial Analytics at Scale with Deep Learning and Apache Spark Raela Wang Databricks #UnifiedAnalytics #SparkAISummit

2.About Me • Raela Wang • Solutions Architect @ Databricks • Specialist in Machine Learning solutions #UnifiedAnalytics #SparkAISummit 2

3.What this talk covers - Image Processing with Apache Spark - Object Detection with Transfer Learning - Deep Learning Pipelines - Run Geolocalized queries to analyze results - Magellan - Demo 3

4.Mapping the world • One of the most ancient big data activities in the world • Critical for navigation, warfare, commercial exploitation 4

5.Mapping the world with images - A lot of tools and companies now provide geospatial solutions - Increasingly done with a combination of satellites, © Wired / Planet Labs airplanes and drones 5

6.Mapping the world with images Large range of new applications - Disaster Recovery: flood survey, fallen trees - Infrastructure management: road damages - Economic Intelligence: roof inclination for solar panels 6

7.New Challenges - Increasing amounts of Rich Data - Cost effective solutions for acquiring data at scale (Drones, CubeSats) - Difficulty Scaling - Traditional tools not designed for scalability: how to work at the scale of a country or a continent? - Pipelining Challenges - Geospatial combines a lot of tools and problems: alignment, image corrections, object detection, … All these technologies need to communicate data in a timely fashion 7

8.8

9.vehicle_classes = { 18:('car', 'red'), 23:('truck', 'orange'), 19:('bus', 'white', 0.0)} 9

10.Apache Spark: the glue of big data - Technologies exist in isolation - OpenCV - Image manipulation - Tensorflow, Keras, PyTorch - Deep Learning - PostGIS, GeoMesa, Magellan - Geospatial Analytics - Leaflet.js/OpenStreetMap - Visualization - Apache Spark - ties all these libraries together - At scale - Allows pipelining - Easily move data from 1 technology to another without having to think about data representation 10

11.High-level View of the Pipeline map data XML UDFs metadata Transfer Learning with Deep Learning Pipelines Analyze and Visualize Geospatial Analytics with Magellan 11

12.Parsing Image Data map data XML UDFs metadata Transfer Learning with Deep Learning Pipelines Analyze and Visualize Geospatial Analytics with Magellan 12

13.Ingesting Images Spark 2.3 -- ImageSchema to Read/Write Image data - Use the same schema across packages - Scikit-image, MMLSpark, OpenCV, PIL, Deep Learning Pipelines, ... images = spark.readImages(img_dir, recursive = True, sampleRatio = 0.1) 13

14.Image Transformations with Spark - Spark Joins - Combine images with XML metadata - Spark UDFs - Eastings and Northings → Latitudes and Longitudes - Creating Image chips and respective coordinates 14

15. (lat, long) (lat, long) 15

16.Deep Learning map data XML UDFs metadata Transfer Learning with Deep Learning Pipelines Analyze and Visualize Geospatial Analytics with Magellan 16

17.Success of Deep Learning • Tremendous success of image-based applications • Increased availability of pre-trained models • Quickly building domain-specific models using transfer learning •

18.Existing frameworks ● Mostly Python ● Google's TensorFlow is the most popular (easy to install/use) ● PyTorch popular in research ● Others: MXNet, Theano, Caffe, Keras, DeepLearning4J (java)

19. s Spark Deep Learning Pipelines: • Deep Learning Open-source with Databricks Simplicity library • Focuses on ease of use and integration • without sacrificing performance • Primary language: Python • Includes APIs to transform images

20.Geospatial Analytics with Magellan map data XML UDFs metadata Transfer Learning with Deep Learning Pipelines Analyze and Visualize Geospatial Analytics with Magellan 20

21.Common geospatial tasks ● Find all objects within an area ● Build geometries ● Cluster and aggregate similar objects ● Infer geometries (roads, buildings, etc.) 21

22.Magellan • Open-source library for geospatial analytics with Spark • Understands various formats (geojson, …) • Performs basic geometric operations at scale (polygon intersection, joining, … ) • Integrates into Spark SQL engine and builds indices for high performance

23.Demo #UnifiedAnalytics #SparkAISummit

24.Recap 1) Read images with Spark 2) Parse image data with OpenCV and Spark UDFs a) Slice images into smaller image chips b) Generate respective coordinates for image chips 3) Pass data into a pre-trained tensorflow model and extract predictions with Spark Deep Learning Pipelines a) Model was trained on the xView dataset b) Model classifies objects identified in images 4) Visualize identified vehicles on a heatmap 5) Cross-check with Magellan 24

25.DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT