Accelerating Real Time Video Analytics on a Heterogenous CPU + FPGA Platform

The current uptrend in faster computational power has led to a more mature eco-system for image processing and video analytics. By using deep neural networks for image recognition and object detection we can achieve better than human accuracies. Industrial sectors led by retail and finance want to take advantage of these latest developments in real-time analysis of video content for fraud detection, surveillance and many other applications.

There are a couple of challenges involved in the real word implementation of a video analytics solution:
1) Most video analytics use-cases are effective only when response times are in milliseconds. Requirement of performing at very low latencies gives rise to a need for software and hardware acceleration
2) Such solutions will need wide-spread deployment and are expected to have low TCO. To address these two key challenges we propose a video analytics solution leveraging Spark Structured Streaming + DL framework (like Intel’s Analytics-Zoo & Tensorflow) built on a heterogenous CPU + FPGA hardware platform.

The proposed solution provides >3x acceleration in performance to a video analytics pipeline when compared to a CPU only implementation while requiring zero code change on the application side as well as achieving more than 2x decrease in TCO. Our video analytics pipeline includes ingestion of video stream + H.264 decode to image frames + image transformation + image inferencing, that uses a deep neural network. FPGA based solution offloads the entire pipeline computation to the FPGA while CPU only solution implements the pipeline using OpenCV + Spark Structured Streaming + Intel’s Analytics-Zoo DL library.

Key Take aways:

  1. Optimizing performance of Spark Streaming + DL pipeline
  2. Acceleration of video analytics pipeline using FPGA to deliver high throughput at low latency and reduced TCO.
  3. Performance data for benchmarking CPU and CPU + FPGA based solution.

1.WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

2.Accelerating Real Time Video Analytics Using Heterogenous CPU+FPGA Environment Bhoomika Sharma, Megh Computing #UnifiedDataAnalytics #SparkAISummit

3.Megh Computing • A startup based in Portland, Oregon, USA with development office in Bangalore, India • Vision of enabling third wave of computing in data center • Mission of accelerating real-time analytics using FPGA #UnifiedDataAnalytics #SparkAISummit 3

4.Agenda 1. Introduction to real-time analytics. 2. Existing software based real-time video analytics solutions. 3. Video analytics pipeline acceleration using CPU+FPGA platform. 4. Benchmarking between CPU and CPU+FPGA based solution. #UnifiedDataAnalytics #SparkAISummit 4

5.Real-Time Analytics 5

6.Why Real-Time ? Information Half-Life Value of Data to Decision Making in Decision Making Time Critical Decisions Predictive / Preventive Traditional “Batch” Actionable Business Intelligence Reactive Historical Real Secs Mins Hours Days Months Time Time #UnifiedDataAnalytics #SparkAISummit 6

7.Real-Time Insights Hard Real Regular Fraud Edge Dashboard Operational Time Trading Prevention Computing (Inference) Insights < 1 𝜇s 100 𝜇s ms 10s ms 100s ms seconds #UnifiedDataAnalytics #SparkAISummit 7

8.Existing Real-Time Analytics Solution *ETL = Extract Transform Load #UnifiedDataAnalytics #SparkAISummit 8

9.Real-Time Video Analytics Object Detection YOU ARE BEING Fraud Detection WATCHED Extracting values from video to impact business Image Source: Pinterest, Towards Data Science, #UnifiedDataAnalytics #SparkAISummit 9

10.Main Phases of Video Analytics Pipeline Ingest Transform Infer #UnifiedDataAnalytics #SparkAISummit 10

11.CPU Software-Based Solutions 11

12.Architecture of CPU based Pipeline RTSP = Real Time Streaming Protocol #UnifiedDataAnalytics #SparkAISummit 12

13.Transform Phase Image Extraction and Transformation Video JavaCV Image Stream Frame Persistent • FFMpeg Library Megh RTSP • H.264 Decoding Microservice • Extracting Image #UnifiedDataAnalytics #SparkAISummit 13

14.Inference Phase Image Processed Frame Image Persistent Spark Deep Megh Structured Learning - Reads from Microservice Streaming custom data Inference source #UnifiedDataAnalytics #SparkAISummit 14

15.From DStreams to Structured Streaming • Based on Simple Ease of Use DataFrame API • Handles Backpressure • Output tables are always Consistency consistent with all the records #UnifiedDataAnalytics #SparkAISummit 15

16.Custom Connector: reading from custom data source Spark Worker Input Partition Reader Spark Driver Read Data of Size n g onfi Micro Batch Reader ader C Re Input Partition 1 mit Com nfig Input Partition Reader Reader Co Plan Input Partitions Input Partition 2 Commit Megh Micro Service Input Partition n Read er Co nfig Spark Worker Com mit Input Partition Reader Megh Micro Service #UnifiedDataAnalytics #SparkAISummit 16

17.Code Snippet val streamData = SQLContext .getOrCreate(sc) .sparkSession .readStream Reads data .format( from custom “com.meghcomputing.videoanalytics.spark.receivers. data source MeghImageSourceV2") .options(Map ( Load Properties "MEGH_RPC_HOST" -> prop.getProperty(""), for Custom "MEGH_RPC_PORT" -> prop.getProperty("rpc.server.port"), source "MEGH_MAX_RECORD" -> prop.getProperty(“rpc.max.record") )) .load() #UnifiedDataAnalytics #SparkAISummit 17

18.Deep Learning Inference ?? CAT Deep Learning Topology #UnifiedDataAnalytics #SparkAISummit 18

19.Deep Learning Inference - Unified Analytics + AI Platform - Bi - BigDL for Deep Learning - Pretrained Squeezenet Quantized Model #UnifiedDataAnalytics #SparkAISummit 19

20.Code Snippet val predictImageUDF = udf( (uri: String, data: Array[Byte], latency: String) => { val st = System.nanoTime() • Broadcasting model, val featureSteps = featureTransformersBC.value.clonePreprocessing() labels and val localModel = modelBroadCast.value val labels = labelBroadcast.value transformation steps to val bytesData = Base64.getDecoder.decode(data) all worker val imf = ImageFeature(bytesData, uri = uri) • Transforming image data val imgSet: ImageSet = ImageSet.array(Array(imf)) var inputTensor = featureSteps(imgSet.toLocal().array.iterator).next() to analytics zoo default inputTensor = inputTensor.reshape(Array(1) ++ inputTensor.size()) ImageFeature type val prediction = localModel .doPredict(inputTensor) .toTensor[Float] .squeeze() .toArray() • Classify image into its val predictClass = prediction.zipWithIndex.maxBy(_._1)._2 if (predictClass < 0 || predictClass > (labels.length - 1)) { category "unknown" } • Predicts labels val labelName: String = labels(predictClass.toInt).toString() labelName } } ) #UnifiedDataAnalytics #SparkAISummit 20

21.Performance of CPU based solution Infrastructure Cluster with one worker node with Xeon Bronze Processor Video Specification 1080p Resolution Throughput ~22 FPS Latency > 250 ms *FPS = Frames Per Second #UnifiedDataAnalytics #SparkAISummit 21

22.Challenges with existing software-based solutions Latency Throughput TCO • Does not • Non-Linear • Increases with always meet relation with an increase in real time number of input feeds requirement nodes *TCO = Total Cost of Ownership #UnifiedDataAnalytics #SparkAISummit 22

23.Hardware Accelerators – Alternate Solution 1st Wave 2nd Wave 3rd Wave CPU GPU FPGA • General Purpose • Suitable for High • Direct I/O for Architecture Batch Ingestion • Non- • Non- • Deterministic Deterministic Deterministic Latency Latency Latency • Efficient • Sub-optimal • Sub-optimal Resource Resource Resource Utilization Utilization Utilization #UnifiedDataAnalytics #SparkAISummit 23

24.FPGA In Brief • Field Programmable Gate Array • Customizable Hardware • Direct I/O for ingestion • Processing at line rates • Support for parallel processing #UnifiedDataAnalytics #SparkAISummit 24

25.Heterogenous CPU+FPGA Solution 25

26.Heterogenous CPU+FPGA based pipeline *FU = Functional Unit #UnifiedDataAnalytics #SparkAISummit 26

27.Performance of CPU+FPGA based solution Infrastructure Cluster with one worker node with 2 Arria10 FPGA and Xeon Bronze Processor Video Specification 1080p Resolution Throughput ~240 FPS Latency < 100 ms #UnifiedDataAnalytics #SparkAISummit 27

28.Distributed System Configuration #UnifiedDataAnalytics #SparkAISummit 28

29.Megh Solution Stack: reduces complexity of programming FPGA #UnifiedDataAnalytics #SparkAISummit 29