AI on Spark for Malware Analysis and Anomalous Threat Detection

Demonstrate how Avast leverages AI and big data to burn malware.

  1. Identify - threat researcher
  2. Block - operator
  3. Analyze and automate - data / AI researcher + engineers
展开查看详情

1.WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

2.Jakub Sanojca & Joāo Da Silva, Avast Researcher Data Engineer

3.AI on Spark for Malware Analysis and Anomalous Threat Detection Jakub Sanojca & Joāo Da Silva, Avast Researcher Data Engineer

4.Goal Demonstrate how Avast leverages AI and big data to burn malware.

5.Goal Demonstrate how Avast leverages AI and big data to burn malware.

6.Agenda • What Avast does • Malware research • Structured Streaming • AI anomaly detection • Demo

7.Thank you

8.Thank you • Big Data Systems • AI team - especially Yura, Olga and Dmitry • Threat researchers and analysts

9.Avast is dedicated to creating a world that provides safety and privacy for all, no matter who you are, where you are, or how you connect.

10. Global reach Portfolio of security, privacy and utility applications #UnifiedDataAnalytics #SparkAISummit 10

11.World’s Largest Detection Network 200B+ URLs 300 M+ new files monthly 10,000 + globally distributed servers

12.Training the Avast Machine Learning Engine Purpose-built approach that takes < 12 hours to add new features, train, and deploy into production #UnifiedDataAnalytics #SparkAISummit 12

13.Malware classification Data ● >500 handcrafted features from binary files from our experts Task ● Classification to clean/malware/pup files Two step ML Pipeline: ● Cluster data with custom k-means ● Classification inside the cluster is done by Random Forest #UnifiedDataAnalytics #SparkAISummit 13

14.Infrastructure: Underlying data lake - Burger #UnifiedDataAnalytics #SparkAISummit 14

15. Architecture: Malware classification Features Clustering Training Validation Production Data 3h 4.5h 24 h Clustering Training Validation 24 h 6h 24 h ● ~700TB of binary files ● patented tailor-made solution 15 #UnifiedDataAnalytics #SparkAISummit 15

16.Custom application Spark • optimised & performant • slower • takes months to develop • easy to experiment with • not that easy to change • very fast development

17.Threat Detections Streaming #UnifiedDataAnalytics #SparkAISummit

18.3 step threat approach 1. Identify - threat researcher 2. Block - operator 3. Analyze and automate - data / AI researcher + engineers

19.3 step threat approach 1. Identify - threat researcher 2. Block - operator 3. Analyze and automate - data / AI researcher + engineers

20.3 step threat approach 1. Identify - threat researcher 2. Block - operator 3. Analyze and automate - data / AI researcher + engineers

21.Time series of detections • Thousands of detection time series • Where should operator focus?

22.Time series of detections • Thousands of detection time series • Where should operator focus?

23.Short response time is necessary

24.Short response time is necessary

25.First idea - custom streaming app • Python because of ML models

26.First idea - custom streaming app • Python because of ML models • Big part of code about already solved problems

27.First idea - custom streaming app • Python because of ML models • Big part of code about already solved problems • POC written by researchers

28.First idea - custom streaming app • Python because of ML models • Big part of code about already solved problems • POC written by researchers • Gets job done, but not easy to maintain or experiment

29. Adopted solution: Spark Structured Streaming #UnifiedDataAnalytics #SparkAISummit 29