- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 视频嵌入链接 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
AI on Spark for Malware Analysis and Anomalous Threat Detection
Demonstrate how Avast leverages AI and big data to burn malware.
- Identify - threat researcher
- Block - operator
- Analyze and automate - data / AI researcher + engineers
展开查看详情
1 .WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
2 .Jakub Sanojca & Joāo Da Silva, Avast Researcher Data Engineer
3 .AI on Spark for Malware Analysis and Anomalous Threat Detection Jakub Sanojca & Joāo Da Silva, Avast Researcher Data Engineer
4 .Goal Demonstrate how Avast leverages AI and big data to burn malware.
5 .Goal Demonstrate how Avast leverages AI and big data to burn malware.
6 .Agenda • What Avast does • Malware research • Structured Streaming • AI anomaly detection • Demo
7 .Thank you
8 .Thank you • Big Data Systems • AI team - especially Yura, Olga and Dmitry • Threat researchers and analysts
9 .Avast is dedicated to creating a world that provides safety and privacy for all, no matter who you are, where you are, or how you connect.
10 . Global reach Portfolio of security, privacy and utility applications #UnifiedDataAnalytics #SparkAISummit 10
11 .World’s Largest Detection Network 200B+ URLs 300 M+ new files monthly 10,000 + globally distributed servers
12 .Training the Avast Machine Learning Engine Purpose-built approach that takes < 12 hours to add new features, train, and deploy into production #UnifiedDataAnalytics #SparkAISummit 12
13 .Malware classification Data ● >500 handcrafted features from binary files from our experts Task ● Classification to clean/malware/pup files Two step ML Pipeline: ● Cluster data with custom k-means ● Classification inside the cluster is done by Random Forest #UnifiedDataAnalytics #SparkAISummit 13
14 .Infrastructure: Underlying data lake - Burger #UnifiedDataAnalytics #SparkAISummit 14
15 . Architecture: Malware classification Features Clustering Training Validation Production Data 3h 4.5h 24 h Clustering Training Validation 24 h 6h 24 h ● ~700TB of binary files ● patented tailor-made solution 15 #UnifiedDataAnalytics #SparkAISummit 15
16 .Custom application Spark • optimised & performant • slower • takes months to develop • easy to experiment with • not that easy to change • very fast development
17 .Threat Detections Streaming #UnifiedDataAnalytics #SparkAISummit
18 .3 step threat approach 1. Identify - threat researcher 2. Block - operator 3. Analyze and automate - data / AI researcher + engineers
19 .3 step threat approach 1. Identify - threat researcher 2. Block - operator 3. Analyze and automate - data / AI researcher + engineers
20 .3 step threat approach 1. Identify - threat researcher 2. Block - operator 3. Analyze and automate - data / AI researcher + engineers
21 .Time series of detections • Thousands of detection time series • Where should operator focus?
22 .Time series of detections • Thousands of detection time series • Where should operator focus?
23 .Short response time is necessary
24 .Short response time is necessary
25 .First idea - custom streaming app • Python because of ML models
26 .First idea - custom streaming app • Python because of ML models • Big part of code about already solved problems
27 .First idea - custom streaming app • Python because of ML models • Big part of code about already solved problems • POC written by researchers
28 .First idea - custom streaming app • Python because of ML models • Big part of code about already solved problems • POC written by researchers • Gets job done, but not easy to maintain or experiment
29 . Adopted solution: Spark Structured Streaming #UnifiedDataAnalytics #SparkAISummit 29