用APACHE Spark检测移动恶意软件

我们将聊聊数据训练,以及为什么我们使用Spark,我们从移动电话应用程序中提取的特征,以及我们如何在云中获得高精度的分数。在Wandera,我们已经通过使用Apache Spark(MLLib)和PMML,通过OpenScoring范例,成功地实现了大规模的恶意软件检测(和分类)ML模型。
展开查看详情

1.Detecting Mobile Malware with Apache Spark David Pryce, Wandera #DSSAIS12

2.Summary • The problem: Mobile-first malware detection • The data and features • The Machine Learning (ML) model • Why Apache Spark? • Making it production ready • Data Science @ Wandera #DSSAIS12 !2

3.The power of enterprise mobility Seamless internal Devices are prone to communication security threats Added flexibility to Concerns around working hours appropriate usage Access to more apps and Data usage costs are productivity tools opaque and spiraling E-mail and other services Potentially exposing available anywhere sensitive data !3

4.Happy hunting ground for attackers 435% 80% 38% 100% High severity threats of organizations of hackers bypass Mobile malware (CVSS) growth in 2016 experienced mobile endpoint defense using growth in 2016 phishing attack social engineering “Mobile threats can no longer be ignored” - AUGUST 2017 - GARTNER MARKET GUIDE TO MOBILE THREAT DEFENSE !4

5.Introducing the Secure Mobile Gateway IN-NETWORK PROTECTION ON-DEVICE DETECTION !5

6.The rise of mobile malware Credit: GData 2017 #DSSAIS12 !6

7.Our objectives: Identify and Classify MALWARE TYPES Ransomware Spyware Banker Trojan SMS Rooting Adware !7

8.Why is this a novel problem? • Mobile malware is on the rise • Signature based detection is no longer scalable or effective • We needed a solution that could • work across both known and unknown threats; • effectively protect our customers; and • enable threat research to quickly identify new outbreaks • First solution = signatures and lists • Our solution = machine learning! #DSSAIS12 !8

9.The data… Good and bad apps • Source 1: official app stores • Source 2: seen in our devices • Source 3: seen by our gateway + 3rd-party threat intelligence External input verified for labels (supervised learning) Currently storing: ~2 million labelled apps #DSSAIS12 !9

10.… and the features Baidu 2016 #DSSAIS12 !10

11.Feature extraction Direct metadata extraction • Total unique fields for all apps ~ 500,000 • A typical app ~ 10+ fields • SPARSE VECTOR Solution: • Hashing function (vector to indices) • Allows for fast retrieval • With big enough map (2^20) to avoid clashes • DENSE VECTOR #DSSAIS12 !11

12.The Machine Learning model • Selected model = Logistic Regression ◦ Models tried = (LogReg, SVM, Decision Tree) • K-fold cross validation to select best parameters • Accuracy: 0.96 
 #DSSAIS12 !12

13.Why Apache Spark? Truly big Ease of use data Deployment Model and Scale persistence Millions of data points, Fast, easy and iterative. From local to cluster is PMML paradigm already millions of fields From EDA to app in easy! integrated days. Scala and python API. #DSSAIS12 !13

14.Production ready? Wandera 2018 #DSSAIS12 !14

15.P.M.M.L • Predictive Model Markup Language • Industry standard • Pro: Language agnostic, REST API, good algo coverage • Con: large file size !15

16.Production ready? • Saving to PMML (ML vs MLlib / DF vs RDD) F • DataFrame API - doesn’t have PMML functionality (yet) • Hacked PMML to get probabilities for predictions • Size of model ~ 20Mb (compressed) • Overall time to train: less than 2 hours on a big enough cluster #DSSAIS12 !16

17.Live scoring 3 If score > 0.9 INVESTIGATE / NOTIFY 2 Extracts features & scores app 1 !17 User installs new app

18. Data Science @ Wandera = Innovative Research + Scalable Architecture + Efficient Feature Delivery • Cross-disciplinary team of scientists, analysts & developers • Focus on solving real-world problems in a real-time, distributed network • Global team with presence in USA, London, UK and Czech Republic #DSSAIS12 !18

19.Thanks for listening #DSSAIS12 !19

20.Appendix 1: model testing results Wandera 2018 #DSSAIS12 !20