CyberMLToolkit - Anomaly Detection as a Scalable Generic Service

Cybercrime is one the greatest threats to every company in the world today and a major problem for mankind in general. The damage due to Cybercrime is estimated to be around $6 Trillion By 2021. Security professionals are struggling to cope with the threat. As a result, powerful and easy to use tools are necessary to aid in this battle. For this purpose we created an anomaly detection framework focused on security which can identify anomalous access patterns. It is built on top of Apache Spark and can be applied in parallel over multiple tenants. This allows the model to be trained over the data of thousands of customers over a Databricks cluster within less than an hour. The model leverages proven technologies from Recommendation Engines to produce high quality anomalies. We thoroughly evaluated the model’s ability to identify actual anomalies by using synthetically generated data and also by creating an actual attack and showing that the model clearly identifies the attack as anomalous behavior. We plan to open source this library as part of a cyber-ML toolkit we will be offering.

展开查看详情

1.WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

2. CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache Spark Roy Levin, Microsoft #UnifiedDataAnalytics #SparkAISummit

3. Session goals • Present an easy-to-use framework that produces cyber-security-anomalies • Explain how recommendation systems are used to find anomalous resource access • Show how we evaluated the framework to show its usefulness 3

4. Agenda Motivation Formulation & Models Scalability for Large Datasets Evaluation Summary 4

5.centralized cloud native Security Information & Event Management system Build Your Own ML (BYOML) 1. Log data from cloud resources 2. Process logs from Azure Databricks cluster 3. Author custom security analytics 5

6. General Anomaly Detector Fault detection System health Dataset monitoring … Security incidents We would like to capture only Security-related-anomalies 6

7.• • • 7

8.anomalous access • Train and apply on a simple-to-construct dataset – Avoid writing and maintaining complex rules and logic – Avoid the need to analyze multiple complex datasets such as: § Org-charts § RBAC tables § Cloud architectures 8

9.? 9

10. Agenda Motivation Formulation & Models Scalability for Large Datasets Evaluation Summary 10

11.• Given user & resource pair (u, r) • Provide an anomaly score of user u accessing resource r • If anomaly score is above some threshold then surface the event 11

12. The straight forward approach ? But users access new resources quite often, so this is just not good enough 12

13.Create profile per user and ? resource and see if access deviates from that profile 13

14.Intuition: • Take a recommendation system and use it for anti-recommendations 14

15.Recommendation Engines 15

16.Movie Recommendations Model Training Phase Roy 1 Inbal2 Hasan 3 Lior4 Anat5 Arnon 6 The God Father1 4 5 The Dark Knight2 3 2 5 Pulp Fiction 3 5 3 5 4 4 5 40 Year Old Virgin 4 2 4 3 3 Analyze That5 3 5 4 4 Anger Management6 3 5 5 Black Hawk Down 7 5 4 16

17.Movie Recommendations Model Training Phase Romance Action Comedy Roy 1 Inbal2 Hasan 3 Lior4 Anat5 Arnon 6 f1 f2 f3 The God Father1 ? 4 ? 5 ? ? ? ? ? x1 The Dark Knight2 3 ? ? ? 2 5 ? ? ? x2 Pulp Fiction 3 5 3 5 4 4 5 ? ? ? 40 Year Old Virgin 4 2 4 ? ? 3 3 ? ? ? Analyze That5 3 5 4 ? 4 ? ? ? ? Anger Management6 3 5 ? ? ? 5 ? ? ? Black Hawk Down 7 5 ? ? 4 ? ? ? ? ? xm Romance f1 ? ? ? ? ? ? Action f2 ? ? ? ? ? ? Comedy f3 ? ? ? ? ? ? 𝜃" 𝜃# 𝜃$ 17

18.Movie Recommendations Model Training Phase Romance Action Comedy Roy 1 Inbal2 Hasan 3 Lior4 Anat5 Arnon 6 f1 f2 f3 The God Father1 ? 4 ? 5 ? ? ? ? ? x1 The Dark Knight2 3 ? ? ? 2 5 ? ? ? x2 Pulp Fiction 3 5 3 5 4 4 5 ? ? ? 40 Year Old Virgin 4 2 4 ? ? 3 3 ? ? ? Analyze That5 3 5 4 ? 4 ? ? ? ? Anger Management6 3 5 ? ? ? 5 ? ? ? Black Hawk Down 7 5 ? ? 4 ? ? ? ? ? xm Romance f1 ? ? ? ? ? ? Action f2 ? ? ? ? ? ? Comedy f3 ? ? ? ? ? ? 𝜃" 𝜃# 𝜃$ 18

19.Movie Recommendations Model Apply Phase Romance Action Comedy Roy 1 Inbal2 Hasan 3 Lior4 Anat5 Arnon 6 f1 f2 f3 The God Father1 ? 4 ? 5 ? ? ? ? ? x1 The Dark Knight2 3 ? ? ? 2 5 ? ? ? x2 Pulp Fiction 3 5 3 5 4 4 5 ? ? ? 40 Year Old Virgin 4 2 4 ? ? 3 3 ? ? ? Analyze That5 3 5 4 ? 4 ? ? ? ? Anger Management6 3 5 ? ? ? 5 ? ? ? Black Hawk Down 7 5 ? ? 4 ? ? ? ? ? xm Romance f1 ? ? ? ? ? ? Action f2 ? ? ? ? ? ? Comedy f3 ? ? ? ? ? ? 𝜃" 𝜃# 𝜃$

20. Back to Anomalous Resource Access 20

21.• Let us re-examine our data: – User-resource pairs with number of times accessed • Standard CF model assumes explicit item ratings, some problems: – A rating is not really what we have in the input • Although more user access to a resource likely means he should be allowed access – We do not really have negative rating indications either, i.e., there is no explicit indicator saying that a user should not have access to some resource • what we do have is missing access 21

22. Linear Scaling user1 user2 user3 user4 user5 user6 user1 user2 user3 user4 user5 user6 resource1 1200 1500 resource1 9 10 resource2 900 301 1 resource2 8 6 5 resource3 1500 599 1 902 1205 1500 resource3 10 7 5 8 9 10 resource4 299 1200 895 901 resource4 6 9 8 8 resource5 601 1500 1200 1203 resource5 7 10 9 9 resource6 603 1499 1495 resource6 7 10 10 resource7 1499 1200 resource7 10 9 22

23. Random Negative Samples user1 user2 user3 user4 user5 user6 resource1 9 10 resource2 8 6 5 resource3 10 7 5 8 9 10 resource4 6 9 8 8 resource5 7 10 9 9 resource6 7 10 10 resource7 10 9 23

24. Random Negative Samples user1 user2 user3 user4 user5 user6 resource1 1 9 10 resource2 8 1 6 5 resource3 10 7 5 8 9 10 resource4 6 9 1 8 8 resource5 7 10 9 9 resource6 7 10 10 resource7 10 9 1 24

25.Adjusting for user & resource bias and create an anomaly score user1 user2 user3 user4 user5 user6 resource1 1 9 10 resource2 8 1 6 5 resource3 10 7 5 8 9 10 resource4 6 9 1 8 8 resource5 7 10 9 9 resource6 7 10 10 resource7 10 9 1 − 25

26. Agenda Motivation Formulation & Models Scalability for Large Datasets Evaluation Summary 26

27.• Actually: we are given a tenant-id, user, resource triplet (tid, u, r) • Provide anomaly score of user u accessing resource r per-tenant • Note: access within each tenant is isolated • Goals: – Process tenants in parallel – Cope with data from large tenants 27

28.• Create a PUDF which uses the Surprise Python library to run the CF algorithm locally on each worker node • Provided PUDF works on Pandas-DFs that are created per-group when apply is called • The method is applied as follows: – df.groupBy(tid_colname).apply(my_pudf) * SurPRISE: Simple Python RecommendatIon System Engine http://surpriselib.com/ 28

29.• Problem: the data from some tenants may be too large to fit into the memory of a single worker node • Solution: before applying, count number of entries per-tenant – If number of entries can fit in-memory then apply PUDF method – If not, then apply Spark CF, per tenant, one-by-one 29