Large-Scale Malicious Domain Detection with Spark AI

Malicious domains are one of the main resources used to mount attacks over the Internet. It is important to detect such activities by mining the large-scale network traffic data and identifying malicious URLs, domains or IPs. The attackers often take advantages of vulnerabilities in DNS and commit activities such as stealing private information, spamming, phishing, and DDoS attacks, and tend to by-pass botnet detection by generating domain clusters from Domain Generation Algorithms (DGA). We have billions of DNS records per day. Spark AI platform hence serves as an efficient distributed platform for the processing and mining of this huge amount of data. We work on the following two cybersecurity use cases. 1. Detect DGA, Porn, and Gambling domains Each malware-compromised host machine will have a large amount of DNS request in sequential order. The domain names are either generated by DGAs or preserve particular string patterns by design. We use spark to generate DNS request domain sequences and use Word2Vec to estimate the embedding of the domains. We then estimate the similarity and the most similar domains in the embedding space are discovered as the potential malicious domains. 2. Detect cryptocurrency mining pool domains The attackers are interested in accessing computing resources to mine cryptocurrency. The malware infected computers will be directed to attacker-controlled mining pool domains. This type of DNS request does not preserve sequential order and is relatively random. Since each mining pool domain cluster is visited by a wide range of different host machines, we used LSH to evaluate the similarity among sets of hosts. As a result, LSH generates domain-bucket bipartite graph and FastUnfolding algorithm is used to discover the domain clusters. We leveraged spark AI for large scale DNS data analysis and discovered hundreds of thousands of malicious domains each day at high precision.

1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

2.Large-scale Malicious Domain Detection with Spark Hao Guo, Tencent Ting Chen, Tencent #UnifiedAnalytics #SparkAISummit

3.About The Speakers Hao Guo • Applied Research Scientist @ Tencent Security • Master degree in Computer Science from HIT with research interest in NLP, deep learning and large-scaled machine learning Ting Chen • Director, Applied Machine Learning @ Tencent Jarvis Lab • PhD degree in Computer Science from UFL with research interest in computer vision and machine learning • Previously, Senior ML engineer and DS manager at Uber #UnifiedAnalytics #SparkAISummit 3

4. Agenda • DDoS Attack & Advance Persistent Threat • Sequence based detection • Crypto Mining Malware • Locality Sensitivity Hashing based detection • Conclusion #UnifiedAnalytics #SparkAISummit 4

5.What is DDoS Attack #UnifiedAnalytics #SparkAISummit 5

6.DDoS Attack Trend 2018 #UnifiedAnalytics #SparkAISummit 6

7.DDoS Attack Trend #UnifiedAnalytics #SparkAISummit 7

8.What is APT Symantec APT white paper #UnifiedAnalytics #SparkAISummit 8

9.APT Activities #UnifiedAnalytics #SparkAISummit 9

10.Behind The Attacks: C&C #UnifiedAnalytics #SparkAISummit 10

11.Behind The Attacks: DGA #UnifiedAnalytics #SparkAISummit 11

12. Malicious Domain Detection Scenario 1: DGA Victim hosts Sequences of domains accessed for each host #UnifiedAnalytics #SparkAISummit 12

13.Crypto Mining Malware #UnifiedAnalytics #SparkAISummit 13

14. Malicious Domain Detection Scenario 2: Crypto mining domains Victim hosts #UnifiedAnalytics #SparkAISummit 14

15.#UnifiedAnalytics #SparkAISummit Data Scale @ Tencent DNS Records Domains IPs/Hosts Storage Tens Billions Billions TB (day) (day) Millions (day) Spark enables large-scale data analysis. #UnifiedAnalytics #SparkAISummit 15

16.Sequence Based Detection #UnifiedAnalytics #SparkAISummit 16

17.Sequence Based Detection Victim hosts sentence word #UnifiedAnalytics #SparkAISummit 17

18.Sequence Based Detection Victim hosts document #UnifiedAnalytics #SparkAISummit 18

19. Domain2Vec Representation Key Idea • Estimate the domain2vec (work2vec) Victim hosts representation of each domain with CBOW CBOW framework i-2 i-1 i i+1 i+2 Example domain2vec representation • -0.10 0.58 -0.04 … 0.29 0.26 -0.17 #UnifiedAnalytics #SparkAISummit 19

20.Domain Clustering • Start with seed domains (known malicious domains) • Find most similar domains 𝑑𝑜𝑚𝑎𝑖𝑛𝑉𝑒𝑐4 5 𝑑𝑜𝑚𝑎𝑖𝑛𝑉𝑒𝑐6 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 = cos θ = ||𝑑𝑜𝑚𝑎𝑖𝑛𝑉𝑒𝑐4 ||||𝑑𝑜𝑚𝑎𝑖𝑛𝑉𝑒𝑐6 || #UnifiedAnalytics #SparkAISummit 20

21.Domain Clustering #UnifiedAnalytics #SparkAISummit 21

22.Example Clustering Results Virus Domains Nymaim Domains Conficker Domains ... ... ... .... ... ... #UnifiedAnalytics #SparkAISummit 22

23. Implementation Framework billions/day #UnifiedAnalytics #SparkAISummit 23

24. Key Functions from pyspark.sql import SparkSession, Row from import Word2Vec, Word2VecModel Domain Sequence Generation domain_sequence_rdd = \ input_rdd.combineByKey(to_list, append, extend, numPartitions = N).mapValues(sortbytime) Domain2Vec Calculation domain_sequence_rdd = domain_sequence lambda r:Row(r) ) domain_sequence = spark.createDataFrame(domain_sequence_rdd,[”domainSequence”]) domain2vec = Word2Vec(vectorsize=100, minCount=3, numPartitions =N, seed=42, \ inputCol=”domainSequce”, outputCol=”model”, windowSize=8, maxSentenceLength=1000) model = Similarity Domain Clustering model.findSynonyms(domain_seed, M).select("word", fmt("similarity", m).alias("similarity")) #UnifiedAnalytics #SparkAISummit 24

25.LSH Based Detection #UnifiedAnalytics #SparkAISummit 25

26. LSH based detection Victim hosts Jaccard similarity host IP sets for accessing domain1: 𝑆9 = {𝐼𝑃0, 𝐼𝑃1, 𝐼𝑃2, 𝐼𝑃3} host IP sets for accessing domain2: 𝑆C = {𝐼𝑃1, 𝐼𝑃2, 𝐼𝑃3, 𝐼𝑃4} Similarity of domain1 and domain2 is: | EF9,EFC,EFG | 𝑠𝑖𝑚 𝑑𝑜𝑚𝑎𝑖𝑛1, 𝑑𝑜𝑚𝑎𝑖𝑛2 = | EFH,EF9,EFC,EFG,EFI | = 3/5 #UnifiedAnalytics #SparkAISummit 26

27. Why LSH • High dimensional and sparse – tens of millions hosts • O(N*N) comparisons – million unique domains • Spark provides Locality Sensitivity Hashing for fast near-duplicate detection #UnifiedAnalytics #SparkAISummit 27

28. LSH With hight probablity domain1 and doman2 are hashed into the same buckets . With high probablity domain1 and domain2 are hashed into the different buckets. #UnifiedAnalytics #SparkAISummit 28

29.Minhash and Jaccard Similarity • There is a suitable hash function for the Jaccard similarity : minhash • The probability that minhash(domain1) = minhash(domain2) is equal to the similarity of Jaccard(domain1, domain2) #UnifiedAnalytics #SparkAISummit 29