1.Bayesian Data Mining University of Belgrade School of Electrical Engineering Department of Computer Engineering and Information Theory Marko Stupar 11/3370 sm113370m@student.etf.rs 1 /40

2.Data Mining problem Too many attributes in training set (colons of table) Existing algorithms need too much time to find solution We need to classify, estimate, predict in real time Marko Stupar 11/3370 sm113370m@student.etf.rs Target Value 1 Value 2 . . . Value 100000 2 /40

3.Problem importance Find relation between: All Diseases, All Medications, All Symptoms, Marko Stupar 11/3370 sm113370m@student.etf.rs 3 /40

4.Existing solutions Marko Stupar 11/3370 sm113370m@student.etf.rs CART, C4.5 Too many iterations Continuous arguments need binning Rule induction Continuous arguments need binning Neural networks High computational time K-nearest neighbor Output depends only on distance based close values 4 /40

5.Classification, Estimation, Prediction Used for large data set Very easy to construct Not using complicated iterative parameter estimations Often does surprisingly well May not be the best possible classifier Robust, Fast, it can usually be relied on to Marko Stupar 11/3370 sm113370m@student.etf.rs Naïve Bayes Algorithm 5 /40

6.Marko Stupar 11/3370 sm113370m@student.etf.rs Naïve Bayes algorithm Reasoning New information arrived How to classify Target? Target Attribute 1 Attribute 2 Attribute n … … … .. … … .. .. .. .. .. .. … .. … Training Set Target Attribute 1 Attribute 2 Attribute n a1 a2 an Missing 6 /40

7.Target can be one of discrete values: t1, t2, …, tn Marko Stupar 11/3370 sm113370m@student.etf.rs Naïve Bayes algorithm Reasoning How to calculate? Did not helped very much? Naïve Bayes assumes that A1…An are independent, So: 7 /40

8.Age Income Student Credit Target Buys Computer 1 Youth High No Fair No 2 Youth High No Excellent No 3 Middle High No Fair Yes 4 Senior Medium No Fair Yes 5 Senior Low Yes Fair Yes 6 Senior Low Yes Excellent No 7 Middle Low Yes Excellent Yes 8 Youth Medium No Fair No 9 Youth Low Yes Fair Yes 10 Senior Medium Yes Fair Yes 11 Youth Medium Yes Excellent Yes 12 Middle Medium No Excellent Yes 13 Middle High Yes Fair Yes 14 Senior Medium No Excellent No Attributes = (Age= youth , Income= medium , Student= yes , Credit_rating = fair ) P(Attributes, Buys_Computer =Yes) = P(Age= youth|Buys_Computer =yes) * P(Income= medium|Buys_Computer =yes) * P(Student= yes|Buys_Computer =yes) * P( Credit_rating = fair|Buys_Computer =yes) * P( Buys_Computer =yes) =2/9 * 4/9 * 6/9 * 6/9 * 9/14 = 0.028 Attributes = (Age= youth , Income= medium , Student= yes , Credit_rating = fair ) P(Attributes, Buys_Computer =No) = P(Age= youth|Buys_Computer =no) * P(Income= medium|Buys_Computer =no) * P(Student= yes|Buys_Computer =no) * P( Credit_rating = fair|Buys_Computer =no) * P( Buys_Computer =no) =3/5 * 2/5 * 1/5 * 2/5 * 5/14 = 0.007 Naïve Bayes Discrete Target Example Marko Stupar 11/3370 sm113370m@student.etf.rs 8 /40

9.Naïve Bayes Discrete Target - Example Attributes = (Age= youth , Income= medium , Student= yes , Credit_rating = fair ) Target = Buys_Computer = [ Yes | No ] ? P(Attributes, Buys_Computer =Yes) = P(Age= youth|Buys_Computer =yes) * P(Income= medium|Buys_Computer =yes) * P(Student= yes|Buys_Computer =yes) * P( Credit_rating = fair|Buys_Computer =yes) * P( Buys_Computer =yes) =2/9 * 4/9 * 6/9 * 6/9 * 9/14 = 0.028 P(Attributes, Buys_Computer =No) = P(Age= youth|Buys_Computer =no) * P(Income= medium|Buys_Computer =no) * P(Student= yes|Buys_Computer =no) * P( Credit_rating = fair|Buys_Computer =no) * P( Buys_Computer =no) =3/5 * 2/5 * 1/5 * 2/5 * 5/14 = 0.007 P( Buys_Computer =Yes | Attributes) > P( Buys_Computer =No| Attributes) Therefore, the naïve Bayesian classifier predicts Buys_Computer = Yes for previously given Attributes Marko Stupar 11/3370 sm113370m@student.etf.rs 9 /40

10.Naïve Bayes Discrete Target – Spam filter Attributes = Text Document = w1,w2,w3… Array of words Target = Spam = [Yes | No] ? - probability that the i-th word of a given document occurs in documents, in training set, that are classified as Spam - probability that all words of document occur in Spam documents in training set Marko Stupar 11/3370 sm113370m@student.etf.rs 10 /40

11.Naïve Bayes Discrete Target – Spam filter - Bayes factor Sample correction – if there is a word in document that never occurred in training set the whole will be zero. Sample correction solution – put some low value for that Marko Stupar 11/3370 sm113370m@student.etf.rs 11 /40

12.Gaussian Naïve Bayes Continuous Attributes Continuous attributes do not need binning (like CART and C4.5) Choose adequate PDF for each Attribute in training set Gaussian PDF is most likely to be used to estimate the attribute probability density function (PDF) Calculate PDF parameters by using Maximum Likelihood Method Naïve Bayes assumption - each attribute is independent of other, so joint PDF of all attributes is result of multiplication of single attributes PDFs Marko Stupar 11/3370 sm113370m@student.etf.rs 12 /40

13.Gaussian Naïve Bayes Continuous Attributes - Example Training set Validation set sex height (feet) weight (lbs) foot size (inches) male 6 180 12 male 5.92 190 11 male 5.58 170 12 male 5.92 165 10 female 5 100 6 female 5.5 150 8 female 5.42 130 7 female 5.75 150 9 Target = male Target = female height (feet) 5.885 0.027175 5.4175 0.07291875 weight (lbs) 176.25 126.5625 132.5 418.75 foot size(inches) 11.25 0.6875 7.5 1.25 sex height (feet) weight (lbs) foot size (inches) Target 6 130 8 Marko Stupar 11/3370 sm113370m@student.etf.rs 13 /40

14.Naïve Bayes - Extensions Easy to extend Gaussian Bayes – sample of extension Estimate Target – If Target is real number, but in training set has only few acceptable discrete values t1… tn , we can estimate Target by: A large number of modifications have been introduced, by the statistical, data mining, machine learning, and pattern recognition communities, in an attempt to make it more flexible Modifications are necessarily complications, which detract from its basic simplicity Marko Stupar 11/3370 sm113370m@student.etf.rs 14 /40

15.Naïve Bayes - Extensions Are Attributes always really independent? A1 = Weight, A2 = Height, A3 = Shoe Size, Target = [ male|female ]? How can that influence our Naïve Bayes data mining? Assumption ? Bayesian network Marko Stupar 11/3370 sm113370m@student.etf.rs 15 /40

16.Marko Stupar 11/3370 sm113370m@student.etf.rs Bayesian Network Bayesian network is a directed acyclic graph (DAG) with a probability table for each node. Bayesian network contains: Nodes and Arcs between them Nodes represent arguments from database Arcs between nodes represent their probabilistic dependencies Target A2 A1 A3 P(A2|Target) A6 A4 A5 P(A6|A4A5) A7 P(A3|Target,A4A6) 16 /40

17.Bayesian Network What to do Read Network Construct Network Training Set Continue with naïve Bayes Joint Probability Distribution Naïve Bayes Marko Stupar 11/3370 sm113370m@student.etf.rs 17 /40

18.Marko Stupar 11/3370 sm113370m@student.etf.rs Bayesian Network Read Network Chain rule of probability Bayesian network - Uses Markov Assumption A7 A2 A5 Joint Probability Distribution A7 depends only on A2 and A5 18 /40

19.Bayesian Network Read Network - Example How to get P(N|B), P(B|M,T)? Expert knowledge From Data(relative frequency estimates) Or a combination of both Medication Blood Cloth Trauma Heart Attack Nothing Stroke P(M) P(!M) 0.2 0.8 P(T) P(!T) 0.05 0.95 M T P(B) P(!B) T T 0.95 0.05 T F 0.3 0.7 F T 0.6 0.4 F F 0.9 0.1 B P(H) P(!H) T 0.4 0.6 F 0.15 0.85 B P(N) P(!N) T 0.25 0.75 F 0.75 0.25 B P(S) P(!S) T 0.35 0.65 F 0.1 0.9 Marko Stupar 11/3370 sm113370m@student.etf.rs 19 /40

20.Manually From Database – Automatically Heuristic algorithms 1. heuristic search method to construct a model 2.evaluates model using a scoring method Bayesian scoring method entropy based method minimum description length method 3. go to 1 if score of new model is not significantly better Algorithms that analyze dependency among nodes Measure dependency by conditional independence (CI) test Bayesian Network Construct Network Marko Stupar 11/3370 sm113370m@student.etf.rs 20 /40

21.Bayesian Network Construct Network Heuristic algorithms Advantages less time complexity in worst case Disadvantage May not find the best solution due to heuristic nature Algorithms that analyze dependency among nodes Advantages usually asymptotically correct Disadvantage CI tests with large condition-sets may be unreliable unless the volume of data is enormous. Marko Stupar 11/3370 sm113370m@student.etf.rs 21 /40

22.Bayesian Network Construct Network - Example 1. Choose an ordering of variables X 1 , … , X n 2. For i = 1 to n add X i to the network select parents from X 1 , … ,X i-1 such that P (X i | Parents(X i )) = P (X i | X 1 , ... X i-1 ) Marko Stupar 11/3370 sm113370m@student.etf.rs Marry Calls John Calls Alarm Burglary Earthquake P (J | M) = P (J)? No P (A | J, M) = P (A | J) ? P (A | J, M) = P (A) ? No P (B | A, J, M) = P (B | A) ? Yes P (B | A, J, M) = P (B) ? No P (E | B, A ,J, M) = P (E | A) ? No P (E | B, A, J, M) = P (E | A, B) ? Yes 22 /40

23.Create Network – from database d(Directional)-Separation d-Separation is graphical criterion for deciding, from a given causal graph(DAG), whether a disjoint sets of nodes X-set, Y-set are independent, when we know realization of third Z-set Z-set - is instantiated(values of it’s nodes are known) before we try to determine d-Separation(independence) between X-set and Y-set X-set and Y-set are d-Separated by given Z-set if all paths between them are blocked Example of Path : N1 <- N2 -> N3 -> N4 -> N5 <- N6 <- N7 N5 – “head-to-head” node Path is not blocked if every “head-to-head” node is in Z-set or has descendant in Z-set, and all other nodes are not in Z-set. Marko Stupar 11/3370 sm113370m@student.etf.rs 23 /40

24.Create Network – from database d-Separation Example 1. Does D d -separate C and F? There are two undirected paths from C to F: ( i ) C -B -E –F This blocked given D by the node E, since E is not one of the given nodes and has both arrows on the path going into it. (ii) C - B - A - D - E - F. This path is also blocked by E (and D as well). So, D does d-separate C and F 2. Do D and E d-separate C and F? The path C - B - A - D - E - F is blocked by the node D given {D,E}. However, E no longer blocks C - B - E - F path since it “given” node. So, D and E do not d-separate C and F 3. Write down all pairs of nodes which are independent of each other. Nodes which are independent are those that are d-separated by the empty set of nodes. This means every path between them must contain at least one node with both path arrows going into it, which is E in current context. We find that F is independent of A, of B, of C and of D. All other pairs of nodes are dependent on each other. 4. Which pairs of nodes are independent of each other given B? We need to find which nodes are d-separated by B. A, C and D are all d-separated from F because of the node E. C is d-separated from all the other nodes (except B) given B. The independent pairs given B are hence: AF, AC, CD, CE, CF, DF. 5. Do we have that: P(AF|E) = P(A|E)P(F|E)? (are A and F independent given E?) A and F are NOT independent given E, since E does not d-separate A and F Marko Stupar 11/3370 sm113370m@student.etf.rs 24 /40

25.Create Network – from database Markov Blanket MB(A) - set of nodes composed of A’s parents, its children and their other parents When given MB(A) every other set of nodes in network is conditionally independent or d-Separated of A MB(A) - The only knowledge needed to predict the behavior of A – Pearl 1988. Marko Stupar 11/3370 sm113370m@student.etf.rs 25 /40

26.Mutual information Conditional mutual information Used to quantify dependence between nodes X and Y If we say that X and Y are d-Separated by condition set Z, and that they are conditionally independent Marko Stupar 11/3370 sm113370m@student.etf.rs Create Network – from database Conditional independence (CI) Test 26 /40

27. Marko Stupar 11/3370 sm113370m@student.etf.rs Backup Slides (Not needed) 27 /40

28.Create Network – from database Naïve Bayes Very fast Very robust Target node is the father of all other nodes The low number of probabilities to be estimated Knowing the value of the target makes each node independent Marko Stupar 11/3370 sm113370m@student.etf.rs 28 /40

29.Create Network – from database Augmented Naïve Bayes Naive structure + relations among son nodes | knowing the value of the target node More precise results than with the naive architecture Costs more in time Models: Pruned Naive Bayes (Naive Bayes Build) Simplified decision tree (Single Feature Build) Boosted (Multi Feature Build) Marko Stupar 11/3370 sm113370m@student.etf.rs 29 /40

30.Create Network – from database Augmented Naïve Bayes Tree Augmented Naive Bayes (TAN) Model (a) Compute I(Ai, Aj|Target) between each pair of attributes, i≠j (b) Build a complete undirected graph in which the vertices are the attributes A1, A2, … The weight of an edge connecting Ai and Aj is I(Ai, Aj|Target) (c) Build a maximum weighted spanning tree. (d) Transform the resulting undirected tree to a directed one by choosing a root variable and setting the direction of all edges to be outward from it. (e) Construct a tree augmented naive Bayes model by adding a vertex labeled by C and adding an directional edge from C to each Ai. Marko Stupar 11/3370 sm113370m@student.etf.rs 30 /40

31.Create Network – from database Sons and Spouses Target node is the father of a subset of nodes possibly having other relationships Showing the set of nodes being indirectly linked to the target Time cost of the same order as for the augmented naive Bayes Marko Stupar 11/3370 sm113370m@student.etf.rs 31 /40

32.Create Network – from database Markov Blanket Get relevant nodes on time frame lower than with the other two algorithms Augmented Naïve Bayes and Sons & Spouses Good tool for analyzing one variable Searches for the nodes that belong to the Markov Blanket The observation of the nodes belonging to the Markov Blanket makes the target node independent of all the other nodes. Marko Stupar 11/3370 sm113370m@student.etf.rs 32 /40

33.Create Network – from database Augmented Markov Blanket Marko Stupar 11/3370 sm113370m@student.etf.rs 33 /40

34.An Algorithm for Bayesian Belief Network Construction from Data Jie Cheng, David A. Bell, Weiru Liu School of Information and Software Engineering University of Ulster at Jordanstown Northern Ireland, UK, BT37 0QB e-mail: { j.cheng , da.bell , w.liu}@ ulst.ac.uk Phase I: (Drafting) 1. Initiate a graph G(V, E) where V={all the nodes of a data set}, E={ }. Initiate two empty ordered set S, R. 2. For each pair of nodes ( v , v ) i j where v v V i j , Î , compute mutual information I v v i j ( , ) using equation (1). For the pairs of nodes that have mutual information greater than a certain small value e , sort them by their mutual information from large to small and put them into an ordered set S. 3. Get the first two pairs of nodes in S and remove them from S. Add the corresponding arcs to E. (the direction of the arcs in this algorithm is determined by the previously available nodes ordering.) 4. Get the first pair of nodes remained in S and remove it from S. If there is no open path between the two nodes (these two nodes are d-separated given empty set), add the corresponding arc to E; Otherwise, add the pair of nodes to the end of an ordered set R. 5. Repeat step 4 until S is empty. Phase II: (Thickening) 6. Get the first pair of nodes in R and remove it from R. 7. Find a block set that blocks each open path between these two nodes by a set of minimum number of nodes. (This procedure find_block_set (current graph, node1, node2) is given at the end of this subsection.) Conduct a CI test. If these two nodes are still dependent on each other given the block set, connect them by an arc. 8. go to step 6 until R is empty. Phase III: (Thinning) 9. For each arc in E, if there are open paths between the two nodes besides this arc, remove this arc from E temporarily and call procedure find_block_set (current graph, node1, node2). Conduct a CI test on the condition of the block set. If the two nodes are dependent, add this arc back to E; otherwise remove the arc permanently. Marko Stupar 11/3370 sm113370m@student.etf.rs Create Network – from database Construction Algorithm - Example 34 /40

35.An Algorithm for Bayesian Belief Network Construction from Data Jie Cheng, David A. Bell, Weiru Liu School of Information and Software Engineering University of Ulster at Jordanstown Northern Ireland, UK, BT37 0QB e-mail: { j.cheng , da.bell , w.liu}@ ulst.ac.uk Phase I: (Drafting) 1. Initiate a graph G(V, E) where V={all the nodes of a data set}, E={ }. Initiate two empty ordered set S, R. 2. For each pair of nodes ( v , v ) i j where v v V i j , Î , compute mutual information I v v i j ( , ) using equation (1). For the pairs of nodes that have mutual information greater than a certain small value e , sort them by their mutual information from large to small and put them into an ordered set S. 3. Get the first two pairs of nodes in S and remove them from S. Add the corresponding arcs to E. (the direction of the arcs in this algorithm is determined by the previously available nodes ordering.) 4. Get the first pair of nodes remained in S and remove it from S. If there is no open path between the two nodes (these two nodes are d-separated given empty set), add the corresponding arc to E; Otherwise, add the pair of nodes to the end of an ordered set R. 5. Repeat step 4 until S is empty. Phase II: (Thickening) 6. Get the first pair of nodes in R and remove it from R. 7. Find a block set that blocks each open path between these two nodes by a set of minimum number of nodes. (This procedure find_block_set (current graph, node1, node2) is given at the end of this subsection.) Conduct a CI test. If these two nodes are still dependent on each other given the block set, connect them by an arc. 8. go to step 6 until R is empty. Phase III: (Thinning) 9. For each arc in E, if there are open paths between the two nodes besides this arc, remove this arc from E temporarily and call procedure find_block_set (current graph, node1, node2). Conduct a CI test on the condition of the block set. If the two nodes are dependent, add this arc back to E; otherwise remove the arc permanently. Marko Stupar 11/3370 sm113370m@student.etf.rs Create Network – from database Construction Algorithm - Example 34 /40

36.An Algorithm for Bayesian Belief Network Construction from Data Jie Cheng, David A. Bell, Weiru Liu School of Information and Software Engineering University of Ulster at Jordanstown Northern Ireland, UK, BT37 0QB e-mail: { j.cheng , da.bell , w.liu}@ ulst.ac.uk Phase I: (Drafting) 1. Initiate a graph G(V, E) where V={all the nodes of a data set}, E={ }. Initiate two empty ordered set S, R. 2. For each pair of nodes ( v , v ) i j where v v V i j , Î , compute mutual information I v v i j ( , ) using equation (1). For the pairs of nodes that have mutual information greater than a certain small value e , sort them by their mutual information from large to small and put them into an ordered set S. 3. Get the first two pairs of nodes in S and remove them from S. Add the corresponding arcs to E. (the direction of the arcs in this algorithm is determined by the previously available nodes ordering.) 4. Get the first pair of nodes remained in S and remove it from S. If there is no open path between the two nodes (these two nodes are d-separated given empty set), add the corresponding arc to E; Otherwise, add the pair of nodes to the end of an ordered set R. 5. Repeat step 4 until S is empty. Phase II: (Thickening) 6. Get the first pair of nodes in R and remove it from R. 7. Find a block set that blocks each open path between these two nodes by a set of minimum number of nodes. (This procedure find_block_set (current graph, node1, node2) is given at the end of this subsection.) Conduct a CI test. If these two nodes are still dependent on each other given the block set, connect them by an arc. 8. go to step 6 until R is empty. Phase III: (Thinning) 9. For each arc in E, if there are open paths between the two nodes besides this arc, remove this arc from E temporarily and call procedure find_block_set (current graph, node1, node2). Conduct a CI test on the condition of the block set. If the two nodes are dependent, add this arc back to E; otherwise remove the arc permanently. Marko Stupar 11/3370 sm113370m@student.etf.rs Create Network – from database Construction Algorithm - Example 34 /40

37.Bayesian Network Software Bayesia Lab Weka - Machine Learning Software in Java AgenaRisk , Analytica , Banjo, Bassist, Bayda , BayesBuilder , Bayesware Discoverer , B-course, Belief net power constructor, BNT, BNJ, BucketElim , BUGS, Business Navigator 5, CABeN , Causal discoverer , CoCo+Xlisp , Cispace , DBNbox , Deal, DeriveIt , Ergo , GDAGsim , Genie, GMRFsim , GMTk , gR , Grappa, Hugin Expert, Hydra, Ideal, Java Bayes , KBaseAI , LibB , MIM, MSBNx , Netica , Optimal Reinsertion, PMT Marko Stupar 11/3370 sm113370m@student.etf.rs 37 /40

38.Problem Trend Marko Stupar 11/3370 sm113370m@student.etf.rs History The term "Bayesian networks" was coined by Judea Pearl in 1985 In the late 1980s the seminal texts Probabilistic Reasoning in Intelligent Systems and Probabilistic Reasoning in Expert Systems summarized the properties of Bayesian networks Fields of Expansion Naïve Bayes Choose optimal PDF Bayesian Networks Find new way to construct network 38 /40

39.Bibliography – borrowed parts Naïve Bayes Classifiers, Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599 http://en.wikipedia.org/wiki/Bayesian_network Bayesian Measurement of Associations in Adverse Drug Reaction Databases William DuMouchel Shannon Laboratory, AT&T Labs –Research dumouchel@research.att.com DIMACS Tutorial on Statistical Surveillance Methods Rutgers University June 20, 2003 http://download.oracle.com/docs/cd/B13789_01/datamine.101/b10698/3predict.htm#1005771 CS/CNS/EE 155: Probabilistic Graphical Models Problem Set 2 Handed out: 21 Oct 2009 Due: 4 Nov 2009 Learning Bayesian Networks from Data: An Efficient Approach Based on Information Theory Jie Cheng Dept. of Computing Science University of Alberta Alberta , T6G 2H1 Email: jcheng@cs.ualberta.ca David Bell, Weiru Liu Faculty of Informatics, University of Ulster, UK BT37 0QB Email: {w.liu, da.bell }@ ulst.ac.uk http://www.bayesia.com/en/products/bayesialab/tutorial.php ISyE8843A, Brani Vidakovic Handout 17 1 Bayesian Networks Bayesian networks Chapter 14 Section 1 – 2 Naive- Bayes Classification Algorithm Lab4-NaiveBayes.pdf Top 10 algorithms in data mining XindongWu · Vipin Kumar · J. Ross Quinlan · Joydeep Ghosh · Qiang Yang · Hiroshi Motoda · Geoffrey J. McLachlan · Angus Ng · Bing Liu · Philip S. Yu · Zhi-Hua Zhou · Michael Steinbach · David J. Hand · Dan Steinberg Received: 9 July 2007 / Revised: 28 September 2007 / Accepted: 8 October 2007 Published online: 4 December 2007 © Springer- Verlag London Limited 2007 Causality Computational Systems Biology Lab Arizona State University Michael Verdicchio With some slides and slide content from: Judea Pearl, Chitta Baral , Xin Zhang Marko Stupar 11/3370 sm113370m@student.etf.rs 39 /40

40. Marko Stupar 11/3370 sm113370m@student.etf.rs Questions? 40 /40