Semisupervised Learning
展开查看详情
1.Semisupervised Learning
2.Introduction Labelled data cat dog Unlabeled data (Image of cats and dogs without labeling)
3.Introduction • Supervised learning: 𝑥 𝑟 , 𝑦ො 𝑟 𝑅𝑟=1 • E.g. 𝑥 𝑟 : image, 𝑦ො 𝑟 : class labels • Semisupervised learning: 𝑥 𝑟 , 𝑦ො 𝑟 𝑅𝑟=1 , 𝑥 𝑢 𝑅+𝑈 𝑢=𝑅 • A set of unlabeled data, usually U >> R • Transductive learning: unlabeled data is the testing data • Inductive learning: unlabeled data is not the testing data • Why semisupervised learning? • Collecting data is easy, but collecting “labelled” data is expensive • We do semisupervised learning in our lives
4.Why semisupervised learning helps? Who knows? The distribution of the unlabeled data tell us something. Usually with some assumptions
5.Outline Semisupervised Learning for Generative Model Lowdensity Separation Assumption Smoothness Assumption Better Representation
6.Semisupervised Learning for Generative Model
7.Supervised Generative Model • Given labelled training examples 𝑥 𝑟 ∈ 𝐶1 , 𝐶2 • looking for most likely prior probability P(Ci) and class dependent probability P(xCi) • P(xCi) is a Gaussian parameterized by 𝜇𝑖 and Σ 𝜇1 , Σ 𝜇2 , Σ Decision Boundary With 𝑃 𝐶1 , 𝑃 𝐶2 ,𝜇1 ,𝜇2 , Σ 𝑃 𝑥𝐶1 𝑃 𝐶1 𝑃 𝐶1 𝑥 = 𝑃 𝑥𝐶1 𝑃 𝐶1 + 𝑃 𝑥𝐶2 𝑃 𝐶2
8. Semisupervised Generative Model • Given labelled training examples 𝑥 𝑟 ∈ 𝐶1 , 𝐶2 • looking for most likely prior probability P(Ci) and class dependent probability P(xCi) • P(xCi) is a Gaussian parameterized by 𝜇𝑖 and Σ 𝜇1 , Σ 𝜇2 , Σ Decision Boundary The unlabeled data 𝑥 𝑢 help reestimate 𝑃 𝐶1 , 𝑃 𝐶2 , 𝜇1 ,𝜇2 , Σ
9. Semisupervised The algorithm converges eventually, but the initialization Generative Model influences the results. • Initialization:𝜃 = 𝑃 𝐶1 ,𝑃 𝐶2 ,𝜇1 ,𝜇2 , Σ E • Step 1: compute the posterior probability of unlabeled data 𝑃𝜃 𝐶1 𝑥 𝑢 Depending on model 𝜃 Back to step 1 M • Step 2: update model 𝑁1 + σ𝑥 𝑢 𝑃 𝐶1 𝑥 𝑢 𝑁: total number of examples 𝑃 𝐶1 = 𝑁1 : number of examples 𝑁 belonging to C1 1 1 𝜇1 = 𝑟 𝑥 + 𝑃 𝐶1 𝑥 𝑢 𝑥𝑢 …… 𝑁1 𝑟 σ𝑥 𝑢 𝑃 𝐶1 𝑥 𝑢 𝑢 𝑥 ∈𝐶1 𝑥
10.Why? 𝜃 = 𝑃 𝐶1 ,𝑃 𝐶2 ,𝜇1 ,𝜇2 , Σ • Maximum likelihood with labelled data 𝑙𝑜𝑔𝐿 𝜃 = 𝑙𝑜𝑔𝑃𝜃 𝑥 𝑟 𝑦ො 𝑟 𝑥 𝑟 ,𝑦ො 𝑟 • Maximum likelihood with labelled + unlabeled data 𝑙𝑜𝑔𝐿 𝜃 = 𝑙𝑜𝑔𝑃𝜃 𝑥 𝑟 𝑦ො 𝑟 + 𝑙𝑜𝑔𝑃𝜃 𝑥 𝑢 𝑥 𝑟 ,𝑦ො 𝑟 𝑥𝑢 Solved iteratively 𝑃𝜃 𝑥 𝑢 = 𝑃𝜃 𝑥 𝑢 𝐶1 𝑃 𝐶1 + 𝑃𝜃 𝑥 𝑢 𝐶2 𝑃 𝐶2 (𝑥 𝑢 can come from either C1 and C2)
11.Semisupervised Learning Lowdensity Separation 非黑即白 “Blackorwhite”
12.Selftraining • Given: labelled data set = 𝑥 𝑟 , 𝑦ො 𝑟 𝑅𝑟=1 , unlabeled data set = 𝑥𝑢 𝑈 𝑢=1 • Repeat: • Train model 𝑓 ∗ from labelled data set You can use any model here. Regression? • Apply 𝑓 ∗ to the unlabeled data set • Obtain 𝑥 𝑢 , 𝑦 𝑢 𝑈𝑢=1 Pseudolabel • Remove a set of data from unlabeled data set, and add them into the labeled data set How to choose the data You can also provide a set remains open weight to each data.
13. Selftraining • Similar to semisupervised learning for generative model • Hard label v.s. Soft label Considering using neural network 𝜃 ∗ (network parameter) from labelled data 1 Hard Class 1 New target for 𝑥 𝑢 is 0 𝑥𝑢 0.7 𝜃∗ It looks like class 1, then it is class 1. 0.3 0.7 Soft 70% Class 1 New target for 𝑥 𝑢 is 0.3 30% Class 2 Doesn’t work …
14. Entropybased Regularization Entropy of 𝑦 𝑢 : 𝑥𝑢 𝜃∗ 𝑦𝑢 Evaluate how concentrate Distribution the distribution 𝑦 𝑢 is 5 𝑦𝑢 Good! 𝐸 𝑦𝑢 = 0 𝐸 𝑦 𝑢 = − 𝑦𝑚 𝑢 𝑙𝑛 𝑦 𝑢 𝑚 1 2 3 4 5 𝑚=1 As small as possible 𝑦𝑢 Good! 𝐸 𝑦𝑢 = 0 labelled 1 2 3 4 5 𝐿 = 𝐶 𝑦 𝑟 , 𝑦ො 𝑟 data Bad! 𝐸 𝑦𝑢 𝑥𝑟 𝑦𝑢 1 = −𝑙𝑛 +𝜆 𝐸 𝑦 𝑢 unlabeled 1 2 3 4 5 5 = 𝑙𝑛5 𝑥𝑢 data
15. Outlook: Semisupervised SVM Enumerate all possible labels for the unlabeled data Find a boundary that can provide the largest margin and least error Thorsten Joachims, ”Transductive Inference for Text Classification using Support Vector Machines”, ICML, 1999
16.Semisupervised Learning Smoothness Assumption 近朱者赤，近墨者黑 “You are known by the company you keep”
17. Smoothness Assumption • Assumption: “similar” 𝑥 has the same 𝑦ො • More precisely: • x is not uniform. • If 𝑥 1 and 𝑥 2 are close in a high density region, 𝑦ො 1 and 𝑦ො 2 are the same. connected by a high density path 公館 v.s. 台北車站 Source of image: 公館 v.s. 科技大樓 http://hips.seas.harvard.edu/files /pinwheel.png
18. Smoothness Assumption • Assumption: “similar” 𝑥 has the same 𝑦ො • More precisely: • x is not uniform. 𝑥3 • If 𝑥 1 and 𝑥 2 are close in 𝑥1 a high density region, 𝑦ො 1 and 𝑦ො 2 are the same. 𝑥2 connected by a high density path Source of image: 𝑥 1 and 𝑥 2 have the same label http://hips.seas.harvard.edu/files /pinwheel.png 𝑥 2 and 𝑥 3 have different labels
19.Smoothness Assumption Not similar? similar? “indirectly” similar with stepping stones (The example is from the tutorial slides of Xiaojin Zhu.) Source of image: http://www.moehui.com/5833.html/5/
20.Smoothness Assumption • Classify astronomy vs. travel articles (The example is from the tutorial slides of Xiaojin Zhu.)
21.Smoothness Assumption • Classify astronomy vs. travel articles (The example is from the tutorial slides of Xiaojin Zhu.)
22.Cluster and then Label Cluster 2 Class 2 Class 1 Class 1 Class 2 Class 2 Cluster 1 Cluster 3 Using all the data to learn a classifier as usual
23.Graphbased Approach • How to know 𝑥 1 and 𝑥 2 are connected by a high density path Represented the data points as a graph Graph representation is nature sometimes. E.g. Hyperlink of webpages, citation of papers Sometimes you have to construct the graph yourself.
24. The images are from the Graphbased Approach tutorial slides of Amarnag Subramanya and Partha  Graph Construction Pratim Talukdar • Define the similarity 𝑠 𝑥 𝑖 , 𝑥 𝑗 between 𝑥 𝑖 and 𝑥 𝑗 • Add edge: • K Nearest Neighbor • eNeighborhood • Edge weight is proportional to s 𝑥 𝑖 , 𝑥 𝑗 Gaussian Radial Basis Function: 𝑖 𝑗 𝑖 𝑗 2 𝑠 𝑥 ,𝑥 = 𝑒𝑥𝑝 −𝛾 𝑥 − 𝑥
25.Graphbased Approach Class 1 Class 1 Class 1 Class 1 Class 1 x The labelled data influence their neighbors. Propagate through the graph
26. Graphbased Approach • Define the smoothness of the labels on the graph 1 𝑖 𝑗 2 𝑆 = 𝑤𝑖,𝑗 𝑦 − 𝑦 Smaller means smoother 2 𝑖,𝑗 For all data (no matter labelled or not) 𝑦1 =1 𝑦 1 = 0 x1 x1 3 2 3 𝑦3 =1 2 𝑦3 = 1 1 x3 1 x3 𝑦2 =1 x 2 𝑦 2 = 1 x2 1 1 x4 x4 𝑆 = 0.5 𝑆=3 𝑦4 = 0 𝑦4 =0
27. Graphbased Approach • Define the smoothness of the labels on the graph 1 2 𝑆 = 𝑤𝑖,𝑗 𝑦 𝑖 − 𝑦 𝑗 = 𝒚𝑇 𝐿𝒚 2 𝑖,𝑗 y: (R+U)dim vector 𝑻 𝒚 = ⋯ 𝑦𝑖 ⋯ 𝑦 𝑗 ⋯ L: (R+U) x (R+U) matrix Graph Laplacian 0 2 3 0 5 0 0 0 𝑊= 2 0 1 0 D= 0 3 0 0 𝐿 =𝐷−𝑊 3 1 0 1 0 0 5 0 0 0 1 0 0 0 0 1
28.Graphbased Approach • Define the smoothness of the labels on the graph 1 2 Depending on 𝑆 = 𝑤𝑖,𝑗 𝑦 𝑖 − 𝑦 𝑗 = 𝒚𝑇 𝐿𝒚 2 model parameters 𝑖,𝑗 𝐿 = 𝐶 𝑦 𝑟 , 𝑦ො 𝑟 +𝜆𝑆 𝑥𝑟 smooth As a regularization term J. Weston, F. Ratle, and R. Collobert, “Deep learning via semisupervised embedding,” ICML, 2008 smooth smooth
29.Semisupervised Learning Better Representation 去蕪存菁，化繁為簡

Graph Features in Spark 3.0  Integrating Graph Querying and Algorithms in Spark Graphg
Spark开源社区

Updates from Project Hydrogen  Unifying StateoftheArt AI and Big Data in Apache Spark
Spark开源社区

Tensorflow Faster RCNN 2.0
GDG

Deep learning and gene computing acceleration with alluxio in kubernetes
Alluxio

tf.data: TensorFlow Input Pipeline
Alluxio