Modelbased RL
展开查看详情
1.MODELBASED REINFORCEMENT LEARNING
2. The material for this talk is drawn from the slides, notes and lectures from several offerings of the CS181 course at Harvard University: CS181 Intelligent Machines: Perception, Learning and Uncertainty, Sarah Finney, Spring 2009 CS181 Intelligent Machines: Perception, Learning and Uncertainty, Prof. David C Brooks, Spring 2011 CS181 – Machine Learning, Prof. Ryan P. Adams, Spring REFERENCES 2014. https://github.com/wihl/cs181spring2014 CS181 – Machine Learning, Prof. David Parkes, Spring 2017. https://harvardmlcourses.github.io/cs181web 2017/ As well as notes and lectures from Stanford course CS229 : CS229 – Machine Learning, Andrew Ng. https://see.stanford.edu/Course/CS229
3. 1. Overview o Types of Machine Learning o Markov Decision Processes o Reinforcement Learning o Applications 2. Review of MDP Algorithms o The Bellman equations o Expectimax (finite horizon) o Value Iteration (finite horizon) OVERVIEW o o Value Iteration (infinite horizon) Policy Iteration(infinite horizon) 3. Reinforcement Learning o The basic idea o ModelBased RL o Learning the reward and transition probabilities o Credit assignment o Exploration vs. exploitation 4. Next Time o Qlearning
4. TYPES OF MACHINE LEARNING There are (at least) 3 broad categories of machine learning problems: Supervised Learning 𝑫𝒂𝒕𝒂 = 𝒙𝟏 , 𝒚𝟏 , … , 𝒙𝒏 , 𝒚𝒏 e.g., linear regression, decision trees, SVMs Unsupervised Learning 𝑫𝒂𝒕𝒂 = 𝒙𝟏 , , … , 𝒙𝒏 e.g., Kmeans, HAC, Gaussian mixture models Reinforcement Learning 𝑫𝒂𝒕𝒂 = 𝒔𝟏 , 𝒂𝟏 , 𝒓𝟏 , 𝒔𝟐 , 𝒂𝟐 , 𝒓𝟐 … an agent learns to act in an uncertain environment by training on data that are sequences of state, action, reward.
5. MARKOV DECISION PROCESSES • Markov Decision Processes provide a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. • The initial analysis of MDPs assume complete knowledge of states, actions, rewards, transitions, and discounts. <https://en.wikipedia.org/w/index.php?title=Markov_decision_process&oldid=855934986> [accessed 11 September 2018]
6. MARKOV DECISION PROCESSES • States: 𝒔𝟏 , … , 𝒔𝒏 • Actions: 𝒂𝟏 , … , 𝒂𝒎 • Reward Function: r 𝑠, 𝑎 ∈ 𝑅 • Transition model: p 𝑠′ 𝑠, 𝑎 • Discount factor: γ ∈ 𝟎, 𝟏 <https://en.wikipedia.org/w/index.php?title=Markov_decision_process&oldid=855934986> [accessed 11 September 2018]
7.MDP GOAL: FIND AN OPTIMAL POLICY π GOAL: find a policy π that tells you what action to take in each state. We want to find ‘rewarding’ policies. n state policy: 𝜋 𝑠1 = 𝑎1 𝜋 𝑠2 = 𝑎2 𝜋 𝑠3 = 𝑎3 … 𝜋 𝑠𝑛 = 𝑎𝑛
8.REINFORCEMENT LEARNING • Agent knows the current state s, takes action a, and gets reward r. • There is no access to reward model r(s,a) or transition model p(s’s,a) • Agent only sees the outcome reward r and the next state s’. • Under these conditions, it is a very challenging problem to learn π.
9.REVIEW OF MDP ALGORITHMS
10. REVIEW OF MDP ALGORITHMS There are 4 standard MDP algorithms that assume complete knowledge: • Expectimax (finite horizon, 𝛾 = 1) • Value Iteration (finite horizon, 𝛾 = 1) • Value Iteration (infinite horizon, 𝛾 ∈ [0,1)) • Policy Iteration (infinite horizon , 𝛾 ∈ [0,1))
11. BELLMAN EQUATIONS • The planning problem for an MDP is: 𝜋 ∗ ∈ arg max 𝑉 𝜋 (𝑠) 𝜋 • Bellman equations: 𝑉 ∗ 𝑠 = max 𝑅 𝑠, 𝑎 + 𝛾 σ𝑠′ ∈𝑆 𝑃 𝑠 ′ 𝑠, 𝑎 𝑉 ∗ 𝑠 ′ , ∀𝑠 𝑎∈𝐴 • All MDP algorithms use some variant of the Bellman equations.
12. EXPECTIMAX: TOPDOWN, RECURSIVE • Build out a lookahead tree to the decision horizon; take the max over actions, expectations over next states. • Solve from the leaves, backingup the expectimax values. • Problems: (1) computation is exponential in the horizon; (2) may expand the same subtree multiple times.
13.EXPECTIMAX: A GAME AGAINST NATURE • Expectimax is like a gameplaying algorithm except the opponent is nature. • Expectimax is strongly related to the minmax algorithm used in game theory, but the response is probabilistic. • Nodes where you move: S ( ) • Nodes where nature moves: <S,A> ( )
14. EXPECTIMAX (F𝑖𝑛𝑖𝑡𝑒 𝐻𝑜𝑟𝑖𝑧𝑜𝑛 𝑇) function EXPECTIMAX(s) if s is a terminal then return 0 else return max[𝑅 𝑠, 𝑎 + σ𝑠′ 𝑃 𝑠′ 𝑠, 𝑎 𝐸𝑋𝑃𝐸𝐶𝑇𝐼𝑀𝐴𝑋(𝑠 ′ )] 𝑎∈𝐴 𝜋 𝑇 (𝑠) = 𝐸𝑋𝑃𝐸𝐶𝑇𝐼𝑀𝐴𝑋(𝑠)
15.VALUE ITERATION (F𝑖𝑛𝑖𝑡𝑒 𝐻𝑜𝑟𝑖𝑧𝑜𝑛 𝑇) • Value iteration with a finite horizon works from the bottomup using dynamic programming. • The idea is to break up the problem by number of the number of steps to go. • Start with the base case values with no timesteps to go. • Given optimal policy for k1 steps to go, compute values for k steps to go
16. VALUE ITERATION (F𝑖𝑛𝑖𝑡𝑒 𝐻𝑜𝑟𝑖𝑧𝑜𝑛 𝑇) 1. For each state s, initialize 𝑉0 𝑠 = 0 2. For 𝑘 ← 1 … 𝑇 { For every state s: 𝑉𝑘 𝑠 = max[𝑅 𝑠, 𝑎 + σ𝑠′ 𝑃 𝑠 ′ 𝑠, 𝑎 𝑉𝑘−1 𝑠 ′ ] 𝑎∈𝐴 } 3. 𝜋 𝑇 (𝑠) = argmax[ 𝑅 𝑠, 𝑎 + σ𝑠′ ∈𝑆 𝑃 𝑠 ′ 𝑠, 𝑎 𝑉𝑇 𝑠 ′ ] 𝑎∈𝐴
17.VALUE ITERATION (∞ HORIZON, 𝛾 ∈ [0,1)) • Value iteration with an infinite horizon is a generalization of value iteration with a finite horizon. • The main change is that the discount factor 𝛾 is set to less than one discounting the value of states farther in the future. • It can be shown that 𝑉 will converge over time to 𝑉 ∗ • Iterate the update equation until the changes in 𝑉 fall below some 𝜖.
18.VALUE ITERATION (∞ HORIZON, 𝛾 ∈ [0,1)) 1. For each state s, initialize 𝑉 𝑠 = 0 2. Repeat until convergence { For every state s: 𝑉′ 𝑠 = max[𝑅 𝑠, 𝑎 + 𝛾 σ𝑠′ ∈𝑆 𝑃 𝑠 ′ 𝑠, 𝑎 𝑉 𝑠 ′ ] 𝑎∈𝐴 𝑉 ← 𝑉′ } 3. 𝜋 ∗ (𝑠) = argmax[ 𝑅 𝑠, 𝑎 + σ𝑠′ ∈𝑆 𝑃 𝑠 ′ 𝑠, 𝑎 𝑉𝑇 𝑠 ′ ] 𝑎∈𝐴
19.POLICY ITERATION (∞ HORIZON, 𝛾 ∈ [0,1)) • For value iteration, the policy often stops changing long before the values converge. • Policy iteration, iterates on the policy rather than V. • Policy iteration can converge in many fewer iterations than value iteration. • However, the loop body in policy iteration takes much longer than value iteration. • Harvard CS181 notes say policy iteration is faster. Andrew Ng says that value iteration is faster.
20.POLICY ITERATION (∞ HORIZON, 𝛾 ∈ [0,1)) 1. Initialize policy 𝜋 randomly 2. Repeat until policy 𝜋 does not change { a) Solve 𝑉 𝜋 (𝑠) = 𝑅 𝑠, 𝜋(𝑠) + 𝛾 σ𝑠′ ∈𝑆 𝑃 𝑠 ′ 𝑠, 𝜋(𝑠) 𝑉 𝑠 ′ ] b) For every state s: let 𝜋′ 𝑠 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑅 𝑠, 𝑎 + 𝛾 σ𝑠′ ∈𝑆 𝑃 𝑠′ 𝑠, 𝑎 𝑉 𝜋 (𝑠 ′ ) 𝑎∈𝐴 c) 𝜋 ← 𝜋′ }
21.REINFORCEMENT LEARNING
22. REINFORCEMENT LEARNING: THE BASIC IDEA • Select an action • If action leads to reward, reinforce that action • If action leads to punishment, avoid that action • Basically, a computational form of Behaviorism (Pavlov, B. F. Skinner)
23. THE LEARNING FRAMEWORK • Learning is performed online, learn as we interact with the world • In contrast with supervised learning, there are no training or test sets. The reward is accumulated over interactions with the environment. • Data is not fixed, more information is acquired as you go. • The training distribution can be influenced by action decisions.
24. CHALLENGES • credit assignment problem: how do you know which actions were responsible for success or failure? • exploration vs exploitation: should you use your current model to collect rewards or should risk lower rewards now for a better model and higher rewards later?
25.MODELBASED REINFORCEMENT LEARNING o Mechanism is Markov Decision Process  assume states and actions are known  assume states are fully observable o Approach:  learn the MDP reward and transition parameters  solve the MDP to determine an optimal policy o Appropriate when the model is unknown, but small enough to store and solve.
26. LEARN THE REWARD AND TRANSITION DISTRIBUTIONS • Try every action in each state a number of times • RTotal(a,s) = total reward for action a in state s • N(a,s) = number of times action a taken in state s • N(a,s,s’) = number of transitions from s to s’ on action a • R(s,a) = RTotal(a,s) / N(a,s) • T(a,s,s’) = N(a,s) / N(a,s,s’)
27. REWARD PARAMETER TABLE Actions R(s0,a0) R(s0,a1) R(s0,a2) States R(s1,a0) R(s1,a1) R(s1,a2) R(s2,a0) R(s2,a1) R(s2,a2) R(s3,a0) R(s3,a1) R(s3,a2)
28. TRANSITION PARAMETER TABLE Action a Next State T(a,s0,s0) T(a,s0,s1) T(a,s0,s2) T(a,s0,s3) Current T(a,s1,s0) T(a,s1,s1) T(a,s1,s2) T(a,s1,s3) State T(a,s2,s0) T(a,s2,s1) T(a,s2,s2) T(a,s2,s3) T(a,s3,s0) T(a,s3,s1) T(a,s3,s2) T(a,s3,s3)
29. ITERATIVE IMPROVEMENT o Swap between learning the model and solving the model to determine the optimal policy o Why? • A poor policy may be expensive • might want to avoid learning a perfect model everywhere o How often should one solve the MDP? • it depends on the relative costs

Graph Features in Spark 3.0  Integrating Graph Querying and Algorithms in Spark Graphg
Spark开源社区

Updates from Project Hydrogen  Unifying StateoftheArt AI and Big Data in Apache Spark
Spark开源社区

Tensorflow Faster RCNN 2.0
GDG

Deep learning and gene computing acceleration with alluxio in kubernetes
Alluxio

tf.data: TensorFlow Input Pipeline
Alluxio