利用BigDL在Analytics Zoo构建Apache Spark的深度强化学习应用

深度强化学习(DRL)是当前人工智能战场中的一个蓬勃发展的领域。DeepMind的AlffAX是DRL的一个非常成功的应用,引起了全世界的关注。除了玩游戏,DRL在工业上也有很多实际应用,例如自主驾驶、聊天机器人、金融投资、库存管理,甚至推荐系统。尽管DRL应用程序与受监督的计算机视觉或自然语言处理任务有一些共同点,但它们在许多方面是独特的。
展开查看详情

1.Building Deep Reinforcement Learning Applications on Apache Spark with Analytics Zoo using BigDL Yuhao Yang Intel Data Analytics Technologies

2.Agenda Analytics Zoo overview Reinforcement learning overview Reinforcement learning with Analytics zoo future directions

3.Analytics Zoo • Analytics + AI Platform for Apache Spark and BigDL • Open source, Scala/Python, Spark 1.6 and 2.X Analytics Zoo High level API, Industry pipelines, App demo & Util BigDL MKL, Tensors, Layers, optim Methods, all-reduce Apache Spark RDD, DataFrame, Scala/Python https://github.com/intel-analytics/analytics-zoo

4.Analytics Zoo High level pipeline APIs nnframes: Spark DataFrames and ML Pipelines for DL Keras-style API autograd: custom layer/loss using auto differentiation Transfer learning

5.Analytics Zoo Built-in deep learning pipelines & models Object detection: API and pre-trained SSD and Faster-RCNN Image classification: API and pre-trained VGG, Inception, ResNet, MobileNet, etc. Text classification API with CNN, LSTM and GRU Recommendation API with NCF, Wide and Deep etc.

6.Analytics Zoo End-to-end reference use cases reinforcement learning anomaly detection sentiment analysis fraud detection image augmentation object detection variational autoencoder …

7.Reinforcement Learning (RL) • RL is for Decision-making

8.Examples of RL applications • Play: Atari, poker, Go, ... • Interact with users: recommend, Healthcare, chatbot, personalize, .. • Control: auto-driving, robotics, finance, …

9.Deep Reinforcement Learning (DRL) Agents take actions (a) in state (s) and receives rewards (R) Goal is to find the policy (π) that maximized future rewards http://people.csail.mit.edu/hongzi/content/publications/DeepRM-HotNets16.pdf

10.Cartpole

11.Approaches to Reinforcement Learning • Value-based RL • Estimate the optimal value function Q*(S,A) • Output of the Neural network is the value for Q(S, A) • Policy-based RL • Search directly for the optimal policy π* • Output of the neural network is the probability of each action. • Model-based RL

12.DRL algo

13.Examples • 1. Simple DQN to demo API and train with Spark RDD. • 2. Distributed REINFORCE

14.Q-network https://ai.intel.com/demystifying-deep-reinforcement-learning/

15.Bellman Equation http://www0.cs.ucl.ac.uk/staff/d.silver/web/Resources_files/deep_rl.pdf

16.DQN critical routines for e in range(EPISODES): state = env.reset() state = np.reshape(state, [1, state_size]) for time in range(500): action = agent.act(state) next_state, reward, done, _ = env.step(action) reward = reward if not done else -10 next_state = np.reshape(next_state, [1, state_size]) agent.remember(state, action, reward, next_state, done) state = next_state if len(agent.memory) > batch_size: agent.replay(batch_size)

17.Parallelize the neural network training def replay(self, batch_size): X_batch = np.array([0,0,0,0]) y_batch = np.array([0,0]) minibatch = random.sample(self.memory, batch_size) for state, action, reward, next_state, done in minibatch: target = reward if not done: target = (reward + self.gamma * np.amax(self.model.predict_local(next_state)[0])) target_f = self.model.predict_local(state) target_f[0][action] = target X_batch = np.vstack((X_batch, state)) y_batch = np.vstack((y_batch, target_f)) rdd_sample = to_RDD(X_batch,y_batch) self.model.fit(rdd_sample, None, nb_epoch=10, batch_size=batch_size)

18.Analytics Zoo Keras-style Model

19.Vanilla DQN

20.Policy gradients • In Policy Gradients, we usually use a neural network (or other function approximators) to directly model the action probabilities. • we tweak the parameters θ of the neural network so that “good” actions will be sampled more likely in the future.

21.REINFORCE

22.Time breakdown • Game playing takes the most time in each iteration

23. Distributed REINFORCE # create and cache several agents on each partition as specified by parallelism # and cache it with DistributedAgents(sc, create_agent=create_agent, parallelism=parallelism) as a: agents = a.agents # a.agents is a RDD[Agent] optimizer = None num_trajs_per_part = int(math.ceil(15.0 / parallelism)) mean_std = [] for i in range(60): with SampledTrajs(sc, agents, model, num_trajs_per_part=num_trajs_per_part) as trajs: trajs = trajs.samples \ # samples is a RDD[Trajectory] .map(lambda traj: (traj.data["observations"], traj.data["actions"], traj.data["rewards"]))

24. REINFORCE algorithm VanillaPGCriterion Minimize −1 ∗ 𝑟𝑒𝑤𝑎𝑟𝑑 ∗ (𝑦 − 𝑝𝑟𝑜𝑏) Y = (action, reward) pair prob Sigmoid Play N games and collect Linear (24,1) The output is only samples and targets 1 node as there’re only 2 Loop N- Prepared ReLU actions in the CartPole game. updates training and exit samples (X,Y) Linear (24,24) Train and update model ReLU The input state is a vector of 4 dimension in the CartPole game, Linear (4,24) for other games, input may be Overflow of a PG program arbitrary image X= State/observation

25.Distributed REINFORCE

26.Other RL algorithms • Flappy bird with DQN • Discrete and continuous PPO • A2C (in roadmap)

27.Q&A Analytics Zoo High level API, Industry pipelines, App demo & Util https://github.com/intel-analytics/analytics-zoo Thanks Shane Huang and Yang Wang for working on RL implementations.