Skip to content

Latest commit

 

History

History
221 lines (174 loc) · 9.97 KB

Introduction-to-Deep-Reinforcement-Learning.md

File metadata and controls

221 lines (174 loc) · 9.97 KB

Introduction to Deep Reinforment Learning (DRL)

Table of Contents

Background: Machine Learning Method

There are three typical types of machine leanring methods:

  1. Supervised Learning: given labeled data, train the model for predicting the correct result

Supervised Learning

  1. Unsupervised Learning: given unlabeled data, train the model to find underlying patterns between data

Unsupervised Learning

  1. Reinforcement Learning: get feedback (state and reward) from interacting with the environment, and adjust new action to the environment to get maximum expected reward.

Reinforcement Learning

Method Comparison Input Data Output Result Types of Problem Application
Supervised Learning Labeled data Prediction result Classification; Regression Risk Evaluation; Forecasting
Unsupervised Learning Unlabeled data Underlying pattern Clustering Recommendation; Anomaly detection
Reinforcement Learning Learn from environment Action to the enviroment Exploration and Exploitation Self driving cars; Gaming

Methodology

Reinforcement Learning

As mensioned previously, Reinforcement Learning get feedback from interacting with the environment without having predefined data. It is a goal-oriented method that an agent tries to come up with the best action given a state. One of the most important issue in Reinforcement Learning is the design of reward function, which influence how fast the agent learns from interacting with the environment.

For example, an utimate goal for a dog (agent) is to catch a frisbee thrown by a kid. The closer the dog to the frisbee, the more reward it will get. This reward function will affect the dog's subsequent action. The dog will know where it is (state) and how much reward it gets in the previous action. All these result will be saved as the dog's experience for deciding the next action.

Q-Learning

Q-learning is a model-free Reinforcement Learning algorithm. In Reinforcement Learning, agent will learn from experience. In Q-Learning, each state and action are viewed as inputs to a Q-function which outputs a corresponding Q-value (Expected future reward). Besides, these expereinces will be saved to a Q-table as a reference for agent to decide a best action.

Q-table Mechanism

Deep Q-Learning (DQL)

In Q-Learning, the experience learned by the agent will be save to Q-table; however, when the problem scale is larger, Q-table will be ineffcient. Take playing games as an example, the action space and the state space are too large to handle. To deal with this problem, Neural Networks method is used to approximate the Q-value for each action when given a state.

Deep Q Learning

Example: DQL Implementation

Environment
OpenAI Gym: MountainCar-v0
Description: The agent (a car) is started at the bottom of a valley. For any given state the agent may choose to accelerate to the left, right or cease any acceleration.

Mountain Car

Code shown below is adjusted from Reinforcement Learning 進階篇:Deep Q-Learning

Import Module: Pytorch is used for building neural network

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import gym
import matplotlib.pyplot as plt

Neural Network Structure:

class Net(nn.Module):
    def __init__(self, n_states, n_actions, n_hidden):
        super(Net, self).__init__()

        # input: state to hidden layer, hidden layer to output: action
        self.fc1 = nn.Linear(n_states, n_hidden)
        self.out = nn.Linear(n_hidden, n_actions)

    def forward(self, x):
        x = self.fc1(x)
        x = F.relu(x) # ReLU activation
        actions_value = self.out(x)
        return actions_value

Deep Q-Network module (simple version): For more detail: Deep Q-Learning

class DQN(object):
    def __init__(self):
        # Create target network, evaluation network and memory
        
    def choose_action(self):
    	# Choose action according to the state

    def store_transition(self):
    	# Store experience to memory

    def learn(self):
    	# Update tatget network

Since the default reward setting is too simple, I revise it to accelerate the training process.

  • Default Reward
    Agent reached the flag (position = 0.5): 0
    Position of the agent is less than 0.5: -1
  • Adjust Reward Distance between agent and the flag = pos-0.5 (negative) Velocity of the agent = vel (positive if the agent is toward to the flag) Reward = (pos-0.5)+vel
if __name__ == '__main__' :        
    env = gym.make('MountainCar-v0')
    
    # Environment parameters
    n_actions = env.action_space.n
    n_states = env.observation_space.shape[0]
    
    # Hyper parameters
    n_hidden = 20
    batch_size = 32
    lr = 0.1                 # learning rate
    epsilon = 0.1             # epsilon-greedy
    gamma = 0.9               # reward discount factor
    target_replace_iter = 100 # target network update frequency
    memory_capacity = 2000
    n_episodes = 200
    
    # create DQN
    dqn = DQN(n_states, n_actions, n_hidden, batch_size, lr, epsilon, gamma, target_replace_iter, memory_capacity)
    pos_his, reward_his = [], []
    
    # train DQN
    for i_episode in range(n_episodes):
        t = 0
        rewards = 0
        best_pos = -1.2 # min position defined in 'MountainCar-v0'
        state = env.reset()
        while True:
            env.render()
    
            # choose action
            action = dqn.choose_action(state)
            next_state, reward, done, info = env.step(action)
            
            # revise reward to accelerate training process
            pos, vel = next_state
            r1 = pos-0.5 # better to make the car closer to the flag
            r2 = vel
            reward = r1+r2
            
            # save experience
            dqn.store_transition(state, action, reward, next_state)
            
            # record best position happened during steps
            best_pos = pos if (pos > best_pos) else best_pos
    
            # accumulate reward
            rewards += reward
    
            # train the model after having enough expereince
            if dqn.memory_counter > memory_capacity:
                dqn.learn()
    
            # go to next state
            state = next_state
    
            if done:
                pos_his.append(best_pos)
                reward_his.append(rewards)
                print(f'{i_episode+1} Episode finished after {t+1} timesteps, total rewards {rewards}')
                break
    
            t += 1
            
    env.close()

Result Plots:

  • Default Reward

(Default Reward) Best Car Position in each Episode

Since there are only two values in reward space (-1; 0), the total reward in each episode is the same if the car didn't achieve the flag.

(Default Reward) Total Reward in each Episode

  • Adjusted Reward

(Adjusted Reward) Best Car Position in each Episode

(Adjusted Reward) Total Reward in each Episode

Notice that the reward functions are different in these two cases, we can't compare the value of their Total Reward.

Comment

  1. According to the result in DQL Implementation, it is clear that the reward function has a great effect on the agent's action. A better designed reward function will lead to better effiency for an agent to learn from the environment.
  2. In Deep Q-Learning, a neural network structure is used for dealing with large-scale problems. However, the model will become more complex which decrease its explanation. Besides, there are more hyper parameters needed to be tuned in the Deep Q-Network model.

Reference

Machine Learning Method: Supervised vs. Unsupervised vs. Reinforcement
Reinforcement Learning: Reinforcement Learning 健身房:OpenAI Gym
Deep Reinforcement Learning: A Beginner's Guide to Deep Reinforcement Learning
Deep Q-Learning: Reinforcement Learning 進階篇:Deep Q-Learning
OpenAI Gym: MountainCarEnv