Jaewoo Song
Jaewoo Song

Categories

  • Tech

It has been a long time since I posted a tech post!

I’ve been so busy pursuing my graduate studies, job hunting, and Ph.D. applications, so I haven’t had any chance to invest sufficient time to post a technical post.

Since this is my last semester and I have a little bit of time to write a new post, I’m going to start my new chapter with a new topic, which is Reinforcement Learning.

In this post, I’m going to work through the basic concept of Deep Q-Learning[1] and how to implement the DQN in TensorFlow to perform as an agent for a simple Atari game.

This post is based on HuggingFace’s Deep RL course[2], which I’m taking right now.

If you want to learn Deep Reinforcement Learning step-by-step, I highly recommend this course! (Super helpful!)



Background: Q-learning

First, let’s start with the basics of Q-learning. (Assuming that you already have the knowledge of the basic Reinforcement Learning…)

As you already know, Reinforcement Learning is a learning process where an agent continuously interacts with an environment to find the optimal policy to maximize the sum of rewards by sequentially taking specific actions based on the Markov Decision Process.

In this context, a policy refers to a function that determines which action should be taken given a certain state, aiming to maximize the total sum of rewards which will be obtained in the future.

If we replace the expected value of the sum of rewards with the Bellman Equation, the optimal policy can be represented as follows:


\[\pi^{*}(s) = \underset{a}{\operatorname{arg max}} \ \sum_{s'} T(s,a,s')[R(s,a,s') + \gamma V^{*}(s')] = \underset{a}{\operatorname{arg max}} \ Q^{*}(s,a)\]


Each component represents:

  • $T(s,a,s’)$: The transition probability of reaching the state $s’$ when the agent takes an action $a$ at the state $s$.
  • $R(s,a,s’)$: The reward obtained when the agent takes an action $a$ at the state $s$ and moves to the next state $s’$.
    • Simply, it is sometimes represented as $R(s’)$.
  • $\gamma V^*(s’)$: The optimal result (sum of rewards) expected from the state $s’$ multiplied by the discount factor $\gamma$.
    • $V^*(s’)$ is obtained one step later, so this should be discounted once.


According to these components, the optimal policy $\pi^*$ is a function which returns an action $a$ which maximizes the final expected value which is produced by taking $a$ at the state $s$.

In addition, the Q-value is an expected value from taking $a$ at each state $s$, so $\pi^*$ is also interpreted as a function which finds an action $a$ to maximize the Q-value when the agent is currently in the state $s$, assuming that we know the optimal Q-value tables with pairs of $s$ and $a$.



Background: Deep Q-Learning



Implementation



Results & Analysis





[1] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. https://arxiv.org/pdf/1312.5602.pdf