Reinforcement Learning with Predator/Prey

Large parts of the content for this post were drawn from Reinforcement Learning: An Introduction by Sutton and Barto, as well Arthur Juliani's multi part series on Medium

The point of this project is to explore reinforcement learning in an analogue of real life situations. Short of robots actually hunting other entities, this problem involves multiple RL actors either cooperating or competing as well as a partially observable environment.

See the project on Github. Note this repository is active, subject to change, and not exactly cleaned up yet.
See release version v1.0 to get the code at the time of this writing.

A brief introduction to Q learning

Reinforcement learning, in a nutshell, is how to make a program learn to maximize the total reward gained from an environment. The environment is defined by some blackbox with a state, set of actions, and rewards. Each action takes you from one state to another and returns some reward. Let's denote these properties of the environment with associated timestamps as $s_t$, $a_t$, $r_t$.

What we want is to construct a policy, denoted $\pi(s, a) \mapsto \mathbb{R}$, which gives the probability of taking an action in a given state in order to maximize the total reward extracted from the environment. We start in $s_0$ and follow the choices given by $\pi(s_t,a_t)$ at each time step $t$. See appendix for an example of why probabilities can outperform a simple choice of "best result". See appendix for counter argument.

Q learning approaches this problem by defining the following function:

$$Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha \left[ r_{t+1} + \gamma ~ \mbox{max}_a ~ Q(s_{t+1}, a) - Q(s_t, a_t) \right]$$ To examine this, let's take the case with unit step size $\alpha = 1$: $$Q(s_t,a_t) \leftarrow r_{t+1} + \gamma ~ \mbox{max}_a ~ Q(s_{t+1}, a)$$

Ignoring convergence concerns for now we can see that $Q$ takes on the value of the reward for performing the action $a_t$ in state $s_t$ plus the discounted reward for taking the highest value action in the next state. Thus if we were to expand this formula we would see that $Q(s,a)$ is equal to the reward for taking the action $a$ in state $s$ and then performing optimally thereafter. Note that since $Q$ is bootstrapping off of itself it will not immediately take the actual value described above, instead it will approach the value if convergence conditions are met.

If/Once we have access to the function $Q(s_t, a)$ then we can simply construct $\pi(s_t, a_t)$ by comparing the value of $Q(s_t, a_t)$ to all other actions in any way we see fit. In very simple form we could have $\pi(s_t, a_t) = 1$ for $a_t = \mbox{argmax}_a~Q(s_t, a)$ and $0$ otherwise (meaning we always take the "best" action).

Pred/Prey Environment

In this project I explore the application of Q learning to the following environment: We have a square map filled with many copies of predator and prey agents. Agents have a limited energy supply and die after running out of energy, falling off the map, or being eaten by another agent. I am trying to use Q learning to have the predator agents learn how to pursue and catch the prey. The prey use a heuristic policy that I manually defined which simply avoids falling off the edge of the map and replenishes energy from the environment when needed.

The world offers a state to each agent which include a few internal variables, such as relative energy compared to maximum (0 implies imminent death), as well as vision in the form of an array of gray-scale and distance; Predators appear as +1, Prey appear as -1, and the map wall appears as 0. Distance is relative to maximum view distance defaulting to 1 when nothing is in view. We can refer to the $i^{th}$ pixel as a tuple $(c_i,d_i)$ giving color and distance.

Agents may choose to move forward (and optionally left or right), stay still, graze, attack, or mate.

Rewards are also in the process of being tuned, what is consistent though is that predators are modestly rewarded for attacking predators, strongly rewarded for attacking prey or mating, and harshly punished for dying in any way.

If you are familiar with reinforcement learning you will immediately notice that this is not a Markov Decision Process. What this means is that it does not satisfy the Markov Property which states that all relevant information for making future choices can be determined from the given state and action alone:

If the state signal has the Markov property [...] then the environment's response at $t+1$ depends only on the state and action representations at $t$, in which case the environment's dynamics can be defined by specifying only $$\mathbb{P}(s_{t+1} = s', r_{t+1} = r ~ \vert ~ s_t, a_t)$$ for all $s'$, $r$, $s_t$, and $a_t$.
- Reinforcement Learning: An Introduction

This is not the case in our environment, an agent may simply pass out of the view cone of another agent and yet still be able to affect their future actions.

I point this out because there isn't a single correct answer for the action to take in a given state; the reward may depend on hidden variables. Allowing the policy to select a probability of actions can help mitigate this problem. For a more detailed example see the appendix referenced before.

Implementation

The execution of the above design is admittedly a bit sloppy, but we will focus on the implementation of the tensorflow model as opposed to the environment.

For my first attempt the predator agents will have a simple dense neural network connecting the 13 inputs (3 internal senses + 5 color pixels + 5 distance pixels) to two hidden layers of width 15 to an output layer of size 10 for each action.

I create two such networks called $Q$ and $Q'$, when training the $Q$ network we will base the loss function off of $Q(s_t,a_t) \leftarrow r_{t+1} + \gamma ~ \mbox{max}_a ~ Q'(s_{t+1}, a)$ then copy the weights of $Q$ into $Q'$ after a fixed number of cycles. Doing this makes the training more stable. Only one network is used during evaluation of the environment however. Below is the pseudocode for this first version:

model = Environment()
buffer = ReplayBuffer()
Q, Q' = NN()

while True:
    π = Q.get_policy()
    for running_iters:
        s = model.get_all_agent_senses()
        a = choose_from_probs( π(s,:) )
        r, s' = model.act_all_agents(a)
        buffer.save(s, a, r, s')

    for training_iters:
        Q' = Q
        for (s, a, r, s') in buffer.batches():
            Q(s,a) ← r + y[ max Q(s',:) ]

For choose_from_probs(π(s)) I used tf.multinomial to pick one of the actions based on the probabilities assigned to each by the policy. Here the ← operator is normal backpropagation.

Results

Initially the predators don't display any hunting skills. They sometimes randomly attack prey upon encountering.

After a few training iterations, predators are able to more consistently catch prey, following other bots (including other predators). They sometimes avoid the world boundaries.

After many training iterations, the predators are worse at catching prey, and worse at avoiding boundaries - often pausing at the border until exploration pushes them past the edge. See appendix for a possible explanation as to why this is.

Future Work

It isn't satisfying to have a golden region wherein peak performance is achieved. We would much prefer to have training consistently improve reward until the network achieves peak performance. While this model could certainly be improved with various Q learning techniques (see most old DeepMind papers for details) I would prefer to move on and explore new architectures later.

This has been a brief introduction to reinforcement learning. Instead of using a toy model, here I used something which can be developed upon to hopefully create interesting behavior. In future posts I will introduce Recurrent Neural Networks (and LSTMs) for the purpose of solving several of the issues with $\mathbb{P}$ and $\pi$ we have seen so far. They can be thought of as a sort of "memory", but more on that later.

Appendix

When is probabalistic $\pi(s,a)$ better?

This is an example from Reinforcement Learning: An Introduction but I cannot for the life of me find the page number.

Consider the following environment as follows, you have a grid as detailed below where the yellow ball indicated a goal which, upon entering, gives some positive reward amount. All other movement rewards zero. We consider the experiment terminated once rewarded.

The agent when in a given square can only observe whether it is possible to move in each direction, possibly represented as the array $[m_{north}, m_{south}, m_{east}, m_{west}]$, thus from the grid position $(2,0)$ we have the state $[0,1,1,1]$.

Imagine the policy which one could create for this system. Notice that $(1,0)$ has the same state as $(3,0)$. Since the two are indistinguishable a simple policy choosing only the single best action for a given state would either choose left for both or right for both - thus for three of the $8$ possible starting positions it would never be able to reach the goal.

As you can see there is an infinite loop between $(3,0)$ and $(4,0)$. If our policy learns to take on probability distributions, however, we would have $(1,0)$ and $(3,0)$ both have a 50/50 chance of east/west. Thus preventing infinite loops without a reward.

When is greedy $\pi(s,a)$ better?

Now let's consider a more difficult problem than the one detailed above. Think of an agent in out pred/prey model above weighing the options when seeing prey in front of them - without context this prey could be moving leftwards, rightwards, away from, or towards the predator. Assuming these all occur with relatively equal probability then $\pi(s,a)$ should output about $33\%$ for each of the actions left, right, and forward. Unfortunately if your policy chooses one of these three at each time step with no memory of past states then you will keep randomly picking one until your prey inevitably escapes. Here it would almost certainly be better to choose one of the three more consistently with a greedy approach. This option decreases the exploration of the environment since the agent does not try new actions until one fails enough for gradient descent to decrease its value substantially. As such this approach is generally employed as an $\epsilon$-greedy approach where with probability $\epsilon$ a random action is taken instead of the action with the highest $Q$ value.

Another concern is that since about $100$ timesteps elapse between a creature seeing prey and getting the reward for catching it, the reward has been discounted significantly. As such the difference between immediately pursuing the prey and doing something else for one timestep is insignificant, it's $\gamma^n$ versus $\gamma^{n+1}$ - negligable for large $n$. If you are constructing your policy as a softmax over the $Q$ values you will find that almost all actions are predicted as "best" with almost equal probability. This is because softmax cares about the absolute difference between values, not the relative difference. Consider the example below:

For any value of $R$ we can choose an $n$ large enough such that the difference in $Q$ value between the up/down and forward actions is less than any desired small value $\epsilon$. As such any fixed probabilistic policy $\pi$ will choose each action with about equal probability. Using this $\pi$ will therefore result in a random walk for some initial set of timesteps. In our environment this may be long enough to allow the goal (prey) to escape.

Greedy versus Probabilistic

Experimenting with both policies I found that probabilistic provided greatly improved exploration and training speed compared to $\epsilon$-greedy, but eventually performed suboptimally once all the $Q$ values became relatively close and the predators became indecisive. Greedy was slow to learn, but performed better at late stage. Mixing the two may prove to be the best choice, for example: $$ \pi(s,a) = \alpha_t \pi_g(s,a) + (1-\alpha_t) \pi_p(s,a) $$ Where $\alpha_t$ decreases with time.

However, if you're familiar with ML you may also be familiar with the common technique of using temperature to adjust the confidence of the highest scored class or action. Softmax is given by: $$ softmax(a)_i = \frac{e^{-a_i}}{\sum_j e^{-a_j}} $$ where $a$ is your array of action scores. If we consider a temperature $T$ and adjust our formula to be: $$ softmax(a)_i = \frac{e^{-\frac{a_i}{T}}}{\sum_j e^{-\frac{a_j}{T}}} $$ given that $ e^{\frac{x}{T}} = {\left(e^{\frac{1}{T}}\right)}^{x} $ we can choose a temperature $T = \frac{1}{\mbox{log}(b)} $ and rewrite softmax as: $$ softmax(a)_i = \frac{b^{-a_i}}{\sum_j b^{-a_j}} $$ Lower values of $b$ make the output probability distribution of softmax closer to uniform. In the extreme case of $b = 1$ we will have a uniform probability distribution over $a_i$. Thus a high temperature gives an emphasis on exploration of new states whereas a low temperature an emphasis on exploitation of known rewards.