An Introduction to Enhancing Learning Agents with the Unity Game Engine

Reinforcement Learning — learning from observations and rewards — is a method very similar to how humans (and animals) learn.
Despite this similarity, it still remains a complex and vexing domain in modern machine learning. To quote the famous Andej Karpathy:
Reinforcement learning is bad. It just so happened that all we had was very bad.
To help you understand the method, I will build a step-by-step example of an agent that learns to navigate the environment using Q-Learning. The text will start with the basics and end with a fully working example that you can use in the Unity game engine.
For this article, a basic knowledge of the C# programming language is required. If you're not familiar with the Unity game engine, just think of each object as an agent, which:
- he pulled out
Start()and at the beginning of the program, - again
Update()continuously in conjunction with other agents.
The repository associated with this article is on GitHub.
What is Reinforcement Learning
In Reinforcement Learning (RL), we have an agent that is able to take actions, observe the consequences of these actions, and learn from the rewards/punishments of these actions.
How an agent decides to act in a given situation depends on it the policy. The policy π is a function that describes the behavior of an agent, mapping states to actions. Given a set of states S and a set of actions A direct map policy: π: S → A .
Additionally, if we want the agent to have multiple possible choices, by choice, we can create a stock policy. Then, instead of a single action, the policy determines the probability of taking each action in a given situation: π: S × A → [0, 1].
An example of a navigation robot
To demonstrate the learning process, we will create an example of a robot that navigates in 2D space, using one of the four actions, A = {Left, Right, Up, Down} . The robot needs to find a way to the prize from any point on the map, without falling into the water.

Rewards will be coded and tile types using Enum:
public enum TileEnum { Water = -1, Grass = 0, Award = 1 }
A state is given its position on the grid, meaning we have 40 possible states: S = [0…7] × [0…4] (a grid of 8 × 5 tiles), which we encode using a 2D array:
_map = {
{ -1, -1, -1, -1, -1, -1, -1, -1 }, // all water border
{ -1, 0, 0, 0, -1, 0, 1, -1 }, // 1 = Award (trophy)
{ -1, 0, 0, 0, -1, 0, 0, -1 },
{ -1, 0, 0, 0, 0, 0, 0, -1 },
{ -1, -1, -1, -1, -1, -1, -1, -1 }, // all water border
}
We keep the map on the tile TileGrid with the following auxiliary functions:
// Obtain a tile at a coordinate
public T GetTileByCoords(int x, int y);
// Given a tile and an action, obtain the next tile
public T GetTargetTile(T source, ActionEnum action);
// Create a tile grid from the map
public void GenerateTiles();
We will use different types of tiles, hence the generic T. Each tile has a TileType provided by TileEnum therefore also its reward can be received as (int) TileType.
The Bellman Equation
The problem of finding the optimal policy can be solved iteratively using the Bellman Equation. The Bellman Equation states that the long-term reward for an action is equal to the immediate reward for that action. plus the expected reward for all future actions.
It can be calculated recursively for systems with different states and different state transitions. Be:
s– current situation,A– set of all actions,s'– a state achieved by taking actionain the situations,γ– discount factor (the longer the reward, the lower its value),R(s, a)— immediate reward for taking actionain the situations
Bellman's equation then states that the value V(s) of the state s by:

Solving the Bellman Equation Iteratively
Using the Bellman Equation is a dynamic programming problem. At each iteration nwe calculate the expected future reward that can be achieved n+1 steps for all tiles. For each tile we store this using a Value flexible.
We provide a reward base on a target tile, e.g 1 if the prize is reached, -1 the robot falls into the water, too 0 otherwise. Once the prize or water is reached, no actions can take place, so the state value remains the initial value 0 .
We create a manager that will generate the grid and calculate the iteration:
private void Start()
{
tileGrid.GenerateTiles();
}
private void Update()
{
CalculateValues();
Step();
}
To track values, we will use a VTile ruling class a Value. To avoid taking updated values directly, we start with a setup NextValue and set all values at once to Step() work.
private float gamma = 0.9; // Discounting factor
// The Bellman equation
private double GetNewValue(VTile tile)
{
return Agent.Actions
.Select(a => tileGrid.GetTargetTile(tile, a))
.Select(t => t.Reward + gamma * t.Value) // Reward in [1, 0, -1]
.Max();
}
// Get next values for all tiles
private void CalculateValues()
{
for (var y = 0; y < TileGrid.BOARD_HEIGHT; y++)
{
for (var x = 0; x < TileGrid.BOARD_WIDTH; x++)
{
var tile = tileGrid.GetTileByCoords(x, y);
if (tile.TileType == TileEnum.Grass)
{
tile.NextValue = GetNewValue(tile);
}
}
}
}
// Copy next values to current values (iteration step)
private void Step()
{
for (var y = 0; y < TileGrid.BOARD_HEIGHT; y++)
{
for (var x = 0; x < TileGrid.BOARD_WIDTH; x++)
{
tileGrid.GetTileByCoords(x, y).Step();
}
}
}
At every step, value V(s) each tile is updated higher for all immediate reward actions and the reduced value of the resulting tile. The future prize is distributed outwards from the Prize tile with diminishing returns controlled by γ = 0.9 .

Quality of Action (Q-Values)
We have found a way to associate states with values, which is enough for this routing problem. However, this focuses on nature, ignoring the agent. For an agent we usually want to know what would be a good action in the environment.
In Q-learning, this action value is called level (Q-Value). Each one (state, action) a pair is assigned a Q value of one.

When the new hyperparameter α describes the learning rate – how quickly new information overtakes old information. This is similar to conventional machine learning and the values are often the same, here we use 0.005 . We then calculate the benefit of taking action using the temporal difference D(s,a):

Since we no longer consider all actions in the current state, but the quality of each action separately, we do not increase it from all possible actions in the current state, but rather from all possible actions in the state we will reach after taking the action that counts its quality, combined with the reward for taking that action.

The temporal difference term combines the immediate reward with the best possible future reward, making it directly derived from the Bellman Equation (see Wiki for details).
To train the agent, we also reinforce the grid, but this time we also create an instance of the agent, set (2,2).
private Agent _agent;
private void ResetAgentPos()
{
_agent.State = tileGrid.GetTileByCoords(2, 2);
}
private void Start()
{
tileGrid.GenerateTiles();
_agent = Instantiate(agentPrefab, transform);
ResetAgentPos();
}
private void Update()
{
Step();
}
An Agent the object has a current state QState. Each one QStatestores an IQ-Value for each available action. At each step the agent evaluates the quality of each action found in the state:
private void Step()
{
if (_agent.State.TileType != TileEnum.Grass)
{
ResetAgentPos();
}
else
{
QTile s = _agent.State;
// Update Q-values for ALL actions from current state
foreach (var a in Agent.Actions)
{
double q = s.GetQValue(a);
QTile sPrime = tileGrid.GetTargetTile(s, a);
double r = sPrime.Reward;
double qMax = Agent.Actions.Select(sPrime.GetQValue).Max();
double td = r + gamma * qMax - q;
s.SetQValue(a, q + alpha * td);
}
// Take the best available action a
ActionEnum chosen = PickAction(s);
_agent.State = tileGrid.GetTargetTile(s, chosen);
}
}
An Agent has a set of possible actions in each region and will take the best action in each region.
If there are multiple leading actions, one of them is randomly selected as we have shuffled the actions before. Because of this impossibility, each training will proceed differently, but usually stabilizes between 500–1000 steps.
This is the basis of Q-Learning. Unlike define valuesi quality of action can be used in situations where:
- visibility is incomplete in time (agent's field of view)
- view changes (objects move in place)
Testing vs. Exploitation (ε-Graedy)
So far the agent has taken the best action every time, however this can cause the agent to quickly get caught up in the local good. A major challenge in Q-Learning is the trade-off between testing and exploitation:
- Exploit — choose the exploit with the highest known (greedy) Q value.
- Explore – choose a random action to find possible alternatives.
ε-Policy of Greed
It is assigned a random value r ∈ [0, 1] and parameter epsilon there are two options:
- if
r > epsilonand choose the best action (exploitation), - otherwise choose a random action (check).
Epsilon decay
Often we want to explore early and exploit later. This is accomplished by decomposition epsilon over time:
epsilon = max(epsilonMin, epsilon − epsilonDecay)
After enough steps, the agent's policy converges almost always to choosing the highest quality action.
private epsilonMin = 0.05;
private epsilonDecay = 0.005;
private ActionEnum PickAction(QTile state) {
ActionEnum action = Random.Range(0f, 1f) > epsilon
? Agent.Actions.Shuffle().OrderBy(state.GetQValue).Last() // exploit
: Agent.RndAction(); // explore
epsilon = Mathf.Max(epsilonMin, epsilon - epsilonDecay);
return action;
}
Broader RL Ecosystem
IQ-Learning is one algorithm in a large family of reinforcement learning (RL) methods. Algorithms can be divided into several categories:
- District location : Alternatives (eg, board games) | Continuous (eg, FPS games)
- Location: Alternative (eg, strategy games) | Continuous (eg, driving)
- Type of policy: Without policy (Q-Learning:
a’is always raised) | In the policy (SARSA:a’selected by the agent's current policy) - User: Value | Quality | Profit
A(s, a) = Q(s, a) − V(s)
For a comprehensive list of RL algorithms, see i Emphasis Wikipedia page. Additional methods such as behavioral integration are not listed there but are also used in practice. Real-world solutions often use an extended variant or combination of the above.
IQ-Learning is a non-policy, alternative way of doing things. Extending it to continuous state/action spaces leads to methods such as Deep Q-Networks (DQN), which replace the Q-table with a neural network.
In the grid world example, the Q table has it |S| × |A| = 40 × 4 = 160 included — completely manageable. But with a game like chess, the space of the state is transient 10⁴⁴ positions, making a clear table impossible to maintain or fill. In such cases neural networks may be used to compress information.

(s, a) pairing, the network takes a state as input and outputs the Q values of all actions, including all the same states it has never seen before.


