Return to the look at the tabar strengthening methods

nimda July 1, 2025

0 10 7 minutes read

Return to the look at the tabar strengthening methods

Publishing my previous posts to measure the Tabar reinforcement results and I was not satisfied with no way in a way that they had out.

Nevertheless, I continued with the mailing series, a fluctuating fluctuations in multiplayer games and moderate ways of solutions. Supporting this, I've been opposed to the original frame built. The new version is clean, normal, and easy to use. In this process, it also helps to uncover a few bugs and edge problems with other earlier algorithms (above that later).

In this post, I will introduce a new framework, highlighting the mistakes I have made, share the prepared outcomes, which indicates valuable lessons learned from, set up a complex test section.

Updated code can be found in GitHub.

Frame

A major change from the previous version of the code that RL solution has served as classes. These classes expose normal methods such as act() (by choosing the action) and update() (by correcting model parameters).

Compliance with this, a Script of combined training It regulates the environment: producing episodes and feeds feeding on the proper way of learning – using the shared interface provided by those classes.

This continues to be very easy and similar to the training process. Earlier, each approach had its own dirty privacy. Now, training is included in the middle, and each role each per method is properly defined and intensified.

Before entering the lessons of the way in detail, let's first look at the Training Loop of one Player Player:

def train_single_player(
    env: ParametrizedEnv,
    method: RLMethod,
    max_steps: int = 100,
    callback: Callable | None = None,
) -> tuple[bool, int]:
    """Trains a method on single-player environments.

    Args:
        env: env to use
        method: method to use
        max_steps: maximal number of update steps
        callback: callback to determine if method already solves the given problem

    Returns:
        tuple of success, found policy, number of update steps
    """
    for step in range(max_steps):
        observation, _ = env.env.reset()
        terminated = truncated = False

        episode = []
        cur_episode_len = 0

        while not terminated and not truncated:
            action = method.act(observation, step)

            observation_new, reward, terminated, truncated, _ = env.step(
                action, observation
            )

            episode.append(ReplayItem(observation, action, reward))
            method.update(episode, step)

            observation = observation_new

            # NOTE: this is highly dependent on environment size
            cur_episode_len += 1
            if cur_episode_len > env.get_max_num_steps():
                break

        episode.append(ReplayItem(observation_new, -1, reward, []))
        method.finalize(episode, step)

        if callback and callback(method, step):
            return True, step

    env.env.close()

    return False, step

Let's make the same episode what looks like – and there update() including finalize() Ways called During the process:

Photo by the writer

After each repetition object is processed – including the state, the action taken, and reward received – the way update() The work has been called to convert the inside parameters of the model. A specific behavior of this activity depends on the use of the algorithm.

Giving you a concrete example, let's immediately look at how this works Q-Learning.

Remember Q:

When the second call in the update() occurred, we have s_t= S₁A_t = a1 and r_{t + 1} = r₂.

Using this information, the Q-learning agent for Q Qualify its value correctly.

Supported Ways

Powerful Program (DP) Methods do not compile above the structure presented – because they depend on finding all the natural provinces. For that reason, we leave their own date for search and treat us differently.

In addition, we completely remove the support of Sweeping on the fore. Also, here we need to use provinces somehow to find that the provinces are in advance, which is, and – is not possible, impossible, that the number of region is larger and very difficult to greater.

As this approach did not express positive effects, we focus on the remaining. NOTE: The same thinking is done in DPs: This cannot be easily expanded to Multi-Player Games, and as a result will be less interested.

Bugs

Bugs appear – everywhere, and this project is not the same. In this section, I will emphasize a bug that has a specific impact on previous back effects, as well as small changes and development. I will explain how these results were affected before.

The Correctional Calculation

Some methods require that the action is made to the selected action during the renewal action. In the previous code part, we had:

def _get_action_prob(Q: np.ndarray) -> float:
        return (
            Q[observation_new, a] / sum(Q[observation_new, :])
            if sum(Q[observation_new, :])
            else 1
        )

This is activated Only with Q-solid prices of qBut there broke out when Q-prices were not wrong – to make normal unemployment.

The repaired version treats delicious and negative prices in Q-prices that use the SOFMAX method:

def _get_action_prob(self, observation: int, action: int) -> float:
        probs = [self.Q[observation, a] for a in range(self.env.get_action_space_len())]
        probs = np.exp(probs - np.max(probs))
        return probs[action] / sum(probs)

This insect is very affected by SARSA expected including Backup second of n-stepAs their reviews rely on the workpieces.

Tying by tying to the designation of a greedy act

Earlier, when the episodes are produced, we have selected a greedy or sample action with a greedy logic:

def get_eps_greedy_action(q_values: np.ndarray, eps: float = 0.05) -> int:
    if random.uniform(0, 1) < eps or np.all(q_values == q_values[0]):
        return int(np.random.choice([a for a in range(len(q_values))]))
    else:
        return int(np.argmax(q_values))

However, this did not handle it well societyie, where many actions share the same Q-vious value. Restitution act() The way now includes right breakings:

def act(
        self, state: int, step: int | None = None, mask: np.ndarray | None = None
    ) -> int:
        allowed_actions = self.get_allowed_actions(mask)
        if self._train and step and random.uniform(0, 1) < self.env.eps(step):
            return random.choice(allowed_actions)
        else:
            q_values = [self.Q[state, a] for a in allowed_actions]
            max_q = max(q_values)
            max_actions = [a for a, q in zip(allowed_actions, q_values) if q == max_q]
            return random.choice(max_actions)

Small change, but maybe appropriate – because this resurrects the selection of an active testing further at the beginning of each training, where all Q-equal amounts are equal.

This minimum change may have a meaningful impact – especially at the beginning of the training, where all Q-start prices equally. It promotes a variety of different test strategy during the first critical phase.

As discussed earlier – and as we will see the relatives below RL indicates high variables, making the impact of these changes difficult to measure accuracy. However, this correction seemed to progressing slowly Several means performance: SARSA, Q-Learning, Double Q-Learningbeside SARSA-N.

Renewal results

Let us now examine updated results – perfection, including all methods, not just advanced.

But first, the fastest memorial of the work we solved: We work with Gymnasium's Gridworld environment [2] – Really basically the work of resolving the marching manufacturer:

The agent must wander from the top left-right of the grid while avoiding heavy lakes.

Checking the operation of each route, weigh the gridworld size and measure the number of Update measures until conversion.

Monte Ways

These methods were not affected by the latest conversion of the implementation, so we see the results that suit our previous findings:

Both are able to resolve places until 25 × 25 in size.

On-Policy MC do better than the Off Policy.

Methods of a Temporary Differences

In this case, we measure the following results:

In this regard, we are soon aware that SARSARA is now excellent, due to repairing above mentioned in the Computing Chances of the Act.

But other ways do better: As mentioned above, this can be dangerous / can be the result of small improvements, especially tires during the selection selection.

Td-n

In ways of TD-N, our results look very different:

SARSA-N is also enforced, perhaps because of the same reasons as discussed in the last phase – but the N-strike agreement is really doing – proving that the optimistic option is a powerful way of the solution.

Preparation

In planning, we have dyna-q left – also seemingly it's better:

Comparing the best ways of the solutions in large areas

So, let's imagine what is automatically about all the stages in one help. Due to removal of alternatives such as DP, now I have chosen on-Policy MC, SARSA, QN-Bady Backup SARSA-Q Step.

We begin by showing the results of the size of the world size 50 x 50:

We see On-Policy MC In order to make a wonderfully good-in-line basis. Its energy may occur in Simple measurement and unintended measurementWorking well to find middle-tip episodes.

However, unlike the previous mail, Backup second of n-step clearly appears as a very effective way. This vision: Its use of expected backups Smooth spread and stabilityconsisting of the capacity for the promotion policy of the policy in terms of policy learning.

Next, we see the middle collection of: SARSA, Q-LEARNING, AND DYNA-Q – with SARSA little by little.
It amazes that updates based on the model in Dyna-Q do not lead to better performance. This may point to limitations with an example that is exemplary or the amount of planning action. Q-Learning It is often overcrowded because of a different unique presentation by its policy environment.

This page How to Do Worst In this test SARSA-Naccompanied by previous views. We suspect that destruction in work appears The difference with bias Due to the N-step sample without expecting in action.

It is still unexpected that MC's outgoing MC methods come out of the TD in this situation – traditionally, TD methods are expected to do better in large areas. However, this is reduced in our browsing by The Rewardic Clay Suspect: We provide a little good reward for each step as an agent moves closer to the purpose. This reduces one of the maci-weakness of MC – the misuse of the Reward Settings.

TODO: With the ability to incorporate new results up to 100 x 100 (It will be done before the period of publication)

The end and reading

In this post, we shared the reviews in the promotional RL frame in this series. Next to the various developments, we repaired some bedbugs – which is the most improved algorithm work.

We then use renewed methods to grow the larger gridworld areas, by finding the following:

Backup second of n-step It appears as the best way, due to its expected updates that include learning benefits both with the policy organization.
Monte Ways Followed, indicating a strict strong performance due to their unstable measurements and the learning between middle rewards.
Is a collection of TD ways – Q-Learning, Sarsa, and Dyna-Q – followed. Despite the renovation based on Dyna-Q model, it did not end their models.
SARSA-N The worst, possible option because the combined choices and the differences are used by the SAMPLING N-STEP TO Reply.

Thanks for reading this update! Stay watching more content – Next, cover Multiplayer and places.