VISIBLE WAY AS LEARNING GROUP, UCB, and MCTS Collaborate Learn Problem Solving Strategies in Grid-Powered Environments

In this tutorial, we explore how evaluation techniques make intelligent decision-making decisions by solving Agent problems. We build and train three agents, epsilon greedy learning, greedy greedy, and Monte Carlo Tree Search (MCTS), to navigate the grid line and reach the goal. Also, we experimented with different methods to evaluate measurement and exploitation, visualize learning curves, and compare how each agent adapts and works under uncertainty. Look Full codes here.
import numpy as np
import random
from collections import defaultdict, deque
import math
import matplotlib.pyplot as plt
from typing import List, Tuple, Dict
class GridWorld:
def __init__(self, size=10, n_obstacles=15):
self.size = size
self.grid = np.zeros((size, size))
self.start = (0, 0)
self.goal = (size-1, size-1)
obstacles = set()
while len(obstacles) < n_obstacles:
obs = (random.randint(0, size-1), random.randint(0, size-1))
if obs not in [self.start, self.goal]:
obstacles.add(obs)
self.grid[obs] = 1
self.reset()
def reset(self):
self.agent_pos = self.start
return self.agent_pos
def step(self, action):
if self.agent_pos == self.goal:
reward, done = 100, True
else:
reward, done = -1, False
return self.agent_pos, reward, done
def get_valid_actions(self, state):
valid = []
for i, move in enumerate(moves):
new_pos = (state[0] + move[0], state[1] + move[1])
if (0 <= new_pos[0] < self.size and 0 <= new_pos[1] < self.size
and self.grid[new_pos] == 0):
valid.append(i)
return valid
We start by creating a grid world environment that challenges our agent to reach the goal while avoiding obstacles. We design its structure, define the flow rules, and ensure logical navigation parameters to simulate the problem-solving space. This forms the basis from which our test agents will work and learn. Look Full codes here.
class QLearningAgent:
def __init__(self, n_actions=4, alpha=0.1, gamma=0.95, epsilon=1.0):
self.n_actions = n_actions
self.alpha = alpha
self.gamma = gamma
self.epsilon = epsilon
self.q_table = defaultdict(lambda: np.zeros(n_actions))
def get_action(self, state, valid_actions):
if random.random() < self.epsilon:
return random.choice(valid_actions)
else:
q_values = self.q_table[state]
valid_q = [(a, q_values[a]) for a in valid_actions]
return max(valid_q, key=lambda x: x[1])[0]
def update(self, state, action, reward, next_state, valid_next_actions):
current_q = self.q_table[state][action]
if valid_next_actions:
max_next_q = max([self.q_table[next_state][a] for a in valid_next_actions])
else:
max_next_q = 0
new_q = current_q + self.alpha * (reward + self.gamma * max_next_q - current_q)
self.q_table[state][action] = new_q
def decay_epsilon(self, decay_rate=0.995):
self.epsilon = max(0.01, self.epsilon * decay_rate)
We use the Q learning agent who learns through experience, guided by the EPSILOn-Ohahayo policy. We see how it evaluates random actions early and gradually focuses on the most beneficial methods. Through iterative review, it learns how to balance testing and exploitation.
class UCBAgent:
def __init__(self, n_actions=4, c=2.0, gamma=0.95):
self.n_actions = n_actions
self.c = c
self.gamma = gamma
self.q_values = defaultdict(lambda: np.zeros(n_actions))
self.action_counts = defaultdict(lambda: np.zeros(n_actions))
self.total_counts = defaultdict(int)
def get_action(self, state, valid_actions):
self.total_counts[state] += 1
ucb_values = []
for action in valid_actions:
q = self.q_values[state][action]
count = self.action_counts[state][action]
if count == 0:
return action
exploration_bonus = self.c * math.sqrt(math.log(self.total_counts[state]) / count)
ucb_values.append((action, q + exploration_bonus))
return max(ucb_values, key=lambda x: x[1])[0]
def update(self, state, action, reward, next_state, valid_next_actions):
self.action_counts[state][action] += 1
count = self.action_counts[state][action]
current_q = self.q_values[state][action]
if valid_next_actions:
max_next_q = max([self.q_values[next_state][a] for a in valid_next_actions])
else:
max_next_q = 0
target = reward + self.gamma * max_next_q
self.q_values[state][action] += (target - current_q) / count
We develop a UCB agent that uses confidence bounds to guide its evaluation decisions. We look at how it deals with strategies that are sustainable while prioritizing those that yield higher rewards. This approach helps us to understand a statistically based testing strategy. Look Full codes here.
class MCTSNode:
def __init__(self, state, parent=None):
self.state = state
self.parent = parent
self.children = {}
self.visits = 0
self.value = 0.0
def is_fully_expanded(self, valid_actions):
return len(self.children) == len(valid_actions)
def best_child(self, c=1.4):
choices = [(action, child.value / child.visits +
c * math.sqrt(2 * math.log(self.visits) / child.visits))
for action, child in self.children.items()]
return max(choices, key=lambda x: x[1])
class MCTSAgent:
def __init__(self, env, n_simulations=50):
self.env = env
self.n_simulations = n_simulations
def search(self, state):
root = MCTSNode(state)
for _ in range(self.n_simulations):
node = root
sim_env = GridWorld(size=self.env.size)
sim_env.grid = self.env.grid.copy()
sim_env.agent_pos = state
while node.is_fully_expanded(sim_env.get_valid_actions(node.state)) and node.children:
action, _ = node.best_child()
node = node.children[action]
sim_env.agent_pos = node.state
valid_actions = sim_env.get_valid_actions(node.state)
if valid_actions and not node.is_fully_expanded(valid_actions):
untried = [a for a in valid_actions if a not in node.children]
action = random.choice(untried)
next_state, _, _ = sim_env.step(action)
child = MCTSNode(next_state, parent=node)
node.children[action] = child
node = child
total_reward = 0
depth = 0
while depth < 20:
valid = sim_env.get_valid_actions(sim_env.agent_pos)
if not valid:
break
action = random.choice(valid)
_, reward, done = sim_env.step(action)
total_reward += reward
depth += 1
if done:
break
while node:
node.visits += 1
node.value += total_reward
node = node.parent
if root.children:
return max(root.children.items(), key=lambda x: x[1].visits)[0]
return random.choice(self.env.get_valid_actions(state))
We developed a Monte Carlo Tree Search (MCTS) agent for simulating and planning multiple possible future outcomes. We see how it builds a search tree, expands promising branches, and returns the results of decision analysis. This allows the agent to plan intelligently before acting. Look Full codes here.
def train_agent(agent, env, episodes=500, max_steps=100, agent_type="standard"):
rewards_history = []
for episode in range(episodes):
state = env.reset()
total_reward = 0
for step in range(max_steps):
valid_actions = env.get_valid_actions(state)
if agent_type == "mcts":
action = agent.search(state)
else:
action = agent.get_action(state, valid_actions)
next_state, reward, done = env.step(action)
total_reward += reward
if agent_type != "mcts":
valid_next = env.get_valid_actions(next_state)
agent.update(state, action, reward, next_state, valid_next)
state = next_state
if done:
break
rewards_history.append(total_reward)
if hasattr(agent, 'decay_epsilon'):
agent.decay_epsilon()
if (episode + 1) % 100 == 0:
avg_reward = np.mean(rewards_history[-100:])
print(f"Episode {episode+1}/{episodes}, Avg Reward: {avg_reward:.2f}")
return rewards_history
if __name__ == "__main__":
print("=" * 70)
print("Problem Solving via Exploration Agents Tutorial")
print("=" * 70)
env = GridWorld(size=8, n_obstacles=10)
agents_config = {
'Q-Learning (ε-greedy)': (QLearningAgent(), 'standard'),
'UCB Agent': (UCBAgent(), 'standard'),
'MCTS Agent': (MCTSAgent(env, n_simulations=30), 'mcts')
}
results = {}
for name, (agent, agent_type) in agents_config.items():
print(f"nTraining {name}...")
rewards = train_agent(agent, GridWorld(size=8, n_obstacles=10),
episodes=300, agent_type=agent_type)
results[name] = rewards
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
for name, rewards in results.items():
smoothed = np.convolve(rewards, np.ones(20)/20, mode="valid")
plt.plot(smoothed, label=name, linewidth=2)
plt.xlabel('Episode')
plt.ylabel('Reward (smoothed)')
plt.title('Agent Performance Comparison')
plt.legend()
plt.grid(alpha=0.3)
plt.subplot(1, 2, 2)
for name, rewards in results.items():
avg_last_100 = np.mean(rewards[-100:])
plt.bar(name, avg_last_100, alpha=0.7)
plt.ylabel('Average Reward (Last 100 Episodes)')
plt.title('Final Performance')
plt.xticks(rotation=15, ha="right")
plt.grid(axis="y", alpha=0.3)
plt.tight_layout()
plt.show()
print("=" * 70)
print("Tutorial Complete!")
print("Key Concepts Demonstrated:")
print("1. Epsilon-Greedy exploration")
print("2. UCB strategy")
print("3. MCTS-based planning")
print("=" * 70)
We train all three agents in our grid world and visualize their learning and performance progress. We analyze how each strategy, Q-Learning, UCB, and MCTS, is sustainable over time. Finally, we compare the results and find out where the test method leads to faster, more reliable problem solving.
In conclusion, we successfully started and compared three test agents that were run, each one showing a different strategy to solve the same navigation challenge. We see that epsilon – greed enables random learning, UCB measures confidence in curiosity, and MCTS gets a reworked release of forward detection and planning. This work helps us appreciate how different assessment methods, adaptations, and effectiveness in reinforcement learning work.
Look Full codes here. Feel free to take a look at ours GitHub page for tutorials, code and notebooks. Also, feel free to follow us Kind of stubborn and don't forget to join ours 100K + ML Subreddit and sign up Our newsletter. Wait! Do you telegraph? Now you can join us by telegraph.
AsifAzzaq is the CEO of MarktechPost Media Inc.. as a visionary entrepreneur and developer, Asifi is committed to harnessing the power of social intelligence for good. His latest effort is the launch of a media intelligence platform, MarktechPpost, which stands out for its deep understanding of machine learning and deep learning stories that are technically sound and easily understood by a wide audience. The platform sticks to more than two million monthly views, which shows its popularity among the audience.
Follow Marktechpost: Add us as a favorite source on Google.



