Generative AI

VISIBLE WAY AS LEARNING GROUP, UCB, and MCTS Collaborate Learn Problem Solving Strategies in Grid-Powered Environments

In this tutorial, we explore how evaluation techniques make intelligent decision-making decisions by solving Agent problems. We build and train three agents, epsilon greedy learning, greedy greedy, and Monte Carlo Tree Search (MCTS), to navigate the grid line and reach the goal. Also, we experimented with different methods to evaluate measurement and exploitation, visualize learning curves, and compare how each agent adapts and works under uncertainty. Look Full codes here.

import numpy as np
import random
from collections import defaultdict, deque
import math
import matplotlib.pyplot as plt
from typing import List, Tuple, Dict


class GridWorld:
   def __init__(self, size=10, n_obstacles=15):
       self.size = size
       self.grid = np.zeros((size, size))
       self.start = (0, 0)
       self.goal = (size-1, size-1)
       obstacles = set()
       while len(obstacles) < n_obstacles:
           obs = (random.randint(0, size-1), random.randint(0, size-1))
           if obs not in [self.start, self.goal]:
               obstacles.add(obs)
               self.grid[obs] = 1
       self.reset()
   def reset(self):
       self.agent_pos = self.start
       return self.agent_pos
   def step(self, action):
       if self.agent_pos == self.goal:
           reward, done = 100, True
       else:
           reward, done = -1, False
       return self.agent_pos, reward, done
   def get_valid_actions(self, state):
       valid = []
       for i, move in enumerate(moves):
           new_pos = (state[0] + move[0], state[1] + move[1])
           if (0 <= new_pos[0] < self.size and 0 <= new_pos[1] < self.size
               and self.grid[new_pos] == 0):
               valid.append(i)
       return valid

We start by creating a grid world environment that challenges our agent to reach the goal while avoiding obstacles. We design its structure, define the flow rules, and ensure logical navigation parameters to simulate the problem-solving space. This forms the basis from which our test agents will work and learn. Look Full codes here.

class QLearningAgent:
   def __init__(self, n_actions=4, alpha=0.1, gamma=0.95, epsilon=1.0):
       self.n_actions = n_actions
       self.alpha = alpha
       self.gamma = gamma
       self.epsilon = epsilon
       self.q_table = defaultdict(lambda: np.zeros(n_actions))
   def get_action(self, state, valid_actions):
       if random.random() < self.epsilon:
           return random.choice(valid_actions)
       else:
           q_values = self.q_table[state]
           valid_q = [(a, q_values[a]) for a in valid_actions]
           return max(valid_q, key=lambda x: x[1])[0]
   def update(self, state, action, reward, next_state, valid_next_actions):
       current_q = self.q_table[state][action]
       if valid_next_actions:
           max_next_q = max([self.q_table[next_state][a] for a in valid_next_actions])
       else:
           max_next_q = 0
       new_q = current_q + self.alpha * (reward + self.gamma * max_next_q - current_q)
       self.q_table[state][action] = new_q
   def decay_epsilon(self, decay_rate=0.995):
       self.epsilon = max(0.01, self.epsilon * decay_rate)

We use the Q learning agent who learns through experience, guided by the EPSILOn-Ohahayo policy. We see how it evaluates random actions early and gradually focuses on the most beneficial methods. Through iterative review, it learns how to balance testing and exploitation.

class UCBAgent:
   def __init__(self, n_actions=4, c=2.0, gamma=0.95):
       self.n_actions = n_actions
       self.c = c
       self.gamma = gamma
       self.q_values = defaultdict(lambda: np.zeros(n_actions))
       self.action_counts = defaultdict(lambda: np.zeros(n_actions))
       self.total_counts = defaultdict(int)
   def get_action(self, state, valid_actions):
       self.total_counts[state] += 1
       ucb_values = []
       for action in valid_actions:
           q = self.q_values[state][action]
           count = self.action_counts[state][action]
           if count == 0:
               return action
           exploration_bonus = self.c * math.sqrt(math.log(self.total_counts[state]) / count)
           ucb_values.append((action, q + exploration_bonus))
       return max(ucb_values, key=lambda x: x[1])[0]
   def update(self, state, action, reward, next_state, valid_next_actions):
       self.action_counts[state][action] += 1
       count = self.action_counts[state][action]
       current_q = self.q_values[state][action]
       if valid_next_actions:
           max_next_q = max([self.q_values[next_state][a] for a in valid_next_actions])
       else:
           max_next_q = 0
       target = reward + self.gamma * max_next_q
       self.q_values[state][action] += (target - current_q) / count

We develop a UCB agent that uses confidence bounds to guide its evaluation decisions. We look at how it deals with strategies that are sustainable while prioritizing those that yield higher rewards. This approach helps us to understand a statistically based testing strategy. Look Full codes here.

class MCTSNode:
   def __init__(self, state, parent=None):
       self.state = state
       self.parent = parent
       self.children = {}
       self.visits = 0
       self.value = 0.0
   def is_fully_expanded(self, valid_actions):
       return len(self.children) == len(valid_actions)
   def best_child(self, c=1.4):
       choices = [(action, child.value / child.visits +
                   c * math.sqrt(2 * math.log(self.visits) / child.visits))
                  for action, child in self.children.items()]
       return max(choices, key=lambda x: x[1])


class MCTSAgent:
   def __init__(self, env, n_simulations=50):
       self.env = env
       self.n_simulations = n_simulations
   def search(self, state):
       root = MCTSNode(state)
       for _ in range(self.n_simulations):
           node = root
           sim_env = GridWorld(size=self.env.size)
           sim_env.grid = self.env.grid.copy()
           sim_env.agent_pos = state
           while node.is_fully_expanded(sim_env.get_valid_actions(node.state)) and node.children:
               action, _ = node.best_child()
               node = node.children[action]
               sim_env.agent_pos = node.state
           valid_actions = sim_env.get_valid_actions(node.state)
           if valid_actions and not node.is_fully_expanded(valid_actions):
               untried = [a for a in valid_actions if a not in node.children]
               action = random.choice(untried)
               next_state, _, _ = sim_env.step(action)
               child = MCTSNode(next_state, parent=node)
               node.children[action] = child
               node = child
           total_reward = 0
           depth = 0
           while depth < 20:
               valid = sim_env.get_valid_actions(sim_env.agent_pos)
               if not valid:
                   break
               action = random.choice(valid)
               _, reward, done = sim_env.step(action)
               total_reward += reward
               depth += 1
               if done:
                   break
           while node:
               node.visits += 1
               node.value += total_reward
               node = node.parent
       if root.children:
           return max(root.children.items(), key=lambda x: x[1].visits)[0]
       return random.choice(self.env.get_valid_actions(state))

We developed a Monte Carlo Tree Search (MCTS) agent for simulating and planning multiple possible future outcomes. We see how it builds a search tree, expands promising branches, and returns the results of decision analysis. This allows the agent to plan intelligently before acting. Look Full codes here.

def train_agent(agent, env, episodes=500, max_steps=100, agent_type="standard"):
   rewards_history = []
   for episode in range(episodes):
       state = env.reset()
       total_reward = 0
       for step in range(max_steps):
           valid_actions = env.get_valid_actions(state)
           if agent_type == "mcts":
               action = agent.search(state)
           else:
               action = agent.get_action(state, valid_actions)
           next_state, reward, done = env.step(action)
           total_reward += reward
           if agent_type != "mcts":
               valid_next = env.get_valid_actions(next_state)
               agent.update(state, action, reward, next_state, valid_next)
           state = next_state
           if done:
               break
       rewards_history.append(total_reward)
       if hasattr(agent, 'decay_epsilon'):
           agent.decay_epsilon()
       if (episode + 1) % 100 == 0:
           avg_reward = np.mean(rewards_history[-100:])
           print(f"Episode {episode+1}/{episodes}, Avg Reward: {avg_reward:.2f}")
   return rewards_history


if __name__ == "__main__":
   print("=" * 70)
   print("Problem Solving via Exploration Agents Tutorial")
   print("=" * 70)
   env = GridWorld(size=8, n_obstacles=10)
   agents_config = {
       'Q-Learning (ε-greedy)': (QLearningAgent(), 'standard'),
       'UCB Agent': (UCBAgent(), 'standard'),
       'MCTS Agent': (MCTSAgent(env, n_simulations=30), 'mcts')
   }
   results = {}
   for name, (agent, agent_type) in agents_config.items():
       print(f"nTraining {name}...")
       rewards = train_agent(agent, GridWorld(size=8, n_obstacles=10),
                             episodes=300, agent_type=agent_type)
       results[name] = rewards
   plt.figure(figsize=(12, 5))
   plt.subplot(1, 2, 1)
   for name, rewards in results.items():
       smoothed = np.convolve(rewards, np.ones(20)/20, mode="valid")
       plt.plot(smoothed, label=name, linewidth=2)
   plt.xlabel('Episode')
   plt.ylabel('Reward (smoothed)')
   plt.title('Agent Performance Comparison')
   plt.legend()
   plt.grid(alpha=0.3)
   plt.subplot(1, 2, 2)
   for name, rewards in results.items():
       avg_last_100 = np.mean(rewards[-100:])
       plt.bar(name, avg_last_100, alpha=0.7)
   plt.ylabel('Average Reward (Last 100 Episodes)')
   plt.title('Final Performance')
   plt.xticks(rotation=15, ha="right")
   plt.grid(axis="y", alpha=0.3)
   plt.tight_layout()
   plt.show()
   print("=" * 70)
   print("Tutorial Complete!")
   print("Key Concepts Demonstrated:")
   print("1. Epsilon-Greedy exploration")
   print("2. UCB strategy")
   print("3. MCTS-based planning")
   print("=" * 70)

We train all three agents in our grid world and visualize their learning and performance progress. We analyze how each strategy, Q-Learning, UCB, and MCTS, is sustainable over time. Finally, we compare the results and find out where the test method leads to faster, more reliable problem solving.

In conclusion, we successfully started and compared three test agents that were run, each one showing a different strategy to solve the same navigation challenge. We see that epsilon – greed enables random learning, UCB measures confidence in curiosity, and MCTS gets a reworked release of forward detection and planning. This work helps us appreciate how different assessment methods, adaptations, and effectiveness in reinforcement learning work.


Look Full codes here. Feel free to take a look at ours GitHub page for tutorials, code and notebooks. Also, feel free to follow us Kind of stubborn and don't forget to join ours 100K + ML Subreddit and sign up Our newsletter. Wait! Do you telegraph? Now you can join us by telegraph.


AsifAzzaq is the CEO of MarktechPost Media Inc.. as a visionary entrepreneur and developer, Asifi is committed to harnessing the power of social intelligence for good. His latest effort is the launch of a media intelligence platform, MarktechPpost, which stands out for its deep understanding of machine learning and deep learning stories that are technically sound and easily understood by a wide audience. The platform sticks to more than two million monthly views, which shows its popularity among the audience.

Follow Marktechpost: Add us as a favorite source on Google.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button