Emphasizing Learning Made: Create J-Learning Agent Eython

In, the GO World Champion Lee Sedol faced a competitor who can be done with meat and blood – but in the code lines.
It was soon obvious that man was missing.
Finally, Lee Sedol was lost 4: 1.
Last week I also looked at the alphago in the dorn – and found it interesting again.
What is shocking? The alphago has not received its play style from database, laws or strategies of strategies.
Instead, we had played poorly of millions of times – and we learned how to win from this process.
Move 37 in the game 2 was a moment when the whole world is straightforward: this AI does not play as a person – it is better playing.
The Alphavo is integrated with the targeted learning, strengthening, and searches. One attractive part is this, its strategy comes from learning against it – using the validity of learning to improve over time.
We now use the strengthening of learning and not only in games, but also to Robotic, such as grippers or home robots, e.g. To reduce the use of the data facilities or traffic control, e.g. By control of traffic.
And to the elderly, we now use large language models by re-reading (eg a reading alignment from the people's reply) to make Chatgpt answers, for example, or Gemini.
In this article, I will show you exactly how this works, and how we can better understand how easy game: TIC TAC Toe.
What is the validity of reading?
When we see a child learning to walk, we see: it's getting up, he is falling, trying again – and sometimes takes his first steps.
No teacher shows a child to do it. Instead, the child is trying to be different deeds on trial and travel error -.
When it stops or walked in a few steps, this is the child's reward. After all, its goal is to be able to walk. When it falls down, there is no reward.
This process of trial, an error and reward is a basic idea after strengthening the verification (RL).
Learning Confirmation is a way of learning where the agent reads in contact with its location, which action leads to the rewards.
Its goal is: for a long-term reward as possible.
- Unlike watched reading, no “correct answers” or labels. Antier must find out what good decisions.
- Unlike random learning, the purpose is not available for hidden patterns in the data, but doing those acts that increase the reward.
RL agent thinks, determining – and reading
For the agent of learning, it requires four objects: The idea that currently exists (status), what can we do (verb), how we do with the plan in the past (value).
Agent verbs, receive feedback, and better.
For this to work, four things are required:
1) Policy / Strategy
This is a law or plan according to the agent determines which action is to do in a particular situation. In easy cases, this is a table of view. In complex apps (eg Neural networks), it is work.
2) reward signal
Reward from nature. For example, this can be + for winning, 0 drawing and all of the losses. The purpose of agent collects as many benefits as possible with as many steps as possible.
3) The work of the value
This work is implementing the expected future reward of the situation. The reward shows the agent that the action is “good” or “bad”. Value performance estimates how good condition – not as soon as possible, but we immediately look at the future rewards for the agent to expect from the situation on. Price employee so you estimate the longest profit of the state.
4) Natural model
The model tells an agent: “If I work in a situation, I will end up in S 'state and reward reward R.”
In free model cases such as Q-Learning, however, this is not necessary.
Bullying vs. Checking: Move 37 – and what can we learn from it
You can remember to move 37 from game 2 between Alphago and Lee Sedol:
The unusual walks looked as an error for us people – but later praised as intelligent.
Why did algorithm do that?
The computer system tried something new. This is called exploration.
Strengthening to read both requirements: The agent must receive balance between abuse and evaluation.
- Bullying means that agent uses an act that has already been aware.
- To explore, on the other hand, are the actions that the agent is trying for the first time. It helps them because they can get better than actions already knowing.
The agent tries to find the right strategy in trial and error.
Tic-Tac-Toe by reading the strengthening
Let's look at the strengthening of reading about the best game.
He probably played it as a baby again: Tic Tac Toe.

The game is perfect as an instance of introduction, because it does not require neural network, the rules are clear and we can use a game with a small Python:
- Our agent starts with zero knowledge of the game. First as a person who sees the game for the first time.
- The agent gradually examine each game status: 0.5 points mean “I don't know yet I will be successful here.” 1.0 means “this condition is almost leading to victory.
- By playing many groups, the agent sees active – and adapts to her strategy.
Google? With each opportunity, the agent must choose an action that leads to a long-term reward.
In this section, we will create an RL program by step by step and create file tectacterl.py file.
→ You can find the whole code in the GitTub repository.
1. Building the nature of the game
In the tingling learning, the agent learns working with the environment. It determines what the state is (eg current board), which actions allowed (eg where you can place bets) and which answer to action (eg.
In a sense, we refer to this set of a Markov resolution: The model contains provinces, actions and rewards.
First, we form a tictactoe class. This is a game board, which is still building 3 × 3 NUNPY ARRAY, and treat logic game:
- Reset work (self) starts a new game.
- The work that is found_we () returns all free sectors.
- Activity action (action, action, player) makes the game movement. Here we return this new situation, reward (1 = win, 0.5 = drawing, -10 = Invalid travel) and the status of the game. We punish the wrong movement in this example by -10 so that the agent learns to avoid faster – regular process in small RL areas.
- Assessment Work_A screening () checks that the player has three X or O succession and therefore won.
- With the Render_gui () shows the current board with the matplotlib as an X and O graphics.
import numpy as np
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
import random
from collections import defaultdict
# Tic Tac Toe Spielumgebung
class TicTacToe:
def __init__(self):
self.board = np.zeros((3, 3), dtype=int)
self.done = False
self.winner = None
def reset(self):
self.board[:] = 0
self.done = False
self.winner = None
return self.get_state()
def get_state(self):
return tuple(self.board.flatten())
def available_actions(self):
return [(i, j) for i in range(3) for j in range(3) if self.board[i, j] == 0]
def step(self, action, player):
if self.done:
raise ValueError("Spiel ist vorbei")
i, j = action
if self.board[i, j] != 0:
return self.get_state(), -10, True
self.board[i, j] = player
if self.check_winner(player):
self.done = True
self.winner = player
return self.get_state(), 1, True
elif not self.available_actions():
self.done = True
return self.get_state(), 0.5, True
return self.get_state(), 0, False
def check_winner(self, player):
for i in range(3):
if all(self.board[i, :] == player) or all(self.board[:, i] == player):
return True
if all(np.diag(self.board) == player) or all(np.diag(np.fliplr(self.board)) == player):
return True
return False
def render_gui(self):
fig, ax = plt.subplots()
ax.set_xticks([0.5, 1.5], minor=False)
ax.set_yticks([0.5, 1.5], minor=False)
ax.set_xticks([], minor=True)
ax.set_yticks([], minor=True)
ax.set_xlim(-0.5, 2.5)
ax.set_ylim(-0.5, 2.5)
ax.grid(True, which='major', color='black', linewidth=2)
for i in range(3):
for j in range(3):
value = self.board[i, j]
if value == 1:
ax.plot(j, 2 - i, 'x', markersize=20, markeredgewidth=2, color='blue')
elif value == -1:
circle = plt.Circle((j, 2 - i), 0.3, fill=False, color='red', linewidth=2)
ax.add_patch(circle)
ax.set_aspect('equal')
plt.axis('off')
plt.show()
2. Edit the Qual-reading agent
Next, describing part of study: Our agent
It determines which action we should do in a particular state of receiving a greater reward as possible.
Agent uses Classic RL Method Q-Learning. Q-stored in each context and action – long-term benefit of this action.
The most important ways are:
- You use
choose_action(self, state, actions)
The agent, the agent determines in each game of the game that choosing the action that has already been aware (bullying) or trying to perform an adequate new action (assessment).This decision is based on the method called Evi-Gave:
With the availability of ε = 0.1 The agent selects an unplanned action (a survey), 90% of the opportunities (1 – ε) chooses the best action known by Q- (Bullying).
- For work
update(state, action, reward, next_state, next_actions)
We are changing the Q depending on how the action was right and what happened after that. This is a middle school study step.
# Q-Learning-Agent
class QLearningAgent:
def __init__(self, alpha=0.1, gamma=0.9, epsilon=0.1):
self.q_table = defaultdict(float)
self.alpha = alpha
self.gamma = gamma
self.epsilon = epsilon
def get_q(self, state, action):
return self.q_table[(state, action)]
def choose_action(self, state, actions):
if random.random() < self.epsilon:
return random.choice(actions)
else:
q_values = [self.get_q(state, a) for a in actions]
max_q = max(q_values)
best_actions = [a for a, q in zip(actions, q_values) if q == max_q]
return random.choice(best_actions)
def update(self, state, action, reward, next_state, next_actions):
max_q_next = max([self.get_q(next_state, a) for a in next_actions], default=0)
old_value = self.q_table[(state, action)]
new_value = old_value + self.alpha * (reward + self.gamma * max_q_next - old_value)
self.q_table[(state, action)] = new_value
Myself Put downI am writing summaries regularly with the articles in Tech fields, Python, Data Science, Reading Machine and Ai. If you are interested, look or sign up.
3. Train an agent
The actual learning process begins in this step. During training, the agent learns about the trial and error. Anchor plays many games, memorizing what actions are effective – and adapt their strategy.
During training, the agent learns how their actions are, how his character affects the provinces, and how better plans grow at a long time.
- For work
train(agent, episodes=10000)
It explains that the agent plays 10,000 games to facilitate a simple unfamiliar opponent. In each piece, an agent (player 1) make movements, followed by an opponent (player 2). After each moving, the agent readsupdate()
. - All 1000 games will damage how many winins there are, pull and win there.
- Finally, we organize a matplotlib curve. Indicates how an agent improves later.
# Training mit Lernkurve
def train(agent, episodes=10000):
env = TicTacToe()
results = {"win": 0, "draw": 0, "loss": 0}
win_rates = []
draw_rates = []
loss_rates = []
for episode in range(episodes):
state = env.reset()
done = False
while not done:
actions = env.available_actions()
action = agent.choose_action(state, actions)
next_state, reward, done = env.step(action, player=1)
if done:
agent.update(state, action, reward, next_state, [])
if reward == 1:
results["win"] += 1
elif reward == 0.5:
results["draw"] += 1
else:
results["loss"] += 1
break
opp_actions = env.available_actions()
opp_action = random.choice(opp_actions)
next_state2, reward2, done = env.step(opp_action, player=-1)
if done:
agent.update(state, action, -1 * reward2, next_state2, [])
if reward2 == 1:
results["loss"] += 1
elif reward2 == 0.5:
results["draw"] += 1
else:
results["win"] += 1
break
next_actions = env.available_actions()
agent.update(state, action, reward, next_state2, next_actions)
state = next_state2
if (episode + 1) % 1000 == 0:
total = sum(results.values())
win_rates.append(results["win"] / total)
draw_rates.append(results["draw"] / total)
loss_rates.append(results["loss"] / total)
print(f"Episode {episode+1}: Wins {results['win']}, Draws {results['draw']}, Losses {results['loss']}")
results = {"win": 0, "draw": 0, "loss": 0}
x = [i * 1000 for i in range(1, len(win_rates) + 1)]
plt.plot(x, win_rates, label="Win Rate")
plt.plot(x, draw_rates, label="Draw Rate")
plt.plot(x, loss_rates, label="Loss Rate")
plt.xlabel("Episodes")
plt.ylabel("Rate")
plt.title("Lernkurve des Q-Learning-Agenten")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
4. To see the Board
With the main program “If Name ==” Main “:” describes the original program. It ensures that the agent's training is running automatically when we make text. And we use the render_gui()
How to show a tictacto board as a picture.
# Hauptprogramm
if __name__ == "__main__":
agent = QLearningAgent()
train(agent, episodes=10000)
# Visualisierung eines Beispielbretts
env = TicTacToe()
env.board[0, 0] = 1
env.board[1, 1] = -1
env.render_gui()
Execution in the Sygin
We store the code in the TiredOnl.py file.
For a fortune, we now go to a compatible directory where our Titactorerva.py is saved and used the file with the command of “Python TicTactorLactorl.py” command.
In the terminal, we see how many games our agent has won after all 1000 episodes:

Consider and imaginative of the curve of reading:

The last thoughts
With tictactoe, we use a simple game with the Python – but we can easily see that the tasks of strengthening:
- The agent starts without previous knowledge.
- Improving the response plan and experience.
- Its decisions gradually make progress because of that – not because he knows the rules, but because he reads.
In our example, the opponent was a random agel. Next, we realized that our Learning Agent IQ works against the learning agent or we.
Emphasis for learning shows that a machine intelligence is not created with information or knowledge – but through experience, the answer and adaptation.