Surviving High Uncertainty in Transportation with MAR

part of the series on planning transport efficiency through multi-agent reinforcement learning (MARL). Here, I focus more on how general integration was achieved. I recommend that you read Part 1 first if you want to get a picture of the context of architecture and business.
The goal was for the model to incorporate mid-mileage procedures and survive even in changing conditions. I realized this idea by using three basic concepts:
- Hybrid Architecture removes physical complexity
- A scale-invariant view creates input for the universal model
- MAR makes agents flexible
Spoiler alert: The first two concepts allow us to easily transfer agents between tasks, while the third allows agents to work within one or more tasks. Let's look at each one.
Hybrid Architecture
How can you engineer a system that can deliver robust solutions, even if they are ported to completely new situations? You just need to make it solve not some special case, but something general – a high-level problem of abstraction.
But how do we make this come alive? Let's break the problem down into layers and solve it using hybrid: RL dictates high-level strategy, and LP lives on low-level strategy. By doing so, we allow RL to synthesize domain-wide information, while LP solves specific packing cases.
action = [num_vehicles_1, .. , num_vehicles_n]
See Part 1 for more details on hybrid method and verb versions
Because of this “separation of duties,” the RL division is not responsible for the minute, technical details of where the parcels go, or how they are packed. As a manager removed from execution details.
Finally, the RL agent affects the environment indirectly – its positive actions are processed by the LP solver, which then refreshes the environment.
Here is how we process the action of the RL agent and pass it to the LP solver.
def decide_send_LP(self, action: np.array):
# Parse the RL agent's action array into a dictionary of active destinations
neighb_action = {v_id: num_v for v_id, num_v in enumerate(action) if num_v > 0}
if not neighb_action:
return 0, 0 # No vehicles dispatched
# Get warehouse inventory for parcels that can actually go to the chosen destinations
available_parcels = self.get_available_parcels(destinations=neighb_action.keys())
if available_parcels.empty:
return 0, 0 # No packages to send
# The LP decides which parcels go into the vehicles to maximize volume/profit
av_vehicles = self.get_available_vehicles()
parcels_result, edges_result = send_veh(neighb_action, available_parcels, av_vehicles)
# Update the environment state based on the LP's physical execution
self.process_sent(parcels_result)
# Return costs to the environment (for reward calculation)
shipment_cost = sum(edges_result.c_cost * edges_result.v_varr_value)
num_vehicles_sent = edges_result.v_varr_value.sum()
return shipment_cost, num_vehicles_sent
What's going on here? Initially, we must translate the agent's actions into a digestible format, ensuring that the agent has requested at least one dispatch. Then, we check if there are any parcels in the store that can be sent.
Next, we use a direct system, which packs available packages in available vehicles, choosing not only the transport category but the specific vehicle, as well as where this parcel will go.
And finally, we update the environment based on the use of LP, calculate the shipping cost, and return it to calculate the reward.
So, we get the portability – as long as the work structure is the same, the system can adapt to any problem within the same category.
A Fixed View of the scale
Let's say we have a hybrid structure. But how can you make it live in different situations if the RL agents' perception and action spaces are technically fixed at the start?
I achieved that by changing ideas – I normal a place to watch to do it scale-invariant. Instead of tracking raw statistics (eg, “how many packages were sent”), we track measurements (eg, “what percentage of the total backlog was sent”).
This is a technical trick that gives you a “free” transfer of an agent from one task to another by allowing the agent to work at a higher output level where absolute numbers are not important.
Let's discuss some examples.
Observations
Local inventory perc_piles_wh— Number of packages per warehouse.
def upd_perc_piles_wh(env):
piles_wh = env.metrics['piles_wh']
return np.array([piles_wh / env.num_piles])
Here, to keep the view scale constant, I divide the current inventory piles_wh by the total number of packages that will go through the env.num_piles match. By doing so, the agent learns to prioritize based on the percentage of daily work it is currently handling.
Local inventory with directions – Shows exactly where the current load needs to go. This is the basis of the route decision.
def upd_warehouse_loading_level_by_directions(env):
# Get the current physical inventory at this specific node
parcels = env.get_current_warehouse_parcels()
if parcels.empty:
return np.zeros(env.num_vertices)
# Prepare the destinations array
destinations = parcels['destination'].values.astype(int)
# Get the counts for the destinations
counts = np.bincount(destinations, minlength=env.num_vertices)
return counts / len(parcels)
First, we pull the current stock of packages for this particular item and make sure it's empty. Next, we extract the 'destination' column as a list of integer values, representing the target store IDs. Finally, np.bincount counts the distribution of packages across all destinations. By dividing these figures by the total number of packages currently in this local warehouse, we convert the total volume into a share. The result is a constant vector of floating scales, where each index represents the exact percentage of the local stock towards that particular vertex.
Very close deadline with Direction (deadlines_min_dist) — Distribution of the nearest deadlines for the current stock.
def upd_deadlines_min_dist(env):
parcels = env.get_current_warehouse_parcels()
deadlines = np.ones(env.num_vertices) # 1.0 means no urgency or no parcels
if not parcels.empty:
# Group by destination and find the actual minimum time left
min_times = parcels.groupby('destination')['time_left'].min() / env.max_time_left
# Assign the calculated minimums to their respective destination indices
deadlines[min_times.index.astype(int)] = min_times.values
return np.clip(deadlines, env.config.OBS_BOX_LOW, env.config.OBS_BOX_HIGH)
Here, we also pull the current local list. We initialize a vector of deadlines to the size of the graph and fill it with those (where 1.0 means no urgency, and values closer to 0.0 indicate the deadline has arrived).
Next, we group the parcels by destination and find the minimum time_ remaining for each route. And we divide this by the maximum possible remaining time to convert the total time into a relative rate (same method here).
Because this resulting vector contains only the data of the action points, it is sparse and does not correspond to our action space. We map these urgent deadlines to the IDs of the correct topological destinations by using the destinations as integers.
As a final touch, we trim the array so that it stays strictly between 0 and 1. This is an important security measure, since outdated packages will produce negative time values, which may violate the observational limits of the neural network.
So, often, a new job means a new perspective. However, in my hybrid approach, this is not the case: agents can be transferred from warehouse to warehouse by design, regardless of the number of parcels, vehicles, or neighborhoods.
Zero-Padding or Maximum Node Padding
In the current version, the only difference is the total number of warehouses in the network (graph order). This should be known in advance, since transfer is only possible in a graph with the same maximum size.
We handle this limitation using normalization Zero-padding. We define a the larger the graph size (eg, 100 vertices), and for any smaller graphs, we hide missing nodes with zero values. If your maximum graph size is 100 vertices, you simply remove the agent from the active vertices and close the rest with zero. The same concept applies to neighbor recognition: the size of the vector is always proportional to the order of the transport graph, but only available (visible) neighbors have non-zero values.
MARL
Good solutions in a changing context
Now let's face another problem: reality is volatile.
A sudden snowstorm, 3PL prices triple, or there's a big spike in orders right before the holidays. The company needs to be flexible to survive this. Note that the visual rules of the game (vehicle sizes, map) remain the same, but the context changes completely.
Static heuristics (eg, a hard-coded rule of thumb to “send at 85% capacity”) will quickly begin to produce large losses in these situations. A major advantage of the MAR method is that it normalizes the situation given the observations. It dynamically changes its decision-making threshold “on the fly” in response to these changing observations.
Another major advantage of MAR is that the problem is divided into small parts, which are solved independently by agents. The multi-agent architecture prevents us from being forced to solve every network problem with a single “mega-agent”. However, I will cover that in more detail in my next downsizing article.
MARL Implementation
A few words about how we use the multi-agent feature. I faced two different challenges:
- Because the actions of agents are interdependent, they can easily adapt to each other each other's behavior is sub-optimal. Therefore, in the early stages of training, traditional MAR may not be very stable.
- I wanted to stay within the OpenAI Gym + Stable-baselines stack, which clearly doesn't support native MARL training.
At the same time, going back to a single-agent solution was not possible due to the large number of warehouses, and the “mega-single-agent” approach was reduced to the architecture stage (details in Architecture Part 1).
As a result, I designed the following training pipeline:
- Instead of training all agents at once, we train only one – the “current” agent per episode.
- While the “current” agent is training, others are working in frozen mode.
- A global environment “step” consists of a sequential release of all agents: the “training” agent takes its action, followed by the “inference” agent.
Here's how it looks in code:
# Initialize environment and load the current best weights for all agents
env.env_method('prepare_env', best_agent_paths)
for i in range(NUM_MARL_LOOPS):
for training_ag_id in agents.keys():
# Shift the environment's perspective to the current active agent
env.env_method('set_cur_training_agent', training_ag_id)
# Fetch the active agent's policy model
agent_obj = agents.get(training_ag_id)
# Train ONLY this agent
# (This will call env.step() under the hood
# and will run the other agents in frozen inference mode)
agent_obj = agent_obj.learn(
TS_PER_AGENT,
reset_num_timesteps=False,
tb_log_name=f"Agent_{training_ag_id}",
callback=callbacks,
)
# Save the updated weights and push them to the live models cache
agent_obj.save(last_agent_paths[training_ag_id])
agents[training_ag_id] = agent_obj
First, prepare_env() is invoked, which sets default values and methods for saving agents. Then, we introduce the main loop, which means the number of NUM_MARL_LOOPS training passes in the whole network.
Within that, we manage the training of one “current” agent. I agents dictionary: keys are IDs, values are object models. The set_cur_training_agent() method changes the view environment. Then, we take the current agent model and trigger .read(). After that, it's very straightforward: we save the model and update agents dictionary.
Now, let's take a brief look at how this step works in the environment:
def step(self, action) -> tuple[dict, float, bool, dict]:
# Training Agent executes its action
reward = self.process_packages(action)
self.process_inflow() # Localized to the active agent's node
self.update_state_and_metrics(reward)
self.save_current_act_agent()
# Inference Loop: Other agents take their turns sequentially
for ag_id in self.inference_agents.keys():
if ag_id == self.cur_training_agent:
continue # Skip the training agent (it already acted)
# Switch environment context to the current inference agent
self.current_origin = ag_id
self.load_act_agent()
# Load model and get masked prediction
agent_obj = self.inference_agents.get(ag_id)
action_mask = self.valid_action_mask()
ag_action, _ = agent_obj.predict(self.state, action_masks=action_mask)
# Execute inference agent's action
sub_reward = self.process_packages(ag_action)
self.update_state_and_metrics(sub_reward)
self.save_current_act_agent()
# Restore environment state to the Training Agent's perspective
self.current_origin = self.cur_training_agent
self.load_act_agent()
# Check terminal conditions
done = self.check_if_done()
self.step_n += 1
return self.state, reward, done, self.info
First, we perform the action of the “current” training agent. We start by processing the packages currently in the system with self.process_packages(action), where the agent action is invoked in the environment logic. In other words, if the agent decides to send some trucks to other warehouses, the LP solver is used here.
After that, we receive new incoming packages in self.process_inflow(), update the state and metrics in self.update_state_and_metrics(), and save the agent context in save_current_act_agent().
Now the fun part begins. Since the current training agent has already taken its action, we need to consider the actions of the entire network. So we start a loop over our available agents, skipping the training one. Inside this loop, we change the “current” agent context, load its model, and generate predictions by feeding the current state and action mask to agent_obj.predict().
From there, the flow is the same as the training agent: we process the generated action (now, with the guess agent), and update the environment. Finally, at the end of the loop, we put the context back to the current training agent and pass the final results to the loop.
In the following episodes
So, we now have a fully functional training loop. The code works, and the MAR environment will start, but how can we ensure that this training process is actually:
- Does it finish on time?
- Do the models converge?
- Does it generate “enough” routing strategies?
That is what I will explain in the following articles. Stay tuned!
LinkedIn | Email



