Value Iteration
Description
ValueIteration is a model-based, tabular planning algorithm for discrete Markov Decision Processes (MDPs). It learns an empirical transition model P(s′|s,a) and a mean reward model R(s,a,s′) directly from environment interaction, then applies the Bellman optimality operator to convergence to obtain the optimal state-value function V*(s).
Core Bellman update (applied at every iteration):
Once V* is known, the optimal policy is recovered by a one-step lookahead:
Q(s,a) is never stored — it is computed transiently during planning and action selection. This keeps memory usage at O(|S|) instead of O(|S||A|), making ValueIteration slightly more memory-efficient than QValueIteration for large state spaces.
When to use ``ValueIteration``?
You want a simple, memory-efficient planner for small to medium discrete MDPs.
You need a reference baseline before moving to Q-learning or deep-RL.
The environment is small enough that a table fits in memory and planning is fast.
Inherits from BaseIteration, which handles empirical model construction, random exploration, and episode rollouts.
Constructor
ValueIteration(
env: gym.Env,
gamma: float = 0.9,
num_test_episodes: int = 20,
device: Optional[torch.device] = None,
dtype: Optional[torch.dtype] = torch.float32,
)
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
— |
Gymnasium / qrl-qai environment with discrete obs & action spaces. |
|
|
|
Discount factor in [0, 1). Higher → more far-sighted. |
|
|
|
Informational; used by external training loops for evaluation. |
|
|
auto (CUDA/CPU) |
Compute device for all tensors. |
|
|
|
Floating-point precision for all tensors. |
The constructor calls env.reset() internally to initialise the agent’s current state.
Typical Training Loop
1. agent.play_n_random_steps(N) ← seed the model
2. agent.value_iteration() ← run Bellman updates to convergence
3. Evaluate agent.play_episode() ← measure quality
4. Repeat 1-3 until solved
Each outer iteration adds more environment experience (step 1), then re-solves the updated empirical MDP (step 2). As the empirical model improves, V* converges to the true optimal value function.
Implementation Details
Exploration / model-building
Method / Property |
Description |
|---|---|
|
Execute n uniformly-random steps to seed the empirical model. |
|
Run one greedy episode on env, updating the model. Returns total reward. |
Planning
Method |
Description |
|---|---|
|
Apply Bellman updates until ‖V_new − V‖_∞ < |
Action selection & inspection
Method / Property |
Description |
|---|---|
|
Greedy action via one-step lookahead on V. Returns |
|
Current state-value tensor, shape |
|
Greedy policy tensor of shape |
Internal model (inherited from BaseIteration)
Attribute |
Shape |
Description |
|---|---|---|
|
|
Visitation counts C[s,a,s’]. |
|
|
Accumulated reward per transition. |
|
|
Current value-function tensor. |
Example 1 — FrozenLake-v1 (stochastic environment)
FrozenLake-v1 with is_slippery=True is a classic stochastic benchmark: the agent tries to cross a frozen lake (4×4 grid) from start to goal without falling through holes. Transitions are stochastic — each intended action has only a 1/3 chance of succeeding; the remaining probability is split between the two perpendicular directions.
This is an ideal showcase for ValueIteration because:
The state and action spaces are small (16 states, 4 actions).
Stochasticity means the empirical model must accumulate enough samples before the value function converges to a useful policy.
A success rate of > 80 % is considered solved.
[1]:
import gymnasium as gym
from qrl.algorithms.classical import ValueIteration
# Two separate env instances:
# env → used by the agent for exploration and model-building
# test_env → used for clean evaluation (no model updates during testing)
env = gym.make("FrozenLake-v1", is_slippery=True)
test_env = gym.make("FrozenLake-v1", is_slippery=True)
agent = ValueIteration(env=env, gamma=0.9)
print(f"State space : {agent.n_states} (one per grid cell)")
print(f"Action space : {agent.n_actions} (left=0, down=1, right=2, up=3)")
print(f"Device : {agent.device}")
print(f"V shape : {agent.V.shape}")
State space : 16 (one per grid cell)
Action space : 4 (left=0, down=1, right=2, up=3)
Device : cpu
V shape : torch.Size([16])
[2]:
TEST_EPISODES = 20
iter_no = 0
best_reward = 0.0
while True:
iter_no += 1
# Step 1 — explore: collect 100 random transitions to update the model
agent.play_n_random_steps(100)
# Step 2 — plan: solve the empirical MDP with Bellman updates
n_bellman = agent.value_iteration()
# Step 3 — evaluate: run TEST_EPISODES greedy episodes
reward = sum(agent.play_episode(test_env) for _ in range(TEST_EPISODES))
reward /= TEST_EPISODES
if reward > best_reward:
print(f" ▶ Best reward updated {best_reward:.3f} → {reward:.3f} "
f"(iter {iter_no}, Bellman sweeps: {n_bellman})")
best_reward = reward
if best_reward > 0.80:
print(f"\nSolved in {iter_no} outer iterations!")
break
▶ Best reward updated 0.000 → 0.250 (iter 16, Bellman sweeps: 77)
▶ Best reward updated 0.250 → 0.500 (iter 17, Bellman sweeps: 71)
▶ Best reward updated 0.500 → 0.550 (iter 18, Bellman sweeps: 60)
▶ Best reward updated 0.550 → 0.600 (iter 23, Bellman sweeps: 43)
▶ Best reward updated 0.600 → 0.700 (iter 43, Bellman sweeps: 42)
▶ Best reward updated 0.700 → 0.750 (iter 48, Bellman sweeps: 54)
▶ Best reward updated 0.750 → 0.800 (iter 77, Bellman sweeps: 43)
▶ Best reward updated 0.800 → 0.850 (iter 91, Bellman sweeps: 34)
Solved in 91 outer iterations!
[3]:
import torch
policy = agent.get_policy() # shape (16,) — one greedy action per state
action_symbols = {0: "←", 1: "↓", 2: "→", 3: "↑"}
print("Greedy policy (raw tensor):")
print(policy)
print("\nGreedy policy grid (4×4):")
grid = [[action_symbols[int(policy[r * 4 + c])] for c in range(4)] for r in range(4)]
for row in grid:
print(" ", " ".join(row))
print("\nState-value function V (reshaped to 4×4):")
print(agent.V.reshape(4, 4).round(decimals=3))
Greedy policy (raw tensor):
tensor([0, 3, 0, 3, 0, 0, 0, 0, 3, 1, 1, 0, 0, 2, 1, 0])
Greedy policy grid (4×4):
← ↑ ← ↑
← ← ← ←
↑ ↓ ↓ ←
← → ↓ ←
State-value function V (reshaped to 4×4):
tensor([[0.0640, 0.0520, 0.0570, 0.0420],
[0.0860, 0.0000, 0.0880, 0.0000],
[0.1350, 0.2320, 0.2640, 0.0000],
[0.0000, 0.3690, 0.6460, 0.0000]])
Example 2 — BlochSphereV1 (quantum discrete MDP)
BlochSphereV1 models a single-qubit Bloch sphere as a deterministic finite graph:
6 states — the six canonical pure states:
|0⟩,|1⟩,|+⟩,|-⟩,|+i⟩,|-i⟩.4 actions — unitary gates H, X, Z, S.
Reward — sparse binary:
1.0on reaching the target state,0.0otherwise.
Because every gate acts deterministically, ValueIteration converges in very few outer iterations (often just one), demonstrating how model-based planning trivially solves small deterministic MDPs.
After training, the environment’s built-in 2D graph renderer produces an animated GIF showing:
Left panel — true state-transition graph with the episode trajectory overlaid.
Right panel — the agent’s learned V(s) values (warm colormap) and greedy policy (bold teal edges).
[4]:
from qrl.algorithms.classical import ValueIteration
from qrl.env import BlochSphereV1
# Target state index 4 → |+i⟩
# set ffmpeg=True if you have ffmpeg installed to save as mp4
env = BlochSphereV1(target_state=4, max_steps=10, reward_tolerance=0.99, ffmpeg=False)
test_env = BlochSphereV1(target_state=4, max_steps=10, reward_tolerance=0.99, ffmpeg=False)
agent = ValueIteration(env=env, gamma=0.9)
print(f"State space : {agent.n_states} (|0⟩ |1⟩ |+⟩ |-⟩ |+i⟩ |-i⟩)")
print(f"Action space : {agent.n_actions} (H=0, X=1, Z=2, S=3)")
print(f"Target state : {env.target_state_index} → |+i⟩")
print()
print("Transition table T[s, a] = s':")
print(BlochSphereV1.transition_table())
State space : 6 (|0⟩ |1⟩ |+⟩ |-⟩ |+i⟩ |-i⟩)
Action space : 4 (H=0, X=1, Z=2, S=3)
Target state : 4 → |+i⟩
Transition table T[s, a] = s':
[[2 1 0 0]
[3 0 1 1]
[0 2 3 4]
[1 3 2 5]
[5 5 5 3]
[4 4 4 2]]
[5]:
TEST_EPISODES = 20
iter_no, best_reward = 0, 0.0
while True:
iter_no += 1
# Explore: seed the empirical model
agent.play_n_random_steps(50)
# Plan: solve with Value Iteration
agent.value_iteration()
# Collect a graph snapshot (appended to env.fig_array_list)
env._render_graph(agent=agent)
# Evaluate: measure success rate over TEST_EPISODES greedy rollouts
reward = 0.0
for _ in range(TEST_EPISODES):
obs, _ = test_env.reset()
while True:
action = agent.select_action(int(obs))
obs, _, terminated, truncated, _ = test_env.step(action)
if terminated or truncated:
reward += float(terminated) # 1.0 on success, 0.0 on truncation
break
reward /= TEST_EPISODES
print(f"Iteration {iter_no:>3} | Success rate: {reward:.3f}")
if reward > best_reward:
print(f" ▶ Best reward updated {best_reward:.3f} → {reward:.3f}")
best_reward = reward
if best_reward >= 1.0: # 100 % success rate → solved
print(f"\nSolved in {iter_no} iteration(s)!")
break
Iteration 1 | Success rate: 1.000
▶ Best reward updated 0.000 → 1.000
Solved in 1 iteration(s)!
[6]:
state_labels = ["|0⟩", "|1⟩", "|+⟩", "|-⟩", "|+i⟩", "|-i⟩"]
action_names = ["H", "X", "Z", "S"]
policy = agent.get_policy()
print("Greedy policy:")
for s, (label, a) in enumerate(zip(state_labels, policy)):
marker = " ← target" if s == env.target_state_index else ""
print(f" s={s} {label:<6} → {action_names[int(a)]}{marker}")
print("\nLearned V(s):")
for s, (label, v) in enumerate(zip(state_labels, agent.V)):
print(f" s={s} {label:<6} V = {v:.4f}")
# Save the learning animation
env.render(
save_path_without_extension="bloch_sphere_value_iteration",
interval=600,
ffmpeg=False,
)
print("\nAnimation saved → bloch_sphere_value_iteration.gif")
Greedy policy:
s=0 |0⟩ → H
s=1 |1⟩ → H
s=2 |+⟩ → S
s=3 |-⟩ → Z
s=4 |+i⟩ → H ← target
s=5 |-i⟩ → H
Learned V(s):
s=0 |0⟩ V = 0.9000
s=1 |1⟩ V = 0.8100
s=2 |+⟩ V = 1.0000
s=3 |-⟩ V = 0.9000
s=4 |+i⟩ V = 0.0000
s=5 |-i⟩ V = 1.0000
Animation saved → bloch_sphere_value_iteration.gif
Implementation Notes & Extensions
Memory: Only V[s] (shape
(n_states,)) is stored. Q(s,a) is computed transiently duringvalue_iteration()andselect_action(). This is more memory-efficient thanQValueIterationwhich stores the full Q-table.Action-selection cost:
select_action()performs a full one-step lookahead (O(|S||A|)) every time it is called. For large action spaces, preferQValueIterationwhoseselect_action()is a simple table lookup (O(|A|)).Discount factor ``gamma``: Values close to 1.0 make the agent more far-sighted but can slow convergence. For sparse-reward environments, try
gamma=0.99.Convergence tolerance ``tol``: The default
1e-6is tight; you can loosen it to1e-4for faster (but slightly sub-optimal) convergence in large environments.Empirical model quality: Call
play_n_random_steps(N)with larger N before the firstvalue_iteration()call to improve model accuracy, especially for stochastic environments likeFrozenLake-v1.Extension — model-free bridge: Replace the empirical model with a neural network approximator to transition from Value Iteration to DQN-style algorithms.
Version History
v0: Initial implementation. Tabular model-based Value Iteration over discrete MDPs. PyTorch tensors for P, R, and V. Empirical model built via random exploration. Compatible with any Gymnasium environment with
Discreteobservation and action spaces, including qrl-qai quantum environments such asBlochSphereV1.
References (Suggested Reading)
Sutton, R. S., & Barto, A. G., Reinforcement Learning: An Introduction (2nd ed.), Chapter 4 — Dynamic Programming (Value Iteration, Policy Iteration).
Bellman, R., Dynamic Programming (1957) — the original formulation.
Gymnasium documentation —
FrozenLake-v1environment.Nielsen, M. A., & Chuang, I. L., Quantum Computation and Quantum Information — for the single-qubit gate definitions used in
BlochSphereV1.