Value Iteration

Description

ValueIteration is a model-based, tabular planning algorithm for discrete Markov Decision Processes (MDPs). It learns an empirical transition model P(s′|s,a) and a mean reward model R(s,a,s′) directly from environment interaction, then applies the Bellman optimality operator to convergence to obtain the optimal state-value function V*(s).

Core Bellman update (applied at every iteration):

\[V[s] \leftarrow \max_{a} \sum_{s'} P(s'|s,a) \cdot \left(R(s,a,s') + \gamma \cdot V[s']\right)\]

Once V* is known, the optimal policy is recovered by a one-step lookahead:

\[a^*(s) = \arg\max_{a} \sum_{s'} P(s'|s,a) \cdot \left(R(s,a,s') + \gamma \cdot V[s']\right)\]

Q(s,a) is never stored — it is computed transiently during planning and action selection. This keeps memory usage at O(|S|) instead of O(|S||A|), making ValueIteration slightly more memory-efficient than QValueIteration for large state spaces.

When to use ``ValueIteration``?

You want a simple, memory-efficient planner for small to medium discrete MDPs.
You need a reference baseline before moving to Q-learning or deep-RL.
The environment is small enough that a table fits in memory and planning is fast.

Inherits from BaseIteration, which handles empirical model construction, random exploration, and episode rollouts.

Constructor

ValueIteration(
    env: gym.Env,
    gamma: float = 0.9,
    num_test_episodes: int = 20,
    device: Optional[torch.device] = None,
    dtype: Optional[torch.dtype] = torch.float32,
)

Parameter	Type	Default	Description
`env`	`gym.Env`	—	Gymnasium / qrl-qai environment with discrete obs & action spaces.
`gamma`	`float`	`0.9`	Discount factor in [0, 1). Higher → more far-sighted.
`num_test_episodes`	`int`	`20`	Informational; used by external training loops for evaluation.
`device`	`torch.device`	auto (CUDA/CPU)	Compute device for all tensors.
`dtype`	`torch.dtype`	`float32`	Floating-point precision for all tensors.

The constructor calls env.reset() internally to initialise the agent’s current state.

Typical Training Loop

agent.play_n_random_steps(N)   ← seed the model
agent.value_iteration()        ← run Bellman updates to convergence
Evaluate agent.play_episode()  ← measure quality
Repeat 1-3 until solved

Each outer iteration adds more environment experience (step 1), then re-solves the updated empirical MDP (step 2). As the empirical model improves, V* converges to the true optimal value function.

Implementation Details

Exploration / model-building

Method / Property	Description
`play_n_random_steps(n)`	Execute n uniformly-random steps to seed the empirical model.
`play_episode(env)`	Run one greedy episode on env, updating the model. Returns total reward.

Planning

Method	Description
`value_iteration(max_iters, tol)`	Apply Bellman updates until ‖V_new − V‖_∞ < `tol` (or `max_iters` steps). Returns iteration count.

Action selection & inspection

Method / Property	Description
`select_action(s)`	Greedy action via one-step lookahead on V. Returns `int`.
`V`	Current state-value tensor, shape `(n_states,)`.
`get_policy()`	Greedy policy tensor of shape `(n_states,)` — `argmax_a Q(s,a)` for every state.

Internal model (inherited from `BaseIteration`)

Attribute	Shape	Description
`_counts`	`(n_s, n_a, n_s)`	Visitation counts C[s,a,s’].
`_reward_sum`	`(n_s, n_a, n_s)`	Accumulated reward per transition.
`_V`	`(n_states,)`	Current value-function tensor.

Example 1 — FrozenLake-v1 (stochastic environment)

FrozenLake-v1 with is_slippery=True is a classic stochastic benchmark: the agent tries to cross a frozen lake (4×4 grid) from start to goal without falling through holes. Transitions are stochastic — each intended action has only a 1/3 chance of succeeding; the remaining probability is split between the two perpendicular directions.

This is an ideal showcase for ValueIteration because:

The state and action spaces are small (16 states, 4 actions).
Stochasticity means the empirical model must accumulate enough samples before the value function converges to a useful policy.
A success rate of > 80 % is considered solved.

[1]:

import gymnasium as gym
from qrl.algorithms.classical import ValueIteration

# Two separate env instances:
#   env       → used by the agent for exploration and model-building
#   test_env  → used for clean evaluation (no model updates during testing)
env      = gym.make("FrozenLake-v1", is_slippery=True)
test_env = gym.make("FrozenLake-v1", is_slippery=True)

agent = ValueIteration(env=env, gamma=0.9)

print(f"State space  : {agent.n_states}   (one per grid cell)")
print(f"Action space : {agent.n_actions}   (left=0, down=1, right=2, up=3)")
print(f"Device       : {agent.device}")
print(f"V shape      : {agent.V.shape}")

State space  : 16   (one per grid cell)
Action space : 4   (left=0, down=1, right=2, up=3)
Device       : cpu
V shape      : torch.Size([16])

[2]:

TEST_EPISODES = 20

iter_no     = 0
best_reward = 0.0

while True:
    iter_no += 1

    # Step 1 — explore: collect 100 random transitions to update the model
    agent.play_n_random_steps(100)

    # Step 2 — plan: solve the empirical MDP with Bellman updates
    n_bellman = agent.value_iteration()

    # Step 3 — evaluate: run TEST_EPISODES greedy episodes
    reward = sum(agent.play_episode(test_env) for _ in range(TEST_EPISODES))
    reward /= TEST_EPISODES

    if reward > best_reward:
        print(f"  ▶ Best reward updated {best_reward:.3f} → {reward:.3f}  "
              f"(iter {iter_no}, Bellman sweeps: {n_bellman})")
        best_reward = reward

    if best_reward > 0.80:
        print(f"\nSolved in {iter_no} outer iterations!")
        break

  ▶ Best reward updated 0.000 → 0.250  (iter 16, Bellman sweeps: 77)
  ▶ Best reward updated 0.250 → 0.500  (iter 17, Bellman sweeps: 71)
  ▶ Best reward updated 0.500 → 0.550  (iter 18, Bellman sweeps: 60)
  ▶ Best reward updated 0.550 → 0.600  (iter 23, Bellman sweeps: 43)
  ▶ Best reward updated 0.600 → 0.700  (iter 43, Bellman sweeps: 42)
  ▶ Best reward updated 0.700 → 0.750  (iter 48, Bellman sweeps: 54)
  ▶ Best reward updated 0.750 → 0.800  (iter 77, Bellman sweeps: 43)
  ▶ Best reward updated 0.800 → 0.850  (iter 91, Bellman sweeps: 34)

Solved in 91 outer iterations!

[3]:

import torch

policy = agent.get_policy()   # shape (16,) — one greedy action per state
action_symbols = {0: "←", 1: "↓", 2: "→", 3: "↑"}

print("Greedy policy (raw tensor):")
print(policy)

print("\nGreedy policy grid (4×4):")
grid = [[action_symbols[int(policy[r * 4 + c])] for c in range(4)] for r in range(4)]
for row in grid:
    print("  ", " ".join(row))

print("\nState-value function V (reshaped to 4×4):")
print(agent.V.reshape(4, 4).round(decimals=3))

Greedy policy (raw tensor):
tensor([0, 3, 0, 3, 0, 0, 0, 0, 3, 1, 1, 0, 0, 2, 1, 0])

Greedy policy grid (4×4):
   ← ↑ ← ↑
   ← ← ← ←
   ↑ ↓ ↓ ←
   ← → ↓ ←

State-value function V (reshaped to 4×4):
tensor([[0.0640, 0.0520, 0.0570, 0.0420],
        [0.0860, 0.0000, 0.0880, 0.0000],
        [0.1350, 0.2320, 0.2640, 0.0000],
        [0.0000, 0.3690, 0.6460, 0.0000]])

Example 2 — BlochSphereV1 (quantum discrete MDP)

BlochSphereV1 models a single-qubit Bloch sphere as a deterministic finite graph:

6 states — the six canonical pure states: |0⟩, |1⟩, |+⟩, |-⟩, |+i⟩, |-i⟩.
4 actions — unitary gates H, X, Z, S.
Reward — sparse binary: 1.0 on reaching the target state, 0.0 otherwise.

Because every gate acts deterministically, ValueIteration converges in very few outer iterations (often just one), demonstrating how model-based planning trivially solves small deterministic MDPs.

After training, the environment’s built-in 2D graph renderer produces an animated GIF showing:

Left panel — true state-transition graph with the episode trajectory overlaid.
Right panel — the agent’s learned V(s) values (warm colormap) and greedy policy (bold teal edges).

[4]:

from qrl.algorithms.classical import ValueIteration
from qrl.env import BlochSphereV1

# Target state index 4 → |+i⟩
# set ffmpeg=True if you have ffmpeg installed to save as mp4
env      = BlochSphereV1(target_state=4, max_steps=10, reward_tolerance=0.99, ffmpeg=False)
test_env = BlochSphereV1(target_state=4, max_steps=10, reward_tolerance=0.99, ffmpeg=False)

agent = ValueIteration(env=env, gamma=0.9)

print(f"State space  : {agent.n_states}  (|0⟩ |1⟩ |+⟩ |-⟩ |+i⟩ |-i⟩)")
print(f"Action space : {agent.n_actions}  (H=0, X=1, Z=2, S=3)")
print(f"Target state : {env.target_state_index}  → |+i⟩")
print()
print("Transition table T[s, a] = s':")
print(BlochSphereV1.transition_table())

State space  : 6  (|0⟩ |1⟩ |+⟩ |-⟩ |+i⟩ |-i⟩)
Action space : 4  (H=0, X=1, Z=2, S=3)
Target state : 4  → |+i⟩

Transition table T[s, a] = s':
[[2 1 0 0]
 [3 0 1 1]
 [0 2 3 4]
 [1 3 2 5]
 [5 5 5 3]
 [4 4 4 2]]

[5]:

TEST_EPISODES = 20
iter_no, best_reward = 0, 0.0

while True:
    iter_no += 1

    # Explore: seed the empirical model
    agent.play_n_random_steps(50)

    # Plan: solve with Value Iteration
    agent.value_iteration()

    # Collect a graph snapshot (appended to env.fig_array_list)
    env._render_graph(agent=agent)

    # Evaluate: measure success rate over TEST_EPISODES greedy rollouts
    reward = 0.0
    for _ in range(TEST_EPISODES):
        obs, _ = test_env.reset()
        while True:
            action = agent.select_action(int(obs))
            obs, _, terminated, truncated, _ = test_env.step(action)
            if terminated or truncated:
                reward += float(terminated)   # 1.0 on success, 0.0 on truncation
                break
    reward /= TEST_EPISODES

    print(f"Iteration {iter_no:>3} | Success rate: {reward:.3f}")
    if reward > best_reward:
        print(f"  ▶ Best reward updated {best_reward:.3f} → {reward:.3f}")
        best_reward = reward
    if best_reward >= 1.0:          # 100 % success rate → solved
        print(f"\nSolved in {iter_no} iteration(s)!")
        break

Iteration   1 | Success rate: 1.000
  ▶ Best reward updated 0.000 → 1.000

Solved in 1 iteration(s)!

[6]:

state_labels = ["|0⟩", "|1⟩", "|+⟩", "|-⟩", "|+i⟩", "|-i⟩"]
action_names  = ["H", "X", "Z", "S"]

policy = agent.get_policy()
print("Greedy policy:")
for s, (label, a) in enumerate(zip(state_labels, policy)):
    marker = " ← target" if s == env.target_state_index else ""
    print(f"  s={s}  {label:<6}  →  {action_names[int(a)]}{marker}")

print("\nLearned V(s):")
for s, (label, v) in enumerate(zip(state_labels, agent.V)):
    print(f"  s={s}  {label:<6}  V = {v:.4f}")

# Save the learning animation
env.render(
    save_path_without_extension="bloch_sphere_value_iteration",
    interval=600,
    ffmpeg=False,
)
print("\nAnimation saved → bloch_sphere_value_iteration.gif")

Greedy policy:
  s=0  |0⟩     →  H
  s=1  |1⟩     →  H
  s=2  |+⟩     →  S
  s=3  |-⟩     →  Z
  s=4  |+i⟩    →  H ← target
  s=5  |-i⟩    →  H

Learned V(s):
  s=0  |0⟩     V = 0.9000
  s=1  |1⟩     V = 0.8100
  s=2  |+⟩     V = 1.0000
  s=3  |-⟩     V = 0.9000
  s=4  |+i⟩    V = 0.0000
  s=5  |-i⟩    V = 1.0000

Animation saved → bloch_sphere_value_iteration.gif

Implementation Notes & Extensions

Memory: Only V[s] (shape (n_states,)) is stored. Q(s,a) is computed transiently during value_iteration() and select_action(). This is more memory-efficient than QValueIteration which stores the full Q-table.
Action-selection cost: select_action() performs a full one-step lookahead (O(|S||A|)) every time it is called. For large action spaces, prefer QValueIteration whose select_action() is a simple table lookup (O(|A|)).
Discount factor ``gamma``: Values close to 1.0 make the agent more far-sighted but can slow convergence. For sparse-reward environments, try gamma=0.99.
Convergence tolerance ``tol``: The default 1e-6 is tight; you can loosen it to 1e-4 for faster (but slightly sub-optimal) convergence in large environments.
Empirical model quality: Call play_n_random_steps(N) with larger N before the first value_iteration() call to improve model accuracy, especially for stochastic environments like FrozenLake-v1.
Extension — model-free bridge: Replace the empirical model with a neural network approximator to transition from Value Iteration to DQN-style algorithms.

Version History

v0: Initial implementation. Tabular model-based Value Iteration over discrete MDPs. PyTorch tensors for P, R, and V. Empirical model built via random exploration. Compatible with any Gymnasium environment with Discrete observation and action spaces, including qrl-qai quantum environments such as BlochSphereV1.

References (Suggested Reading)

Sutton, R. S., & Barto, A. G., Reinforcement Learning: An Introduction (2nd ed.), Chapter 4 — Dynamic Programming (Value Iteration, Policy Iteration).
Bellman, R., Dynamic Programming (1957) — the original formulation.
Gymnasium documentation — FrozenLake-v1 environment.
Nielsen, M. A., & Chuang, I. L., Quantum Computation and Quantum Information — for the single-qubit gate definitions used in BlochSphereV1.