QValue Iteration
Description
QValueIteration is a model-based, tabular planning algorithm for discrete Markov Decision Processes (MDPs). Like ValueIteration, it builds an empirical transition model P(s′|s,a) and mean reward model R(s,a,s′) from environment interaction. The key difference is that it directly maintains and updates the state-action value function Q(s,a) rather than the state-value function V(s).
Core Bellman update (applied at every iteration):
The state-value function is recoverable at any time as a derived quantity:
Action selection is a direct table lookup — no recomputation required:
Why Q instead of V?
select_action()is O(|A|) per call vs. O(|S||A|) forValueIteration’s one-step lookahead.Storing Q(s,a) is the natural precursor to Q-learning, deep Q-networks (DQN), and quantum RL agents, making
QValueIterationthe more forward-compatible algorithmic choice.The Q-table is directly visualised in
BlochSphereV1’s right-panel renderer (value colormap uses max_a Q(s,a), greedy policy uses argmax_a Q(s,a)).
Inherits from BaseIteration, which handles empirical model construction, random exploration, and episode rollouts.
Constructor
QValueIteration(
env: gym.Env,
gamma: float = 0.9,
num_test_episodes: int = 20,
device: Optional[torch.device] = None,
dtype: Optional[torch.dtype] = torch.float32,
)
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
— |
Gymnasium / qrl-qai environment with discrete obs & action spaces. |
|
|
|
Discount factor in [0, 1). Higher → more far-sighted. |
|
|
|
Informational; used by external training loops for evaluation. |
|
|
auto (CUDA/CPU) |
Compute device for all tensors. |
|
|
|
Floating-point precision for all tensors. |
The constructor calls env.reset() internally and initialises Q[s,a] = 0 for all (s,a).
Implementation Details
Exploration / model-building
Method / Property |
Description |
|---|---|
|
Execute n uniformly-random steps to seed the empirical model. |
|
Run one greedy episode on env, updating the model. Returns total reward. |
Planning
Method |
Description |
|---|---|
|
Apply Q Bellman updates until ‖Q_new − Q‖_∞ < |
Action selection & inspection
Method / Property |
Description |
|---|---|
|
Greedy action: |
|
Current Q-table tensor, shape |
|
Derived state-value tensor |
|
Greedy policy tensor of shape |
Internal model (inherited from BaseIteration)
Attribute |
Shape |
Description |
|---|---|---|
|
|
Visitation counts C[s,a,s’]. |
|
|
Accumulated reward per transition. |
|
|
Current Q-function tensor. |
QValueIteration vs. ValueIteration
Aspect |
|
|
|---|---|---|
Stored quantity |
V[s] — shape |
Q[s,a] — shape |
Memory |
O(|S|) |
O(|S||A|) |
|
O(|S||A|) — one-step lookahead |
O(|A|) — direct table lookup |
Extension to Q-learning |
Requires additional step |
Direct (Q-table is already maintained) |
Planning call |
|
|
|
Detects via |
Detects via |
Rule of thumb: prefer QValueIteration when you need fast per-step action selection, when you plan to extend to Q-learning, or when working with BlochSphereV1 (the renderer shows richer information from the Q-table — individual V(s) and argmax_a Q(s,a)).
Example 1 — FrozenLake-v1 (stochastic environment)
FrozenLake-v1 with is_slippery=True is a classic stochastic 4×4 grid benchmark. The agent must navigate from the start cell to the goal without falling into holes; each intended action succeeds with probability 1/3 (the rest is split between two perpendicular directions).
Here QValueIteration builds an empirical model through random exploration, then applies the Q-Bellman operator to convergence. Note how the Q-table allows select_action() to be a fast O(4) lookup rather than the O(64) lookahead required by ValueIteration.select_action().
[1]:
import gymnasium as gym
from qrl.algorithms.classical import QValueIteration
env = gym.make("FrozenLake-v1", is_slippery=True)
test_env = gym.make("FrozenLake-v1", is_slippery=True)
agent = QValueIteration(env=env, gamma=0.9)
print(f"State space : {agent.n_states} (one per grid cell)")
print(f"Action space : {agent.n_actions} (left=0, down=1, right=2, up=3)")
print(f"Device : {agent.device}")
print(f"Q shape : {agent.Q.shape}")
print(f"V shape : {agent.V.shape} (derived from Q)")
State space : 16 (one per grid cell)
Action space : 4 (left=0, down=1, right=2, up=3)
Device : cpu
Q shape : torch.Size([16, 4])
V shape : torch.Size([16]) (derived from Q)
[2]:
TEST_EPISODES = 20
iter_no = 0
best_reward = 0.0
while True:
iter_no += 1
# Step 1 — explore: collect 100 random transitions to update the model
agent.play_n_random_steps(100)
# Step 2 — plan: solve the empirical MDP with Q-Bellman updates
n_bellman = agent.qvalue_iteration()
# Step 3 — evaluate: run TEST_EPISODES greedy episodes
reward = sum(agent.play_episode(test_env) for _ in range(TEST_EPISODES))
reward /= TEST_EPISODES
if reward > best_reward:
print(f" ▶ Best reward updated {best_reward:.3f} → {reward:.3f} "
f"(iter {iter_no}, Q-Bellman sweeps: {n_bellman})")
best_reward = reward
if best_reward > 0.80:
print(f"\nSolved in {iter_no} outer iterations!")
break
▶ Best reward updated 0.000 → 0.400 (iter 3, Q-Bellman sweeps: 59)
▶ Best reward updated 0.400 → 0.450 (iter 15, Q-Bellman sweeps: 59)
▶ Best reward updated 0.450 → 0.650 (iter 19, Q-Bellman sweeps: 66)
▶ Best reward updated 0.650 → 0.800 (iter 23, Q-Bellman sweeps: 60)
▶ Best reward updated 0.800 → 0.850 (iter 35, Q-Bellman sweeps: 44)
Solved in 35 outer iterations!
[3]:
policy = agent.get_policy() # shape (16,)
action_symbols = {0: "←", 1: "↓", 2: "→", 3: "↑"}
print("Greedy policy (raw tensor):")
print(policy)
print("\nGreedy policy grid (4×4):")
grid = [[action_symbols[int(policy[r * 4 + c])] for c in range(4)] for r in range(4)]
for row in grid:
print(" ", " ".join(row))
print("\nDerived V(s) = max_a Q(s,a) (reshaped to 4×4):")
print(agent.V.reshape(4, 4).round(decimals=3))
print("\nQ-table shape:", agent.Q.shape)
print("Q-table (states as rows, actions as columns):")
import torch
print(agent.Q.round(decimals=3))
Greedy policy (raw tensor):
tensor([0, 3, 0, 3, 0, 0, 2, 0, 3, 1, 0, 0, 0, 2, 1, 0])
Greedy policy grid (4×4):
← ↑ ← ↑
← ← → ←
↑ ↓ ← ←
← → ↓ ←
Derived V(s) = max_a Q(s,a) (reshaped to 4×4):
tensor([[0.0670, 0.0560, 0.0660, 0.0480],
[0.0900, 0.0000, 0.0970, 0.0000],
[0.1420, 0.2440, 0.2870, 0.0000],
[0.0000, 0.3730, 0.6290, 0.0000]])
Q-table shape: torch.Size([16, 4])
Q-table (states as rows, actions as columns):
tensor([[0.0670, 0.0650, 0.0640, 0.0570],
[0.0400, 0.0440, 0.0380, 0.0560],
[0.0660, 0.0610, 0.0640, 0.0510],
[0.0300, 0.0390, 0.0300, 0.0480],
[0.0900, 0.0670, 0.0620, 0.0470],
[0.0000, 0.0000, 0.0000, 0.0000],
[0.0800, 0.0460, 0.0970, 0.0180],
[0.0000, 0.0000, 0.0000, 0.0000],
[0.0740, 0.1060, 0.1110, 0.1420],
[0.1550, 0.2440, 0.1960, 0.1560],
[0.2870, 0.2600, 0.1480, 0.0920],
[0.0000, 0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000, 0.0000],
[0.2380, 0.2100, 0.3730, 0.3580],
[0.3270, 0.6290, 0.5660, 0.5410],
[0.0000, 0.0000, 0.0000, 0.0000]])
Example 2 — BlochSphereV1 (quantum discrete MDP)
BlochSphereV1 is a deterministic finite-graph MDP over the six canonical Bloch sphere states. With only 6 states and 4 actions the Q-table is tiny (6 × 4 = 24 entries), and QValueIteration converges in a single outer iteration.
The BlochSphereV1 renderer has first-class support for Q-tables: it detects agent._Q and visualises:
Node colours (right panel) — warm scale encoding
max_a Q(s,a)per state.Bold teal edges — the greedy policy
argmax_a Q(s,a)for each state.Edge opacity — proportional to empirical visit counts from
_counts.Colorbar — min and max Q-value states are labelled.
This makes the Q-table visualisation richer than the V-table variant.
[4]:
from qrl.algorithms.classical import QValueIteration
from qrl.env import BlochSphereV1
# Target state index 4 → |+i⟩
# set ffmpeg=True if you have ffmpeg installed to save as mp4
env = BlochSphereV1(target_state=4, max_steps=10, reward_tolerance=0.99, ffmpeg=False)
test_env = BlochSphereV1(target_state=4, max_steps=10, reward_tolerance=0.99, ffmpeg=False)
agent = QValueIteration(env=env, gamma=0.9)
print(f"State space : {agent.n_states} (|0⟩ |1⟩ |+⟩ |-⟩ |+i⟩ |-i⟩)")
print(f"Action space : {agent.n_actions} (H=0, X=1, Z=2, S=3)")
print(f"Target state : {env.target_state_index} → |+i⟩")
print(f"Q shape : {agent.Q.shape} (6 states × 4 actions)")
print()
print("Transition table T[s, a] = s':")
print(BlochSphereV1.transition_table())
State space : 6 (|0⟩ |1⟩ |+⟩ |-⟩ |+i⟩ |-i⟩)
Action space : 4 (H=0, X=1, Z=2, S=3)
Target state : 4 → |+i⟩
Q shape : torch.Size([6, 4]) (6 states × 4 actions)
Transition table T[s, a] = s':
[[2 1 0 0]
[3 0 1 1]
[0 2 3 4]
[1 3 2 5]
[5 5 5 3]
[4 4 4 2]]
[5]:
TEST_EPISODES = 20
iter_no, best_reward = 0, 0.0
while True:
iter_no += 1
# Explore: seed the empirical model
agent.play_n_random_steps(50)
# Plan: solve with Q-Value Iteration
agent.qvalue_iteration()
# Collect a graph snapshot for the animation
env._render_graph(agent=agent)
# Evaluate: measure success rate
reward = 0.0
for _ in range(TEST_EPISODES):
obs, _ = test_env.reset()
while True:
action = agent.select_action(int(obs))
obs, _, terminated, truncated, _ = test_env.step(action)
if terminated or truncated:
reward += float(terminated) # 1.0 on success
break
reward /= TEST_EPISODES
print(f"Iteration {iter_no:>3} | Success rate: {reward:.3f}")
if reward > best_reward:
print(f" ▶ Best reward updated {best_reward:.3f} → {reward:.3f}")
best_reward = reward
if best_reward >= 1.0:
print(f"\nSolved in {iter_no} iteration(s)!")
break
Iteration 1 | Success rate: 1.000
▶ Best reward updated 0.000 → 1.000
Solved in 1 iteration(s)!
[6]:
state_labels = ["|0⟩", "|1⟩", "|+⟩", "|-⟩", "|+i⟩", "|-i⟩"]
action_names = ["H", "X", "Z", "S"]
policy = agent.get_policy()
print("Greedy policy (argmax_a Q(s,a)):")
for s, (label, a) in enumerate(zip(state_labels, policy)):
marker = " ← target" if s == env.target_state_index else ""
print(f" s={s} {label:<6} → {action_names[int(a)]}{marker}")
print("\nDerived V(s) = max_a Q(s,a):")
for s, (label, v) in enumerate(zip(state_labels, agent.V)):
print(f" s={s} {label:<6} V = {v:.4f}")
print("\nFull Q-table Q[s, a] (states as rows, actions as columns):")
header = " " + " ".join(f"{a:>5}" for a in action_names)
print(header)
for s, label in enumerate(state_labels):
row = " ".join(f"{agent.Q[s, a].item():>5.3f}" for a in range(agent.n_actions))
print(f" {label:<6} {row}")
# Save the learning animation
env.render(
save_path_without_extension="bloch_sphere_qvalue_iteration",
interval=600,
ffmpeg=False,
)
print("\nAnimation saved → bloch_sphere_qvalue_iteration.gif")
Greedy policy (argmax_a Q(s,a)):
s=0 |0⟩ → H
s=1 |1⟩ → H
s=2 |+⟩ → S
s=3 |-⟩ → Z
s=4 |+i⟩ → H ← target
s=5 |-i⟩ → Z
Derived V(s) = max_a Q(s,a):
s=0 |0⟩ V = 0.9000
s=1 |1⟩ V = 0.8100
s=2 |+⟩ V = 1.0000
s=3 |-⟩ V = 0.9000
s=4 |+i⟩ V = 0.0000
s=5 |-i⟩ V = 1.0000
Full Q-table Q[s, a] (states as rows, actions as columns):
H X Z S
|0⟩ 0.900 0.729 0.810 0.810
|1⟩ 0.810 0.810 0.729 0.729
|+⟩ 0.810 0.900 0.810 1.000
|-⟩ 0.729 0.810 0.900 0.900
|+i⟩ 0.000 0.000 0.000 0.000
|-i⟩ 0.000 0.000 1.000 0.000
Animation saved → bloch_sphere_qvalue_iteration.gif
Implementation Notes & Extensions
Memory vs. speed tradeoff:
QValueIterationuses O(|S||A|) memory to store the full Q-table, but gains O(|S||A|) speedup perselect_action()call vs.ValueIteration. For environments with millions of states, use function approximation (DQN) instead.Discount factor ``gamma``: The Q-Bellman operator converges for any
γ ∈ [0, 1). Larger values make the agent more far-sighted but increase the number of Bellman sweeps needed.Convergence tolerance ``tol``: Default
1e-6; relaxing to1e-4reduces planning time with minimal quality loss for most practical MDPs.``BlochSphereV1`` renderer integration: The renderer auto-detects
agent._Q(Q-table) vs.agent._V(value table) and adjusts the visualisation accordingly. To use a custom agent class, ensure it exposes_Qor_Vand_countsas attributes.Extension to Q-learning: Replace
qvalue_iteration()with a TD update rule:Q[s,a] += α · (r + γ · max_a' Q[s',a'] − Q[s,a]). This removes the need for the explicit empirical model P and R, transitioning to fully model-free learning.Extension to quantum RL: The Q-table can be replaced by a variational quantum circuit (VQC) whose output approximates Q(s,a). This is the bridge to quantum Q-learning agents available elsewhere in the
qrllibrary.
Version History
v0: Initial implementation. Tabular model-based Q-Value Iteration over discrete MDPs. PyTorch tensors for P, R, and Q. Empirical model built via random exploration. Compatible with any Gymnasium environment with
Discreteobservation and action spaces, including qrl-qai quantum environments such asBlochSphereV1.BlochSphereV1renderer detects_Qattribute for richer visualisation.
References (Suggested Reading)
Sutton, R. S., & Barto, A. G., Reinforcement Learning: An Introduction (2nd ed.), Chapter 4 (Dynamic Programming) and Chapter 6 (TD / Q-learning).
Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3-4), 279-292.
Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature. (DQN — the deep extension of Q-Value Iteration).
Gymnasium documentation —
FrozenLake-v1environment.Nielsen, M. A., & Chuang, I. L., Quantum Computation and Quantum Information — for the single-qubit gate definitions used in
BlochSphereV1.