QValue Iteration

Description

QValueIteration is a model-based, tabular planning algorithm for discrete Markov Decision Processes (MDPs). Like ValueIteration, it builds an empirical transition model P(s′|s,a) and mean reward model R(s,a,s′) from environment interaction. The key difference is that it directly maintains and updates the state-action value function Q(s,a) rather than the state-value function V(s).

Core Bellman update (applied at every iteration):

\[Q[s,a] \leftarrow \sum_{s'} P(s'|s,a) \cdot \left(R(s,a,s') + \gamma \cdot \max_{a'} Q[s',a']\right)\]

The state-value function is recoverable at any time as a derived quantity:

\[V(s) = \max_{a} Q(s, a)\]

Action selection is a direct table lookup — no recomputation required:

\[a^*(s) = \arg\max_{a} Q(s, a)\]

Why Q instead of V?

select_action() is O(|A|) per call vs. O(|S||A|) for ValueIteration’s one-step lookahead.
Storing Q(s,a) is the natural precursor to Q-learning, deep Q-networks (DQN), and quantum RL agents, making QValueIteration the more forward-compatible algorithmic choice.
The Q-table is directly visualised in BlochSphereV1’s right-panel renderer (value colormap uses max_a Q(s,a), greedy policy uses argmax_a Q(s,a)).

Inherits from BaseIteration, which handles empirical model construction, random exploration, and episode rollouts.

Constructor

QValueIteration(
    env: gym.Env,
    gamma: float = 0.9,
    num_test_episodes: int = 20,
    device: Optional[torch.device] = None,
    dtype: Optional[torch.dtype] = torch.float32,
)

Parameter	Type	Default	Description
`env`	`gym.Env`	—	Gymnasium / qrl-qai environment with discrete obs & action spaces.
`gamma`	`float`	`0.9`	Discount factor in [0, 1). Higher → more far-sighted.
`num_test_episodes`	`int`	`20`	Informational; used by external training loops for evaluation.
`device`	`torch.device`	auto (CUDA/CPU)	Compute device for all tensors.
`dtype`	`torch.dtype`	`float32`	Floating-point precision for all tensors.

The constructor calls env.reset() internally and initialises Q[s,a] = 0 for all (s,a).

Implementation Details

Exploration / model-building

Method / Property	Description
`play_n_random_steps(n)`	Execute n uniformly-random steps to seed the empirical model.
`play_episode(env)`	Run one greedy episode on env, updating the model. Returns total reward.

Planning

Method	Description
`qvalue_iteration(max_iters, tol)`	Apply Q Bellman updates until ‖Q_new − Q‖_∞ < `tol` (or `max_iters` steps). Returns iteration count.

Action selection & inspection

Method / Property	Description
`select_action(s)`	Greedy action: `argmax_a Q(s,a)`. O(
`Q`	Current Q-table tensor, shape `(n_states, n_actions)`.
`V`	Derived state-value tensor `max_a Q(s,a)`, shape `(n_states,)`. Not stored.
`get_policy()`	Greedy policy tensor of shape `(n_states,)` — `argmax_a Q(s,a)` for every state.

Internal model (inherited from `BaseIteration`)

Attribute	Shape	Description
`_counts`	`(n_s, n_a, n_s)`	Visitation counts C[s,a,s’].
`_reward_sum`	`(n_s, n_a, n_s)`	Accumulated reward per transition.
`_Q`	`(n_states, n_actions)`	Current Q-function tensor.

QValueIteration vs. ValueIteration

Aspect	`ValueIteration`	`QValueIteration`
Stored quantity	V[s] — shape `(\|S\|,)`	Q[s,a] — shape `(\|S\|, \|A\|)`
Memory	O(\|S\|)	O(\|S\|\|A\|)
`select_action()` cost	O(\|S\|\|A\|) — one-step lookahead	O(\|A\|) — direct table lookup
Extension to Q-learning	Requires additional step	Direct (Q-table is already maintained)
Planning call	`value_iteration()`	`qvalue_iteration()`
`BlochSphereV1` render	Detects via `agent._V`	Detects via `agent._Q`

Rule of thumb: prefer QValueIteration when you need fast per-step action selection, when you plan to extend to Q-learning, or when working with BlochSphereV1 (the renderer shows richer information from the Q-table — individual V(s) and argmax_a Q(s,a)).

Example 1 — FrozenLake-v1 (stochastic environment)

FrozenLake-v1 with is_slippery=True is a classic stochastic 4×4 grid benchmark. The agent must navigate from the start cell to the goal without falling into holes; each intended action succeeds with probability 1/3 (the rest is split between two perpendicular directions).

Here QValueIteration builds an empirical model through random exploration, then applies the Q-Bellman operator to convergence. Note how the Q-table allows select_action() to be a fast O(4) lookup rather than the O(64) lookahead required by ValueIteration.select_action().

[1]:

import gymnasium as gym
from qrl.algorithms.classical import QValueIteration

env      = gym.make("FrozenLake-v1", is_slippery=True)
test_env = gym.make("FrozenLake-v1", is_slippery=True)

agent = QValueIteration(env=env, gamma=0.9)

print(f"State space  : {agent.n_states}   (one per grid cell)")
print(f"Action space : {agent.n_actions}   (left=0, down=1, right=2, up=3)")
print(f"Device       : {agent.device}")
print(f"Q shape      : {agent.Q.shape}")
print(f"V shape      : {agent.V.shape}   (derived from Q)")

State space  : 16   (one per grid cell)
Action space : 4   (left=0, down=1, right=2, up=3)
Device       : cpu
Q shape      : torch.Size([16, 4])
V shape      : torch.Size([16])   (derived from Q)

[2]:

TEST_EPISODES = 20

iter_no     = 0
best_reward = 0.0

while True:
    iter_no += 1

    # Step 1 — explore: collect 100 random transitions to update the model
    agent.play_n_random_steps(100)

    # Step 2 — plan: solve the empirical MDP with Q-Bellman updates
    n_bellman = agent.qvalue_iteration()

    # Step 3 — evaluate: run TEST_EPISODES greedy episodes
    reward = sum(agent.play_episode(test_env) for _ in range(TEST_EPISODES))
    reward /= TEST_EPISODES

    if reward > best_reward:
        print(f"  ▶ Best reward updated {best_reward:.3f} → {reward:.3f}  "
              f"(iter {iter_no}, Q-Bellman sweeps: {n_bellman})")
        best_reward = reward

    if best_reward > 0.80:
        print(f"\nSolved in {iter_no} outer iterations!")
        break

  ▶ Best reward updated 0.000 → 0.400  (iter 3, Q-Bellman sweeps: 59)
  ▶ Best reward updated 0.400 → 0.450  (iter 15, Q-Bellman sweeps: 59)
  ▶ Best reward updated 0.450 → 0.650  (iter 19, Q-Bellman sweeps: 66)
  ▶ Best reward updated 0.650 → 0.800  (iter 23, Q-Bellman sweeps: 60)
  ▶ Best reward updated 0.800 → 0.850  (iter 35, Q-Bellman sweeps: 44)

Solved in 35 outer iterations!

[3]:

policy = agent.get_policy()   # shape (16,)
action_symbols = {0: "←", 1: "↓", 2: "→", 3: "↑"}

print("Greedy policy (raw tensor):")
print(policy)

print("\nGreedy policy grid (4×4):")
grid = [[action_symbols[int(policy[r * 4 + c])] for c in range(4)] for r in range(4)]
for row in grid:
    print("  ", " ".join(row))

print("\nDerived V(s) = max_a Q(s,a) (reshaped to 4×4):")
print(agent.V.reshape(4, 4).round(decimals=3))

print("\nQ-table shape:", agent.Q.shape)
print("Q-table (states as rows, actions as columns):")
import torch
print(agent.Q.round(decimals=3))

Greedy policy (raw tensor):
tensor([0, 3, 0, 3, 0, 0, 2, 0, 3, 1, 0, 0, 0, 2, 1, 0])

Greedy policy grid (4×4):
   ← ↑ ← ↑
   ← ← → ←
   ↑ ↓ ← ←
   ← → ↓ ←

Derived V(s) = max_a Q(s,a) (reshaped to 4×4):
tensor([[0.0670, 0.0560, 0.0660, 0.0480],
        [0.0900, 0.0000, 0.0970, 0.0000],
        [0.1420, 0.2440, 0.2870, 0.0000],
        [0.0000, 0.3730, 0.6290, 0.0000]])

Q-table shape: torch.Size([16, 4])
Q-table (states as rows, actions as columns):
tensor([[0.0670, 0.0650, 0.0640, 0.0570],
        [0.0400, 0.0440, 0.0380, 0.0560],
        [0.0660, 0.0610, 0.0640, 0.0510],
        [0.0300, 0.0390, 0.0300, 0.0480],
        [0.0900, 0.0670, 0.0620, 0.0470],
        [0.0000, 0.0000, 0.0000, 0.0000],
        [0.0800, 0.0460, 0.0970, 0.0180],
        [0.0000, 0.0000, 0.0000, 0.0000],
        [0.0740, 0.1060, 0.1110, 0.1420],
        [0.1550, 0.2440, 0.1960, 0.1560],
        [0.2870, 0.2600, 0.1480, 0.0920],
        [0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000],
        [0.2380, 0.2100, 0.3730, 0.3580],
        [0.3270, 0.6290, 0.5660, 0.5410],
        [0.0000, 0.0000, 0.0000, 0.0000]])

Example 2 — BlochSphereV1 (quantum discrete MDP)

BlochSphereV1 is a deterministic finite-graph MDP over the six canonical Bloch sphere states. With only 6 states and 4 actions the Q-table is tiny (6 × 4 = 24 entries), and QValueIteration converges in a single outer iteration.

The BlochSphereV1 renderer has first-class support for Q-tables: it detects agent._Q and visualises:

Node colours (right panel) — warm scale encoding max_a Q(s,a) per state.
Bold teal edges — the greedy policy argmax_a Q(s,a) for each state.
Edge opacity — proportional to empirical visit counts from _counts.
Colorbar — min and max Q-value states are labelled.

This makes the Q-table visualisation richer than the V-table variant.

[4]:

from qrl.algorithms.classical import QValueIteration
from qrl.env import BlochSphereV1

# Target state index 4 → |+i⟩
# set ffmpeg=True if you have ffmpeg installed to save as mp4
env      = BlochSphereV1(target_state=4, max_steps=10, reward_tolerance=0.99, ffmpeg=False)
test_env = BlochSphereV1(target_state=4, max_steps=10, reward_tolerance=0.99, ffmpeg=False)

agent = QValueIteration(env=env, gamma=0.9)

print(f"State space  : {agent.n_states}  (|0⟩ |1⟩ |+⟩ |-⟩ |+i⟩ |-i⟩)")
print(f"Action space : {agent.n_actions}  (H=0, X=1, Z=2, S=3)")
print(f"Target state : {env.target_state_index}  → |+i⟩")
print(f"Q shape      : {agent.Q.shape}   (6 states × 4 actions)")
print()
print("Transition table T[s, a] = s':")
print(BlochSphereV1.transition_table())

State space  : 6  (|0⟩ |1⟩ |+⟩ |-⟩ |+i⟩ |-i⟩)
Action space : 4  (H=0, X=1, Z=2, S=3)
Target state : 4  → |+i⟩
Q shape      : torch.Size([6, 4])   (6 states × 4 actions)

Transition table T[s, a] = s':
[[2 1 0 0]
 [3 0 1 1]
 [0 2 3 4]
 [1 3 2 5]
 [5 5 5 3]
 [4 4 4 2]]

[5]:

TEST_EPISODES = 20
iter_no, best_reward = 0, 0.0

while True:
    iter_no += 1

    # Explore: seed the empirical model
    agent.play_n_random_steps(50)

    # Plan: solve with Q-Value Iteration
    agent.qvalue_iteration()

    # Collect a graph snapshot for the animation
    env._render_graph(agent=agent)

    # Evaluate: measure success rate
    reward = 0.0
    for _ in range(TEST_EPISODES):
        obs, _ = test_env.reset()
        while True:
            action = agent.select_action(int(obs))
            obs, _, terminated, truncated, _ = test_env.step(action)
            if terminated or truncated:
                reward += float(terminated)   # 1.0 on success
                break
    reward /= TEST_EPISODES

    print(f"Iteration {iter_no:>3} | Success rate: {reward:.3f}")
    if reward > best_reward:
        print(f"  ▶ Best reward updated {best_reward:.3f} → {reward:.3f}")
        best_reward = reward
    if best_reward >= 1.0:
        print(f"\nSolved in {iter_no} iteration(s)!")
        break

Iteration   1 | Success rate: 1.000
  ▶ Best reward updated 0.000 → 1.000

Solved in 1 iteration(s)!

[6]:

state_labels = ["|0⟩", "|1⟩", "|+⟩", "|-⟩", "|+i⟩", "|-i⟩"]
action_names  = ["H", "X", "Z", "S"]

policy = agent.get_policy()
print("Greedy policy (argmax_a Q(s,a)):")
for s, (label, a) in enumerate(zip(state_labels, policy)):
    marker = " ← target" if s == env.target_state_index else ""
    print(f"  s={s}  {label:<6}  →  {action_names[int(a)]}{marker}")

print("\nDerived V(s) = max_a Q(s,a):")
for s, (label, v) in enumerate(zip(state_labels, agent.V)):
    print(f"  s={s}  {label:<6}  V = {v:.4f}")

print("\nFull Q-table Q[s, a] (states as rows, actions as columns):")
header = "       " + "   ".join(f"{a:>5}" for a in action_names)
print(header)
for s, label in enumerate(state_labels):
    row = "   ".join(f"{agent.Q[s, a].item():>5.3f}" for a in range(agent.n_actions))
    print(f"  {label:<6} {row}")

# Save the learning animation
env.render(
    save_path_without_extension="bloch_sphere_qvalue_iteration",
    interval=600,
    ffmpeg=False,
)
print("\nAnimation saved → bloch_sphere_qvalue_iteration.gif")

Greedy policy (argmax_a Q(s,a)):
  s=0  |0⟩     →  H
  s=1  |1⟩     →  H
  s=2  |+⟩     →  S
  s=3  |-⟩     →  Z
  s=4  |+i⟩    →  H ← target
  s=5  |-i⟩    →  Z

Derived V(s) = max_a Q(s,a):
  s=0  |0⟩     V = 0.9000
  s=1  |1⟩     V = 0.8100
  s=2  |+⟩     V = 1.0000
  s=3  |-⟩     V = 0.9000
  s=4  |+i⟩    V = 0.0000
  s=5  |-i⟩    V = 1.0000

Full Q-table Q[s, a] (states as rows, actions as columns):
           H       X       Z       S
  |0⟩    0.900   0.729   0.810   0.810
  |1⟩    0.810   0.810   0.729   0.729
  |+⟩    0.810   0.900   0.810   1.000
  |-⟩    0.729   0.810   0.900   0.900
  |+i⟩   0.000   0.000   0.000   0.000
  |-i⟩   0.000   0.000   1.000   0.000

Animation saved → bloch_sphere_qvalue_iteration.gif

Implementation Notes & Extensions

Memory vs. speed tradeoff: QValueIteration uses O(|S||A|) memory to store the full Q-table, but gains O(|S||A|) speedup per select_action() call vs. ValueIteration. For environments with millions of states, use function approximation (DQN) instead.
Discount factor ``gamma``: The Q-Bellman operator converges for any γ ∈ [0, 1). Larger values make the agent more far-sighted but increase the number of Bellman sweeps needed.
Convergence tolerance ``tol``: Default 1e-6; relaxing to 1e-4 reduces planning time with minimal quality loss for most practical MDPs.
``BlochSphereV1`` renderer integration: The renderer auto-detects agent._Q (Q-table) vs. agent._V (value table) and adjusts the visualisation accordingly. To use a custom agent class, ensure it exposes _Q or _V and _counts as attributes.
Extension to Q-learning: Replace qvalue_iteration() with a TD update rule: Q[s,a] += α · (r + γ · max_a' Q[s',a'] − Q[s,a]). This removes the need for the explicit empirical model P and R, transitioning to fully model-free learning.
Extension to quantum RL: The Q-table can be replaced by a variational quantum circuit (VQC) whose output approximates Q(s,a). This is the bridge to quantum Q-learning agents available elsewhere in the qrl library.

Version History

v0: Initial implementation. Tabular model-based Q-Value Iteration over discrete MDPs. PyTorch tensors for P, R, and Q. Empirical model built via random exploration. Compatible with any Gymnasium environment with Discrete observation and action spaces, including qrl-qai quantum environments such as BlochSphereV1. BlochSphereV1 renderer detects _Q attribute for richer visualisation.

References (Suggested Reading)

Sutton, R. S., & Barto, A. G., Reinforcement Learning: An Introduction (2nd ed.), Chapter 4 (Dynamic Programming) and Chapter 6 (TD / Q-learning).
Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3-4), 279-292.
Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature. (DQN — the deep extension of Q-Value Iteration).
Gymnasium documentation — FrozenLake-v1 environment.
Nielsen, M. A., & Chuang, I. L., Quantum Computation and Quantum Information — for the single-qubit gate definitions used in BlochSphereV1.