QValue Iteration

Implementation of Q-Value Iteration for discrete MDPs as QValueIteration class using PyTorch

Author: Jay Shah (@Jayshah25)

License: Apache-2.0

class qrl.algorithms.classical.qvalue_iteration.QValueIteration(env, gamma=0.9, num_test_episodes=20, device=None, dtype=torch.float32)[source]

Bases: BaseIteration

Q-Value Iteration for tabular, model-based RL over discrete MDPs.

Maintains a state-action value function Q(s,a) and applies the Bellman optimality operator until convergence:

Q[s,a] <- Σ_s’ P(s’|s,a) · (R(s,a,s’) + γ · max_a’ Q(s’,a’))

Action selection reads directly from Q with no recomputation. V(s) = max_a Q(s,a) is available as a derived property.

Compared to ValueIteration: - Q(s,a) is stored persistently rather than computed transiently. - Action selection is O(n_actions) per state rather than O(n_actions × n_states). - Q(s,a) is the natural precursor to Q-learning and function approximation

(e.g. DQN, quantum RL agents), making this the more forward-compatible choice.

Parameters:

env (gym.Env) – Gymnasium or qrl-qai environment with discrete observation and action spaces.
gamma (float) – Discount factor in [0, 1).
num_test_episodes (int) – Informational; used by external training loops for evaluation.
device (torch.device, optional) – Defaults to CUDA if available, else CPU.
dtype (torch.dtype, optional) – Defaults to float32.

property Q: torch.Tensor: Current state-action value function, shape (n_states, n_actions).

property V: torch.Tensor

V(s) = max_a Q(s,a). Shape (n_states,). Not stored — recomputed on access.

Type:: State-value function derived from Q

get_policy()[source]

Greedy policy: pi[s] = argmax_a Q(s,a).

Returns:: Long tensor of shape (n_states,).
Return type:: torch.Tensor

qvalue_iteration(max_iters=None, tol=1e-06)[source]

Run Q-Value Iteration to convergence (or max_iters).

Parameters:

max_iters (int, optional) – Hard cap on Bellman updates. Runs until |Q_new - Q|_inf < tol if None.
tol (float) – Convergence threshold on the sup-norm of the Q-value change.

Returns:

Number of iterations performed.

Return type:

int

select_action(state)[source]

Greedy action: argmax_a Q(s, a). Direct table lookup — no recomputation.

Parameters:: state (int or 0-d Tensor) – Current state index.
Returns:: Greedy action.
Return type:: int