QValue Iteration
Implementation of Q-Value Iteration for discrete MDPs as QValueIteration class using PyTorch
Author: Jay Shah (@Jayshah25)
Contact: jay.shah@qrlqai.com
License: Apache-2.0
- class qrl.algorithms.classical.qvalue_iteration.QValueIteration(env, gamma=0.9, num_test_episodes=20, device=None, dtype=torch.float32)[source]
Bases:
BaseIterationQ-Value Iteration for tabular, model-based RL over discrete MDPs.
Maintains a state-action value function Q(s,a) and applies the Bellman optimality operator until convergence:
Q[s,a] <- Σ_s’ P(s’|s,a) · (R(s,a,s’) + γ · max_a’ Q(s’,a’))
Action selection reads directly from Q with no recomputation. V(s) = max_a Q(s,a) is available as a derived property.
Compared to ValueIteration: - Q(s,a) is stored persistently rather than computed transiently. - Action selection is O(n_actions) per state rather than O(n_actions × n_states). - Q(s,a) is the natural precursor to Q-learning and function approximation
(e.g. DQN, quantum RL agents), making this the more forward-compatible choice.
- Parameters:
env (gym.Env) – Gymnasium or qrl-qai environment with discrete observation and action spaces.
gamma (float) – Discount factor in [0, 1).
num_test_episodes (int) – Informational; used by external training loops for evaluation.
device (torch.device, optional) – Defaults to CUDA if available, else CPU.
dtype (torch.dtype, optional) – Defaults to float32.
- property Q: torch.Tensor
Current state-action value function, shape (n_states, n_actions).
- property V: torch.Tensor
V(s) = max_a Q(s,a). Shape (n_states,). Not stored — recomputed on access.
- Type:
State-value function derived from Q
- get_policy()[source]
Greedy policy: pi[s] = argmax_a Q(s,a).
- Returns:
Long tensor of shape (n_states,).
- Return type:
torch.Tensor
- qvalue_iteration(max_iters=None, tol=1e-06)[source]
Run Q-Value Iteration to convergence (or max_iters).
- Parameters:
max_iters (int, optional) – Hard cap on Bellman updates. Runs until |Q_new - Q|_inf < tol if None.
tol (float) – Convergence threshold on the sup-norm of the Q-value change.
- Returns:
Number of iterations performed.
- Return type:
int