Value Iteration
Value Iteration for discrete MDPs.
- class qrl.algorithms.classical.value_iteration.ValueIteration(env, gamma=0.9, num_test_episodes=20, device=None, dtype=torch.float32)[source]
Bases:
BaseIterationValue Iteration for tabular, model-based RL over discrete MDPs.
Maintains a state-value function V(s) and applies the Bellman optimality operator until convergence:
V[s] <- max_a Σ_s’ P(s’|s,a) · (R(s,a,s’) + γ · V(s’))
Action selection is greedy with respect to V via a one-step lookahead. Q(s,a) is never stored — it is computed transiently during planning and action selection.
- Parameters:
env (gym.Env) – Gymnasium or qrl-qaienvironment with discrete observation and action spaces.
gamma (float) – Discount factor in [0, 1).
num_test_episodes (int) – Informational; used by external training loops for evaluation.
device (torch.device, optional) – Defaults to CUDA if available, else CPU.
dtype (torch.dtype, optional) – Defaults to float32.
- property V: torch.Tensor
Current state-value function, shape (n_states,).
- get_policy()[source]
Greedy policy derived from V.
- Returns:
Long tensor of shape (n_states,) where entry s is argmax_a Q(s,a).
- Return type:
torch.Tensor
- select_action(state)[source]
Greedy action w.r.t. V via one-step lookahead.
a* = argmax_a Σ_s’ P(s’|s,a) · (R(s,a,s’) + γ · V(s’))
- Parameters:
state (int or 0-d Tensor) – Current state index.
- Returns:
Greedy action.
- Return type:
int
- value_iteration(max_iters=None, tol=1e-06)[source]
Run Value Iteration to convergence (or max_iters).
- Parameters:
max_iters (int, optional) – Hard cap on Bellman updates. Runs until |V_new - V|_inf < tol if None.
tol (float) – Convergence threshold on the sup-norm of the value change.
- Returns:
Number of iterations performed.
- Return type:
int