Base Iteration

Shared model-based agent base for tabular RL (discrete MDPs).

class qrl.algorithms._base.BaseIteration(env, gamma=0.9, num_test_episodes=20, device=None, dtype=torch.float32)[source]

Bases: object

Shared base class for tabular model-based RL agents (Value Iteration, QValueIteration).

Maintains empirical estimates of the transition probability P(s’|s,a) and mean reward R(s,a,s’) from environment interaction. Subclasses implement the specific Bellman update and action-selection strategy.

Parameters:

env (gym.Env) – A Gymnasium or qrl-qai environment with discrete observation and action spaces.
gamma (float) – Discount factor in [0, 1).
num_test_episodes (int) – Number of episodes used for evaluation (informational; used by training loops).
device (torch.device, optional) – Compute device. Defaults to CUDA if available, else CPU.
dtype (torch.dtype, optional) – Floating-point dtype for all tensors. Defaults to float32.

play_episode(env)[source]

Run one full episode with the current policy, updating the model on-the-fly.

Parameters:: env (gym.Env) – A separate environment instance to avoid interfering with self.env.
Returns:: Total undiscounted reward accumulated over the episode.
Return type:: float

play_n_random_steps(n)[source]

Collect n random environment steps to seed the transition/reward model. Should be called before the first planning update.

Return type:: None

select_action(state)[source]

Return type:: int