{ "cells": [ { "cell_type": "markdown", "id": "vi_01", "metadata": {}, "source": [ "# Value Iteration" ] }, { "cell_type": "markdown", "id": "vi_02", "metadata": {}, "source": [ "## Description\n", "\n", "`ValueIteration` is a **model-based, tabular planning algorithm** for discrete Markov Decision\n", "Processes (MDPs). It learns an empirical transition model P(s′|s,a) and a mean reward model\n", "R(s,a,s′) directly from environment interaction, then applies the Bellman optimality operator\n", "to convergence to obtain the optimal state-value function V*(s).\n", "\n", "**Core Bellman update (applied at every iteration):**\n", "\n", "$$V[s] \\leftarrow \\max_{a} \\sum_{s'} P(s'|s,a) \\cdot \\left(R(s,a,s') + \\gamma \\cdot V[s']\\right)$$\n", "\n", "Once V* is known, the optimal policy is recovered by a one-step lookahead:\n", "\n", "$$a^*(s) = \\arg\\max_{a} \\sum_{s'} P(s'|s,a) \\cdot \\left(R(s,a,s') + \\gamma \\cdot V[s']\\right)$$\n", "\n", "Q(s,a) is **never stored** — it is computed transiently during planning and action selection.\n", "This keeps memory usage at O(|S|) instead of O(|S||A|), making `ValueIteration` slightly\n", "more memory-efficient than `QValueIteration` for large state spaces.\n", "\n", "**When to use `ValueIteration`?**\n", "\n", "- You want a simple, memory-efficient planner for small to medium discrete MDPs.\n", "- You need a reference baseline before moving to Q-learning or deep-RL.\n", "- The environment is small enough that a table fits in memory and planning is fast.\n", "\n", "**Inherits from** `BaseIteration`, which handles empirical model construction, random\n", "exploration, and episode rollouts.\n", "\n" ] }, { "cell_type": "markdown", "id": "vi_03", "metadata": {}, "source": [ "## Constructor\n", "\n", "```python\n", "ValueIteration(\n", " env: gym.Env,\n", " gamma: float = 0.9,\n", " num_test_episodes: int = 20,\n", " device: Optional[torch.device] = None,\n", " dtype: Optional[torch.dtype] = torch.float32,\n", ")\n", "```\n", "\n", "| Parameter | Type | Default | Description |\n", "| ------------------- | ------------------- | ------------- | ---------------------------------------------------------------------- |\n", "| `env` | `gym.Env` | — | Gymnasium / qrl-qai environment with **discrete** obs & action spaces. |\n", "| `gamma` | `float` | `0.9` | Discount factor in [0, 1). Higher → more far-sighted. |\n", "| `num_test_episodes` | `int` | `20` | Informational; used by external training loops for evaluation. |\n", "| `device` | `torch.device` | auto (CUDA/CPU) | Compute device for all tensors. |\n", "| `dtype` | `torch.dtype` | `float32` | Floating-point precision for all tensors. |\n", "\n", "The constructor calls `env.reset()` internally to initialise the agent's current state.\n", "\n" ] }, { "cell_type": "markdown", "id": "vi_05", "metadata": {}, "source": [ "## Typical Training Loop\n", "\n", "```\n", "1. agent.play_n_random_steps(N) ← seed the model\n", "2. agent.value_iteration() ← run Bellman updates to convergence\n", "3. Evaluate agent.play_episode() ← measure quality\n", "4. Repeat 1-3 until solved\n", "```\n", "\n", "Each outer iteration adds more environment experience (step 1), then re-solves the\n", "updated empirical MDP (step 2). As the empirical model improves, V* converges to the\n", "true optimal value function.\n", "\n" ] }, { "cell_type": "markdown", "id": "99558074", "metadata": {}, "source": [ "## Implementation Details\n", "\n", "### Exploration / model-building\n", "\n", "| Method / Property | Description |\n", "| ----------------------------------- | --------------------------------------------------------------------------- |\n", "| `play_n_random_steps(n)` | Execute *n* uniformly-random steps to seed the empirical model. |\n", "| `play_episode(env)` | Run one greedy episode on *env*, updating the model. Returns total reward. |\n", "\n", "### Planning\n", "\n", "| Method | Description |\n", "| ----------------------------------- | --------------------------------------------------------------------------- |\n", "| `value_iteration(max_iters, tol)` | Apply Bellman updates until ‖V_new − V‖_∞ < `tol` (or `max_iters` steps). Returns iteration count. |\n", "\n", "### Action selection & inspection\n", "\n", "| Method / Property | Description |\n", "| ------------------ | ------------------------------------------------------------------------------------ |\n", "| `select_action(s)` | Greedy action via one-step lookahead on V. Returns `int`. |\n", "| `V` | Current state-value tensor, shape `(n_states,)`. |\n", "| `get_policy()` | Greedy policy tensor of shape `(n_states,)` — `argmax_a Q(s,a)` for every state. |\n", "\n", "### Internal model (inherited from `BaseIteration`)\n", "\n", "| Attribute | Shape | Description |\n", "| ----------------- | ---------------------------- | ------------------------------------ |\n", "| `_counts` | `(n_s, n_a, n_s)` | Visitation counts C[s,a,s']. |\n", "| `_reward_sum` | `(n_s, n_a, n_s)` | Accumulated reward per transition. |\n", "| `_V` | `(n_states,)` | Current value-function tensor. |\n", "\n" ] }, { "cell_type": "markdown", "id": "vi_06", "metadata": {}, "source": [ "---\n", "## Example 1 — FrozenLake-v1 (stochastic environment)\n", "\n", "`FrozenLake-v1` with `is_slippery=True` is a classic stochastic benchmark:\n", "the agent tries to cross a frozen lake (4×4 grid) from start to goal without\n", "falling through holes. Transitions are stochastic — each intended action has\n", "only a 1/3 chance of succeeding; the remaining probability is split between\n", "the two perpendicular directions.\n", "\n", "This is an ideal showcase for `ValueIteration` because:\n", "- The state and action spaces are small (16 states, 4 actions).\n", "- Stochasticity means the empirical model must accumulate enough samples\n", " before the value function converges to a useful policy.\n", "- A success rate of > 80 % is considered solved.\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "vi_07", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "State space : 16 (one per grid cell)\n", "Action space : 4 (left=0, down=1, right=2, up=3)\n", "Device : cpu\n", "V shape : torch.Size([16])\n" ] } ], "source": [ "import gymnasium as gym\n", "from qrl.algorithms.classical import ValueIteration\n", "\n", "# Two separate env instances:\n", "# env → used by the agent for exploration and model-building\n", "# test_env → used for clean evaluation (no model updates during testing)\n", "env = gym.make(\"FrozenLake-v1\", is_slippery=True)\n", "test_env = gym.make(\"FrozenLake-v1\", is_slippery=True)\n", "\n", "agent = ValueIteration(env=env, gamma=0.9)\n", "\n", "print(f\"State space : {agent.n_states} (one per grid cell)\")\n", "print(f\"Action space : {agent.n_actions} (left=0, down=1, right=2, up=3)\")\n", "print(f\"Device : {agent.device}\")\n", "print(f\"V shape : {agent.V.shape}\")\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "vi_08", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " ▶ Best reward updated 0.000 → 0.250 (iter 16, Bellman sweeps: 77)\n", " ▶ Best reward updated 0.250 → 0.500 (iter 17, Bellman sweeps: 71)\n", " ▶ Best reward updated 0.500 → 0.550 (iter 18, Bellman sweeps: 60)\n", " ▶ Best reward updated 0.550 → 0.600 (iter 23, Bellman sweeps: 43)\n", " ▶ Best reward updated 0.600 → 0.700 (iter 43, Bellman sweeps: 42)\n", " ▶ Best reward updated 0.700 → 0.750 (iter 48, Bellman sweeps: 54)\n", " ▶ Best reward updated 0.750 → 0.800 (iter 77, Bellman sweeps: 43)\n", " ▶ Best reward updated 0.800 → 0.850 (iter 91, Bellman sweeps: 34)\n", "\n", "Solved in 91 outer iterations!\n" ] } ], "source": [ "TEST_EPISODES = 20\n", "\n", "iter_no = 0\n", "best_reward = 0.0\n", "\n", "while True:\n", " iter_no += 1\n", "\n", " # Step 1 — explore: collect 100 random transitions to update the model\n", " agent.play_n_random_steps(100)\n", "\n", " # Step 2 — plan: solve the empirical MDP with Bellman updates\n", " n_bellman = agent.value_iteration()\n", "\n", " # Step 3 — evaluate: run TEST_EPISODES greedy episodes\n", " reward = sum(agent.play_episode(test_env) for _ in range(TEST_EPISODES))\n", " reward /= TEST_EPISODES\n", "\n", " if reward > best_reward:\n", " print(f\" ▶ Best reward updated {best_reward:.3f} → {reward:.3f} \"\n", " f\"(iter {iter_no}, Bellman sweeps: {n_bellman})\")\n", " best_reward = reward\n", "\n", " if best_reward > 0.80:\n", " print(f\"\\nSolved in {iter_no} outer iterations!\")\n", " break\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "vi_09", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Greedy policy (raw tensor):\n", "tensor([0, 3, 0, 3, 0, 0, 0, 0, 3, 1, 1, 0, 0, 2, 1, 0])\n", "\n", "Greedy policy grid (4×4):\n", " ← ↑ ← ↑\n", " ← ← ← ←\n", " ↑ ↓ ↓ ←\n", " ← → ↓ ←\n", "\n", "State-value function V (reshaped to 4×4):\n", "tensor([[0.0640, 0.0520, 0.0570, 0.0420],\n", " [0.0860, 0.0000, 0.0880, 0.0000],\n", " [0.1350, 0.2320, 0.2640, 0.0000],\n", " [0.0000, 0.3690, 0.6460, 0.0000]])\n" ] } ], "source": [ "import torch\n", "\n", "policy = agent.get_policy() # shape (16,) — one greedy action per state\n", "action_symbols = {0: \"←\", 1: \"↓\", 2: \"→\", 3: \"↑\"}\n", "\n", "print(\"Greedy policy (raw tensor):\")\n", "print(policy)\n", "\n", "print(\"\\nGreedy policy grid (4×4):\")\n", "grid = [[action_symbols[int(policy[r * 4 + c])] for c in range(4)] for r in range(4)]\n", "for row in grid:\n", " print(\" \", \" \".join(row))\n", "\n", "print(\"\\nState-value function V (reshaped to 4×4):\")\n", "print(agent.V.reshape(4, 4).round(decimals=3))\n" ] }, { "cell_type": "markdown", "id": "vi_10", "metadata": {}, "source": [ "---\n", "## Example 2 — BlochSphereV1 (quantum discrete MDP)\n", "\n", "`BlochSphereV1` models a single-qubit Bloch sphere as a **deterministic** finite graph:\n", "- **6 states** — the six canonical pure states: `|0⟩`, `|1⟩`, `|+⟩`, `|-⟩`, `|+i⟩`, `|-i⟩`.\n", "- **4 actions** — unitary gates H, X, Z, S.\n", "- **Reward** — sparse binary: `1.0` on reaching the target state, `0.0` otherwise.\n", "\n", "Because every gate acts deterministically, `ValueIteration` converges in very few outer\n", "iterations (often just one), demonstrating how model-based planning trivially solves\n", "small deterministic MDPs.\n", "\n", "After training, the environment's built-in 2D graph renderer produces an animated GIF\n", "showing:\n", "- **Left panel** — true state-transition graph with the episode trajectory overlaid.\n", "- **Right panel** — the agent's learned V(s) values (warm colormap) and greedy policy\n", " (bold teal edges).\n", "\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "vi_11", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "State space : 6 (|0⟩ |1⟩ |+⟩ |-⟩ |+i⟩ |-i⟩)\n", "Action space : 4 (H=0, X=1, Z=2, S=3)\n", "Target state : 4 → |+i⟩\n", "\n", "Transition table T[s, a] = s':\n", "[[2 1 0 0]\n", " [3 0 1 1]\n", " [0 2 3 4]\n", " [1 3 2 5]\n", " [5 5 5 3]\n", " [4 4 4 2]]\n" ] } ], "source": [ "from qrl.algorithms.classical import ValueIteration\n", "from qrl.env import BlochSphereV1\n", "\n", "# Target state index 4 → |+i⟩\n", "# set ffmpeg=True if you have ffmpeg installed to save as mp4\n", "env = BlochSphereV1(target_state=4, max_steps=10, reward_tolerance=0.99, ffmpeg=False)\n", "test_env = BlochSphereV1(target_state=4, max_steps=10, reward_tolerance=0.99, ffmpeg=False)\n", "\n", "agent = ValueIteration(env=env, gamma=0.9)\n", "\n", "print(f\"State space : {agent.n_states} (|0⟩ |1⟩ |+⟩ |-⟩ |+i⟩ |-i⟩)\")\n", "print(f\"Action space : {agent.n_actions} (H=0, X=1, Z=2, S=3)\")\n", "print(f\"Target state : {env.target_state_index} → |+i⟩\")\n", "print()\n", "print(\"Transition table T[s, a] = s':\")\n", "print(BlochSphereV1.transition_table())\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "vi_12", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Iteration 1 | Success rate: 1.000\n", " ▶ Best reward updated 0.000 → 1.000\n", "\n", "Solved in 1 iteration(s)!\n" ] } ], "source": [ "TEST_EPISODES = 20\n", "iter_no, best_reward = 0, 0.0\n", "\n", "while True:\n", " iter_no += 1\n", "\n", " # Explore: seed the empirical model\n", " agent.play_n_random_steps(50)\n", "\n", " # Plan: solve with Value Iteration\n", " agent.value_iteration()\n", "\n", " # Collect a graph snapshot (appended to env.fig_array_list)\n", " env._render_graph(agent=agent)\n", "\n", " # Evaluate: measure success rate over TEST_EPISODES greedy rollouts\n", " reward = 0.0\n", " for _ in range(TEST_EPISODES):\n", " obs, _ = test_env.reset()\n", " while True:\n", " action = agent.select_action(int(obs))\n", " obs, _, terminated, truncated, _ = test_env.step(action)\n", " if terminated or truncated:\n", " reward += float(terminated) # 1.0 on success, 0.0 on truncation\n", " break\n", " reward /= TEST_EPISODES\n", "\n", " print(f\"Iteration {iter_no:>3} | Success rate: {reward:.3f}\")\n", " if reward > best_reward:\n", " print(f\" ▶ Best reward updated {best_reward:.3f} → {reward:.3f}\")\n", " best_reward = reward\n", " if best_reward >= 1.0: # 100 % success rate → solved\n", " print(f\"\\nSolved in {iter_no} iteration(s)!\")\n", " break\n" ] }, { "cell_type": "code", "execution_count": 6, "id": "vi_13", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Greedy policy:\n", " s=0 |0⟩ → H\n", " s=1 |1⟩ → H\n", " s=2 |+⟩ → S\n", " s=3 |-⟩ → Z\n", " s=4 |+i⟩ → H ← target\n", " s=5 |-i⟩ → H\n", "\n", "Learned V(s):\n", " s=0 |0⟩ V = 0.9000\n", " s=1 |1⟩ V = 0.8100\n", " s=2 |+⟩ V = 1.0000\n", " s=3 |-⟩ V = 0.9000\n", " s=4 |+i⟩ V = 0.0000\n", " s=5 |-i⟩ V = 1.0000\n", "\n", "Animation saved → bloch_sphere_value_iteration.gif\n" ] } ], "source": [ "state_labels = [\"|0⟩\", \"|1⟩\", \"|+⟩\", \"|-⟩\", \"|+i⟩\", \"|-i⟩\"]\n", "action_names = [\"H\", \"X\", \"Z\", \"S\"]\n", "\n", "policy = agent.get_policy()\n", "print(\"Greedy policy:\")\n", "for s, (label, a) in enumerate(zip(state_labels, policy)):\n", " marker = \" ← target\" if s == env.target_state_index else \"\"\n", " print(f\" s={s} {label:<6} → {action_names[int(a)]}{marker}\")\n", "\n", "print(\"\\nLearned V(s):\")\n", "for s, (label, v) in enumerate(zip(state_labels, agent.V)):\n", " print(f\" s={s} {label:<6} V = {v:.4f}\")\n", "\n", "# Save the learning animation\n", "env.render(\n", " save_path_without_extension=\"bloch_sphere_value_iteration\",\n", " interval=600,\n", " ffmpeg=False,\n", ")\n", "print(\"\\nAnimation saved → bloch_sphere_value_iteration.gif\")\n" ] }, { "cell_type": "markdown", "id": "vi_14", "metadata": {}, "source": [ "## Implementation Notes & Extensions\n", "\n", "* **Memory**: Only V[s] (shape `(n_states,)`) is stored. Q(s,a) is computed transiently\n", " during `value_iteration()` and `select_action()`. This is more memory-efficient than\n", " `QValueIteration` which stores the full Q-table.\n", "\n", "* **Action-selection cost**: `select_action()` performs a full one-step lookahead (O(|S||A|))\n", " every time it is called. For large action spaces, prefer `QValueIteration` whose\n", " `select_action()` is a simple table lookup (O(|A|)).\n", "\n", "* **Discount factor `gamma`**: Values close to 1.0 make the agent more far-sighted but\n", " can slow convergence. For sparse-reward environments, try `gamma=0.99`.\n", "\n", "* **Convergence tolerance `tol`**: The default `1e-6` is tight; you can loosen it to\n", " `1e-4` for faster (but slightly sub-optimal) convergence in large environments.\n", "\n", "* **Empirical model quality**: Call `play_n_random_steps(N)` with larger *N* before\n", " the first `value_iteration()` call to improve model accuracy, especially for\n", " stochastic environments like `FrozenLake-v1`.\n", "\n", "* **Extension — model-free bridge**: Replace the empirical model with a neural network\n", " approximator to transition from Value Iteration to DQN-style algorithms.\n", "\n" ] }, { "cell_type": "markdown", "id": "vi_15", "metadata": {}, "source": [ "## Version History\n", "\n", "* **v0**: Initial implementation. Tabular model-based Value Iteration over discrete MDPs.\n", " PyTorch tensors for P, R, and V. Empirical model built via random exploration.\n", " Compatible with any Gymnasium environment with `Discrete` observation and action spaces,\n", " including qrl-qai quantum environments such as `BlochSphereV1`.\n" ] }, { "cell_type": "markdown", "id": "vi_16", "metadata": {}, "source": [ "## References (Suggested Reading)\n", "\n", "* Sutton, R. S., & Barto, A. G., *Reinforcement Learning: An Introduction* (2nd ed.),\n", " Chapter 4 — Dynamic Programming (Value Iteration, Policy Iteration).\n", "* Bellman, R., *Dynamic Programming* (1957) — the original formulation.\n", "* Gymnasium documentation — `FrozenLake-v1` environment.\n", "* Nielsen, M. A., & Chuang, I. L., *Quantum Computation and Quantum Information* —\n", " for the single-qubit gate definitions used in `BlochSphereV1`.\n" ] } ], "metadata": { "kernelspec": { "display_name": "qrl_env (3.12.3)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" } }, "nbformat": 4, "nbformat_minor": 5 }