{ "cells": [ { "cell_type": "markdown", "id": "qvi_01", "metadata": {}, "source": [ "# QValue Iteration" ] }, { "cell_type": "markdown", "id": "qvi_02", "metadata": {}, "source": [ "## Description\n", "\n", "`QValueIteration` is a **model-based, tabular planning algorithm** for discrete Markov Decision\n", "Processes (MDPs). Like `ValueIteration`, it builds an empirical transition model P(s′|s,a) and\n", "mean reward model R(s,a,s′) from environment interaction. The key difference is that it directly\n", "maintains and updates the **state-action value function Q(s,a)** rather than the state-value\n", "function V(s).\n", "\n", "**Core Bellman update (applied at every iteration):**\n", "\n", "$$Q[s,a] \\leftarrow \\sum_{s'} P(s'|s,a) \\cdot \\left(R(s,a,s') + \\gamma \\cdot \\max_{a'} Q[s',a']\\right)$$\n", "\n", "\n", "The state-value function is recoverable at any time as a derived quantity:\n", "\n", "$$V(s) = \\max_{a} Q(s, a)$$\n", "\n", "\n", "Action selection is a **direct table lookup** — no recomputation required:\n", "\n", "$$a^*(s) = \\arg\\max_{a} Q(s, a)$$\n", "\n", "**Why Q instead of V?**\n", "\n", "- `select_action()` is O(|A|) per call vs. O(|S||A|) for `ValueIteration`'s one-step lookahead.\n", "- Storing Q(s,a) is the natural precursor to Q-learning, deep Q-networks (DQN), and quantum RL\n", " agents, making `QValueIteration` the more **forward-compatible** algorithmic choice.\n", "- The Q-table is directly visualised in `BlochSphereV1`'s right-panel renderer (value colormap\n", " uses max_a Q(s,a), greedy policy uses argmax_a Q(s,a)).\n", "\n", "**Inherits from** `BaseIteration`, which handles empirical model construction, random\n", "exploration, and episode rollouts.\n", "\n" ] }, { "cell_type": "markdown", "id": "qvi_03", "metadata": {}, "source": [ "## Constructor\n", "\n", "```python\n", "QValueIteration(\n", " env: gym.Env,\n", " gamma: float = 0.9,\n", " num_test_episodes: int = 20,\n", " device: Optional[torch.device] = None,\n", " dtype: Optional[torch.dtype] = torch.float32,\n", ")\n", "```\n", "\n", "| Parameter | Type | Default | Description |\n", "| ------------------- | ------------------- | --------------- | ---------------------------------------------------------------------- |\n", "| `env` | `gym.Env` | — | Gymnasium / qrl-qai environment with **discrete** obs & action spaces. |\n", "| `gamma` | `float` | `0.9` | Discount factor in [0, 1). Higher → more far-sighted. |\n", "| `num_test_episodes` | `int` | `20` | Informational; used by external training loops for evaluation. |\n", "| `device` | `torch.device` | auto (CUDA/CPU) | Compute device for all tensors. |\n", "| `dtype` | `torch.dtype` | `float32` | Floating-point precision for all tensors. |\n", "\n", "The constructor calls `env.reset()` internally and initialises Q[s,a] = 0 for all (s,a).\n", "\n" ] }, { "cell_type": "markdown", "id": "qvi_04", "metadata": {}, "source": [ "## Implementation Details\n", "\n", "### Exploration / model-building\n", "\n", "| Method / Property | Description |\n", "| ------------------------------------- | --------------------------------------------------------------------------- |\n", "| `play_n_random_steps(n)` | Execute *n* uniformly-random steps to seed the empirical model. |\n", "| `play_episode(env)` | Run one greedy episode on *env*, updating the model. Returns total reward. |\n", "\n", "### Planning\n", "\n", "| Method | Description |\n", "| ------------------------------------- | --------------------------------------------------------------------------- |\n", "| `qvalue_iteration(max_iters, tol)` | Apply Q Bellman updates until ‖Q_new − Q‖_∞ < `tol` (or `max_iters` steps). Returns iteration count. |\n", "\n", "### Action selection & inspection\n", "\n", "| Method / Property | Description |\n", "| ------------------- | ------------------------------------------------------------------------------------ |\n", "| `select_action(s)` | Greedy action: `argmax_a Q(s,a)`. O(|A|) direct table lookup. Returns `int`. |\n", "| `Q` | Current Q-table tensor, shape `(n_states, n_actions)`. |\n", "| `V` | Derived state-value tensor `max_a Q(s,a)`, shape `(n_states,)`. Not stored. |\n", "| `get_policy()` | Greedy policy tensor of shape `(n_states,)` — `argmax_a Q(s,a)` for every state. |\n", "\n", "### Internal model (inherited from `BaseIteration`)\n", "\n", "| Attribute | Shape | Description |\n", "| ----------------- | ---------------------------- | ------------------------------------ |\n", "| `_counts` | `(n_s, n_a, n_s)` | Visitation counts C[s,a,s']. |\n", "| `_reward_sum` | `(n_s, n_a, n_s)` | Accumulated reward per transition. |\n", "| `_Q` | `(n_states, n_actions)` | Current Q-function tensor. |\n", "\n" ] }, { "cell_type": "markdown", "id": "qvi_05", "metadata": {}, "source": [ "## QValueIteration vs. ValueIteration\n", "\n", "| Aspect | `ValueIteration` | `QValueIteration` |\n", "| ------------------------- | --------------------------------------- | ------------------------------------------ |\n", "| Stored quantity | V[s] — shape `(|S|,)` | Q[s,a] — shape `(|S|, |A|)` |\n", "| Memory | O(\\|S\\|) | O(\\|S\\|\\|A\\|) |\n", "| `select_action()` cost | O(\\|S\\|\\|A\\|) — one-step lookahead | O(\\|A\\|) — direct table lookup |\n", "| Extension to Q-learning | Requires additional step | Direct (Q-table is already maintained) |\n", "| Planning call | `value_iteration()` | `qvalue_iteration()` |\n", "| `BlochSphereV1` render | Detects via `agent._V` | Detects via `agent._Q` |\n", "\n", "**Rule of thumb:** prefer `QValueIteration` when you need fast per-step action selection,\n", "when you plan to extend to Q-learning, or when working with `BlochSphereV1` (the renderer\n", "shows richer information from the Q-table — individual V(s) and argmax_a Q(s,a)).\n", "\n" ] }, { "cell_type": "markdown", "id": "qvi_06", "metadata": {}, "source": [ "---\n", "## Example 1 — FrozenLake-v1 (stochastic environment)\n", "\n", "`FrozenLake-v1` with `is_slippery=True` is a classic stochastic 4×4 grid benchmark.\n", "The agent must navigate from the start cell to the goal without falling into holes;\n", "each intended action succeeds with probability 1/3 (the rest is split between\n", "two perpendicular directions).\n", "\n", "Here `QValueIteration` builds an empirical model through random exploration, then\n", "applies the Q-Bellman operator to convergence. Note how the Q-table allows\n", "`select_action()` to be a fast O(4) lookup rather than the O(64) lookahead\n", "required by `ValueIteration.select_action()`.\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "qvi_07", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "State space : 16 (one per grid cell)\n", "Action space : 4 (left=0, down=1, right=2, up=3)\n", "Device : cpu\n", "Q shape : torch.Size([16, 4])\n", "V shape : torch.Size([16]) (derived from Q)\n" ] } ], "source": [ "import gymnasium as gym\n", "from qrl.algorithms.classical import QValueIteration\n", "\n", "env = gym.make(\"FrozenLake-v1\", is_slippery=True)\n", "test_env = gym.make(\"FrozenLake-v1\", is_slippery=True)\n", "\n", "agent = QValueIteration(env=env, gamma=0.9)\n", "\n", "print(f\"State space : {agent.n_states} (one per grid cell)\")\n", "print(f\"Action space : {agent.n_actions} (left=0, down=1, right=2, up=3)\")\n", "print(f\"Device : {agent.device}\")\n", "print(f\"Q shape : {agent.Q.shape}\")\n", "print(f\"V shape : {agent.V.shape} (derived from Q)\")\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "qvi_08", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " ▶ Best reward updated 0.000 → 0.400 (iter 3, Q-Bellman sweeps: 59)\n", " ▶ Best reward updated 0.400 → 0.450 (iter 15, Q-Bellman sweeps: 59)\n", " ▶ Best reward updated 0.450 → 0.650 (iter 19, Q-Bellman sweeps: 66)\n", " ▶ Best reward updated 0.650 → 0.800 (iter 23, Q-Bellman sweeps: 60)\n", " ▶ Best reward updated 0.800 → 0.850 (iter 35, Q-Bellman sweeps: 44)\n", "\n", "Solved in 35 outer iterations!\n" ] } ], "source": [ "TEST_EPISODES = 20\n", "\n", "iter_no = 0\n", "best_reward = 0.0\n", "\n", "while True:\n", " iter_no += 1\n", "\n", " # Step 1 — explore: collect 100 random transitions to update the model\n", " agent.play_n_random_steps(100)\n", "\n", " # Step 2 — plan: solve the empirical MDP with Q-Bellman updates\n", " n_bellman = agent.qvalue_iteration()\n", "\n", " # Step 3 — evaluate: run TEST_EPISODES greedy episodes\n", " reward = sum(agent.play_episode(test_env) for _ in range(TEST_EPISODES))\n", " reward /= TEST_EPISODES\n", "\n", " if reward > best_reward:\n", " print(f\" ▶ Best reward updated {best_reward:.3f} → {reward:.3f} \"\n", " f\"(iter {iter_no}, Q-Bellman sweeps: {n_bellman})\")\n", " best_reward = reward\n", "\n", " if best_reward > 0.80:\n", " print(f\"\\nSolved in {iter_no} outer iterations!\")\n", " break\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "qvi_09", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Greedy policy (raw tensor):\n", "tensor([0, 3, 0, 3, 0, 0, 2, 0, 3, 1, 0, 0, 0, 2, 1, 0])\n", "\n", "Greedy policy grid (4×4):\n", " ← ↑ ← ↑\n", " ← ← → ←\n", " ↑ ↓ ← ←\n", " ← → ↓ ←\n", "\n", "Derived V(s) = max_a Q(s,a) (reshaped to 4×4):\n", "tensor([[0.0670, 0.0560, 0.0660, 0.0480],\n", " [0.0900, 0.0000, 0.0970, 0.0000],\n", " [0.1420, 0.2440, 0.2870, 0.0000],\n", " [0.0000, 0.3730, 0.6290, 0.0000]])\n", "\n", "Q-table shape: torch.Size([16, 4])\n", "Q-table (states as rows, actions as columns):\n", "tensor([[0.0670, 0.0650, 0.0640, 0.0570],\n", " [0.0400, 0.0440, 0.0380, 0.0560],\n", " [0.0660, 0.0610, 0.0640, 0.0510],\n", " [0.0300, 0.0390, 0.0300, 0.0480],\n", " [0.0900, 0.0670, 0.0620, 0.0470],\n", " [0.0000, 0.0000, 0.0000, 0.0000],\n", " [0.0800, 0.0460, 0.0970, 0.0180],\n", " [0.0000, 0.0000, 0.0000, 0.0000],\n", " [0.0740, 0.1060, 0.1110, 0.1420],\n", " [0.1550, 0.2440, 0.1960, 0.1560],\n", " [0.2870, 0.2600, 0.1480, 0.0920],\n", " [0.0000, 0.0000, 0.0000, 0.0000],\n", " [0.0000, 0.0000, 0.0000, 0.0000],\n", " [0.2380, 0.2100, 0.3730, 0.3580],\n", " [0.3270, 0.6290, 0.5660, 0.5410],\n", " [0.0000, 0.0000, 0.0000, 0.0000]])\n" ] } ], "source": [ "policy = agent.get_policy() # shape (16,)\n", "action_symbols = {0: \"←\", 1: \"↓\", 2: \"→\", 3: \"↑\"}\n", "\n", "print(\"Greedy policy (raw tensor):\")\n", "print(policy)\n", "\n", "print(\"\\nGreedy policy grid (4×4):\")\n", "grid = [[action_symbols[int(policy[r * 4 + c])] for c in range(4)] for r in range(4)]\n", "for row in grid:\n", " print(\" \", \" \".join(row))\n", "\n", "print(\"\\nDerived V(s) = max_a Q(s,a) (reshaped to 4×4):\")\n", "print(agent.V.reshape(4, 4).round(decimals=3))\n", "\n", "print(\"\\nQ-table shape:\", agent.Q.shape)\n", "print(\"Q-table (states as rows, actions as columns):\")\n", "import torch\n", "print(agent.Q.round(decimals=3))\n" ] }, { "cell_type": "markdown", "id": "qvi_10", "metadata": {}, "source": [ "---\n", "## Example 2 — BlochSphereV1 (quantum discrete MDP)\n", "\n", "`BlochSphereV1` is a deterministic finite-graph MDP over the six canonical Bloch sphere states.\n", "With only 6 states and 4 actions the Q-table is tiny (6 × 4 = 24 entries), and\n", "`QValueIteration` converges in a single outer iteration.\n", "\n", "The `BlochSphereV1` renderer has **first-class support for Q-tables**: it detects `agent._Q`\n", "and visualises:\n", "\n", "- **Node colours** (right panel) — warm scale encoding `max_a Q(s,a)` per state.\n", "- **Bold teal edges** — the greedy policy `argmax_a Q(s,a)` for each state.\n", "- **Edge opacity** — proportional to empirical visit counts from `_counts`.\n", "- **Colorbar** — min and max Q-value states are labelled.\n", "\n", "This makes the Q-table visualisation richer than the V-table variant.\n", "\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "qvi_11", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "State space : 6 (|0⟩ |1⟩ |+⟩ |-⟩ |+i⟩ |-i⟩)\n", "Action space : 4 (H=0, X=1, Z=2, S=3)\n", "Target state : 4 → |+i⟩\n", "Q shape : torch.Size([6, 4]) (6 states × 4 actions)\n", "\n", "Transition table T[s, a] = s':\n", "[[2 1 0 0]\n", " [3 0 1 1]\n", " [0 2 3 4]\n", " [1 3 2 5]\n", " [5 5 5 3]\n", " [4 4 4 2]]\n" ] } ], "source": [ "from qrl.algorithms.classical import QValueIteration\n", "from qrl.env import BlochSphereV1\n", "\n", "# Target state index 4 → |+i⟩\n", "# set ffmpeg=True if you have ffmpeg installed to save as mp4\n", "env = BlochSphereV1(target_state=4, max_steps=10, reward_tolerance=0.99, ffmpeg=False)\n", "test_env = BlochSphereV1(target_state=4, max_steps=10, reward_tolerance=0.99, ffmpeg=False)\n", "\n", "agent = QValueIteration(env=env, gamma=0.9)\n", "\n", "print(f\"State space : {agent.n_states} (|0⟩ |1⟩ |+⟩ |-⟩ |+i⟩ |-i⟩)\")\n", "print(f\"Action space : {agent.n_actions} (H=0, X=1, Z=2, S=3)\")\n", "print(f\"Target state : {env.target_state_index} → |+i⟩\")\n", "print(f\"Q shape : {agent.Q.shape} (6 states × 4 actions)\")\n", "print()\n", "print(\"Transition table T[s, a] = s':\")\n", "print(BlochSphereV1.transition_table())\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "qvi_12", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Iteration 1 | Success rate: 1.000\n", " ▶ Best reward updated 0.000 → 1.000\n", "\n", "Solved in 1 iteration(s)!\n" ] } ], "source": [ "TEST_EPISODES = 20\n", "iter_no, best_reward = 0, 0.0\n", "\n", "while True:\n", " iter_no += 1\n", "\n", " # Explore: seed the empirical model\n", " agent.play_n_random_steps(50)\n", "\n", " # Plan: solve with Q-Value Iteration\n", " agent.qvalue_iteration()\n", "\n", " # Collect a graph snapshot for the animation\n", " env._render_graph(agent=agent)\n", "\n", " # Evaluate: measure success rate\n", " reward = 0.0\n", " for _ in range(TEST_EPISODES):\n", " obs, _ = test_env.reset()\n", " while True:\n", " action = agent.select_action(int(obs))\n", " obs, _, terminated, truncated, _ = test_env.step(action)\n", " if terminated or truncated:\n", " reward += float(terminated) # 1.0 on success\n", " break\n", " reward /= TEST_EPISODES\n", "\n", " print(f\"Iteration {iter_no:>3} | Success rate: {reward:.3f}\")\n", " if reward > best_reward:\n", " print(f\" ▶ Best reward updated {best_reward:.3f} → {reward:.3f}\")\n", " best_reward = reward\n", " if best_reward >= 1.0:\n", " print(f\"\\nSolved in {iter_no} iteration(s)!\")\n", " break\n" ] }, { "cell_type": "code", "execution_count": 6, "id": "qvi_13", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Greedy policy (argmax_a Q(s,a)):\n", " s=0 |0⟩ → H\n", " s=1 |1⟩ → H\n", " s=2 |+⟩ → S\n", " s=3 |-⟩ → Z\n", " s=4 |+i⟩ → H ← target\n", " s=5 |-i⟩ → Z\n", "\n", "Derived V(s) = max_a Q(s,a):\n", " s=0 |0⟩ V = 0.9000\n", " s=1 |1⟩ V = 0.8100\n", " s=2 |+⟩ V = 1.0000\n", " s=3 |-⟩ V = 0.9000\n", " s=4 |+i⟩ V = 0.0000\n", " s=5 |-i⟩ V = 1.0000\n", "\n", "Full Q-table Q[s, a] (states as rows, actions as columns):\n", " H X Z S\n", " |0⟩ 0.900 0.729 0.810 0.810\n", " |1⟩ 0.810 0.810 0.729 0.729\n", " |+⟩ 0.810 0.900 0.810 1.000\n", " |-⟩ 0.729 0.810 0.900 0.900\n", " |+i⟩ 0.000 0.000 0.000 0.000\n", " |-i⟩ 0.000 0.000 1.000 0.000\n", "\n", "Animation saved → bloch_sphere_qvalue_iteration.gif\n" ] } ], "source": [ "state_labels = [\"|0⟩\", \"|1⟩\", \"|+⟩\", \"|-⟩\", \"|+i⟩\", \"|-i⟩\"]\n", "action_names = [\"H\", \"X\", \"Z\", \"S\"]\n", "\n", "policy = agent.get_policy()\n", "print(\"Greedy policy (argmax_a Q(s,a)):\")\n", "for s, (label, a) in enumerate(zip(state_labels, policy)):\n", " marker = \" ← target\" if s == env.target_state_index else \"\"\n", " print(f\" s={s} {label:<6} → {action_names[int(a)]}{marker}\")\n", "\n", "print(\"\\nDerived V(s) = max_a Q(s,a):\")\n", "for s, (label, v) in enumerate(zip(state_labels, agent.V)):\n", " print(f\" s={s} {label:<6} V = {v:.4f}\")\n", "\n", "print(\"\\nFull Q-table Q[s, a] (states as rows, actions as columns):\")\n", "header = \" \" + \" \".join(f\"{a:>5}\" for a in action_names)\n", "print(header)\n", "for s, label in enumerate(state_labels):\n", " row = \" \".join(f\"{agent.Q[s, a].item():>5.3f}\" for a in range(agent.n_actions))\n", " print(f\" {label:<6} {row}\")\n", "\n", "# Save the learning animation\n", "env.render(\n", " save_path_without_extension=\"bloch_sphere_qvalue_iteration\",\n", " interval=600,\n", " ffmpeg=False,\n", ")\n", "print(\"\\nAnimation saved → bloch_sphere_qvalue_iteration.gif\")\n" ] }, { "cell_type": "markdown", "id": "qvi_14", "metadata": {}, "source": [ "## Implementation Notes & Extensions\n", "\n", "* **Memory vs. speed tradeoff**: `QValueIteration` uses O(|S||A|) memory to store the full\n", " Q-table, but gains O(|S||A|) speedup per `select_action()` call vs. `ValueIteration`.\n", " For environments with millions of states, use function approximation (DQN) instead.\n", "\n", "* **Discount factor `gamma`**: The Q-Bellman operator converges for any `γ ∈ [0, 1)`. Larger\n", " values make the agent more far-sighted but increase the number of Bellman sweeps needed.\n", "\n", "* **Convergence tolerance `tol`**: Default `1e-6`; relaxing to `1e-4` reduces planning time\n", " with minimal quality loss for most practical MDPs.\n", "\n", "* **`BlochSphereV1` renderer integration**: The renderer auto-detects `agent._Q` (Q-table)\n", " vs. `agent._V` (value table) and adjusts the visualisation accordingly. To use a custom\n", " agent class, ensure it exposes `_Q` or `_V` and `_counts` as attributes.\n", "\n", "* **Extension to Q-learning**: Replace `qvalue_iteration()` with a TD update rule:\n", " `Q[s,a] += α · (r + γ · max_a' Q[s',a'] − Q[s,a])`. This removes the need for the\n", " explicit empirical model P and R, transitioning to fully model-free learning.\n", "\n", "* **Extension to quantum RL**: The Q-table can be replaced by a variational quantum circuit\n", " (VQC) whose output approximates Q(s,a). This is the bridge to quantum Q-learning agents\n", " available elsewhere in the `qrl` library.\n", "\n" ] }, { "cell_type": "markdown", "id": "qvi_15", "metadata": {}, "source": [ "## Version History\n", "\n", "* **v0**: Initial implementation. Tabular model-based Q-Value Iteration over discrete MDPs.\n", " PyTorch tensors for P, R, and Q. Empirical model built via random exploration.\n", " Compatible with any Gymnasium environment with `Discrete` observation and action spaces,\n", " including qrl-qai quantum environments such as `BlochSphereV1`.\n", " `BlochSphereV1` renderer detects `_Q` attribute for richer visualisation.\n" ] }, { "cell_type": "markdown", "id": "qvi_16", "metadata": {}, "source": [ "## References (Suggested Reading)\n", "\n", "* Sutton, R. S., & Barto, A. G., *Reinforcement Learning: An Introduction* (2nd ed.),\n", " Chapter 4 (Dynamic Programming) and Chapter 6 (TD / Q-learning).\n", "* Watkins, C. J. C. H., & Dayan, P. (1992). *Q-learning.* Machine Learning, 8(3-4), 279-292.\n", "* Mnih, V. et al. (2015). *Human-level control through deep reinforcement learning.* Nature.\n", " (DQN — the deep extension of Q-Value Iteration).\n", "* Gymnasium documentation — `FrozenLake-v1` environment.\n", "* Nielsen, M. A., & Chuang, I. L., *Quantum Computation and Quantum Information* —\n", " for the single-qubit gate definitions used in `BlochSphereV1`.\n" ] } ], "metadata": { "kernelspec": { "display_name": "qrl_env (3.12.3)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" } }, "nbformat": 4, "nbformat_minor": 5 }