{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "qvi_01",
   "metadata": {},
   "source": [
    "# QValue Iteration"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "qvi_02",
   "metadata": {},
   "source": [
    "## Description\n",
    "\n",
    "`QValueIteration` is a **model-based, tabular planning algorithm** for discrete Markov Decision\n",
    "Processes (MDPs). Like `ValueIteration`, it builds an empirical transition model P(s′|s,a) and\n",
    "mean reward model R(s,a,s′) from environment interaction. The key difference is that it directly\n",
    "maintains and updates the **state-action value function Q(s,a)** rather than the state-value\n",
    "function V(s).\n",
    "\n",
    "**Core Bellman update (applied at every iteration):**\n",
    "\n",
    "$$Q[s,a] \\leftarrow \\sum_{s'} P(s'|s,a) \\cdot \\left(R(s,a,s') + \\gamma \\cdot \\max_{a'} Q[s',a']\\right)$$\n",
    "\n",
    "\n",
    "The state-value function is recoverable at any time as a derived quantity:\n",
    "\n",
    "$$V(s) = \\max_{a} Q(s, a)$$\n",
    "\n",
    "\n",
    "Action selection is a **direct table lookup** — no recomputation required:\n",
    "\n",
    "$$a^*(s) = \\arg\\max_{a} Q(s, a)$$\n",
    "\n",
    "**Why Q instead of V?**\n",
    "\n",
    "- `select_action()` is O(|A|) per call vs. O(|S||A|) for `ValueIteration`'s one-step lookahead.\n",
    "- Storing Q(s,a) is the natural precursor to Q-learning, deep Q-networks (DQN), and quantum RL\n",
    "  agents, making `QValueIteration` the more **forward-compatible** algorithmic choice.\n",
    "- The Q-table is directly visualised in `BlochSphereV1`'s right-panel renderer (value colormap\n",
    "  uses max_a Q(s,a), greedy policy uses argmax_a Q(s,a)).\n",
    "\n",
    "**Inherits from** `BaseIteration`, which handles empirical model construction, random\n",
    "exploration, and episode rollouts.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "qvi_03",
   "metadata": {},
   "source": [
    "## Constructor\n",
    "\n",
    "```python\n",
    "QValueIteration(\n",
    "    env: gym.Env,\n",
    "    gamma: float = 0.9,\n",
    "    num_test_episodes: int = 20,\n",
    "    device: Optional[torch.device] = None,\n",
    "    dtype: Optional[torch.dtype] = torch.float32,\n",
    ")\n",
    "```\n",
    "\n",
    "| Parameter           | Type                | Default         | Description                                                            |\n",
    "| ------------------- | ------------------- | --------------- | ---------------------------------------------------------------------- |\n",
    "| `env`               | `gym.Env`           | —               | Gymnasium / qrl-qai environment with **discrete** obs & action spaces. |\n",
    "| `gamma`             | `float`             | `0.9`           | Discount factor in [0, 1). Higher → more far-sighted.                  |\n",
    "| `num_test_episodes` | `int`               | `20`            | Informational; used by external training loops for evaluation.         |\n",
    "| `device`            | `torch.device`      | auto (CUDA/CPU) | Compute device for all tensors.                                        |\n",
    "| `dtype`             | `torch.dtype`       | `float32`       | Floating-point precision for all tensors.                              |\n",
    "\n",
    "The constructor calls `env.reset()` internally and initialises Q[s,a] = 0 for all (s,a).\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "qvi_04",
   "metadata": {},
   "source": [
    "## Implementation Details\n",
    "\n",
    "### Exploration / model-building\n",
    "\n",
    "| Method / Property                     | Description                                                                 |\n",
    "| ------------------------------------- | --------------------------------------------------------------------------- |\n",
    "| `play_n_random_steps(n)`              | Execute *n* uniformly-random steps to seed the empirical model.             |\n",
    "| `play_episode(env)`                   | Run one greedy episode on *env*, updating the model. Returns total reward.  |\n",
    "\n",
    "### Planning\n",
    "\n",
    "| Method                                | Description                                                                 |\n",
    "| ------------------------------------- | --------------------------------------------------------------------------- |\n",
    "| `qvalue_iteration(max_iters, tol)`    | Apply Q Bellman updates until ‖Q_new − Q‖_∞ < `tol` (or `max_iters` steps). Returns iteration count. |\n",
    "\n",
    "### Action selection & inspection\n",
    "\n",
    "| Method / Property   | Description                                                                          |\n",
    "| ------------------- | ------------------------------------------------------------------------------------ |\n",
    "| `select_action(s)`  | Greedy action: `argmax_a Q(s,a)`. O(|A|) direct table lookup. Returns `int`.        |\n",
    "| `Q`                 | Current Q-table tensor, shape `(n_states, n_actions)`.                               |\n",
    "| `V`                 | Derived state-value tensor `max_a Q(s,a)`, shape `(n_states,)`. Not stored.          |\n",
    "| `get_policy()`      | Greedy policy tensor of shape `(n_states,)` — `argmax_a Q(s,a)` for every state.    |\n",
    "\n",
    "### Internal model (inherited from `BaseIteration`)\n",
    "\n",
    "| Attribute         | Shape                        | Description                          |\n",
    "| ----------------- | ---------------------------- | ------------------------------------ |\n",
    "| `_counts`         | `(n_s, n_a, n_s)`            | Visitation counts C[s,a,s'].         |\n",
    "| `_reward_sum`     | `(n_s, n_a, n_s)`            | Accumulated reward per transition.   |\n",
    "| `_Q`              | `(n_states, n_actions)`      | Current Q-function tensor.           |\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "qvi_05",
   "metadata": {},
   "source": [
    "## QValueIteration vs. ValueIteration\n",
    "\n",
    "| Aspect                    | `ValueIteration`                        | `QValueIteration`                          |\n",
    "| ------------------------- | --------------------------------------- | ------------------------------------------ |\n",
    "| Stored quantity           | V[s] — shape `(|S|,)`                  | Q[s,a] — shape `(|S|, |A|)`               |\n",
    "| Memory                    | O(\\|S\\|)                                | O(\\|S\\|\\|A\\|)                              |\n",
    "| `select_action()` cost    | O(\\|S\\|\\|A\\|) — one-step lookahead      | O(\\|A\\|) — direct table lookup             |\n",
    "| Extension to Q-learning   | Requires additional step                | Direct (Q-table is already maintained)     |\n",
    "| Planning call             | `value_iteration()`                     | `qvalue_iteration()`                       |\n",
    "| `BlochSphereV1` render    | Detects via `agent._V`                  | Detects via `agent._Q`                     |\n",
    "\n",
    "**Rule of thumb:** prefer `QValueIteration` when you need fast per-step action selection,\n",
    "when you plan to extend to Q-learning, or when working with `BlochSphereV1` (the renderer\n",
    "shows richer information from the Q-table — individual V(s) and argmax_a Q(s,a)).\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "qvi_06",
   "metadata": {},
   "source": [
    "---\n",
    "## Example 1 — FrozenLake-v1 (stochastic environment)\n",
    "\n",
    "`FrozenLake-v1` with `is_slippery=True` is a classic stochastic 4×4 grid benchmark.\n",
    "The agent must navigate from the start cell to the goal without falling into holes;\n",
    "each intended action succeeds with probability 1/3 (the rest is split between\n",
    "two perpendicular directions).\n",
    "\n",
    "Here `QValueIteration` builds an empirical model through random exploration, then\n",
    "applies the Q-Bellman operator to convergence. Note how the Q-table allows\n",
    "`select_action()` to be a fast O(4) lookup rather than the O(64) lookahead\n",
    "required by `ValueIteration.select_action()`.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "qvi_07",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "State space  : 16   (one per grid cell)\n",
      "Action space : 4   (left=0, down=1, right=2, up=3)\n",
      "Device       : cpu\n",
      "Q shape      : torch.Size([16, 4])\n",
      "V shape      : torch.Size([16])   (derived from Q)\n"
     ]
    }
   ],
   "source": [
    "import gymnasium as gym\n",
    "from qrl.algorithms.classical import QValueIteration\n",
    "\n",
    "env      = gym.make(\"FrozenLake-v1\", is_slippery=True)\n",
    "test_env = gym.make(\"FrozenLake-v1\", is_slippery=True)\n",
    "\n",
    "agent = QValueIteration(env=env, gamma=0.9)\n",
    "\n",
    "print(f\"State space  : {agent.n_states}   (one per grid cell)\")\n",
    "print(f\"Action space : {agent.n_actions}   (left=0, down=1, right=2, up=3)\")\n",
    "print(f\"Device       : {agent.device}\")\n",
    "print(f\"Q shape      : {agent.Q.shape}\")\n",
    "print(f\"V shape      : {agent.V.shape}   (derived from Q)\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "qvi_08",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  ▶ Best reward updated 0.000 → 0.400  (iter 3, Q-Bellman sweeps: 59)\n",
      "  ▶ Best reward updated 0.400 → 0.450  (iter 15, Q-Bellman sweeps: 59)\n",
      "  ▶ Best reward updated 0.450 → 0.650  (iter 19, Q-Bellman sweeps: 66)\n",
      "  ▶ Best reward updated 0.650 → 0.800  (iter 23, Q-Bellman sweeps: 60)\n",
      "  ▶ Best reward updated 0.800 → 0.850  (iter 35, Q-Bellman sweeps: 44)\n",
      "\n",
      "Solved in 35 outer iterations!\n"
     ]
    }
   ],
   "source": [
    "TEST_EPISODES = 20\n",
    "\n",
    "iter_no     = 0\n",
    "best_reward = 0.0\n",
    "\n",
    "while True:\n",
    "    iter_no += 1\n",
    "\n",
    "    # Step 1 — explore: collect 100 random transitions to update the model\n",
    "    agent.play_n_random_steps(100)\n",
    "\n",
    "    # Step 2 — plan: solve the empirical MDP with Q-Bellman updates\n",
    "    n_bellman = agent.qvalue_iteration()\n",
    "\n",
    "    # Step 3 — evaluate: run TEST_EPISODES greedy episodes\n",
    "    reward = sum(agent.play_episode(test_env) for _ in range(TEST_EPISODES))\n",
    "    reward /= TEST_EPISODES\n",
    "\n",
    "    if reward > best_reward:\n",
    "        print(f\"  ▶ Best reward updated {best_reward:.3f} → {reward:.3f}  \"\n",
    "              f\"(iter {iter_no}, Q-Bellman sweeps: {n_bellman})\")\n",
    "        best_reward = reward\n",
    "\n",
    "    if best_reward > 0.80:\n",
    "        print(f\"\\nSolved in {iter_no} outer iterations!\")\n",
    "        break\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "qvi_09",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Greedy policy (raw tensor):\n",
      "tensor([0, 3, 0, 3, 0, 0, 2, 0, 3, 1, 0, 0, 0, 2, 1, 0])\n",
      "\n",
      "Greedy policy grid (4×4):\n",
      "   ← ↑ ← ↑\n",
      "   ← ← → ←\n",
      "   ↑ ↓ ← ←\n",
      "   ← → ↓ ←\n",
      "\n",
      "Derived V(s) = max_a Q(s,a) (reshaped to 4×4):\n",
      "tensor([[0.0670, 0.0560, 0.0660, 0.0480],\n",
      "        [0.0900, 0.0000, 0.0970, 0.0000],\n",
      "        [0.1420, 0.2440, 0.2870, 0.0000],\n",
      "        [0.0000, 0.3730, 0.6290, 0.0000]])\n",
      "\n",
      "Q-table shape: torch.Size([16, 4])\n",
      "Q-table (states as rows, actions as columns):\n",
      "tensor([[0.0670, 0.0650, 0.0640, 0.0570],\n",
      "        [0.0400, 0.0440, 0.0380, 0.0560],\n",
      "        [0.0660, 0.0610, 0.0640, 0.0510],\n",
      "        [0.0300, 0.0390, 0.0300, 0.0480],\n",
      "        [0.0900, 0.0670, 0.0620, 0.0470],\n",
      "        [0.0000, 0.0000, 0.0000, 0.0000],\n",
      "        [0.0800, 0.0460, 0.0970, 0.0180],\n",
      "        [0.0000, 0.0000, 0.0000, 0.0000],\n",
      "        [0.0740, 0.1060, 0.1110, 0.1420],\n",
      "        [0.1550, 0.2440, 0.1960, 0.1560],\n",
      "        [0.2870, 0.2600, 0.1480, 0.0920],\n",
      "        [0.0000, 0.0000, 0.0000, 0.0000],\n",
      "        [0.0000, 0.0000, 0.0000, 0.0000],\n",
      "        [0.2380, 0.2100, 0.3730, 0.3580],\n",
      "        [0.3270, 0.6290, 0.5660, 0.5410],\n",
      "        [0.0000, 0.0000, 0.0000, 0.0000]])\n"
     ]
    }
   ],
   "source": [
    "policy = agent.get_policy()   # shape (16,)\n",
    "action_symbols = {0: \"←\", 1: \"↓\", 2: \"→\", 3: \"↑\"}\n",
    "\n",
    "print(\"Greedy policy (raw tensor):\")\n",
    "print(policy)\n",
    "\n",
    "print(\"\\nGreedy policy grid (4×4):\")\n",
    "grid = [[action_symbols[int(policy[r * 4 + c])] for c in range(4)] for r in range(4)]\n",
    "for row in grid:\n",
    "    print(\"  \", \" \".join(row))\n",
    "\n",
    "print(\"\\nDerived V(s) = max_a Q(s,a) (reshaped to 4×4):\")\n",
    "print(agent.V.reshape(4, 4).round(decimals=3))\n",
    "\n",
    "print(\"\\nQ-table shape:\", agent.Q.shape)\n",
    "print(\"Q-table (states as rows, actions as columns):\")\n",
    "import torch\n",
    "print(agent.Q.round(decimals=3))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "qvi_10",
   "metadata": {},
   "source": [
    "---\n",
    "## Example 2 — BlochSphereV1 (quantum discrete MDP)\n",
    "\n",
    "`BlochSphereV1` is a deterministic finite-graph MDP over the six canonical Bloch sphere states.\n",
    "With only 6 states and 4 actions the Q-table is tiny (6 × 4 = 24 entries), and\n",
    "`QValueIteration` converges in a single outer iteration.\n",
    "\n",
    "The `BlochSphereV1` renderer has **first-class support for Q-tables**: it detects `agent._Q`\n",
    "and visualises:\n",
    "\n",
    "- **Node colours** (right panel) — warm scale encoding `max_a Q(s,a)` per state.\n",
    "- **Bold teal edges** — the greedy policy `argmax_a Q(s,a)` for each state.\n",
    "- **Edge opacity** — proportional to empirical visit counts from `_counts`.\n",
    "- **Colorbar** — min and max Q-value states are labelled.\n",
    "\n",
    "This makes the Q-table visualisation richer than the V-table variant.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "qvi_11",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "State space  : 6  (|0⟩ |1⟩ |+⟩ |-⟩ |+i⟩ |-i⟩)\n",
      "Action space : 4  (H=0, X=1, Z=2, S=3)\n",
      "Target state : 4  → |+i⟩\n",
      "Q shape      : torch.Size([6, 4])   (6 states × 4 actions)\n",
      "\n",
      "Transition table T[s, a] = s':\n",
      "[[2 1 0 0]\n",
      " [3 0 1 1]\n",
      " [0 2 3 4]\n",
      " [1 3 2 5]\n",
      " [5 5 5 3]\n",
      " [4 4 4 2]]\n"
     ]
    }
   ],
   "source": [
    "from qrl.algorithms.classical import QValueIteration\n",
    "from qrl.env import BlochSphereV1\n",
    "\n",
    "# Target state index 4 → |+i⟩\n",
    "# set ffmpeg=True if you have ffmpeg installed to save as mp4\n",
    "env      = BlochSphereV1(target_state=4, max_steps=10, reward_tolerance=0.99, ffmpeg=False)\n",
    "test_env = BlochSphereV1(target_state=4, max_steps=10, reward_tolerance=0.99, ffmpeg=False)\n",
    "\n",
    "agent = QValueIteration(env=env, gamma=0.9)\n",
    "\n",
    "print(f\"State space  : {agent.n_states}  (|0⟩ |1⟩ |+⟩ |-⟩ |+i⟩ |-i⟩)\")\n",
    "print(f\"Action space : {agent.n_actions}  (H=0, X=1, Z=2, S=3)\")\n",
    "print(f\"Target state : {env.target_state_index}  → |+i⟩\")\n",
    "print(f\"Q shape      : {agent.Q.shape}   (6 states × 4 actions)\")\n",
    "print()\n",
    "print(\"Transition table T[s, a] = s':\")\n",
    "print(BlochSphereV1.transition_table())\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "qvi_12",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Iteration   1 | Success rate: 1.000\n",
      "  ▶ Best reward updated 0.000 → 1.000\n",
      "\n",
      "Solved in 1 iteration(s)!\n"
     ]
    }
   ],
   "source": [
    "TEST_EPISODES = 20\n",
    "iter_no, best_reward = 0, 0.0\n",
    "\n",
    "while True:\n",
    "    iter_no += 1\n",
    "\n",
    "    # Explore: seed the empirical model\n",
    "    agent.play_n_random_steps(50)\n",
    "\n",
    "    # Plan: solve with Q-Value Iteration\n",
    "    agent.qvalue_iteration()\n",
    "\n",
    "    # Collect a graph snapshot for the animation\n",
    "    env._render_graph(agent=agent)\n",
    "\n",
    "    # Evaluate: measure success rate\n",
    "    reward = 0.0\n",
    "    for _ in range(TEST_EPISODES):\n",
    "        obs, _ = test_env.reset()\n",
    "        while True:\n",
    "            action = agent.select_action(int(obs))\n",
    "            obs, _, terminated, truncated, _ = test_env.step(action)\n",
    "            if terminated or truncated:\n",
    "                reward += float(terminated)   # 1.0 on success\n",
    "                break\n",
    "    reward /= TEST_EPISODES\n",
    "\n",
    "    print(f\"Iteration {iter_no:>3} | Success rate: {reward:.3f}\")\n",
    "    if reward > best_reward:\n",
    "        print(f\"  ▶ Best reward updated {best_reward:.3f} → {reward:.3f}\")\n",
    "        best_reward = reward\n",
    "    if best_reward >= 1.0:\n",
    "        print(f\"\\nSolved in {iter_no} iteration(s)!\")\n",
    "        break\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "qvi_13",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Greedy policy (argmax_a Q(s,a)):\n",
      "  s=0  |0⟩     →  H\n",
      "  s=1  |1⟩     →  H\n",
      "  s=2  |+⟩     →  S\n",
      "  s=3  |-⟩     →  Z\n",
      "  s=4  |+i⟩    →  H ← target\n",
      "  s=5  |-i⟩    →  Z\n",
      "\n",
      "Derived V(s) = max_a Q(s,a):\n",
      "  s=0  |0⟩     V = 0.9000\n",
      "  s=1  |1⟩     V = 0.8100\n",
      "  s=2  |+⟩     V = 1.0000\n",
      "  s=3  |-⟩     V = 0.9000\n",
      "  s=4  |+i⟩    V = 0.0000\n",
      "  s=5  |-i⟩    V = 1.0000\n",
      "\n",
      "Full Q-table Q[s, a] (states as rows, actions as columns):\n",
      "           H       X       Z       S\n",
      "  |0⟩    0.900   0.729   0.810   0.810\n",
      "  |1⟩    0.810   0.810   0.729   0.729\n",
      "  |+⟩    0.810   0.900   0.810   1.000\n",
      "  |-⟩    0.729   0.810   0.900   0.900\n",
      "  |+i⟩   0.000   0.000   0.000   0.000\n",
      "  |-i⟩   0.000   0.000   1.000   0.000\n",
      "\n",
      "Animation saved → bloch_sphere_qvalue_iteration.gif\n"
     ]
    }
   ],
   "source": [
    "state_labels = [\"|0⟩\", \"|1⟩\", \"|+⟩\", \"|-⟩\", \"|+i⟩\", \"|-i⟩\"]\n",
    "action_names  = [\"H\", \"X\", \"Z\", \"S\"]\n",
    "\n",
    "policy = agent.get_policy()\n",
    "print(\"Greedy policy (argmax_a Q(s,a)):\")\n",
    "for s, (label, a) in enumerate(zip(state_labels, policy)):\n",
    "    marker = \" ← target\" if s == env.target_state_index else \"\"\n",
    "    print(f\"  s={s}  {label:<6}  →  {action_names[int(a)]}{marker}\")\n",
    "\n",
    "print(\"\\nDerived V(s) = max_a Q(s,a):\")\n",
    "for s, (label, v) in enumerate(zip(state_labels, agent.V)):\n",
    "    print(f\"  s={s}  {label:<6}  V = {v:.4f}\")\n",
    "\n",
    "print(\"\\nFull Q-table Q[s, a] (states as rows, actions as columns):\")\n",
    "header = \"       \" + \"   \".join(f\"{a:>5}\" for a in action_names)\n",
    "print(header)\n",
    "for s, label in enumerate(state_labels):\n",
    "    row = \"   \".join(f\"{agent.Q[s, a].item():>5.3f}\" for a in range(agent.n_actions))\n",
    "    print(f\"  {label:<6} {row}\")\n",
    "\n",
    "# Save the learning animation\n",
    "env.render(\n",
    "    save_path_without_extension=\"bloch_sphere_qvalue_iteration\",\n",
    "    interval=600,\n",
    "    ffmpeg=False,\n",
    ")\n",
    "print(\"\\nAnimation saved → bloch_sphere_qvalue_iteration.gif\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "qvi_14",
   "metadata": {},
   "source": [
    "## Implementation Notes & Extensions\n",
    "\n",
    "* **Memory vs. speed tradeoff**: `QValueIteration` uses O(|S||A|) memory to store the full\n",
    "  Q-table, but gains O(|S||A|) speedup per `select_action()` call vs. `ValueIteration`.\n",
    "  For environments with millions of states, use function approximation (DQN) instead.\n",
    "\n",
    "* **Discount factor `gamma`**: The Q-Bellman operator converges for any `γ ∈ [0, 1)`. Larger\n",
    "  values make the agent more far-sighted but increase the number of Bellman sweeps needed.\n",
    "\n",
    "* **Convergence tolerance `tol`**: Default `1e-6`; relaxing to `1e-4` reduces planning time\n",
    "  with minimal quality loss for most practical MDPs.\n",
    "\n",
    "* **`BlochSphereV1` renderer integration**: The renderer auto-detects `agent._Q` (Q-table)\n",
    "  vs. `agent._V` (value table) and adjusts the visualisation accordingly. To use a custom\n",
    "  agent class, ensure it exposes `_Q` or `_V` and `_counts` as attributes.\n",
    "\n",
    "* **Extension to Q-learning**: Replace `qvalue_iteration()` with a TD update rule:\n",
    "  `Q[s,a] += α · (r + γ · max_a' Q[s',a'] − Q[s,a])`. This removes the need for the\n",
    "  explicit empirical model P and R, transitioning to fully model-free learning.\n",
    "\n",
    "* **Extension to quantum RL**: The Q-table can be replaced by a variational quantum circuit\n",
    "  (VQC) whose output approximates Q(s,a). This is the bridge to quantum Q-learning agents\n",
    "  available elsewhere in the `qrl` library.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "qvi_15",
   "metadata": {},
   "source": [
    "## Version History\n",
    "\n",
    "* **v0**: Initial implementation. Tabular model-based Q-Value Iteration over discrete MDPs.\n",
    "  PyTorch tensors for P, R, and Q. Empirical model built via random exploration.\n",
    "  Compatible with any Gymnasium environment with `Discrete` observation and action spaces,\n",
    "  including qrl-qai quantum environments such as `BlochSphereV1`.\n",
    "  `BlochSphereV1` renderer detects `_Q` attribute for richer visualisation.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "qvi_16",
   "metadata": {},
   "source": [
    "## References (Suggested Reading)\n",
    "\n",
    "* Sutton, R. S., & Barto, A. G., *Reinforcement Learning: An Introduction* (2nd ed.),\n",
    "  Chapter 4 (Dynamic Programming) and Chapter 6 (TD / Q-learning).\n",
    "* Watkins, C. J. C. H., & Dayan, P. (1992). *Q-learning.* Machine Learning, 8(3-4), 279-292.\n",
    "* Mnih, V. et al. (2015). *Human-level control through deep reinforcement learning.* Nature.\n",
    "  (DQN — the deep extension of Q-Value Iteration).\n",
    "* Gymnasium documentation — `FrozenLake-v1` environment.\n",
    "* Nielsen, M. A., & Chuang, I. L., *Quantum Computation and Quantum Information* —\n",
    "  for the single-qubit gate definitions used in `BlochSphereV1`.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "qrl_env (3.12.3)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}