{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "vi_01",
   "metadata": {},
   "source": [
    "# Value Iteration"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "vi_02",
   "metadata": {},
   "source": [
    "## Description\n",
    "\n",
    "`ValueIteration` is a **model-based, tabular planning algorithm** for discrete Markov Decision\n",
    "Processes (MDPs). It learns an empirical transition model P(s′|s,a) and a mean reward model\n",
    "R(s,a,s′) directly from environment interaction, then applies the Bellman optimality operator\n",
    "to convergence to obtain the optimal state-value function V*(s).\n",
    "\n",
    "**Core Bellman update (applied at every iteration):**\n",
    "\n",
    "$$V[s] \\leftarrow \\max_{a} \\sum_{s'} P(s'|s,a) \\cdot \\left(R(s,a,s') + \\gamma \\cdot V[s']\\right)$$\n",
    "\n",
    "Once V* is known, the optimal policy is recovered by a one-step lookahead:\n",
    "\n",
    "$$a^*(s) = \\arg\\max_{a} \\sum_{s'} P(s'|s,a) \\cdot \\left(R(s,a,s') + \\gamma \\cdot V[s']\\right)$$\n",
    "\n",
    "Q(s,a) is **never stored** — it is computed transiently during planning and action selection.\n",
    "This keeps memory usage at O(|S|) instead of O(|S||A|), making `ValueIteration` slightly\n",
    "more memory-efficient than `QValueIteration` for large state spaces.\n",
    "\n",
    "**When to use `ValueIteration`?**\n",
    "\n",
    "- You want a simple, memory-efficient planner for small to medium discrete MDPs.\n",
    "- You need a reference baseline before moving to Q-learning or deep-RL.\n",
    "- The environment is small enough that a table fits in memory and planning is fast.\n",
    "\n",
    "**Inherits from** `BaseIteration`, which handles empirical model construction, random\n",
    "exploration, and episode rollouts.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "vi_03",
   "metadata": {},
   "source": [
    "## Constructor\n",
    "\n",
    "```python\n",
    "ValueIteration(\n",
    "    env: gym.Env,\n",
    "    gamma: float = 0.9,\n",
    "    num_test_episodes: int = 20,\n",
    "    device: Optional[torch.device] = None,\n",
    "    dtype: Optional[torch.dtype] = torch.float32,\n",
    ")\n",
    "```\n",
    "\n",
    "| Parameter           | Type                | Default       | Description                                                            |\n",
    "| ------------------- | ------------------- | ------------- | ---------------------------------------------------------------------- |\n",
    "| `env`               | `gym.Env`           | —             | Gymnasium / qrl-qai environment with **discrete** obs & action spaces. |\n",
    "| `gamma`             | `float`             | `0.9`         | Discount factor in [0, 1). Higher → more far-sighted.                  |\n",
    "| `num_test_episodes` | `int`               | `20`          | Informational; used by external training loops for evaluation.         |\n",
    "| `device`            | `torch.device`      | auto (CUDA/CPU) | Compute device for all tensors.                                      |\n",
    "| `dtype`             | `torch.dtype`       | `float32`     | Floating-point precision for all tensors.                              |\n",
    "\n",
    "The constructor calls `env.reset()` internally to initialise the agent's current state.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "vi_05",
   "metadata": {},
   "source": [
    "## Typical Training Loop\n",
    "\n",
    "```\n",
    "1. agent.play_n_random_steps(N)   ← seed the model\n",
    "2. agent.value_iteration()        ← run Bellman updates to convergence\n",
    "3. Evaluate agent.play_episode()  ← measure quality\n",
    "4. Repeat 1-3 until solved\n",
    "```\n",
    "\n",
    "Each outer iteration adds more environment experience (step 1), then re-solves the\n",
    "updated empirical MDP (step 2). As the empirical model improves, V* converges to the\n",
    "true optimal value function.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "99558074",
   "metadata": {},
   "source": [
    "## Implementation Details\n",
    "\n",
    "### Exploration / model-building\n",
    "\n",
    "| Method / Property                   | Description                                                                 |\n",
    "| ----------------------------------- | --------------------------------------------------------------------------- |\n",
    "| `play_n_random_steps(n)`            | Execute *n* uniformly-random steps to seed the empirical model.             |\n",
    "| `play_episode(env)`                 | Run one greedy episode on *env*, updating the model. Returns total reward.  |\n",
    "\n",
    "### Planning\n",
    "\n",
    "| Method                              | Description                                                                 |\n",
    "| ----------------------------------- | --------------------------------------------------------------------------- |\n",
    "| `value_iteration(max_iters, tol)`   | Apply Bellman updates until ‖V_new − V‖_∞ < `tol` (or `max_iters` steps). Returns iteration count. |\n",
    "\n",
    "### Action selection & inspection\n",
    "\n",
    "| Method / Property  | Description                                                                          |\n",
    "| ------------------ | ------------------------------------------------------------------------------------ |\n",
    "| `select_action(s)` | Greedy action via one-step lookahead on V. Returns `int`.                            |\n",
    "| `V`                | Current state-value tensor, shape `(n_states,)`.                                    |\n",
    "| `get_policy()`     | Greedy policy tensor of shape `(n_states,)` — `argmax_a Q(s,a)` for every state.    |\n",
    "\n",
    "### Internal model (inherited from `BaseIteration`)\n",
    "\n",
    "| Attribute         | Shape                        | Description                          |\n",
    "| ----------------- | ---------------------------- | ------------------------------------ |\n",
    "| `_counts`         | `(n_s, n_a, n_s)`            | Visitation counts C[s,a,s'].         |\n",
    "| `_reward_sum`     | `(n_s, n_a, n_s)`            | Accumulated reward per transition.   |\n",
    "| `_V`              | `(n_states,)`                | Current value-function tensor.       |\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "vi_06",
   "metadata": {},
   "source": [
    "---\n",
    "## Example 1 — FrozenLake-v1 (stochastic environment)\n",
    "\n",
    "`FrozenLake-v1` with `is_slippery=True` is a classic stochastic benchmark:\n",
    "the agent tries to cross a frozen lake (4×4 grid) from start to goal without\n",
    "falling through holes. Transitions are stochastic — each intended action has\n",
    "only a 1/3 chance of succeeding; the remaining probability is split between\n",
    "the two perpendicular directions.\n",
    "\n",
    "This is an ideal showcase for `ValueIteration` because:\n",
    "- The state and action spaces are small (16 states, 4 actions).\n",
    "- Stochasticity means the empirical model must accumulate enough samples\n",
    "  before the value function converges to a useful policy.\n",
    "- A success rate of > 80 % is considered solved.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "vi_07",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "State space  : 16   (one per grid cell)\n",
      "Action space : 4   (left=0, down=1, right=2, up=3)\n",
      "Device       : cpu\n",
      "V shape      : torch.Size([16])\n"
     ]
    }
   ],
   "source": [
    "import gymnasium as gym\n",
    "from qrl.algorithms.classical import ValueIteration\n",
    "\n",
    "# Two separate env instances:\n",
    "#   env       → used by the agent for exploration and model-building\n",
    "#   test_env  → used for clean evaluation (no model updates during testing)\n",
    "env      = gym.make(\"FrozenLake-v1\", is_slippery=True)\n",
    "test_env = gym.make(\"FrozenLake-v1\", is_slippery=True)\n",
    "\n",
    "agent = ValueIteration(env=env, gamma=0.9)\n",
    "\n",
    "print(f\"State space  : {agent.n_states}   (one per grid cell)\")\n",
    "print(f\"Action space : {agent.n_actions}   (left=0, down=1, right=2, up=3)\")\n",
    "print(f\"Device       : {agent.device}\")\n",
    "print(f\"V shape      : {agent.V.shape}\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "vi_08",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  ▶ Best reward updated 0.000 → 0.250  (iter 16, Bellman sweeps: 77)\n",
      "  ▶ Best reward updated 0.250 → 0.500  (iter 17, Bellman sweeps: 71)\n",
      "  ▶ Best reward updated 0.500 → 0.550  (iter 18, Bellman sweeps: 60)\n",
      "  ▶ Best reward updated 0.550 → 0.600  (iter 23, Bellman sweeps: 43)\n",
      "  ▶ Best reward updated 0.600 → 0.700  (iter 43, Bellman sweeps: 42)\n",
      "  ▶ Best reward updated 0.700 → 0.750  (iter 48, Bellman sweeps: 54)\n",
      "  ▶ Best reward updated 0.750 → 0.800  (iter 77, Bellman sweeps: 43)\n",
      "  ▶ Best reward updated 0.800 → 0.850  (iter 91, Bellman sweeps: 34)\n",
      "\n",
      "Solved in 91 outer iterations!\n"
     ]
    }
   ],
   "source": [
    "TEST_EPISODES = 20\n",
    "\n",
    "iter_no     = 0\n",
    "best_reward = 0.0\n",
    "\n",
    "while True:\n",
    "    iter_no += 1\n",
    "\n",
    "    # Step 1 — explore: collect 100 random transitions to update the model\n",
    "    agent.play_n_random_steps(100)\n",
    "\n",
    "    # Step 2 — plan: solve the empirical MDP with Bellman updates\n",
    "    n_bellman = agent.value_iteration()\n",
    "\n",
    "    # Step 3 — evaluate: run TEST_EPISODES greedy episodes\n",
    "    reward = sum(agent.play_episode(test_env) for _ in range(TEST_EPISODES))\n",
    "    reward /= TEST_EPISODES\n",
    "\n",
    "    if reward > best_reward:\n",
    "        print(f\"  ▶ Best reward updated {best_reward:.3f} → {reward:.3f}  \"\n",
    "              f\"(iter {iter_no}, Bellman sweeps: {n_bellman})\")\n",
    "        best_reward = reward\n",
    "\n",
    "    if best_reward > 0.80:\n",
    "        print(f\"\\nSolved in {iter_no} outer iterations!\")\n",
    "        break\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "vi_09",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Greedy policy (raw tensor):\n",
      "tensor([0, 3, 0, 3, 0, 0, 0, 0, 3, 1, 1, 0, 0, 2, 1, 0])\n",
      "\n",
      "Greedy policy grid (4×4):\n",
      "   ← ↑ ← ↑\n",
      "   ← ← ← ←\n",
      "   ↑ ↓ ↓ ←\n",
      "   ← → ↓ ←\n",
      "\n",
      "State-value function V (reshaped to 4×4):\n",
      "tensor([[0.0640, 0.0520, 0.0570, 0.0420],\n",
      "        [0.0860, 0.0000, 0.0880, 0.0000],\n",
      "        [0.1350, 0.2320, 0.2640, 0.0000],\n",
      "        [0.0000, 0.3690, 0.6460, 0.0000]])\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "\n",
    "policy = agent.get_policy()   # shape (16,) — one greedy action per state\n",
    "action_symbols = {0: \"←\", 1: \"↓\", 2: \"→\", 3: \"↑\"}\n",
    "\n",
    "print(\"Greedy policy (raw tensor):\")\n",
    "print(policy)\n",
    "\n",
    "print(\"\\nGreedy policy grid (4×4):\")\n",
    "grid = [[action_symbols[int(policy[r * 4 + c])] for c in range(4)] for r in range(4)]\n",
    "for row in grid:\n",
    "    print(\"  \", \" \".join(row))\n",
    "\n",
    "print(\"\\nState-value function V (reshaped to 4×4):\")\n",
    "print(agent.V.reshape(4, 4).round(decimals=3))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "vi_10",
   "metadata": {},
   "source": [
    "---\n",
    "## Example 2 — BlochSphereV1 (quantum discrete MDP)\n",
    "\n",
    "`BlochSphereV1` models a single-qubit Bloch sphere as a **deterministic** finite graph:\n",
    "- **6 states** — the six canonical pure states: `|0⟩`, `|1⟩`, `|+⟩`, `|-⟩`, `|+i⟩`, `|-i⟩`.\n",
    "- **4 actions** — unitary gates H, X, Z, S.\n",
    "- **Reward** — sparse binary: `1.0` on reaching the target state, `0.0` otherwise.\n",
    "\n",
    "Because every gate acts deterministically, `ValueIteration` converges in very few outer\n",
    "iterations (often just one), demonstrating how model-based planning trivially solves\n",
    "small deterministic MDPs.\n",
    "\n",
    "After training, the environment's built-in 2D graph renderer produces an animated GIF\n",
    "showing:\n",
    "- **Left panel** — true state-transition graph with the episode trajectory overlaid.\n",
    "- **Right panel** — the agent's learned V(s) values (warm colormap) and greedy policy\n",
    "  (bold teal edges).\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "vi_11",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "State space  : 6  (|0⟩ |1⟩ |+⟩ |-⟩ |+i⟩ |-i⟩)\n",
      "Action space : 4  (H=0, X=1, Z=2, S=3)\n",
      "Target state : 4  → |+i⟩\n",
      "\n",
      "Transition table T[s, a] = s':\n",
      "[[2 1 0 0]\n",
      " [3 0 1 1]\n",
      " [0 2 3 4]\n",
      " [1 3 2 5]\n",
      " [5 5 5 3]\n",
      " [4 4 4 2]]\n"
     ]
    }
   ],
   "source": [
    "from qrl.algorithms.classical import ValueIteration\n",
    "from qrl.env import BlochSphereV1\n",
    "\n",
    "# Target state index 4 → |+i⟩\n",
    "# set ffmpeg=True if you have ffmpeg installed to save as mp4\n",
    "env      = BlochSphereV1(target_state=4, max_steps=10, reward_tolerance=0.99, ffmpeg=False)\n",
    "test_env = BlochSphereV1(target_state=4, max_steps=10, reward_tolerance=0.99, ffmpeg=False)\n",
    "\n",
    "agent = ValueIteration(env=env, gamma=0.9)\n",
    "\n",
    "print(f\"State space  : {agent.n_states}  (|0⟩ |1⟩ |+⟩ |-⟩ |+i⟩ |-i⟩)\")\n",
    "print(f\"Action space : {agent.n_actions}  (H=0, X=1, Z=2, S=3)\")\n",
    "print(f\"Target state : {env.target_state_index}  → |+i⟩\")\n",
    "print()\n",
    "print(\"Transition table T[s, a] = s':\")\n",
    "print(BlochSphereV1.transition_table())\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "vi_12",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Iteration   1 | Success rate: 1.000\n",
      "  ▶ Best reward updated 0.000 → 1.000\n",
      "\n",
      "Solved in 1 iteration(s)!\n"
     ]
    }
   ],
   "source": [
    "TEST_EPISODES = 20\n",
    "iter_no, best_reward = 0, 0.0\n",
    "\n",
    "while True:\n",
    "    iter_no += 1\n",
    "\n",
    "    # Explore: seed the empirical model\n",
    "    agent.play_n_random_steps(50)\n",
    "\n",
    "    # Plan: solve with Value Iteration\n",
    "    agent.value_iteration()\n",
    "\n",
    "    # Collect a graph snapshot (appended to env.fig_array_list)\n",
    "    env._render_graph(agent=agent)\n",
    "\n",
    "    # Evaluate: measure success rate over TEST_EPISODES greedy rollouts\n",
    "    reward = 0.0\n",
    "    for _ in range(TEST_EPISODES):\n",
    "        obs, _ = test_env.reset()\n",
    "        while True:\n",
    "            action = agent.select_action(int(obs))\n",
    "            obs, _, terminated, truncated, _ = test_env.step(action)\n",
    "            if terminated or truncated:\n",
    "                reward += float(terminated)   # 1.0 on success, 0.0 on truncation\n",
    "                break\n",
    "    reward /= TEST_EPISODES\n",
    "\n",
    "    print(f\"Iteration {iter_no:>3} | Success rate: {reward:.3f}\")\n",
    "    if reward > best_reward:\n",
    "        print(f\"  ▶ Best reward updated {best_reward:.3f} → {reward:.3f}\")\n",
    "        best_reward = reward\n",
    "    if best_reward >= 1.0:          # 100 % success rate → solved\n",
    "        print(f\"\\nSolved in {iter_no} iteration(s)!\")\n",
    "        break\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "vi_13",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Greedy policy:\n",
      "  s=0  |0⟩     →  H\n",
      "  s=1  |1⟩     →  H\n",
      "  s=2  |+⟩     →  S\n",
      "  s=3  |-⟩     →  Z\n",
      "  s=4  |+i⟩    →  H ← target\n",
      "  s=5  |-i⟩    →  H\n",
      "\n",
      "Learned V(s):\n",
      "  s=0  |0⟩     V = 0.9000\n",
      "  s=1  |1⟩     V = 0.8100\n",
      "  s=2  |+⟩     V = 1.0000\n",
      "  s=3  |-⟩     V = 0.9000\n",
      "  s=4  |+i⟩    V = 0.0000\n",
      "  s=5  |-i⟩    V = 1.0000\n",
      "\n",
      "Animation saved → bloch_sphere_value_iteration.gif\n"
     ]
    }
   ],
   "source": [
    "state_labels = [\"|0⟩\", \"|1⟩\", \"|+⟩\", \"|-⟩\", \"|+i⟩\", \"|-i⟩\"]\n",
    "action_names  = [\"H\", \"X\", \"Z\", \"S\"]\n",
    "\n",
    "policy = agent.get_policy()\n",
    "print(\"Greedy policy:\")\n",
    "for s, (label, a) in enumerate(zip(state_labels, policy)):\n",
    "    marker = \" ← target\" if s == env.target_state_index else \"\"\n",
    "    print(f\"  s={s}  {label:<6}  →  {action_names[int(a)]}{marker}\")\n",
    "\n",
    "print(\"\\nLearned V(s):\")\n",
    "for s, (label, v) in enumerate(zip(state_labels, agent.V)):\n",
    "    print(f\"  s={s}  {label:<6}  V = {v:.4f}\")\n",
    "\n",
    "# Save the learning animation\n",
    "env.render(\n",
    "    save_path_without_extension=\"bloch_sphere_value_iteration\",\n",
    "    interval=600,\n",
    "    ffmpeg=False,\n",
    ")\n",
    "print(\"\\nAnimation saved → bloch_sphere_value_iteration.gif\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "vi_14",
   "metadata": {},
   "source": [
    "## Implementation Notes & Extensions\n",
    "\n",
    "* **Memory**: Only V[s] (shape `(n_states,)`) is stored. Q(s,a) is computed transiently\n",
    "  during `value_iteration()` and `select_action()`. This is more memory-efficient than\n",
    "  `QValueIteration` which stores the full Q-table.\n",
    "\n",
    "* **Action-selection cost**: `select_action()` performs a full one-step lookahead (O(|S||A|))\n",
    "  every time it is called. For large action spaces, prefer `QValueIteration` whose\n",
    "  `select_action()` is a simple table lookup (O(|A|)).\n",
    "\n",
    "* **Discount factor `gamma`**: Values close to 1.0 make the agent more far-sighted but\n",
    "  can slow convergence. For sparse-reward environments, try `gamma=0.99`.\n",
    "\n",
    "* **Convergence tolerance `tol`**: The default `1e-6` is tight; you can loosen it to\n",
    "  `1e-4` for faster (but slightly sub-optimal) convergence in large environments.\n",
    "\n",
    "* **Empirical model quality**: Call `play_n_random_steps(N)` with larger *N* before\n",
    "  the first `value_iteration()` call to improve model accuracy, especially for\n",
    "  stochastic environments like `FrozenLake-v1`.\n",
    "\n",
    "* **Extension — model-free bridge**: Replace the empirical model with a neural network\n",
    "  approximator to transition from Value Iteration to DQN-style algorithms.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "vi_15",
   "metadata": {},
   "source": [
    "## Version History\n",
    "\n",
    "* **v0**: Initial implementation. Tabular model-based Value Iteration over discrete MDPs.\n",
    "  PyTorch tensors for P, R, and V. Empirical model built via random exploration.\n",
    "  Compatible with any Gymnasium environment with `Discrete` observation and action spaces,\n",
    "  including qrl-qai quantum environments such as `BlochSphereV1`.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "vi_16",
   "metadata": {},
   "source": [
    "## References (Suggested Reading)\n",
    "\n",
    "* Sutton, R. S., & Barto, A. G., *Reinforcement Learning: An Introduction* (2nd ed.),\n",
    "  Chapter 4 — Dynamic Programming (Value Iteration, Policy Iteration).\n",
    "* Bellman, R., *Dynamic Programming* (1957) — the original formulation.\n",
    "* Gymnasium documentation — `FrozenLake-v1` environment.\n",
    "* Nielsen, M. A., & Chuang, I. L., *Quantum Computation and Quantum Information* —\n",
    "  for the single-qubit gate definitions used in `BlochSphereV1`.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "qrl_env (3.12.3)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}