{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a1b2c3d4",
   "metadata": {},
   "source": [
    "# BlochSphereV1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26161d04",
   "metadata": {},
   "source": [
    "![image](../images/bloch_spherev1.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2c3d4e5",
   "metadata": {},
   "source": [
    "## Description\n",
    "\n",
    "`BlochSphereV1` models the single-qubit Bloch sphere as a **finite graph problem** for reinforcement learning.\n",
    "Rather than tracking a continuous statevector, the environment works with a discrete set of six canonical pure\n",
    "states — `|0⟩`, `|1⟩`, `|+⟩`, `|-⟩`, `|+i⟩`, `|-i⟩` — and four deterministic gate actions (H, X, Z, S).\n",
    "\n",
    "The qubit lives on a directed graph with 6 nodes. Each node is one of the canonical states and each edge\n",
    "represents a gate action that maps one state to another. Because this is a proper finite Markov Decision\n",
    "Process (MDP), `BlochSphereV1` is **fully compatible with tabular planning algorithms** such as\n",
    "`ValueIteration` and `QValueIteration` from `qrl.algorithms` — no wrapper required.\n",
    "\n",
    "The objective is to steer the qubit from the fixed initial state `|0⟩` (index 0) to a user-specified\n",
    "target pure state (default `|+⟩`, index 2) within a limited number of steps.\n",
    "\n",
    "Key details\n",
    "-----------\n",
    "- **Action space**: Discrete(4) — gates H, X, Z, S.\n",
    "- **Observation space**: Discrete(6) — integer index ∈ {0,1,2,3,4,5} for the six canonical states.\n",
    "- **Reward**: Binary sparse — `1.0` if fidelity ≥ `reward_tolerance`, `0.0` otherwise.\n",
    "- **Termination**: Success when fidelity ≥ `reward_tolerance`, truncation at `max_steps`.\n",
    "- **Rendering**: 2D state-transition graph with optional agent-model panel showing learned value function\n",
    "  and greedy policy.\n",
    "\n",
    "**When to prefer V1 over V0?**  Use `BlochSphereV1` when you want to apply classical RL planning methods\n",
    "(value iteration, Q-value iteration, policy iteration) or when a discrete-state formulation suffices\n",
    "for your experiment. Use `BlochSphereV0` for continuous-action experiments or deep-RL training that\n",
    "requires a richer, continuous Bloch-vector observation.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3d4e5f6",
   "metadata": {},
   "source": [
    "## Action Space\n",
    "\n",
    "The action space is `gymnasium.spaces.Discrete(4)`. Each integer index selects a unitary gate\n",
    "to apply to the current statevector. Gate transitions are **deterministic** — the same action\n",
    "from the same state always leads to the same next state.\n",
    "\n",
    "| Num | Action | Description          |\n",
    "| --- | ------ | -------------------- |\n",
    "| 0   | `H`    | Hadamard gate        |\n",
    "| 1   | `X`    | Pauli-X (NOT)        |\n",
    "| 2   | `Z`    | Pauli-Z              |\n",
    "| 3   | `S`    | Phase gate (S)       |\n",
    "\n",
    "The full (6 × 4) transition table `T[s, a] = s'` can be retrieved via the static method\n",
    "`BlochSphereV1.transition_table()`. This table is also used internally for the graph rendering.\n",
    "\n",
    "> **Note:** `BlochSphereV1` uses only 4 gates vs. the 17 available in `BlochSphereV0`.\n",
    "> The four gates are chosen because they suffice to reach all six canonical states from any\n",
    "> starting state and form a clean finite MDP over the canonical Bloch-sphere axes.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d4e5f6g7",
   "metadata": {},
   "source": [
    "## Observation Space\n",
    "\n",
    "The observation is a **single integer** representing the current discrete state index.\n",
    "Space type: `gymnasium.spaces.Discrete(6)`.\n",
    "\n",
    "| Index | State   | Description                          |\n",
    "| ----- | ------- | ------------------------------------ |\n",
    "| 0     | `\\|0⟩`  | Computational basis state zero       |\n",
    "| 1     | `\\|1⟩`  | Computational basis state one        |\n",
    "| 2     | `\\|+⟩`  | Equal superposition (positive phase) |\n",
    "| 3     | `\\|-⟩`  | Equal superposition (negative phase) |\n",
    "| 4     | `\\|+i⟩` | Y-axis positive pole                 |\n",
    "| 5     | `\\|-i⟩` | Y-axis negative pole                 |\n",
    "\n",
    "**Auxiliary properties** (not part of `observation_space` but available on the env object):\n",
    "\n",
    "- `env.state_index` — current state index (same as the returned observation).\n",
    "- `env.bloch_vector` — `(x, y, z)` float32 array for the current canonical state.\n",
    "- `BlochSphereV1.transition_table()` — static (6, 4) integer array.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5f6g7h8",
   "metadata": {},
   "source": [
    "## Rewards\n",
    "\n",
    "The reward signal is **binary and sparse**:\n",
    "\n",
    "```\n",
    "reward = 1.0   if  fidelity(current_state, target_state) ≥ reward_tolerance\n",
    "reward = 0.0   otherwise\n",
    "```\n",
    "\n",
    "where fidelity is `|⟨target | current⟩|²` and `reward_tolerance` defaults to `0.99`.\n",
    "\n",
    "This sparse 0/1 signal is well-suited for **tabular planning algorithms** (e.g. Value Iteration)\n",
    "that propagate rewards backwards through the known transition table. For policy-gradient or\n",
    "Q-learning agents that benefit from denser feedback, consider wrapping the environment and\n",
    "shaping the reward (e.g. by adding a small negative step penalty).\n",
    "\n",
    "> **Contrast with V0:** `BlochSphereV0` returns a continuous fidelity reward in `[0, 1]` at\n",
    "> every step, making it more informative for gradient-based agents.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f6g7h8i9",
   "metadata": {},
   "source": [
    "## Starting State\n",
    "\n",
    "On `reset()` the environment:\n",
    "\n",
    "- Sets `_state_index = 0` (corresponding to `|0⟩`).\n",
    "- Resets the internal statevector to `STATE_VECTORS[0]`.\n",
    "- Clears `history` to `[0]` (only the initial state index).\n",
    "- Sets `steps = 0`, `terminated = False`, `truncated = False`.\n",
    "\n",
    "The **target state** is fixed at construction time via the `target_state` argument (integer index,\n",
    "default `2` = `|+⟩`). It does **not** change between episodes.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "g7h8i9j0",
   "metadata": {},
   "source": [
    "## Episode End\n",
    "\n",
    "An episode ends when either of the following conditions is met:\n",
    "\n",
    "1. **Termination (success):** Fidelity between the current state and the target state meets or\n",
    "   exceeds `reward_tolerance` (default `0.99`). `terminated = True`.\n",
    "2. **Truncation:** The number of steps reaches `max_steps` (default `10`). `truncated = True`.\n",
    "\n",
    "`step()` returns the 5-tuple `(observation, reward, terminated, truncated, info)` in compliance\n",
    "with the modern Gymnasium API. The `info` dict contains:\n",
    "\n",
    "| Key            | Type        | Description                                   |\n",
    "| -------------- | ----------- | --------------------------------------------- |\n",
    "| `fidelity`     | `float`     | Current fidelity with the target state.       |\n",
    "| `gate`         | `str`       | Name of the gate applied in this step.        |\n",
    "| `state_index`  | `int`       | Current state index (same as `observation`).  |\n",
    "\n",
    "> **Note:** Unlike `BlochSphereV0`, `BlochSphereV1` returns a **5-tuple** from `step()`\n",
    "> (terminated and truncated are separate) and also returns `info` from `reset()`.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "h8i9j0k1",
   "metadata": {},
   "source": [
    "## Render\n",
    "\n",
    "`BlochSphereV1` uses a **two-step rendering workflow**:\n",
    "\n",
    "### Step 1 — Collect frames: `env._render_graph(agent, show_true_dynamics=True)`\n",
    "\n",
    "Call this at the end of each training episode. It draws a snapshot of the current state and\n",
    "appends it to `env.fig_array_list`. The snapshot contains up to two panels:\n",
    "\n",
    "- **Left panel** (`show_true_dynamics=True`, default): True environment dynamics graph.\n",
    "  - Nodes are coloured: red = current, blue = target, yellow = current=target, grey = other.\n",
    "  - Episode trajectory is overlaid in bold orange.\n",
    "- **Right panel** (requires `agent` argument): Agent's learned model.\n",
    "  - Node colour encodes V(s) or max_a Q(s,a) on a warm colormap (warm = high value).\n",
    "  - Edge opacity is proportional to visit counts.\n",
    "  - Bold teal edges show the greedy policy (argmax_a Q(s,a)).\n",
    "  - A colorbar labels the minimum/maximum value states.\n",
    "\n",
    "Raises `ValueError` if no `agent` is provided.\n",
    "\n",
    "### Step 2 — Save animation: `env.render(save_path_without_extension, interval=600, ffmpeg=False)`\n",
    "\n",
    "Assembles all collected frames into a single `.gif` (or `.mp4` if `ffmpeg=True`) animation\n",
    "and saves it to `<save_path_without_extension>.gif`.\n",
    "\n",
    "Raises `ValueError` if `_render_graph()` was never called (no frames collected).\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "i9j0k1l2",
   "metadata": {},
   "source": [
    "## Arguments (Constructor & Reset Options)\n",
    "\n",
    "Constructor signature:\n",
    "\n",
    "```python\n",
    "BlochSphereV1(\n",
    "    target_state: int = 2,\n",
    "    max_steps: int = 10,\n",
    "    reward_tolerance: float = 0.99,\n",
    "    ffmpeg: bool = False,\n",
    ")\n",
    "```\n",
    "\n",
    "- `target_state`: Target state index in `[0, 5]`. Default `2` (|+⟩). Mapping:\n",
    "  `0→|0⟩, 1→|1⟩, 2→|+⟩, 3→|-⟩, 4→|+i⟩, 5→|-i⟩`.\n",
    "- `max_steps`: Maximum actions per episode (truncation threshold). Default `10`.\n",
    "- `reward_tolerance`: Fidelity threshold for success. Must be in `(0, 1]`. Default `0.99`.\n",
    "- `ffmpeg`: If `True`, animations are saved as `.mp4` (requires ffmpeg). Default `False` (GIF).\n",
    "\n",
    "`reset(seed=None, options=None)` accepts an optional seed passed to the Gymnasium base class\n",
    "and always returns `(0, info_dict)` — the fixed initial state index and its info.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "j0k1l2m3",
   "metadata": {},
   "outputs": [],
   "source": [
    "### Example 1 — Standalone random agent\n",
    "\n",
    "from qrl.env import BlochSphereV1\n",
    "\n",
    "# Target state is |+⟩ (index 2)\n",
    "# set ffmpeg=True if you have ffmpeg installed to save as mp4, or ffmpeg=False to save as gif\n",
    "env = BlochSphereV1(target_state=2, max_steps=10, reward_tolerance=0.99, ffmpeg=False)\n",
    "\n",
    "# Reset\n",
    "obs, info = env.reset()\n",
    "print(\"Initial state index:\", obs)          # always 0 → |0⟩\n",
    "print(\"Initial info:\", info)\n",
    "print(\"Bloch vector:\", env.bloch_vector)\n",
    "\n",
    "# Transition table (6 states × 4 actions)\n",
    "print(\"\\nTransition table (T[s, a] = s'):\\n\", BlochSphereV1.transition_table())\n",
    "\n",
    "# Random rollout\n",
    "action_names = ['H', 'X', 'Z', 'S']\n",
    "for step in range(env.max_steps):\n",
    "    action = env.action_space.sample()\n",
    "    obs, reward, terminated, truncated, info = env.step(action)\n",
    "    print(f\"Step {step+1} | Gate: {action_names[action]:<2} | \"\n",
    "          f\"State: {obs} | Reward: {reward:.1f} | \"\n",
    "          f\"Fidelity: {info['fidelity']:.4f}\")\n",
    "    if terminated or truncated:\n",
    "        break\n",
    "\n",
    "print(\"\\nEpisode finished. Env repr:\", repr(env))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "k1l2m3n4",
   "metadata": {},
   "outputs": [],
   "source": [
    "### Example 2 — Training with ValueIteration + animated graph rendering\n",
    "\n",
    "from qrl.algorithms.classical import ValueIteration\n",
    "from qrl.env import BlochSphereV1\n",
    "\n",
    "# ---- Training and Testing Environment ------\n",
    "env      = BlochSphereV1(target_state=4, max_steps=10, reward_tolerance=0.99)\n",
    "test_env = BlochSphereV1(target_state=4, max_steps=10, reward_tolerance=0.99)\n",
    "\n",
    "# ---- Training Agent (ValueIteration Algorithm) ------\n",
    "agent    = ValueIteration(env=env, gamma=0.9)\n",
    "\n",
    "# ---- Training Loop ------\n",
    "TEST_EPISODES = 20\n",
    "iter_no, best_reward = 0, 0.0\n",
    "while True:\n",
    "    iter_no += 1\n",
    "    agent.play_n_random_steps(50)\n",
    "    agent.value_iteration()\n",
    "\n",
    "    # ---- Save the training progress ------\n",
    "    env._render_graph(agent=agent)\n",
    "\n",
    "    reward = 0.0\n",
    "    for _ in range(TEST_EPISODES):\n",
    "        obs, _ = test_env.reset()\n",
    "        while True:\n",
    "            action = agent.select_action(int(obs))\n",
    "            obs, _, terminated, truncated, info = test_env.step(action)\n",
    "            if terminated or truncated:\n",
    "                reward += float(terminated)   # success rate\n",
    "                break\n",
    "    reward /= TEST_EPISODES\n",
    "\n",
    "    print(f\"Iteration {iter_no} reward: {reward:.3f}\")\n",
    "    if reward > best_reward:\n",
    "        print(\"Best reward updated %.3f -> %.3f\" % (best_reward, reward))\n",
    "        best_reward = reward\n",
    "    if reward >= 1.0:                         # 100% success rate\n",
    "        print(\"Solved in %d iterations!\" % iter_no)\n",
    "        break\n",
    "\n",
    "# ---- Render the training progress ------\n",
    "env.render(save_path_without_extension=\"bloch_sphere_value_iteration\",\n",
    "           interval=600, ffmpeg=False)\n",
    "print(\"Animation saved.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "l2m3n4o5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Iteration 1 reward: 1.000\n",
      "Best reward updated 0.000 -> 1.000\n",
      "Solved in 1 iterations!\n",
      "Animation saved.\n"
     ]
    }
   ],
   "source": [
    "### Example 3 — Training with QValueIteration\n",
    "\n",
    "from qrl.algorithms.classical import QValueIteration\n",
    "from qrl.env import BlochSphereV1\n",
    "\n",
    "# ---- Training and Testing Environment ------\n",
    "env      = BlochSphereV1(target_state=4, max_steps=10, reward_tolerance=0.99)\n",
    "test_env = BlochSphereV1(target_state=4, max_steps=10, reward_tolerance=0.99)\n",
    "\n",
    "# ---- Training Agent (QValueIteration Algorithm) ------\n",
    "agent    = QValueIteration(env=env, gamma=0.9)\n",
    "\n",
    "# ---- Training Loop ------\n",
    "TEST_EPISODES = 20\n",
    "iter_no, best_reward = 0, 0.0\n",
    "while True:\n",
    "    iter_no += 1\n",
    "    agent.play_n_random_steps(50)\n",
    "    agent.qvalue_iteration()\n",
    "\n",
    "    # ---- Save the training progress ------\n",
    "    env._render_graph(agent=agent,show_true_dynamics=True)\n",
    "\n",
    "    reward = 0.0\n",
    "    for _ in range(TEST_EPISODES):\n",
    "        obs, _ = test_env.reset()\n",
    "        while True:\n",
    "            action = agent.select_action(int(obs))\n",
    "            obs, _, terminated, truncated, info = test_env.step(action)\n",
    "            if terminated or truncated:\n",
    "                reward += float(terminated)   # success rate\n",
    "                break\n",
    "    reward /= TEST_EPISODES\n",
    "\n",
    "    print(f\"Iteration {iter_no} reward: {reward:.3f}\")\n",
    "    if reward > best_reward:\n",
    "        print(\"Best reward updated %.3f -> %.3f\" % (best_reward, reward))\n",
    "        best_reward = reward\n",
    "    if reward >= 1.0:                         # 100% success rate\n",
    "        print(\"Solved in %d iterations!\" % iter_no)\n",
    "        break\n",
    "\n",
    "# ---- Render the training progress ------\n",
    "env.render(save_path_without_extension=\"bloch_sphere_qvalue_iteration\",\n",
    "           interval=600, ffmpeg=False)\n",
    "print(\"Animation saved.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "m3n4o5p6",
   "metadata": {},
   "source": [
    "## Implementation Notes & Extensions\n",
    "\n",
    "* **Transition table**: The environment pre-computes a `(6, 4)` integer array `_TRANSITIONS`\n",
    "  (imported from `qrl.env.utils`) where `_TRANSITIONS[s, a] = s'`. This table is also\n",
    "  used for graph construction in the render methods. To inspect it:\n",
    "  `BlochSphereV1.transition_table()`.\n",
    "\n",
    "* **Algorithm compatibility**: `ValueIteration` detects the value function via the attribute\n",
    "  `agent._V`, while `QValueIteration` is detected via `agent._Q`. The render panel uses\n",
    "  `max_a Q(s, a)` when a Q-table is present and `V(s)` otherwise.\n",
    "\n",
    "* **Adding more states / actions**: Extending V1 to a richer gate set (e.g. adding T or\n",
    "  Y gates) requires updating `ACTION_NAMES`, `GATES`, and `_TRANSITIONS` in `utils.py`,\n",
    "  and adjusting `action_space = spaces.Discrete(N)` accordingly.\n",
    "\n",
    "* **Reward shaping**: For model-free agents, add a small step penalty (e.g. `−0.01`)\n",
    "  inside `get_reward()` to encourage shorter paths. The commented-out `STEP_PENALTY`\n",
    "  and `SUCCESS_BONUS` constants in the source are a natural starting point.\n",
    "\n",
    "* **Rendering only the agent panel**: Pass `show_true_dynamics=False` to `_render_graph()`\n",
    "  to omit the left panel and render only the agent's learned model.\n",
    "\n",
    "* **Multiple target states**: Wrap the environment in a meta-loop that instantiates\n",
    "  `BlochSphereV1` with different `target_state` values to train a goal-conditioned agent.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "n4o5p6q7",
   "metadata": {},
   "source": [
    "## Version History\n",
    "\n",
    "* **v1**: Discrete graph formulation of the Bloch sphere.\n",
    "  Six canonical states, four deterministic gate actions (H, X, Z, S).\n",
    "  Integer observation, binary sparse reward.\n",
    "  Fully compatible with `ValueIteration` and `QValueIteration` from `qrl.algorithms`.\n",
    "  Two-step 2D graph-based renderer with agent-model panel (value colormap + greedy policy overlay).\n",
    "  Returns 5-tuple `(obs, reward, terminated, truncated, info)` from `step()` in line with modern Gymnasium API.\n",
    "\n",
    "* **v0**: Initial design and implementation. Single-qubit pure-state environment with fixed initial state `|0⟩`, discrete gate set, fidelity reward, Matplotlib-based Bloch sphere renderer, and history tracking.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "o5p6q7r8",
   "metadata": {},
   "source": [
    "## References (Suggested Reading)\n",
    "\n",
    "* Bloch sphere — standard geometric representation for a single qubit.\n",
    "* Nielsen, M. A., & Chuang, I. L., *Quantum Computation and Quantum Information*\n",
    "  (for unitary gate definitions and single-qubit geometry).\n",
    "* Sutton, R. S., & Barto, A. G., *Reinforcement Learning: An Introduction* (2nd ed.)\n",
    "  (for Value Iteration, Q-Value Iteration, and finite MDP fundamentals).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e026b150",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "qrl_env (3.12.3)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}