{ "cells": [ { "cell_type": "markdown", "id": "a1b2c3d4", "metadata": {}, "source": [ "# BlochSphereV1" ] }, { "cell_type": "markdown", "id": "26161d04", "metadata": {}, "source": [ "![image](../images/bloch_spherev1.png)" ] }, { "cell_type": "markdown", "id": "b2c3d4e5", "metadata": {}, "source": [ "## Description\n", "\n", "`BlochSphereV1` models the single-qubit Bloch sphere as a **finite graph problem** for reinforcement learning.\n", "Rather than tracking a continuous statevector, the environment works with a discrete set of six canonical pure\n", "states — `|0⟩`, `|1⟩`, `|+⟩`, `|-⟩`, `|+i⟩`, `|-i⟩` — and four deterministic gate actions (H, X, Z, S).\n", "\n", "The qubit lives on a directed graph with 6 nodes. Each node is one of the canonical states and each edge\n", "represents a gate action that maps one state to another. Because this is a proper finite Markov Decision\n", "Process (MDP), `BlochSphereV1` is **fully compatible with tabular planning algorithms** such as\n", "`ValueIteration` and `QValueIteration` from `qrl.algorithms` — no wrapper required.\n", "\n", "The objective is to steer the qubit from the fixed initial state `|0⟩` (index 0) to a user-specified\n", "target pure state (default `|+⟩`, index 2) within a limited number of steps.\n", "\n", "Key details\n", "-----------\n", "- **Action space**: Discrete(4) — gates H, X, Z, S.\n", "- **Observation space**: Discrete(6) — integer index ∈ {0,1,2,3,4,5} for the six canonical states.\n", "- **Reward**: Binary sparse — `1.0` if fidelity ≥ `reward_tolerance`, `0.0` otherwise.\n", "- **Termination**: Success when fidelity ≥ `reward_tolerance`, truncation at `max_steps`.\n", "- **Rendering**: 2D state-transition graph with optional agent-model panel showing learned value function\n", " and greedy policy.\n", "\n", "**When to prefer V1 over V0?** Use `BlochSphereV1` when you want to apply classical RL planning methods\n", "(value iteration, Q-value iteration, policy iteration) or when a discrete-state formulation suffices\n", "for your experiment. Use `BlochSphereV0` for continuous-action experiments or deep-RL training that\n", "requires a richer, continuous Bloch-vector observation.\n", "\n" ] }, { "cell_type": "markdown", "id": "c3d4e5f6", "metadata": {}, "source": [ "## Action Space\n", "\n", "The action space is `gymnasium.spaces.Discrete(4)`. Each integer index selects a unitary gate\n", "to apply to the current statevector. Gate transitions are **deterministic** — the same action\n", "from the same state always leads to the same next state.\n", "\n", "| Num | Action | Description |\n", "| --- | ------ | -------------------- |\n", "| 0 | `H` | Hadamard gate |\n", "| 1 | `X` | Pauli-X (NOT) |\n", "| 2 | `Z` | Pauli-Z |\n", "| 3 | `S` | Phase gate (S) |\n", "\n", "The full (6 × 4) transition table `T[s, a] = s'` can be retrieved via the static method\n", "`BlochSphereV1.transition_table()`. This table is also used internally for the graph rendering.\n", "\n", "> **Note:** `BlochSphereV1` uses only 4 gates vs. the 17 available in `BlochSphereV0`.\n", "> The four gates are chosen because they suffice to reach all six canonical states from any\n", "> starting state and form a clean finite MDP over the canonical Bloch-sphere axes.\n", "\n" ] }, { "cell_type": "markdown", "id": "d4e5f6g7", "metadata": {}, "source": [ "## Observation Space\n", "\n", "The observation is a **single integer** representing the current discrete state index.\n", "Space type: `gymnasium.spaces.Discrete(6)`.\n", "\n", "| Index | State | Description |\n", "| ----- | ------- | ------------------------------------ |\n", "| 0 | `\\|0⟩` | Computational basis state zero |\n", "| 1 | `\\|1⟩` | Computational basis state one |\n", "| 2 | `\\|+⟩` | Equal superposition (positive phase) |\n", "| 3 | `\\|-⟩` | Equal superposition (negative phase) |\n", "| 4 | `\\|+i⟩` | Y-axis positive pole |\n", "| 5 | `\\|-i⟩` | Y-axis negative pole |\n", "\n", "**Auxiliary properties** (not part of `observation_space` but available on the env object):\n", "\n", "- `env.state_index` — current state index (same as the returned observation).\n", "- `env.bloch_vector` — `(x, y, z)` float32 array for the current canonical state.\n", "- `BlochSphereV1.transition_table()` — static (6, 4) integer array.\n", "\n" ] }, { "cell_type": "markdown", "id": "e5f6g7h8", "metadata": {}, "source": [ "## Rewards\n", "\n", "The reward signal is **binary and sparse**:\n", "\n", "```\n", "reward = 1.0 if fidelity(current_state, target_state) ≥ reward_tolerance\n", "reward = 0.0 otherwise\n", "```\n", "\n", "where fidelity is `|⟨target | current⟩|²` and `reward_tolerance` defaults to `0.99`.\n", "\n", "This sparse 0/1 signal is well-suited for **tabular planning algorithms** (e.g. Value Iteration)\n", "that propagate rewards backwards through the known transition table. For policy-gradient or\n", "Q-learning agents that benefit from denser feedback, consider wrapping the environment and\n", "shaping the reward (e.g. by adding a small negative step penalty).\n", "\n", "> **Contrast with V0:** `BlochSphereV0` returns a continuous fidelity reward in `[0, 1]` at\n", "> every step, making it more informative for gradient-based agents.\n", "\n" ] }, { "cell_type": "markdown", "id": "f6g7h8i9", "metadata": {}, "source": [ "## Starting State\n", "\n", "On `reset()` the environment:\n", "\n", "- Sets `_state_index = 0` (corresponding to `|0⟩`).\n", "- Resets the internal statevector to `STATE_VECTORS[0]`.\n", "- Clears `history` to `[0]` (only the initial state index).\n", "- Sets `steps = 0`, `terminated = False`, `truncated = False`.\n", "\n", "The **target state** is fixed at construction time via the `target_state` argument (integer index,\n", "default `2` = `|+⟩`). It does **not** change between episodes.\n", "\n" ] }, { "cell_type": "markdown", "id": "g7h8i9j0", "metadata": {}, "source": [ "## Episode End\n", "\n", "An episode ends when either of the following conditions is met:\n", "\n", "1. **Termination (success):** Fidelity between the current state and the target state meets or\n", " exceeds `reward_tolerance` (default `0.99`). `terminated = True`.\n", "2. **Truncation:** The number of steps reaches `max_steps` (default `10`). `truncated = True`.\n", "\n", "`step()` returns the 5-tuple `(observation, reward, terminated, truncated, info)` in compliance\n", "with the modern Gymnasium API. The `info` dict contains:\n", "\n", "| Key | Type | Description |\n", "| -------------- | ----------- | --------------------------------------------- |\n", "| `fidelity` | `float` | Current fidelity with the target state. |\n", "| `gate` | `str` | Name of the gate applied in this step. |\n", "| `state_index` | `int` | Current state index (same as `observation`). |\n", "\n", "> **Note:** Unlike `BlochSphereV0`, `BlochSphereV1` returns a **5-tuple** from `step()`\n", "> (terminated and truncated are separate) and also returns `info` from `reset()`.\n", "\n" ] }, { "cell_type": "markdown", "id": "h8i9j0k1", "metadata": {}, "source": [ "## Render\n", "\n", "`BlochSphereV1` uses a **two-step rendering workflow**:\n", "\n", "### Step 1 — Collect frames: `env._render_graph(agent, show_true_dynamics=True)`\n", "\n", "Call this at the end of each training episode. It draws a snapshot of the current state and\n", "appends it to `env.fig_array_list`. The snapshot contains up to two panels:\n", "\n", "- **Left panel** (`show_true_dynamics=True`, default): True environment dynamics graph.\n", " - Nodes are coloured: red = current, blue = target, yellow = current=target, grey = other.\n", " - Episode trajectory is overlaid in bold orange.\n", "- **Right panel** (requires `agent` argument): Agent's learned model.\n", " - Node colour encodes V(s) or max_a Q(s,a) on a warm colormap (warm = high value).\n", " - Edge opacity is proportional to visit counts.\n", " - Bold teal edges show the greedy policy (argmax_a Q(s,a)).\n", " - A colorbar labels the minimum/maximum value states.\n", "\n", "Raises `ValueError` if no `agent` is provided.\n", "\n", "### Step 2 — Save animation: `env.render(save_path_without_extension, interval=600, ffmpeg=False)`\n", "\n", "Assembles all collected frames into a single `.gif` (or `.mp4` if `ffmpeg=True`) animation\n", "and saves it to `.gif`.\n", "\n", "Raises `ValueError` if `_render_graph()` was never called (no frames collected).\n", "\n" ] }, { "cell_type": "markdown", "id": "i9j0k1l2", "metadata": {}, "source": [ "## Arguments (Constructor & Reset Options)\n", "\n", "Constructor signature:\n", "\n", "```python\n", "BlochSphereV1(\n", " target_state: int = 2,\n", " max_steps: int = 10,\n", " reward_tolerance: float = 0.99,\n", " ffmpeg: bool = False,\n", ")\n", "```\n", "\n", "- `target_state`: Target state index in `[0, 5]`. Default `2` (|+⟩). Mapping:\n", " `0→|0⟩, 1→|1⟩, 2→|+⟩, 3→|-⟩, 4→|+i⟩, 5→|-i⟩`.\n", "- `max_steps`: Maximum actions per episode (truncation threshold). Default `10`.\n", "- `reward_tolerance`: Fidelity threshold for success. Must be in `(0, 1]`. Default `0.99`.\n", "- `ffmpeg`: If `True`, animations are saved as `.mp4` (requires ffmpeg). Default `False` (GIF).\n", "\n", "`reset(seed=None, options=None)` accepts an optional seed passed to the Gymnasium base class\n", "and always returns `(0, info_dict)` — the fixed initial state index and its info.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "j0k1l2m3", "metadata": {}, "outputs": [], "source": [ "### Example 1 — Standalone random agent\n", "\n", "from qrl.env import BlochSphereV1\n", "\n", "# Target state is |+⟩ (index 2)\n", "# set ffmpeg=True if you have ffmpeg installed to save as mp4, or ffmpeg=False to save as gif\n", "env = BlochSphereV1(target_state=2, max_steps=10, reward_tolerance=0.99, ffmpeg=False)\n", "\n", "# Reset\n", "obs, info = env.reset()\n", "print(\"Initial state index:\", obs) # always 0 → |0⟩\n", "print(\"Initial info:\", info)\n", "print(\"Bloch vector:\", env.bloch_vector)\n", "\n", "# Transition table (6 states × 4 actions)\n", "print(\"\\nTransition table (T[s, a] = s'):\\n\", BlochSphereV1.transition_table())\n", "\n", "# Random rollout\n", "action_names = ['H', 'X', 'Z', 'S']\n", "for step in range(env.max_steps):\n", " action = env.action_space.sample()\n", " obs, reward, terminated, truncated, info = env.step(action)\n", " print(f\"Step {step+1} | Gate: {action_names[action]:<2} | \"\n", " f\"State: {obs} | Reward: {reward:.1f} | \"\n", " f\"Fidelity: {info['fidelity']:.4f}\")\n", " if terminated or truncated:\n", " break\n", "\n", "print(\"\\nEpisode finished. Env repr:\", repr(env))\n" ] }, { "cell_type": "code", "execution_count": null, "id": "k1l2m3n4", "metadata": {}, "outputs": [], "source": [ "### Example 2 — Training with ValueIteration + animated graph rendering\n", "\n", "from qrl.algorithms.classical import ValueIteration\n", "from qrl.env import BlochSphereV1\n", "\n", "# ---- Training and Testing Environment ------\n", "env = BlochSphereV1(target_state=4, max_steps=10, reward_tolerance=0.99)\n", "test_env = BlochSphereV1(target_state=4, max_steps=10, reward_tolerance=0.99)\n", "\n", "# ---- Training Agent (ValueIteration Algorithm) ------\n", "agent = ValueIteration(env=env, gamma=0.9)\n", "\n", "# ---- Training Loop ------\n", "TEST_EPISODES = 20\n", "iter_no, best_reward = 0, 0.0\n", "while True:\n", " iter_no += 1\n", " agent.play_n_random_steps(50)\n", " agent.value_iteration()\n", "\n", " # ---- Save the training progress ------\n", " env._render_graph(agent=agent)\n", "\n", " reward = 0.0\n", " for _ in range(TEST_EPISODES):\n", " obs, _ = test_env.reset()\n", " while True:\n", " action = agent.select_action(int(obs))\n", " obs, _, terminated, truncated, info = test_env.step(action)\n", " if terminated or truncated:\n", " reward += float(terminated) # success rate\n", " break\n", " reward /= TEST_EPISODES\n", "\n", " print(f\"Iteration {iter_no} reward: {reward:.3f}\")\n", " if reward > best_reward:\n", " print(\"Best reward updated %.3f -> %.3f\" % (best_reward, reward))\n", " best_reward = reward\n", " if reward >= 1.0: # 100% success rate\n", " print(\"Solved in %d iterations!\" % iter_no)\n", " break\n", "\n", "# ---- Render the training progress ------\n", "env.render(save_path_without_extension=\"bloch_sphere_value_iteration\",\n", " interval=600, ffmpeg=False)\n", "print(\"Animation saved.\")" ] }, { "cell_type": "code", "execution_count": 1, "id": "l2m3n4o5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Iteration 1 reward: 1.000\n", "Best reward updated 0.000 -> 1.000\n", "Solved in 1 iterations!\n", "Animation saved.\n" ] } ], "source": [ "### Example 3 — Training with QValueIteration\n", "\n", "from qrl.algorithms.classical import QValueIteration\n", "from qrl.env import BlochSphereV1\n", "\n", "# ---- Training and Testing Environment ------\n", "env = BlochSphereV1(target_state=4, max_steps=10, reward_tolerance=0.99)\n", "test_env = BlochSphereV1(target_state=4, max_steps=10, reward_tolerance=0.99)\n", "\n", "# ---- Training Agent (QValueIteration Algorithm) ------\n", "agent = QValueIteration(env=env, gamma=0.9)\n", "\n", "# ---- Training Loop ------\n", "TEST_EPISODES = 20\n", "iter_no, best_reward = 0, 0.0\n", "while True:\n", " iter_no += 1\n", " agent.play_n_random_steps(50)\n", " agent.qvalue_iteration()\n", "\n", " # ---- Save the training progress ------\n", " env._render_graph(agent=agent,show_true_dynamics=True)\n", "\n", " reward = 0.0\n", " for _ in range(TEST_EPISODES):\n", " obs, _ = test_env.reset()\n", " while True:\n", " action = agent.select_action(int(obs))\n", " obs, _, terminated, truncated, info = test_env.step(action)\n", " if terminated or truncated:\n", " reward += float(terminated) # success rate\n", " break\n", " reward /= TEST_EPISODES\n", "\n", " print(f\"Iteration {iter_no} reward: {reward:.3f}\")\n", " if reward > best_reward:\n", " print(\"Best reward updated %.3f -> %.3f\" % (best_reward, reward))\n", " best_reward = reward\n", " if reward >= 1.0: # 100% success rate\n", " print(\"Solved in %d iterations!\" % iter_no)\n", " break\n", "\n", "# ---- Render the training progress ------\n", "env.render(save_path_without_extension=\"bloch_sphere_qvalue_iteration\",\n", " interval=600, ffmpeg=False)\n", "print(\"Animation saved.\")" ] }, { "cell_type": "markdown", "id": "m3n4o5p6", "metadata": {}, "source": [ "## Implementation Notes & Extensions\n", "\n", "* **Transition table**: The environment pre-computes a `(6, 4)` integer array `_TRANSITIONS`\n", " (imported from `qrl.env.utils`) where `_TRANSITIONS[s, a] = s'`. This table is also\n", " used for graph construction in the render methods. To inspect it:\n", " `BlochSphereV1.transition_table()`.\n", "\n", "* **Algorithm compatibility**: `ValueIteration` detects the value function via the attribute\n", " `agent._V`, while `QValueIteration` is detected via `agent._Q`. The render panel uses\n", " `max_a Q(s, a)` when a Q-table is present and `V(s)` otherwise.\n", "\n", "* **Adding more states / actions**: Extending V1 to a richer gate set (e.g. adding T or\n", " Y gates) requires updating `ACTION_NAMES`, `GATES`, and `_TRANSITIONS` in `utils.py`,\n", " and adjusting `action_space = spaces.Discrete(N)` accordingly.\n", "\n", "* **Reward shaping**: For model-free agents, add a small step penalty (e.g. `−0.01`)\n", " inside `get_reward()` to encourage shorter paths. The commented-out `STEP_PENALTY`\n", " and `SUCCESS_BONUS` constants in the source are a natural starting point.\n", "\n", "* **Rendering only the agent panel**: Pass `show_true_dynamics=False` to `_render_graph()`\n", " to omit the left panel and render only the agent's learned model.\n", "\n", "* **Multiple target states**: Wrap the environment in a meta-loop that instantiates\n", " `BlochSphereV1` with different `target_state` values to train a goal-conditioned agent.\n", "\n" ] }, { "cell_type": "markdown", "id": "n4o5p6q7", "metadata": {}, "source": [ "## Version History\n", "\n", "* **v1**: Discrete graph formulation of the Bloch sphere.\n", " Six canonical states, four deterministic gate actions (H, X, Z, S).\n", " Integer observation, binary sparse reward.\n", " Fully compatible with `ValueIteration` and `QValueIteration` from `qrl.algorithms`.\n", " Two-step 2D graph-based renderer with agent-model panel (value colormap + greedy policy overlay).\n", " Returns 5-tuple `(obs, reward, terminated, truncated, info)` from `step()` in line with modern Gymnasium API.\n", "\n", "* **v0**: Initial design and implementation. Single-qubit pure-state environment with fixed initial state `|0⟩`, discrete gate set, fidelity reward, Matplotlib-based Bloch sphere renderer, and history tracking.\n" ] }, { "cell_type": "markdown", "id": "o5p6q7r8", "metadata": {}, "source": [ "## References (Suggested Reading)\n", "\n", "* Bloch sphere — standard geometric representation for a single qubit.\n", "* Nielsen, M. A., & Chuang, I. L., *Quantum Computation and Quantum Information*\n", " (for unitary gate definitions and single-qubit geometry).\n", "* Sutton, R. S., & Barto, A. G., *Reinforcement Learning: An Introduction* (2nd ed.)\n", " (for Value Iteration, Q-Value Iteration, and finite MDP fundamentals).\n" ] }, { "cell_type": "markdown", "id": "e026b150", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "qrl_env (3.12.3)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" } }, "nbformat": 4, "nbformat_minor": 5 }