qrl.env package
Subpackages
- qrl.env.core package
- Submodules
- Module contents
Module contents
- class qrl.env.BlochSphereV0(*args: Any, **kwargs: Any)[source]
Bases:
QuantumEnvSingle-qubit Bloch sphere environment for reinforcement learning.
BlochSphereV0is agymnasium.Env-compatible environment where an agent controls a single qubit via a discrete set of quantum gates. The qubit state is represented internally as a statevector and exposed to the agent as a 3D Bloch vector(x, y, z).The objective is to steer the qubit from the fixed initial state
|0⟩to a target pure state (default|+⟩) within a limited number of steps by applying unitary gate actions.Key details
Action space: Discrete set of single-qubit gates (Clifford + common rotations).
Observation space: Bloch vector
(x, y, z), each component in[-1, 1].Reward: Fidelity
|⟨target | state⟩|²in[0, 1].Termination: Success when reward exceeds
reward_toleranceor truncation
at
max_steps.Rendering
The
render()method visualizes the Bloch sphere and the agent’s trajectory, showing the current state and target state as arrows in 3D.Input Parameters
target_state: Target pure state as a Numpy complex 2-vector, defaults to
|+⟩.max_steps: Maximum number of steps per episode.
reward_tolerance: Fidelity threshold for successful termination.
ffmpeg: If set to True, animations are saved as mp4 videos, else as GIFs. Default is False.
See also
tutorials/bloch_sphere
- get_reward(action)[source]
Apply a quantum gate action and compute the resulting reward.
This method evolves the internal qubit state by applying the unitary corresponding to the selected action and evaluates the fidelity with respect to the target state.
- Parameters:
action (int) – Index of the selected action in
self.actions.- Returns:
Fidelity between the current state and the target state, defined as
|⟨target | state⟩|²and bounded in[0, 1].- Return type:
float
- render(save_path_without_extension=None, interval=800)[source]
Render the Bloch sphere trajectory as a 3D animation.
The visualization shows: - A translucent Bloch sphere with labeled basis states, - The target Bloch vector (green, static), - The evolving qubit state trajectory (red, dynamic).
- Parameters:
save_path_without_extension (str or None, optional) – Path (without file extension) to save the animation. If provided, the animation is saved using the configured writer (MP4 for FFmpeg or GIF for Pillow). If None, the animation is displayed interactively.
interval (int, optional) – Delay between animation frames in milliseconds. Default is 800.
- Returns:
This method produces a visualization but does not return a value.
- Return type:
None
- reset()[source]
Reset the environment to the initial state.
The qubit is initialized to the computational basis state |0⟩. Episode step count and history are cleared.
- Returns:
observation (np.ndarray) – Initial Bloch vector corresponding to |0⟩, shape
(3,).info (dict) – Empty dictionary provided for compatibility with Gymnasium API.
- step(action)[source]
Execute one environment step.
Applies the selected quantum gate, updates the internal state and history, computes the reward, and checks termination conditions.
- Parameters:
action (int) – Index of the selected action in
self.actions.- Returns:
observation (np.ndarray) – Updated Bloch vector of the qubit state, shape
(3,).reward (float) – Fidelity-based reward after applying the action.
done (bool) – True if the episode has terminated due to success or truncation.
info (dict) – Empty dictionary provided for compatibility with Gymnasium API.
- class qrl.env.BlochSphereV1(target_state=2, max_steps=10, reward_tolerance=0.99, ffmpeg=False)[source]
Bases:
QuantumEnvSingle-qubit Bloch sphere environment as a graph problem for reinforcement learning.
BlochSphereV1is agymnasium.Envcompatible environment where an agent controls a single qubit via a discrete set of quantum gates. The qubit state is exposed to the agent as an integer index corresponding to the discrete states |0⟩, |1⟩, |+⟩, |-⟩, |+i⟩, |-i⟩.The objective is to steer the qubit from the fixed starting initial state
|0⟩to a user defined target pure state (default|+⟩) within a limited number of steps by applying unitary gate actions.The environment is fully compatible with
ValueIterationandQValueIterationfromqrl.algorithms.Key details
Action space: Discrete set of single-qubit gates (H,X,Z,S).
Observation space: Integer index corresponding to the Discrete states |0⟩, |1⟩, |+⟩, |-⟩, |+i⟩, |-i⟩.
Reward: Fidelity
|⟨target | state⟩|²in[0, 1].Termination: Success when reward exceeds
reward_toleranceor truncation
at
max_steps.- param - target_state:
Target state index in [0, 5]. Defaults to 2 (|+⟩). The mapping is: 0 → |0⟩, 1 → |1⟩, 2 → |+⟩, 3 → |-⟩, 4 → |+i⟩, 5 → |-i⟩.
- type - target_state:
int, optional
- param - max_steps:
Maximum number of steps per episode. Default is 10.
- type - max_steps:
int, optional
- param - reward_tolerance:
Fidelity threshold for successful termination. Must be in (0, 1]. Default is 0.99.
- type - reward_tolerance:
float, optional
- param - ffmpeg:
If True, animations are saved as MP4 via ffmpeg, else as GIFs. Default is False.
- type - ffmpeg:
bool, optional
:raises - ValueError : If
target_stateis not in [0, 5].: :raises - ValueError : Ifreward_toleranceis not in (0, 1].: :raises - ValueError : Ifffmpeg=Truebut ffmpeg is not installed on the system.:- property bloch_vector: pennylane.numpy.ndarray
Current Bloch vector (x, y, z).
- get_reward()[source]
Compute the reward for the current state and update termination flags.
Evaluates the fidelity between the current statevector and the target statevector. Sets
self.terminatedif fidelity meets or exceedsreward_tolerance, andself.truncatedif the step limit is reached.- Returns:
1.0 if the current state matches the target within
reward_tolerance, 0.0 otherwise.- Return type:
float
- render(save_path_without_extension, interval=600, ffmpeg=False)[source]
Render accumulated graph frames as an animation and save to disk.
Assembles the list of graph snapshots captured by
_render_graph()into a single animation. Each frame corresponds to one call to_render_graph(), producing a visual record of the agent’s learning progression over episodes.- Parameters:
save_path_without_extension (str) – File path (without extension) where the animation will be saved. The appropriate extension (
.mp4or.gif) is appended automatically based on theffmpegargument.interval (int, optional) – Delay between frames in milliseconds. Default is 600.
ffmpeg (bool, optional) – If True, saves the animation as an MP4 using ffmpeg. If False, saves as a GIF using Pillow. Default is False.
- Raises:
ValueError – If
_render_graph()has not been called and no frames are available.- Returns:
This method produces an animation file but does not return a value.
- Return type:
None
- reset(*, seed=None, options=None)[source]
Reset the environment to the initial state.
The qubit is placed at state index 0 (|0⟩). Episode step count, history, and termination flags are cleared.
- Parameters:
seed (int or None, optional) – Random seed passed to the base
gymnasium.Envreset. Default is None.options (dict or None, optional) – Additional options passed to the base reset. Default is None.
- Returns:
observation (int) – Initial state index (always 0, corresponding to |0⟩).
info (dict) – Dictionary containing
fidelity,gate("reset"), andbloch_vectorof the initial state.
- property state_index: int
Current state index (0-5).
- step(action)[source]
Apply a gate action and advance the episode by one step.
Applies the unitary gate corresponding to
actionto the current statevector, updates the discrete state index via the transition table, increments the step counter, and appends the new state to history.- Parameters:
action (int) – Index into
ACTION_NAMESselecting the gate to apply. 0 → H, 1 → X, 2 → Z, 3 → S.- Returns:
observation (int) – New discrete state index after applying the gate.
reward (float) – 1.0 if the target is reached within tolerance, 0.0 otherwise.
terminated (bool) – True if fidelity ≥
reward_tolerance.truncated (bool) – True if
steps≥max_steps.info (dict) – Dictionary containing
fidelity,gatename applied, andbloch_vectorof the resulting state.
- static transition_table()[source]
Return the deterministic state-transition table for the environment.
Each entry
T[s, a]gives the next state index when actionais taken from states. Rows correspond to the 6 Bloch sphere states and columns to the 4 gate actions (H, X, Z, S).- Returns:
Integer array of shape
(6, 4)whereT[s, a] = s'.- Return type:
np.ndarray
- class qrl.env.CompilerV0(*args: Any, **kwargs: Any)[source]
Bases:
QuantumEnvSingle-qubit quantum gate compilation environment.
CompilerV0is agymnasium.Env-compatible environment that models the problem of compiling a target single-qubit unitary using a fixed, discrete gate set. The agent incrementally applies quantum gates to build a circuit whose resulting unitary approximates a given target operation in SU(2).At each step, the agent selects a gate action that left-multiplies the current circuit unitary. The episode reward is based on the average gate fidelity between the current unitary and the target unitary, encouraging the agent to discover short, high-fidelity gate sequences.
Key properties
Action space: Discrete set of single-qubit gates (Clifford + rotations).
Observation space: Flattened real and imaginary parts of the current
2×2unitary (shape(8,)).Reward: Average gate fidelity with respect to the target unitary.
Termination: Success when fidelity exceeds
reward_toleranceor truncation atmax_steps.
Rendering
The
render()method visualizes the compilation process by displaying a heatmap of the magnitude of the difference matrix|U_target − U|over time, annotated with the current step, last applied gate, and reward.Input Parameters
- targetnp.ndarray
Target
2×2unitary matrix in SU(2) to compile towards.- max_stepsint
Maximum number of gate applications per episode.
- reward_tolerancefloat
Fidelity threshold for early termination.
- ffmpegbool
Whether to use FFmpeg when saving animations.
See also
- tutorials/compiler
Step-by-step tutorial on compiling SU(2) unitaries using
CompilerV0.
- get_reward(action)[source]
Apply a quantum gate action and compute the compilation reward.
This method left-multiplies the current circuit unitary by the unitary corresponding to the selected action and evaluates the average gate fidelity with respect to the target unitary.
- Parameters:
action (int) – Index of the selected action in
self.actions.- Returns:
Average gate fidelity between the current unitary and the target unitary, defined as
0.5 * |Tr(U_target† · U)|for a single-qubit system.- Return type:
float
- render(save_path_without_extension=None, interval=800)[source]
Render the compilation process as an animation of the difference matrix.
The visualization shows the magnitude of the element-wise difference
|U_target - U|as a heatmap that evolves over time, along with annotations indicating the current step, applied action, and reward.- Parameters:
save_path_without_extension (str or None, optional) – Path (without file extension) to save the animation. If provided, the animation is saved using the configured writer (MP4 for FFmpeg or GIF for Pillow). If None, the animation is displayed interactively.
interval (int, optional) – Delay between animation frames in milliseconds. Default is 800.
- Returns:
This method produces a visualization but does not return a value.
- Return type:
None
- reset()[source]
Reset the environment to the initial compilation state.
The circuit unitary is reset to the identity matrix, the step counter is cleared, and the history buffer is reinitialized.
- Returns:
observation (np.ndarray) – Flattened observation corresponding to the identity unitary, shape
(8,).info (dict) – Empty dictionary provided for compatibility with the Gymnasium API.
- step(action)[source]
Execute one compilation step.
Applies the selected gate, updates the internal circuit unitary and history, computes the reward, and checks termination conditions.
- Parameters:
action (int) – Index of the selected action in
self.actions.- Returns:
observation (np.ndarray) – Updated flattened unitary observation, shape
(8,).reward (float) – Average gate fidelity after applying the action.
done (bool) – True if the episode has terminated due to reaching the fidelity threshold or the maximum number of steps.
info (dict) – Empty dictionary provided for compatibility with the Gymnasium API.
- class qrl.env.ErrorChannelV0(n_qubits=3, faulty_qubits=None, max_steps=10, seed=None, ffmpeg=False)[source]
Bases:
QuantumEnvMulti-qubit error mitigation environment with bit-flip noise.
ErrorChannelV0is agymnasium.Env-compatible environment that models a noisy multi-qubit quantum system affected by independent bit-flip error channels. Each qubit may experience noise with a different probability, and the agent’s task is to apply corrective Pauli-X operations to recover the target computational basis state|0…0⟩.The environment captures a simplified quantum error mitigation scenario, where the agent sequentially selects qubits on which to apply corrections based on observed measurement probabilities.
Key properties
Action space: Discrete choice of qubit index on which to apply an
Xgate.Observation space: Probability distribution over all computational basis
states (shape
(2**n_qubits,)). - Reward: Negative mean-squared error between the corrected distribution and the ideal|0…0⟩distribution. - Termination: Success when perfect correction is achieved or truncation atmax_steps.Rendering
The
render()method visualizes the mitigation process using a side-by-side animation that compares ideal, noisy, and corrected probability distributions, along with a dynamically updated circuit diagram showing applied corrections.Input Parameters
- n_qubitsint
Number of qubits in the system.
- faulty_qubitsdict[int, float] or None
Mapping from qubit indices to bit-flip probabilities.
- max_stepsint
Maximum number of correction steps per episode.
- seedint or None
Random seed for reproducibility.
- ffmpegbool
Whether to use FFmpeg to save animations as ,p4 or save it as GIFs with Pillow.
See also
- tutorials/error_channel
Tutorial on multi-qubit error mitigation with bit-flip noise.
- get_reward(action)[source]
Apply a correction action and compute the reward.
The selected qubit index is appended to the correction list, the noisy and corrected circuits are evaluated, and the reward is computed as the negative mean-squared error between the corrected and target probability distributions.
- Parameters:
action (int) – Index of the qubit on which a Pauli-X correction is applied.
- Returns:
Reward value defined as the negative mean-squared error between the corrected probability distribution and the target distribution.
- Return type:
float
- render(save_path_without_extension=None, interval_ms=600)[source]
Render the error-mitigation process as an animated visualization.
The animation consists of: - A bar chart comparing ideal, noisy, and corrected probability distributions for each computational basis state. - A dynamically updated ASCII-style circuit diagram showing the applied correction operations.
- Parameters:
save_path_without_extension (str or None, optional) – Path (without file extension) to save the animation. If provided, the animation is saved using the configured writer (MP4 for FFmpeg or GIF for Pillow). If None, the animation is displayed interactively.
interval_ms (int, optional) – Time between animation frames in milliseconds. Default is 600.
- Returns:
This method produces a visualization but does not return a value.
- Return type:
None
- reset(*, seed=None)[source]
Reset the environment to the initial noisy state.
Clears the correction history, resets the step counter, and evaluates the noisy circuit without any corrective operations.
- Parameters:
seed (int or None, optional) – Random seed for reproducibility. If provided, reinitializes the internal random number generator.
- Returns:
observation – Initial corrected probability distribution over computational basis states (identical to the noisy distribution at reset), with dtype
float32.- Return type:
np.ndarray
- step(action)[source]
Execute one environment step.
Applies a correction action, updates internal state and history, computes the reward, and checks termination conditions.
- Parameters:
action (int) – Index of the qubit on which a Pauli-X correction is applied.
- Returns:
observation (np.ndarray) – Corrected probability distribution over computational basis states, with dtype
float32.reward (float) – Negative mean-squared error between the corrected and target distributions.
done (bool) – True if the episode has terminated due to reaching the maximum number of steps or achieving perfect correction.
info (dict) – Dictionary containing metadata about the environment, including the mapping of faulty qubits.
- class qrl.env.ExpressibilityV0(n_qubits=4, max_blocks=12, max_steps=20, n_pairs_eval=120, bins=50, lambda_depth=0.002, lambda_2q=0.002, terminate_bonus=0.1, device_name='default.qubit', seed=None, allow_all_to_all=False, ffmpeg=False)[source]
Bases:
QuantumEnvParameterized circuit expressibility optimization environment.
ExpressibilityV0is agymnasium.Env-compatible environment that models the construction of parameterized quantum circuits with high expressibility. In the context of variational quantum algorithms, expressibility measures how well an ansatz can explore the Hilbert space of quantum states relative to the Haar-random distribution.The agent incrementally builds a circuit by adding or removing predefined rotation and entangling blocks, or by explicitly terminating construction. Rewards encourage circuits whose fidelity distribution closely matches the Haar distribution, while penalizing excessive circuit depth and two-qubit gate usage.
Key properties
Action space: Discrete set of architectural edits (add/remove blocks or
terminate construction). - Observation space: Vector of circuit statistics summarizing depth, parameter count, entanglement, and recent expressibility estimates (shape
(7,)). - Reward: Negative KL divergence to the Haar distribution with regularization penalties for depth and two-qubit gates. - Termination: Explicit termination by the agent or truncation atmax_steps.Rendering
The
render()method visualizes expressibility optimization via a two-panel animation showing the circuit’s fidelity distribution compared to the Haar-random distribution alongside a block-level diagram of the evolving circuit architecture.Input Parameters
- n_qubitsint
Number of qubits in the circuit.
- max_blocksint
Maximum number of blocks allowed in the circuit.
- max_stepsint
Maximum number of construction steps per episode.
- n_pairs_evalint
Number of random state pairs used to estimate expressibility.
- binsint
Number of histogram bins for fidelity distributions.
- lambda_depthfloat
Penalty weight for circuit depth.
- lambda_2qfloat
Penalty weight for two-qubit gate usage.
- terminate_bonusfloat
Bonus reward for explicit termination.
- device_namestr
PennyLane device backend used for simulation.
- seedint or None
Random seed for reproducibility.
- allow_all_to_allbool
Whether to allow all-to-all entangling blocks.
- ffmpegbool
Whether to use FFmpeg when saving animations.
See also
- tutorials/expressibility
Tutorial on optimizing ansatz expressibility with block-based circuits.
- action_meanings()[source]
Return a mapping from action indices to action names.
- Returns:
Dictionary mapping integer action indices to human-readable architectural action names.
- Return type:
dict
- get_reward(action)[source]
The selected action modifies the circuit architecture by adding, removing, or terminating block construction. Expressibility is evaluated after the update, and a reward is computed based on the circuit’s deviation from the Haar distribution and architectural penalties.
- Parameters:
action (int) – Index of the selected architectural action.
- Returns:
reward – Reward value combining expressibility and architectural penalties.
- Return type:
float
- render(save_path_without_extension=None, interval=800)[source]
Render the expressibility optimization process as an animation.
The animation shows: 1. A histogram of circuit fidelity distribution compared to the Haar-random distribution. 2. A block-diagram visualization of the evolving circuit architecture.
- Parameters:
save_path_without_extension (str or None, optional) – Path (without file extension) to save the animation. If provided, the animation is saved using the configured writer (MP4 for FFmpeg or GIF for Pillow). If None, the animation is displayed interactively.
interval (int, optional) – Delay between animation frames in milliseconds. Default is 800.
- Returns:
This method produces a visualization but does not return a value.
- Return type:
None
- reset(*, seed=None, options=None)[source]
Reset the environment to an empty circuit.
Clears the current circuit architecture, resets internal counters, and initializes the observation vector corresponding to an empty ansatz.
- Parameters:
seed (int or None, optional) – Random seed for reproducibility. If provided, reinitializes the internal random number generator.
options (dict or None, optional) – Additional reset options (currently unused, included for Gymnasium compatibility).
- Returns:
observation (np.ndarray) – Initial observation vector describing an empty circuit, shape
(7,).info (dict) – Empty dictionary provided for Gymnasium API compatibility.
- step(action)[source]
Execute one architecture-modification step by calling the get_reward method.
- Parameters:
action (int) – Index of the selected architectural action.
- Returns:
observation (np.ndarray) – Updated observation vector summarizing circuit statistics, shape
(7,).reward (float) – Reward value combining expressibility and architectural penalties.
done (bool) – True if the episode ended due to termination by agent or truncation, False otherwise.
info (dict) – Diagnostic information including expressibility, KL divergence, depth, parameter count, current block sequence, and terminated (true if agent explicitly terminated, false if episode ended due to max steps).
- class qrl.env.ProbabilityV0(n_qubits, target_distribution, ansatz=None, **kwargs)[source]
Bases:
QuantumEnvProbability distribution matching environment for variational quantum circuits.
ProbabilityV0is agymnasium.Env-compatible environment that trains a parameterized quantum circuit to approximate a target probability distribution over computational basis states. The agent optimizes continuous circuit parameters so that the measurement statistics of the circuit match a specified target distribution.This environment is suitable for distribution learning, quantum generative modeling, and variational circuit optimization tasks.
Key properties
Action space: Continuous parameter updates applied to the circuit ansatz.
Observation space: Probability distribution over
2**n_qubitsbasis
states produced by the current circuit. - Reward: Negative weighted cost combining KL divergence and L2 distance to the target distribution, with an additional step penalty. - Termination: Success when the reward exceeds the specified tolerance or truncation at
max_steps.Visualization
The
render()method animates the evolution of the learned probability distribution relative to the target distribution, along with the reward trajectory over training steps.Input Parameters
- n_qubitsint
Number of qubits in the circuit.
- target_distributionnp.ndarray
Target probability distribution over computational basis states.
- ansatzcallable or None
Custom parameterized circuit ansatz. If
None, a default RY-based ansatz is used.- max_stepsint
Maximum number of optimization steps per episode.
- tolerancefloat
Reward threshold for early termination.
- alphafloat
Weight balancing KL divergence and L2 distance.
- betafloat
Penalty weight for step count.
- ffmpegbool
Whether to use FFmpeg when saving animations.
See also
- tutorials/probability
Tutorial on probability distribution learning with variational circuits.
- get_reward(params)[source]
Compute the reward for a given set of circuit parameters.
The reward is based on a weighted combination of: - Kullback–Leibler (KL) divergence between the target distribution and the circuit output distribution. - L2 distance between the target and circuit distributions.
- Parameters:
params (np.ndarray) – Vector of variational circuit parameters.
- Returns:
Scalar reward value encouraging the circuit output distribution to match the target distribution.
- Return type:
float
- render(save_path_without_extension=None)[source]
Render the evolution of the probability distribution over training steps.
The animation shows a bar plot comparing the target probability distribution with the circuit’s predicted distribution at each step. Reward values are displayed in the plot title.
- Parameters:
save_path_without_extension (str or None, optional) – Path (without file extension) to save the animation. If provided, the animation is saved using the configured writer (MP4 for FFmpeg or GIF for Pillow). If None, the animation is displayed interactively.
- Returns:
This method produces a visualization but does not return a value.
- Return type:
None
- reset()[source]
Reset the environment to a random initial parameter configuration.
Initializes the circuit parameters randomly, clears episode history, and resets the step counter.
- Returns:
observation (np.ndarray) – Initial circuit parameter vector.
info (dict) – Empty dictionary provided for compatibility with Gymnasium-style APIs.
- step(action)[source]
Execute one optimization step.
Updates the circuit parameters using the provided action, evaluates the resulting probability distribution, computes the reward, and checks termination conditions.
- Parameters:
action (np.ndarray) – Parameter update vector applied additively to the current circuit parameters.
- Returns:
observation (np.ndarray) – Probability distribution over computational basis states produced by the circuit after the parameter update.
reward (float) – Reward value after applying the action.
done (bool) – True if the episode has terminated due to reaching the reward tolerance or the maximum number of steps.
info (dict) – Empty dictionary provided for compatibility with Gymnasium-style APIs.