qrl.env package

Subpackages

Module contents

class qrl.env.BlochSphereV0(*args: Any, **kwargs: Any)[source]

Bases: QuantumEnv

Single-qubit Bloch sphere environment for reinforcement learning.

BlochSphereV0 is a gymnasium.Env-compatible environment where an agent controls a single qubit via a discrete set of quantum gates. The qubit state is represented internally as a statevector and exposed to the agent as a 3D Bloch vector (x, y, z).

The objective is to steer the qubit from the fixed initial state |0⟩ to a target pure state (default |+⟩) within a limited number of steps by applying unitary gate actions.

Key details

  • Action space: Discrete set of single-qubit gates (Clifford + common rotations).

  • Observation space: Bloch vector (x, y, z), each component in [-1, 1].

  • Reward: Fidelity |⟨target | state⟩|² in [0, 1].

  • Termination: Success when reward exceeds reward_tolerance or truncation

at max_steps.

Rendering

The render() method visualizes the Bloch sphere and the agent’s trajectory, showing the current state and target state as arrows in 3D.

Input Parameters

  • target_state: Target pure state as a Numpy complex 2-vector, defaults to |+⟩.

  • max_steps: Maximum number of steps per episode.

  • reward_tolerance: Fidelity threshold for successful termination.

  • ffmpeg: If set to True, animations are saved as mp4 videos, else as GIFs. Default is False.

See also

tutorials/bloch_sphere

get_reward(action)[source]

Apply a quantum gate action and compute the resulting reward.

This method evolves the internal qubit state by applying the unitary corresponding to the selected action and evaluates the fidelity with respect to the target state.

Parameters:

action (int) – Index of the selected action in self.actions.

Returns:

Fidelity between the current state and the target state, defined as |⟨target | state⟩|² and bounded in [0, 1].

Return type:

float

render(save_path_without_extension=None, interval=800)[source]

Render the Bloch sphere trajectory as a 3D animation.

The visualization shows: - A translucent Bloch sphere with labeled basis states, - The target Bloch vector (green, static), - The evolving qubit state trajectory (red, dynamic).

Parameters:
  • save_path_without_extension (str or None, optional) – Path (without file extension) to save the animation. If provided, the animation is saved using the configured writer (MP4 for FFmpeg or GIF for Pillow). If None, the animation is displayed interactively.

  • interval (int, optional) – Delay between animation frames in milliseconds. Default is 800.

Returns:

This method produces a visualization but does not return a value.

Return type:

None

reset()[source]

Reset the environment to the initial state.

The qubit is initialized to the computational basis state |0⟩. Episode step count and history are cleared.

Returns:

  • observation (np.ndarray) – Initial Bloch vector corresponding to |0⟩, shape (3,).

  • info (dict) – Empty dictionary provided for compatibility with Gymnasium API.

step(action)[source]

Execute one environment step.

Applies the selected quantum gate, updates the internal state and history, computes the reward, and checks termination conditions.

Parameters:

action (int) – Index of the selected action in self.actions.

Returns:

  • observation (np.ndarray) – Updated Bloch vector of the qubit state, shape (3,).

  • reward (float) – Fidelity-based reward after applying the action.

  • done (bool) – True if the episode has terminated due to success or truncation.

  • info (dict) – Empty dictionary provided for compatibility with Gymnasium API.

class qrl.env.BlochSphereV1(target_state=2, max_steps=10, reward_tolerance=0.99, ffmpeg=False)[source]

Bases: QuantumEnv

Single-qubit Bloch sphere environment as a graph problem for reinforcement learning.

BlochSphereV1 is a gymnasium.Env compatible environment where an agent controls a single qubit via a discrete set of quantum gates. The qubit state is exposed to the agent as an integer index corresponding to the discrete states |0⟩, |1⟩, |+⟩, |-⟩, |+i⟩, |-i⟩.

The objective is to steer the qubit from the fixed starting initial state |0⟩ to a user defined target pure state (default |+⟩) within a limited number of steps by applying unitary gate actions.

The environment is fully compatible with ValueIteration and QValueIteration from qrl.algorithms.

Key details

  • Action space: Discrete set of single-qubit gates (H,X,Z,S).

  • Observation space: Integer index corresponding to the Discrete states |0⟩, |1⟩, |+⟩, |-⟩, |+i⟩, |-i⟩.

  • Reward: Fidelity |⟨target | state⟩|² in [0, 1].

  • Termination: Success when reward exceeds reward_tolerance or truncation

at max_steps.

param - target_state:

Target state index in [0, 5]. Defaults to 2 (|+⟩). The mapping is: 0 → |0⟩, 1 → |1⟩, 2 → |+⟩, 3 → |-⟩, 4 → |+i⟩, 5 → |-i⟩.

type - target_state:

int, optional

param - max_steps:

Maximum number of steps per episode. Default is 10.

type - max_steps:

int, optional

param - reward_tolerance:

Fidelity threshold for successful termination. Must be in (0, 1]. Default is 0.99.

type - reward_tolerance:

float, optional

param - ffmpeg:

If True, animations are saved as MP4 via ffmpeg, else as GIFs. Default is False.

type - ffmpeg:

bool, optional

:raises - ValueError : If target_state is not in [0, 5].: :raises - ValueError : If reward_tolerance is not in (0, 1].: :raises - ValueError : If ffmpeg=True but ffmpeg is not installed on the system.:

property bloch_vector: pennylane.numpy.ndarray

Current Bloch vector (x, y, z).

get_reward()[source]

Compute the reward for the current state and update termination flags.

Evaluates the fidelity between the current statevector and the target statevector. Sets self.terminated if fidelity meets or exceeds reward_tolerance, and self.truncated if the step limit is reached.

Returns:

1.0 if the current state matches the target within reward_tolerance, 0.0 otherwise.

Return type:

float

render(save_path_without_extension, interval=600, ffmpeg=False)[source]

Render accumulated graph frames as an animation and save to disk.

Assembles the list of graph snapshots captured by _render_graph() into a single animation. Each frame corresponds to one call to _render_graph(), producing a visual record of the agent’s learning progression over episodes.

Parameters:
  • save_path_without_extension (str) – File path (without extension) where the animation will be saved. The appropriate extension (.mp4 or .gif) is appended automatically based on the ffmpeg argument.

  • interval (int, optional) – Delay between frames in milliseconds. Default is 600.

  • ffmpeg (bool, optional) – If True, saves the animation as an MP4 using ffmpeg. If False, saves as a GIF using Pillow. Default is False.

Raises:

ValueError – If _render_graph() has not been called and no frames are available.

Returns:

This method produces an animation file but does not return a value.

Return type:

None

reset(*, seed=None, options=None)[source]

Reset the environment to the initial state.

The qubit is placed at state index 0 (|0⟩). Episode step count, history, and termination flags are cleared.

Parameters:
  • seed (int or None, optional) – Random seed passed to the base gymnasium.Env reset. Default is None.

  • options (dict or None, optional) – Additional options passed to the base reset. Default is None.

Returns:

  • observation (int) – Initial state index (always 0, corresponding to |0⟩).

  • info (dict) – Dictionary containing fidelity, gate ("reset"), and bloch_vector of the initial state.

property state_index: int

Current state index (0-5).

step(action)[source]

Apply a gate action and advance the episode by one step.

Applies the unitary gate corresponding to action to the current statevector, updates the discrete state index via the transition table, increments the step counter, and appends the new state to history.

Parameters:

action (int) – Index into ACTION_NAMES selecting the gate to apply. 0 → H, 1 → X, 2 → Z, 3 → S.

Returns:

  • observation (int) – New discrete state index after applying the gate.

  • reward (float) – 1.0 if the target is reached within tolerance, 0.0 otherwise.

  • terminated (bool) – True if fidelity ≥ reward_tolerance.

  • truncated (bool) – True if stepsmax_steps.

  • info (dict) – Dictionary containing fidelity, gate name applied, and bloch_vector of the resulting state.

static transition_table()[source]

Return the deterministic state-transition table for the environment.

Each entry T[s, a] gives the next state index when action a is taken from state s. Rows correspond to the 6 Bloch sphere states and columns to the 4 gate actions (H, X, Z, S).

Returns:

Integer array of shape (6, 4) where T[s, a] = s'.

Return type:

np.ndarray

class qrl.env.CompilerV0(*args: Any, **kwargs: Any)[source]

Bases: QuantumEnv

Single-qubit quantum gate compilation environment.

CompilerV0 is a gymnasium.Env-compatible environment that models the problem of compiling a target single-qubit unitary using a fixed, discrete gate set. The agent incrementally applies quantum gates to build a circuit whose resulting unitary approximates a given target operation in SU(2).

At each step, the agent selects a gate action that left-multiplies the current circuit unitary. The episode reward is based on the average gate fidelity between the current unitary and the target unitary, encouraging the agent to discover short, high-fidelity gate sequences.

Key properties

  • Action space: Discrete set of single-qubit gates (Clifford + rotations).

  • Observation space: Flattened real and imaginary parts of the current 2×2 unitary (shape (8,)).

  • Reward: Average gate fidelity with respect to the target unitary.

  • Termination: Success when fidelity exceeds reward_tolerance or truncation at max_steps.

Rendering

The render() method visualizes the compilation process by displaying a heatmap of the magnitude of the difference matrix |U_target U| over time, annotated with the current step, last applied gate, and reward.

Input Parameters

targetnp.ndarray

Target 2×2 unitary matrix in SU(2) to compile towards.

max_stepsint

Maximum number of gate applications per episode.

reward_tolerancefloat

Fidelity threshold for early termination.

ffmpegbool

Whether to use FFmpeg when saving animations.

See also

CompilerV0

Step-by-step tutorial on compiling SU(2) unitaries using CompilerV0.

get_reward(action)[source]

Apply a quantum gate action and compute the compilation reward.

This method left-multiplies the current circuit unitary by the unitary corresponding to the selected action and evaluates the average gate fidelity with respect to the target unitary.

Parameters:

action (int) – Index of the selected action in self.actions.

Returns:

Average gate fidelity between the current unitary and the target unitary, defined as 0.5 * |Tr(U_target† · U)| for a single-qubit system.

Return type:

float

render(save_path_without_extension=None, interval=800)[source]

Render the compilation process as an animation of the difference matrix.

The visualization shows the magnitude of the element-wise difference |U_target - U| as a heatmap that evolves over time, along with annotations indicating the current step, applied action, and reward.

Parameters:
  • save_path_without_extension (str or None, optional) – Path (without file extension) to save the animation. If provided, the animation is saved using the configured writer (MP4 for FFmpeg or GIF for Pillow). If None, the animation is displayed interactively.

  • interval (int, optional) – Delay between animation frames in milliseconds. Default is 800.

Returns:

This method produces a visualization but does not return a value.

Return type:

None

reset()[source]

Reset the environment to the initial compilation state.

The circuit unitary is reset to the identity matrix, the step counter is cleared, and the history buffer is reinitialized.

Returns:

  • observation (np.ndarray) – Flattened observation corresponding to the identity unitary, shape (8,).

  • info (dict) – Empty dictionary provided for compatibility with the Gymnasium API.

step(action)[source]

Execute one compilation step.

Applies the selected gate, updates the internal circuit unitary and history, computes the reward, and checks termination conditions.

Parameters:

action (int) – Index of the selected action in self.actions.

Returns:

  • observation (np.ndarray) – Updated flattened unitary observation, shape (8,).

  • reward (float) – Average gate fidelity after applying the action.

  • done (bool) – True if the episode has terminated due to reaching the fidelity threshold or the maximum number of steps.

  • info (dict) – Empty dictionary provided for compatibility with the Gymnasium API.

class qrl.env.ErrorChannelV0(n_qubits=3, faulty_qubits=None, max_steps=10, seed=None, ffmpeg=False)[source]

Bases: QuantumEnv

Multi-qubit error mitigation environment with bit-flip noise.

ErrorChannelV0 is a gymnasium.Env-compatible environment that models a noisy multi-qubit quantum system affected by independent bit-flip error channels. Each qubit may experience noise with a different probability, and the agent’s task is to apply corrective Pauli-X operations to recover the target computational basis state |0…0⟩.

The environment captures a simplified quantum error mitigation scenario, where the agent sequentially selects qubits on which to apply corrections based on observed measurement probabilities.

Key properties

  • Action space: Discrete choice of qubit index on which to apply an X gate.

  • Observation space: Probability distribution over all computational basis

states (shape (2**n_qubits,)). - Reward: Negative mean-squared error between the corrected distribution and the ideal |0…0⟩ distribution. - Termination: Success when perfect correction is achieved or truncation at max_steps.

Rendering

The render() method visualizes the mitigation process using a side-by-side animation that compares ideal, noisy, and corrected probability distributions, along with a dynamically updated circuit diagram showing applied corrections.

Input Parameters

n_qubitsint

Number of qubits in the system.

faulty_qubitsdict[int, float] or None

Mapping from qubit indices to bit-flip probabilities.

max_stepsint

Maximum number of correction steps per episode.

seedint or None

Random seed for reproducibility.

ffmpegbool

Whether to use FFmpeg to save animations as ,p4 or save it as GIFs with Pillow.

See also

ErrorChannelV0

Tutorial on multi-qubit error mitigation with bit-flip noise.

get_reward(action)[source]

Apply a correction action and compute the reward.

The selected qubit index is appended to the correction list, the noisy and corrected circuits are evaluated, and the reward is computed as the negative mean-squared error between the corrected and target probability distributions.

Parameters:

action (int) – Index of the qubit on which a Pauli-X correction is applied.

Returns:

Reward value defined as the negative mean-squared error between the corrected probability distribution and the target distribution.

Return type:

float

render(save_path_without_extension=None, interval_ms=600)[source]

Render the error-mitigation process as an animated visualization.

The animation consists of: - A bar chart comparing ideal, noisy, and corrected probability distributions for each computational basis state. - A dynamically updated ASCII-style circuit diagram showing the applied correction operations.

Parameters:
  • save_path_without_extension (str or None, optional) – Path (without file extension) to save the animation. If provided, the animation is saved using the configured writer (MP4 for FFmpeg or GIF for Pillow). If None, the animation is displayed interactively.

  • interval_ms (int, optional) – Time between animation frames in milliseconds. Default is 600.

Returns:

This method produces a visualization but does not return a value.

Return type:

None

reset(*, seed=None)[source]

Reset the environment to the initial noisy state.

Clears the correction history, resets the step counter, and evaluates the noisy circuit without any corrective operations.

Parameters:

seed (int or None, optional) – Random seed for reproducibility. If provided, reinitializes the internal random number generator.

Returns:

observation – Initial corrected probability distribution over computational basis states (identical to the noisy distribution at reset), with dtype float32.

Return type:

np.ndarray

step(action)[source]

Execute one environment step.

Applies a correction action, updates internal state and history, computes the reward, and checks termination conditions.

Parameters:

action (int) – Index of the qubit on which a Pauli-X correction is applied.

Returns:

  • observation (np.ndarray) – Corrected probability distribution over computational basis states, with dtype float32.

  • reward (float) – Negative mean-squared error between the corrected and target distributions.

  • done (bool) – True if the episode has terminated due to reaching the maximum number of steps or achieving perfect correction.

  • info (dict) – Dictionary containing metadata about the environment, including the mapping of faulty qubits.

class qrl.env.ExpressibilityV0(n_qubits=4, max_blocks=12, max_steps=20, n_pairs_eval=120, bins=50, lambda_depth=0.002, lambda_2q=0.002, terminate_bonus=0.1, device_name='default.qubit', seed=None, allow_all_to_all=False, ffmpeg=False)[source]

Bases: QuantumEnv

Parameterized circuit expressibility optimization environment.

ExpressibilityV0 is a gymnasium.Env-compatible environment that models the construction of parameterized quantum circuits with high expressibility. In the context of variational quantum algorithms, expressibility measures how well an ansatz can explore the Hilbert space of quantum states relative to the Haar-random distribution.

The agent incrementally builds a circuit by adding or removing predefined rotation and entangling blocks, or by explicitly terminating construction. Rewards encourage circuits whose fidelity distribution closely matches the Haar distribution, while penalizing excessive circuit depth and two-qubit gate usage.

Key properties

  • Action space: Discrete set of architectural edits (add/remove blocks or

terminate construction). - Observation space: Vector of circuit statistics summarizing depth, parameter count, entanglement, and recent expressibility estimates (shape (7,)). - Reward: Negative KL divergence to the Haar distribution with regularization penalties for depth and two-qubit gates. - Termination: Explicit termination by the agent or truncation at max_steps.

Rendering

The render() method visualizes expressibility optimization via a two-panel animation showing the circuit’s fidelity distribution compared to the Haar-random distribution alongside a block-level diagram of the evolving circuit architecture.

Input Parameters

n_qubitsint

Number of qubits in the circuit.

max_blocksint

Maximum number of blocks allowed in the circuit.

max_stepsint

Maximum number of construction steps per episode.

n_pairs_evalint

Number of random state pairs used to estimate expressibility.

binsint

Number of histogram bins for fidelity distributions.

lambda_depthfloat

Penalty weight for circuit depth.

lambda_2qfloat

Penalty weight for two-qubit gate usage.

terminate_bonusfloat

Bonus reward for explicit termination.

device_namestr

PennyLane device backend used for simulation.

seedint or None

Random seed for reproducibility.

allow_all_to_allbool

Whether to allow all-to-all entangling blocks.

ffmpegbool

Whether to use FFmpeg when saving animations.

See also

ExpressibilityV0

Tutorial on optimizing ansatz expressibility with block-based circuits.

action_meanings()[source]

Return a mapping from action indices to action names.

Returns:

Dictionary mapping integer action indices to human-readable architectural action names.

Return type:

dict

get_reward(action)[source]

The selected action modifies the circuit architecture by adding, removing, or terminating block construction. Expressibility is evaluated after the update, and a reward is computed based on the circuit’s deviation from the Haar distribution and architectural penalties.

Parameters:

action (int) – Index of the selected architectural action.

Returns:

reward – Reward value combining expressibility and architectural penalties.

Return type:

float

render(save_path_without_extension=None, interval=800)[source]

Render the expressibility optimization process as an animation.

The animation shows: 1. A histogram of circuit fidelity distribution compared to the Haar-random distribution. 2. A block-diagram visualization of the evolving circuit architecture.

Parameters:
  • save_path_without_extension (str or None, optional) – Path (without file extension) to save the animation. If provided, the animation is saved using the configured writer (MP4 for FFmpeg or GIF for Pillow). If None, the animation is displayed interactively.

  • interval (int, optional) – Delay between animation frames in milliseconds. Default is 800.

Returns:

This method produces a visualization but does not return a value.

Return type:

None

reset(*, seed=None, options=None)[source]

Reset the environment to an empty circuit.

Clears the current circuit architecture, resets internal counters, and initializes the observation vector corresponding to an empty ansatz.

Parameters:
  • seed (int or None, optional) – Random seed for reproducibility. If provided, reinitializes the internal random number generator.

  • options (dict or None, optional) – Additional reset options (currently unused, included for Gymnasium compatibility).

Returns:

  • observation (np.ndarray) – Initial observation vector describing an empty circuit, shape (7,).

  • info (dict) – Empty dictionary provided for Gymnasium API compatibility.

step(action)[source]

Execute one architecture-modification step by calling the get_reward method.

Parameters:

action (int) – Index of the selected architectural action.

Returns:

  • observation (np.ndarray) – Updated observation vector summarizing circuit statistics, shape (7,).

  • reward (float) – Reward value combining expressibility and architectural penalties.

  • done (bool) – True if the episode ended due to termination by agent or truncation, False otherwise.

  • info (dict) – Diagnostic information including expressibility, KL divergence, depth, parameter count, current block sequence, and terminated (true if agent explicitly terminated, false if episode ended due to max steps).

class qrl.env.ProbabilityV0(n_qubits, target_distribution, ansatz=None, **kwargs)[source]

Bases: QuantumEnv

Probability distribution matching environment for variational quantum circuits.

ProbabilityV0 is a gymnasium.Env-compatible environment that trains a parameterized quantum circuit to approximate a target probability distribution over computational basis states. The agent optimizes continuous circuit parameters so that the measurement statistics of the circuit match a specified target distribution.

This environment is suitable for distribution learning, quantum generative modeling, and variational circuit optimization tasks.

Key properties

  • Action space: Continuous parameter updates applied to the circuit ansatz.

  • Observation space: Probability distribution over 2**n_qubits basis

states produced by the current circuit. - Reward: Negative weighted cost combining KL divergence and L2 distance to the target distribution, with an additional step penalty. - Termination: Success when the reward exceeds the specified tolerance or truncation at max_steps.

Visualization

The render() method animates the evolution of the learned probability distribution relative to the target distribution, along with the reward trajectory over training steps.

Input Parameters

n_qubitsint

Number of qubits in the circuit.

target_distributionnp.ndarray

Target probability distribution over computational basis states.

ansatzcallable or None

Custom parameterized circuit ansatz. If None, a default RY-based ansatz is used.

max_stepsint

Maximum number of optimization steps per episode.

tolerancefloat

Reward threshold for early termination.

alphafloat

Weight balancing KL divergence and L2 distance.

betafloat

Penalty weight for step count.

ffmpegbool

Whether to use FFmpeg when saving animations.

See also

ProbabilityV0

Tutorial on probability distribution learning with variational circuits.

get_reward(params)[source]

Compute the reward for a given set of circuit parameters.

The reward is based on a weighted combination of: - Kullback–Leibler (KL) divergence between the target distribution and the circuit output distribution. - L2 distance between the target and circuit distributions.

Parameters:

params (np.ndarray) – Vector of variational circuit parameters.

Returns:

Scalar reward value encouraging the circuit output distribution to match the target distribution.

Return type:

float

render(save_path_without_extension=None)[source]

Render the evolution of the probability distribution over training steps.

The animation shows a bar plot comparing the target probability distribution with the circuit’s predicted distribution at each step. Reward values are displayed in the plot title.

Parameters:

save_path_without_extension (str or None, optional) – Path (without file extension) to save the animation. If provided, the animation is saved using the configured writer (MP4 for FFmpeg or GIF for Pillow). If None, the animation is displayed interactively.

Returns:

This method produces a visualization but does not return a value.

Return type:

None

reset()[source]

Reset the environment to a random initial parameter configuration.

Initializes the circuit parameters randomly, clears episode history, and resets the step counter.

Returns:

  • observation (np.ndarray) – Initial circuit parameter vector.

  • info (dict) – Empty dictionary provided for compatibility with Gymnasium-style APIs.

step(action)[source]

Execute one optimization step.

Updates the circuit parameters using the provided action, evaluates the resulting probability distribution, computes the reward, and checks termination conditions.

Parameters:

action (np.ndarray) – Parameter update vector applied additively to the current circuit parameters.

Returns:

  • observation (np.ndarray) – Probability distribution over computational basis states produced by the circuit after the parameter update.

  • reward (float) – Reward value after applying the action.

  • done (bool) – True if the episode has terminated due to reaching the reward tolerance or the maximum number of steps.

  • info (dict) – Empty dictionary provided for compatibility with Gymnasium-style APIs.