ProbabilityV0

Implementation of ProbabilityV0 environment

Author: Jay Shah (@Jayshah25)

Contact: jay.shah@qrlqai.com

License: Apache-2.0

class qrl.env.core.probability.ProbabilityV0(n_qubits, target_distribution, ansatz=None, **kwargs)[source]

Bases: QuantumEnv

Probability distribution matching environment for variational quantum circuits.

ProbabilityV0 is a gymnasium.Env-compatible environment that trains a parameterized quantum circuit to approximate a target probability distribution over computational basis states. The agent optimizes continuous circuit parameters so that the measurement statistics of the circuit match a specified target distribution.

This environment is suitable for distribution learning, quantum generative modeling, and variational circuit optimization tasks.

Key properties

  • Action space: Continuous parameter updates applied to the circuit ansatz.

  • Observation space: Probability distribution over 2**n_qubits basis

states produced by the current circuit. - Reward: Negative weighted cost combining KL divergence and L2 distance to the target distribution, with an additional step penalty. - Termination: Success when the reward exceeds the specified tolerance or truncation at max_steps.

Visualization

The render() method animates the evolution of the learned probability distribution relative to the target distribution, along with the reward trajectory over training steps.

Input Parameters

n_qubitsint

Number of qubits in the circuit.

target_distributionnp.ndarray

Target probability distribution over computational basis states.

ansatzcallable or None

Custom parameterized circuit ansatz. If None, a default RY-based ansatz is used.

max_stepsint

Maximum number of optimization steps per episode.

tolerancefloat

Reward threshold for early termination.

alphafloat

Weight balancing KL divergence and L2 distance.

betafloat

Penalty weight for step count.

ffmpegbool

Whether to use FFmpeg when saving animations.

See also

tutorials/probability

Tutorial on probability distribution learning with variational circuits.

get_reward(params)[source]

Compute the reward for a given set of circuit parameters.

The reward is based on a weighted combination of: - Kullback–Leibler (KL) divergence between the target distribution and the circuit output distribution. - L2 distance between the target and circuit distributions.

Parameters:

params (np.ndarray) – Vector of variational circuit parameters.

Returns:

Scalar reward value encouraging the circuit output distribution to match the target distribution.

Return type:

float

render(save_path_without_extension=None)[source]

Render the evolution of the probability distribution over training steps.

The animation shows a bar plot comparing the target probability distribution with the circuit’s predicted distribution at each step. Reward values are displayed in the plot title.

Parameters:

save_path_without_extension (str or None, optional) – Path (without file extension) to save the animation. If provided, the animation is saved using the configured writer (MP4 for FFmpeg or GIF for Pillow). If None, the animation is displayed interactively.

Returns:

This method produces a visualization but does not return a value.

Return type:

None

reset()[source]

Reset the environment to a random initial parameter configuration.

Initializes the circuit parameters randomly, clears episode history, and resets the step counter.

Returns:

  • observation (np.ndarray) – Initial circuit parameter vector.

  • info (dict) – Empty dictionary provided for compatibility with Gymnasium-style APIs.

step(action)[source]

Execute one optimization step.

Updates the circuit parameters using the provided action, evaluates the resulting probability distribution, computes the reward, and checks termination conditions.

Parameters:

action (np.ndarray) – Parameter update vector applied additively to the current circuit parameters.

Returns:

  • observation (np.ndarray) – Probability distribution over computational basis states produced by the circuit after the parameter update.

  • reward (float) – Reward value after applying the action.

  • done (bool) – True if the episode has terminated due to reaching the reward tolerance or the maximum number of steps.

  • info (dict) – Empty dictionary provided for compatibility with Gymnasium-style APIs.