ProbabilityV0
Implementation of ProbabilityV0 environment
Author: Jay Shah (@Jayshah25)
Contact: jay.shah@qrlqai.com
License: Apache-2.0
- class qrl.env.core.probability.ProbabilityV0(n_qubits, target_distribution, ansatz=None, **kwargs)[source]
Bases:
QuantumEnvProbability distribution matching environment for variational quantum circuits.
ProbabilityV0is agymnasium.Env-compatible environment that trains a parameterized quantum circuit to approximate a target probability distribution over computational basis states. The agent optimizes continuous circuit parameters so that the measurement statistics of the circuit match a specified target distribution.This environment is suitable for distribution learning, quantum generative modeling, and variational circuit optimization tasks.
Key properties
Action space: Continuous parameter updates applied to the circuit ansatz.
Observation space: Probability distribution over
2**n_qubitsbasis
states produced by the current circuit. - Reward: Negative weighted cost combining KL divergence and L2 distance to the target distribution, with an additional step penalty. - Termination: Success when the reward exceeds the specified tolerance or truncation at
max_steps.Visualization
The
render()method animates the evolution of the learned probability distribution relative to the target distribution, along with the reward trajectory over training steps.Input Parameters
- n_qubitsint
Number of qubits in the circuit.
- target_distributionnp.ndarray
Target probability distribution over computational basis states.
- ansatzcallable or None
Custom parameterized circuit ansatz. If
None, a default RY-based ansatz is used.- max_stepsint
Maximum number of optimization steps per episode.
- tolerancefloat
Reward threshold for early termination.
- alphafloat
Weight balancing KL divergence and L2 distance.
- betafloat
Penalty weight for step count.
- ffmpegbool
Whether to use FFmpeg when saving animations.
See also
- tutorials/probability
Tutorial on probability distribution learning with variational circuits.
- get_reward(params)[source]
Compute the reward for a given set of circuit parameters.
The reward is based on a weighted combination of: - Kullback–Leibler (KL) divergence between the target distribution and the circuit output distribution. - L2 distance between the target and circuit distributions.
- Parameters:
params (np.ndarray) – Vector of variational circuit parameters.
- Returns:
Scalar reward value encouraging the circuit output distribution to match the target distribution.
- Return type:
float
- render(save_path_without_extension=None)[source]
Render the evolution of the probability distribution over training steps.
The animation shows a bar plot comparing the target probability distribution with the circuit’s predicted distribution at each step. Reward values are displayed in the plot title.
- Parameters:
save_path_without_extension (str or None, optional) – Path (without file extension) to save the animation. If provided, the animation is saved using the configured writer (MP4 for FFmpeg or GIF for Pillow). If None, the animation is displayed interactively.
- Returns:
This method produces a visualization but does not return a value.
- Return type:
None
- reset()[source]
Reset the environment to a random initial parameter configuration.
Initializes the circuit parameters randomly, clears episode history, and resets the step counter.
- Returns:
observation (np.ndarray) – Initial circuit parameter vector.
info (dict) – Empty dictionary provided for compatibility with Gymnasium-style APIs.
- step(action)[source]
Execute one optimization step.
Updates the circuit parameters using the provided action, evaluates the resulting probability distribution, computes the reward, and checks termination conditions.
- Parameters:
action (np.ndarray) – Parameter update vector applied additively to the current circuit parameters.
- Returns:
observation (np.ndarray) – Probability distribution over computational basis states produced by the circuit after the parameter update.
reward (float) – Reward value after applying the action.
done (bool) – True if the episode has terminated due to reaching the reward tolerance or the maximum number of steps.
info (dict) – Empty dictionary provided for compatibility with Gymnasium-style APIs.