A library to model multivariate data using copulas.
An Open Source Project from the Data to AI Lab, at MIT
Copulas is a Python library for modeling multivariate distributions and sampling from them using copula functions. Given a table containing numerical data, we can use Copulas to learn the distribution and later on generate new synthetic rows following the same statistical properties.
Some of the features provided by this library include:
Copulas is part of the SDV project and is automatically installed alongside it. For details about this process please visit the SDV Installation Guide
Optionally, Copulas can also be installed as a standalone library using the following commands:
pip install copulas
conda install -c sdv-dev -c conda-forge copulas
For more installation options please visit the Copulas installation Guide
In this short quickstart, we show how to model a multivariate dataset and then generate synthetic data that resembles it.
import warnings warnings.filterwarnings('ignore')
from copulas.datasets import sample_trivariate_xyz from copulas.multivariate import GaussianMultivariate from copulas.visualization import compare_3d
Load a dataset with 3 columns that are not independent
real_data = sample_trivariate_xyz()
Fit a gaussian copula to the data
copula = GaussianMultivariate() copula.fit(real_data)
Sample synthetic data
synthetic_data = copula.sample(len(real_data))
Plot the real and the synthetic data to compare
The output will be a figure with two plots, showing what both the real and the synthetic data that you just generated look like:
For more details about Copulas and all its possibilities and features, please check the documentation site.
There you can learn more about how to contribute to Copulas in order to help us developing new features or cool ideas.
Copulas is an open source project from the Data to AI Lab at MIT which has been built and maintained over the years by the following team:
This repository is part of The Synthetic Data Vault Project