Gretel synthetics report

Synthetic Data Quality Report

Excellent

Synthetic Quality Score

Excellent

Data Privacy Score

Normal

Privacy Configuration

The Synthetic Data Quality Score is computed by taking a weighted combination of the individual quality metrics: Field Distribution Stability, Field Correlation Stability and Deep Structure Stability. The Synthetic Data Quality Score is an estimate of how well the generated synthetic data maintains the same statistical properties as the original dataset. In this sense, the Synthetic Data Quality Score can be viewed as a utility score or a confidence score as to whether scientific conclusions drawn from the synthetic dataset would be the same if one were to have used the original dataset instead. If you do not require statistical symmetry, as might be the case in a testing or demo environment, a lower score may be just as acceptable.

If your Synthetic Data Quality Score is not as high as you would like it to be, read here for a multitude of ideas for improving your model.

How to interpret your SQS	Excellent	Good	Moderate	Poor	Very Poor
Suitable for machine learning or statistical analysis
Suitable for balancing or augmenting machine learning data sources
Suitable for pre-production testing environments
Suitable for demo environments or mock data
Improve your model using our tips and advice
Significant tuning required to improve model

The Data Privacy Score analyzes the synthetic output to measure how well protected your original data is from adversarial attacks. It combines results from two common attacks: Membership Inference and Attribute Inference. Membership Inference Protection measures how well you are protected from an adversary attempting to determine if specific data points were part of the training set. Attribute Inference Protection measures how well you are protected from an adversary trying to predict sensitive attributes of the data used in training, given other attributes.

Data Sharing Use Case	Excellent	Very Good	Good	Normal	Poor
Internally, within the same team
Internally, across different teams
Externally, with trusted partners
Externally, public availability

If your Data Privacy Score is not high enough for your use case, we recommend applying the following techniques or filters to try to increase the score:

Use the Outlier Filter to ensure that no synthetic record is an outlier with respect to the training space. You can enable this filter by setting privacy_filters.outliers: [medium, high].
Use the Similarity Filter to ensure that no synthetic record is overly similar to a training record. You can enable this filter by setting privacy_filters.similarity: [medium, high].
Underfit the model to generate output that is less similar to the input. In all model types, you can reduce epochs to underfit or prevent overfitting. In LSTM, you can also set validation_split: True and early_stopping: True in the configuration.
Apply Differential Privacy, or reduce epsilon if Differential Privacy is applied.
Increase your training dataset size to reduce the influence of individual data points on the overall model.

Data Sharing Use Case	Excellent	Very Good	Good	Normal	Poor
Internally, within the same team
Internally, across different teams
Externally, with trusted partners
Externally, public availability

Privacy Configuration is determined by the privacy mechanisms you have enabled in the synthetic configuration. The use of these mechanisms helps to ensure that your synthetic data is safe from adversarial attacks. There are four primary protection mechanisms you can add to the creation of synthetic data.

The Outlier Filter ensures that no synthetic record is an outlier with respect to the training space, and is enabled with the privacy_filters.outliers: [medium, high].
The Similarity Filter ensures that no synthetic record is overly similar to a training record. This filter is enabled in the configuration by setting privacy_filters.similarity: [medium, high].
You can also set privacy_filters.outliers to auto which will try for medium, and fall back to turning the filter off if it prevents the synthetic model from generating the requested number of records.
Overfitting Prevention ensures that model training stops before it has a chance to overfit. In all model types, you can reduce epochs to prevent overfitting. In LSTM, you can also set validation_split: True and early_stopping: True in the configuration.
Differential Privacy is an experimental implementation of DP-SGD that modifies the optimizer to offer provable guarantees of privacy, enabling safe training on private data. Differential Privacy can cause a hit to utility, often requiring larger datasets to work well, but it uniquely provides privacy guarantees against both known and unknown attacks on data. Differential Privacy can be enabled by setting dp: True and can be modified using the associated configuration settings.

Synthetic Quality Summary

Excellent

Field Correlation Stability

Excellent

Deep Structure Stability

Excellent

Field Distribution Stability

	Training Data	Synthetic Data
Row Count	5000	5000
Column Count	15	15
Training Lines Duplicated	--	0

What do these values mean?

Privacy Configuration

Default Privacy Protections	Advanced Protections

Outlier Filter Disabled

Similarity Filter Disabled

Overfitting Prevention Disabled

Differential Privacy Disabled

Data Privacy Summary

Excellent

Membership Inference Protection

Very Good

Attribute Inference Protection

Training Field Overview

Field	Unique	Ave. Length	Type	Distribution Stability
native_country	41	13.24	Categorical	Good
education	16	9.40	Categorical	Good
education_num	16	1.54	Numeric	Good
occupation	15	13.28	Categorical	Good
race	5	6.55	Categorical	Good
workclass	9	8.83	Categorical	Good
marital_status	7	15.49	Categorical	Excellent
capital_loss	56	1.13	Numeric	Excellent
relationship	6	10.09	Categorical	Excellent
hours_per_week	84	1.99	Numeric	Excellent
fnlwgt	4590	5.83	Numeric	Excellent
capital_gain	82	1.27	Numeric	Excellent
sex	2	5.66	Binary	Excellent
age	68	2.00	Numeric	Excellent
target	2	5.76	Binary	Excellent

Synthetic Quality Summary

Privacy Configuration

Data Privacy Summary

Training Field Overview

Training and Synthetic Data Correlation

Principal Component Analysis

Field Distribution Comparisons

Membership Inference Protection

Attribute Inference Protection