The Synthetic Data Quality Score is computed by taking a weighted combination of the
individual quality metrics: Field Distribution Stability, Field Correlation Stability
and Deep Structure Stability.
The Synthetic Data Quality Score is an estimate of how well the generated synthetic data maintains the same
statistical properties as the original dataset. In this sense,
the Synthetic Data Quality Score can be viewed as a utility score or a confidence score as to whether
scientific conclusions drawn from the synthetic dataset would be the same if one were to have used the original dataset
instead. If you do not require statistical symmetry, as might be the case in a testing or demo environment, a lower
score may be just as acceptable.
If your Synthetic Data Quality Score is not as high as you would like it to be,
read here
for a multitude of ideas for improving your model.
How to interpret your SQS
Excellent
Good
Moderate
Poor
Very Poor
Suitable for machine learning or statistical analysis
Suitable for balancing or augmenting machine learning data sources
Suitable for pre-production testing environments
Suitable for demo environments or mock data
Improve your model using our tips and advice
Significant tuning required to improve model
The Data Privacy Score analyzes the synthetic output to measure how well protected your original data is from adversarial attacks.
It combines results from two common attacks: Membership Inference and Attribute Inference.
Membership Inference Protection measures how well you are protected from an adversary attempting to determine if specific data points
were part of the training set. Attribute Inference Protection measures how well you are protected from an adversary trying to predict
sensitive attributes of the data used in training, given other attributes.
Data Sharing Use Case
Excellent
Very Good
Good
Normal
Poor
Internally, within the same team
Internally, across different teams
Externally, with trusted partners
Externally, public availability
If your Data Privacy Score is not high enough for your use case, we recommend applying the following techniques or filters to try to
increase the score:
Use the Outlier Filter to ensure that no synthetic record is an outlier with respect to the training space.
You can enable this filter by setting privacy_filters.outliers: [medium, high].
Use the Similarity Filter to ensure that no synthetic record is overly similar to a training record.
You can enable this filter by setting privacy_filters.similarity: [medium, high].
Underfit the model to generate output that is less similar to the input.
In all model types, you can reduce epochs to underfit or prevent overfitting.
In LSTM, you can also set validation_split: True and early_stopping: True
in the configuration.
Apply Differential Privacy, or reduce epsilon if Differential Privacy is applied.
Increase your training dataset size to reduce the influence of individual data points on the overall model.
Data Sharing Use Case
Excellent
Very Good
Good
Normal
Poor
Internally, within the same team
Internally, across different teams
Externally, with trusted partners
Externally, public availability
Privacy Configuration is determined by the privacy mechanisms you have enabled in the synthetic
configuration. The use of these mechanisms helps to ensure that your synthetic data is safe from adversarial
attacks. There are four primary protection mechanisms you can add to the creation of synthetic data.
The Outlier Filter ensures that no synthetic record is an outlier with respect to the training space, and is
enabled with the privacy_filters.outliers: [medium, high].
The Similarity Filter ensures that no synthetic record is overly similar to a training record. This filter is
enabled in the configuration by setting privacy_filters.similarity: [medium, high].
You can also set privacy_filters.outliers to auto which will try for
medium, and fall back to turning the filter
off if it prevents the synthetic model from generating the requested number of records.
Overfitting Prevention ensures that model training stops before it has a chance to overfit.
In all model types, you can reduce epochs to prevent overfitting.
In LSTM, you can also set validation_split: True and
early_stopping: True in the configuration.
Differential Privacy is an experimental implementation of DP-SGD that modifies the optimizer to offer provable
guarantees of privacy, enabling safe training on private data. Differential Privacy can cause a hit to utility, often
requiring larger datasets to work well, but it uniquely provides privacy guarantees against both known and unknown
attacks on data. Differential Privacy can be enabled by setting dp: True and can
be modified using the associated configuration settings.
Synthetic Quality Summary
Excellent
Field Correlation Stability
Excellent
Deep Structure Stability
Excellent
Field Distribution Stability
To measure Field Correlation Stability, the correlation between every pair of fields is computed first in the
training data, and then in the synthetic data. The absolute difference between these values is then computed and
averaged across all fields. The lower this average value is, the higher the Field Correlation Stability quality
score will be. To aid in the comparison of field correlations, a heatmap is shown for both the training data and
the synthetic data, as well as a heatmap for the computed difference of correlation values. If the intended purpose
of the synthetic data is to perform statistical analysis or machine learning, maintaining the integrity of field
correlations can be critical.
To verify the statistical integrity of deeper, multi-field distributions and correlations, Gretel compares a
Principal Component Analysis (PCA) computed first on the original data, then again on the synthetic data. A synthetic
quality score is created by comparing the distributional distance between the principal components found in each
dataset. The closer the principal components are, the higher the synthetic quality score will be. As PCA is a very
common approach used in machine learning for both dimensionality reduction and visualization, this metric gives
immediate feedback as to the utility of the synthetic data for machine learning purposes.
Field Distribution Stability is a measure of how closely the field distributions in the synthetic data mirror those
in the original data. For each numeric or categorical field, we use a common approach for comparing two distributions
referred to as the Jensen-Shannon Distance. The lower the JS Distance score is on average across all fields, the
higher the Field Distribution Stability quality score will be. Note, highly unique strings (neither numeric or
categorical) will not have a distributional distance score. To aid in the comparison of original versus synthetic
field distributions, a bar chart or histogram is shown for each numeric or categorical field. Depending on the
intended purpose of the synthetic data, maintaining the integrity of field distributions can be critical.
The row count is the number of records or lines in the training (or synthetic) dataset. The column count is the
number of fields in the dataset. The number of training rows used can directly impact the quality of the synthetic
data created. The more examples available when training a model, the easier it is for the model to accurately learn
the distributions and correlations in the data. Always strive to have a minimum of 3000 training examples, but
increasing that to 5000 or even 50,000 is even better.
The more synthetic rows generated, the easier it is to deduce whether the statistical integrity of the data remains
intact. If your Synthetic Data Quality Score is not as high as you would like it to be, make sure you have generated at
least 5000 synthetic data records.
The Training Lines Duplicated value is an important way of ensuring the privacy of the generated synthetic data.
In almost all situations, this value should be 0. The only exception would be if the training data itself contained
a multitude of duplicate rows. If this is the situation, simply remove the duplicate rows before training.
Privacy Configuration
Default Privacy Protections
Advanced Protections
Outlier Filter
Disabled
Similarity Filter
Disabled
Overfitting Prevention
Disabled
Differential Privacy
Disabled
The Outlier privacy filter ensures that no synthetic record is an outlier with respect to the training dataset.
Outliers revealed in the synthetic dataset can be exploited by Membership Inference Attacks, Attribute Disclosure,
and a wide variety of other adversarial attacks. They are a serious privacy risk. The Outlier Filter is enabled
by the "privacy_filters.outliers" configuration setting. A value of "medium" will filter out any synthetic record
that has a very high likelihood of being an outlier. A value of "high" will filter out any synthetic record that
has a medium to high likelihood of being an outlier.
The Similarity privacy filter ensures that no synthetic record is overly similar to a training record. Overly
similar training records can be a severe privacy risk as adversarial attacks commonly exploit such records to
gain insights into the original data. The Similarity Filter is enabled by the "privacy_filters.similarity"
configuration setting. A value of "medium" will filter out any synthetic record that is an exact duplicate of a
training record. A value of "high" will filter out any synthetic record that is 99% similar or more to a
training record.
The Overfitting Prevention privacy mechanism ensures that the synthetic model will stop training before it has a
chance to overfit. When a model is overfit, it will start to memorize the training data as opposed to learning
generalized patterns in the data. This is a severe privacy risk as overfit models are commonly exploited by
adversaries seeking to gain insights into the original data.
Differential Privacy ensures that no individual training record can unduly influence the output of a synthetic
model. It is very effective at preventing the generation of records that are overly similar to the training
set and at ensuring an even distribution of values, though it can result in a modest degradation of the SQS
utility score. Differential privacy is best suited to larger datasets (typically 50K rows or more) where
probabilistic privacy guarantees are required.
Data Privacy Summary
Excellent
Membership Inference Protection
Very Good
Attribute Inference Protection
Membership Inference Protection is a measure of how well-protected your data is from membership inference attacks.
A membership inference attack is a type of privacy attack on machine learning models where an adversary aims
to determine whether a particular data sample was part of the model's training dataset. By exploiting the differences
in the model's responses to data points from its training set versus those it has never seen before, an attacker can
attempt to infer membership. This type of attack can have critical privacy implications, as it can reveal whether
specific individuals' data was used to train the model. Based on directly analyzing the synthetic output, a high
score indicates that your training data is well-protected from this type of attack.
Attribute Inference Protection is a measure of how well-protected your data is from attribute inference attacks.
An attribute inference attack is a type of privacy attack on machine learning models where an adversary seeks
to infer missing attributes or sensitive information about individuals from their data that was used to train
the model. By leveraging the model's output, the attacker can attempt to predict unknown attributes of a data sample.
This type of attack poses significant privacy risks, as it can uncover sensitive details about individuals that
were not intended to be revealed by the data owners. Based on directly analyzing the synthetic output, a high
score indicates that your training data is well-protected from this type of attack.
Training Field Overview
The high-level Field Distribution Stability score is computed by taking the average of the individual Field
Distribution Stability scores, shown in the below table. Distributional stability is applicable to numeric and
categorical fields, but not highly unique strings. To better understand a field's Distribution Stability score,
click on the field name to be taken to a graph comparing the training and synthetic distributions.
The below table also shows the count of unique and missing field values, the average length of each field, as well
as its datatype. When a dataset contains a large number of highly unique fields or a large amount of missing data,
these characteristics can impede the model's ability to accurately learn the statistical structure of the data.
Exceptionally long fields can also have the same impact.
Read here
for advice on how best to handle fields like these.
Breakdown of protection level across 360 simulated attacks
Membership Inference Protection is a measure of how well-protected your data is from membership inference attacks.
A membership inference attack is a type of privacy attack on machine learning models where an adversary aims
to determine whether a particular data sample was part of the model's training dataset. By exploiting the differences
in the model's responses to data points from its training set versus those it has never seen before, an attacker can
attempt to infer membership. This type of attack can have critical privacy implications, as it can reveal whether
specific individuals' data was used to train the model. Based on directly analyzing the synthetic output, a high
score indicates that your training data is well-protected from this type of attack.
Attribute Inference Protection
Breakdown of protection across all columns
Attribute Inference Protection is a measure of how well-protected your data is from attribute inference attacks.
An attribute inference attack is a type of privacy attack on machine learning models where an adversary seeks
to infer missing attributes or sensitive information about individuals from their data that was used to train
the model. By leveraging the model's output, the attacker can attempt to predict unknown attributes of a data sample.
This type of attack poses significant privacy risks, as it can uncover sensitive details about individuals that
were not intended to be revealed by the data owners. Based on directly analyzing the synthetic output, a high
score indicates that your training data is well-protected from this type of attack.