The Synthetic Text Data Quality Score is computed by taking a weighted
combination of the individual quality metrics: Text Semantic Similarity
and Text Structure Similarity. The report supports 50+ languages, including:
English, French, German, Dutch, Italian, Portuguese, Spanish, Russian,
Polish, Arabic, Turkish, Chinese, Japanese, Thai and Korean.
Learn more.
The Synthetic Text Quality Score (Text SQS) is an estimate of how well
the generated synthetic data maintains the same semantic and structural
properties as the original dataset. In this sense, the score can be
viewed as a utility score or a confidence score as to whether
scientific conclusions drawn from the synthetic dataset would be
the same if one were to have used the original dataset instead. If
you do not require semantic or structural symmetry, as might be
the case in a testing or demo environment, a lower score may be just
as acceptable.
The 50+ languages supported by the report are: ar, bg, ca, cs, da, de,
el, en, es, et, fa, fi, fr, fr-ca, gl, gu, he, hi, hr, hu, hy, id, it,
ja, ka, ko, ku, lt, lv, mk, mn, mr, ms, my, nb, nl, pl, pt, pt-br, ro,
ru, sk, sl, sq, sr, sv, th, tr, uk, ur, vi, zh-cn, zh-tw.
If your Synthetic Text Data Quality Score isn't as high as you'd like
it to be,
read here
for a multitude of ideas for improving your model.
The Text Semantic Similarity Score is a value in the range of 0–100,
which shows if the real and the synthetic texts have the same meaning
evaluated across all of the text data. An embedding model is used
to vectorize the text to a one-dimensional vector of size 512. The
cosine similarity of the average embedded vectors across all records
of the training and synthetic texts is calculated and a score is
assigned based on the similarity.
The Text Structure Similarity Score calculates the distance between the
average characters per word, words per sentence and sentence count
distribution across the synthetic and training data. It is the average
across the three distribution scores, which produces a single value
as the Text Structure Similarity.
The row count is the number of records or lines in the training (or synthetic)
dataset. The column count is the number of fields in the dataset. The number of
training rows used can directly impact the quality of the synthetic data created.
The more examples available when training a model, the easier it is for the model
to accurately learn the distributions in the data.
The Training Lines Duplicated value is an important way of ensuring the privacy
of the generated synthetic data. In almost all situations, this value should be 0.
The only exception would be if the training data itself contained a multitude of
duplicate rows. If this is the situation, simply remove the duplicate rows before training.
Missing values refers to how many records are empty strings. Unique values is the
count of unique records in the data. Average character, word and sentence count
is calculated across all records in the dataset. These attributes give you a sense
of the shape and size of the training set and synthetic data.
To compute the text semantic and structural similarity scores, training and
synthetic records are downsampled to 80 rows or the training data rows,
whichever is smaller, which is demonstrably statistically robust and more
efficient for NLP models. This does not affect the number of records used
for training of the language model to generate synthetic records.
Semantic Similarity Principal Component Analysis
Gretel compares the principal components of the embedding vectors along the
fraction of variance explained by each principal component. This value is the
ratio between the variance of the related component and the total variance.
We visualize the first 4 principal components to cover more sorted variance
and have the visualization results closer to the semantic similarity score
which is calculated across 512 embedding vectors. Diagonal plots depict the
principal component’s histograms for training and synthetic data, plotted on
top of one another. The closer the principal
components are, the higher the semantic similarity score will be.
Text Structure Similarity
Text Structure Similarity is a measure of how closely the sentence, average
words per sentence, and characters per words distributions in the synthetic
data mirror those in the original
data. For better structure similarity, you can also change the maximum number
of generated tokens in the config which is 100 by default. This might affect
the semantic similarity score depending on the model. For each statistic, we
use Jensen-Shannon (JS) Distance to compare the two distributions. The lower
the JS Distance score is on average
across all distributions, the higher the text structure similarity score
will be. To aid in the comparison of original versus synthetic field
distributions, a bar chart or histogram is shown. Depending on the intended
purpose of the synthetic text, maintaining the integrity of text structure
can be critical.