Gretel Synthetic Text Data Quality Report

Synthetic Text Data Quality Report

Model

Evaluate Model

Model UID 66c4b66716e2a054e09083d1

Project

Generated 08/20/2024, 15:30

Good

Synthetic Text Data Quality Score

The Synthetic Text Data Quality Score is computed by taking a weighted combination of the individual quality metrics: Text Semantic Similarity and Text Structure Similarity. The report supports 50+ languages, including: English, French, German, Dutch, Italian, Portuguese, Spanish, Russian, Polish, Arabic, Turkish, Chinese, Japanese, Thai and Korean. Learn more.

The Synthetic Text Quality Score (Text SQS) is an estimate of how well the generated synthetic data maintains the same semantic and structural properties as the original dataset. In this sense, the score can be viewed as a utility score or a confidence score as to whether scientific conclusions drawn from the synthetic dataset would be the same if one were to have used the original dataset instead. If you do not require semantic or structural symmetry, as might be the case in a testing or demo environment, a lower score may be just as acceptable.

The 50+ languages supported by the report are: ar, bg, ca, cs, da, de, el, en, es, et, fa, fi, fr, fr-ca, gl, gu, he, hi, hr, hu, hy, id, it, ja, ka, ko, ku, lt, lv, mk, mn, mr, ms, my, nb, nl, pl, pt, pt-br, ro, ru, sk, sl, sq, sr, sv, th, tr, uk, ur, vi, zh-cn, zh-tw.

If your Synthetic Text Data Quality Score isn't as high as you'd like it to be, read here for a multitude of ideas for improving your model.

How to interpret the Text SQS	Excellent	Good	Moderate	Poor	Very Poor
Demo environments or mock data
Pre-production testing environments
Suitable for statistical analysis
Augment machine learning data sources
Improve your model using our tips and advice

Data Summary Statistics

Moderate

Text Semantic Similarity

Excellent

Text Structure Similarity

	Training Data	Synthetic Data
Row Count	5000	5000
Column Count	1	1
Training Lines Duplicated	-	16
Missing Values	0	0
Unique Values	5000	4963
Average Words Per Sentence	9.45	9.23
Average Characters Per Word	4.73	4.93
Average Sentence Count	1.01	1.03

What do these values mean?

Data Summary Statistics

Semantic Similarity Principal Component Analysis

Text Structure Similarity