Synthetic data generation

From Systems Analysis Wiki
Jump to navigation Jump to search

Synthetic data generation using LLMs is a technology for artificially creating data that mimics the statistical and structural characteristics of real-world data but does not contain actual personal information. This approach, leveraging the capabilities of large language models (LLMs), has become a key tool in modern machine learning for addressing issues of data scarcity, privacy, and the high cost of manual annotation[1].

Definition and Background

What are synthetic data?

Synthetic data is artificially generated information that reproduces the statistical properties and patterns of an original, real-world dataset. The U.S. National Institute of Standards and Technology (NIST) defines it as data that preserves the statistical properties of the original but does not reveal individual details[2]. Unlike simple de-identification (anonymization), data synthesis creates entirely new records, which provides a higher level of privacy protection.

Why the Need Arose

The growing interest in synthetic generation is driven by several factors:

  • Data scarcity: In many domains, especially highly specialized ones, there is often insufficient high-quality labeled data to train robust models.
  • High cost of annotation: Manual data annotation is a labor-intensive and expensive process.
  • Privacy requirements: Legal and ethical regulations (e.g., GDPR) restrict the use of real data containing personal, medical, or financial information.
  • Class imbalance: In real-world data, some important but rare events (edge cases) may be underrepresented, preventing the model from learning them effectively.

LLMs, trained on vast corpora of text and code, have become a powerful tool for solving these problems, as they can generate coherent and diverse content that mimics the styles and distributions of real data.

Core Generation Methods Using LLMs

There are several key approaches to creating synthetic data using LLMs.

1. Prompt-based Generation

This is a direct method where the LLM generates data based on a textual request (prompt).

  • Zero-shot: The model generates examples based solely on a task description, without any samples provided. This approach promotes diversity but can lead to less relevant results.
  • Few-shot: The prompt includes a few examples (samples) of the desired output. This guides the model and increases the relevance of the generated data but carries the risk of duplication and loss of diversity, as the model tends to copy patterns[1].

2. Retrieval-Augmented Generation

This method aims to improve the factual accuracy of synthetic data and reduce the risk of hallucinations. The model does not rely solely on its internal knowledge but uses context provided from a reliable external source. For example, to generate a question-answer pair, a relevant paragraph is first retrieved from Wikipedia, and then the LLM is asked to formulate a question and answer based strictly on that text.

3. Iterative Refinement and Self-Instruction

This class of methods uses a feedback loop to improve data quality. The most well-known example is the Self-Instruct method[1].

  1. The model generates an initial dataset.
  2. This data is used to fine-tune the model itself (or a copy of it).
  3. Errors and weaknesses of the model on the generated data are analyzed.
  4. The model is prompted to generate new, more complex examples similar to those it failed on.

This is the very scheme used to create the famous Stanford Alpaca dataset—52,000 instruction-response pairs generated by the GPT-3 model, which enabled the fine-tuning of the open-source LLaMA model into an instruction-following assistant.

4. Post-processing and Filtering

After data generation, filtering is always applied to discard low-quality examples. Methods range from simple (removing duplicates, format checking) to complex, such as:

  • Using a critic model: A separate classifier is trained to distinguish between real and synthetic data and filter out the least realistic samples.
  • Confidence-based filtering: Only those examples for which the LLM predicts the correct answer/label with high confidence are kept.
  • Data weighting: Examples suspected of being erroneous or hallucinatory are assigned a lower weight in the loss function to reduce their negative impact (the SunGen method).

5. Learning from Execution Feedback

This method is particularly effective for generating program code. Unlike natural language text, code has a formal correctness criterion—it can be executed. The cycle is as follows:

  1. The LLM generates code to solve a task.
  2. The code is automatically executed and checked against tests.
  3. Correct solutions are included in the training set. Incorrect ones are discarded, or the model receives a signal (reward) to correct the error.

Applications of Synthetic Data

  • Improving tasks in low-data settings: Synthetic data is most effective when real labeled data is scarce. Studies show that adding 100 synthetic examples to 100 real ones can increase a classifier's accuracy by 3–26%[3].
  • Creating instruction datasets (Instruction Tuning): Projects like Alpaca and Code Alpaca have demonstrated that LLMs can be used to create large, high-quality datasets for training assistant models almost from scratch.
  • Information Retrieval and Question Answering (QA): The InPars method uses an LLM to generate search queries for existing documents. This allows for the automatic creation of "question - relevant document" pairs for training search systems.
  • Privacy protection: In medicine and finance, synthetic data is used to train models without access to real personal data. For example, the U.S. Department of Veterans Affairs generated synthetic medical data during the COVID-19 pandemic to facilitate information sharing[2].

Benefits and Risks

Benefits

  • Cost reduction and accelerated development: Data generation by a model is significantly cheaper and faster than manual annotation.
  • Scalability: Synthetic data can be generated in virtually unlimited volumes.
  • Controllability: Developers can flexibly configure the composition, style, and complexity of the generated data.
  • Privacy compliance: Provides a de-identified alternative for working with sensitive data.
  • Model robustness: Training on diverse and even "tricky" synthetic examples makes models less prone to overfitting and more robust to out-of-distribution inputs.

Limitations and Risks

  • Factual inaccuracies (hallucinations): LLMs can generate incorrect facts, which, when included in a training set, become ingrained in new models.
  • Lack of realism: Synthetic texts can be too formulaic, formal, or fail to reflect the full diversity of natural language, which reduces the model's generalization ability.
  • Amplification of systemic bias: LLMs inherit and can amplify social stereotypes and biases present in their training data.
  • Risk of "model collapse": A phenomenon where repeatedly training models on data generated by previous model versions leads to a gradual degradation of quality and "forgetting" of rare phenomena.
  • Potential privacy leaks: Without special measures (e.g., differential privacy), LLMs may accidentally reproduce fragments of real data from their training set, which poses a risk of de-anonymization[4].

Prospects and Research Directions

  • Prompt engineering automation: Developing methods that automatically find optimal prompts for generating high-quality data.
  • Multimodal synthetic generation: Extending methodologies to generate combined data (text + image, audio, video).
  • Development of quality metrics: Creating standardized benchmarks to evaluate the utility, diversity, and realism of synthetic data.
  • Bias management: Developing methods to control and reduce bias in generated data, for example, by generating counterfactual examples.
  • Safe industry adoption: Developing legal and ethical standards for the use of synthetic data in critical domains.

Literature

  • Ye, J. et al. (2025). Synthetic Data Generation Using Large Language Models: Advances in Text and Code. arXiv:2503.14023.
  • Wang, Y. et al. (2022). Self-Instruct: Aligning Language Models with Self-Generated Instructions. arXiv:2212.10560.
  • Gao, J. et al. (2022). Self-Guided Noise-Free Data Generation for Efficient Zero-Shot Learning. arXiv:2205.12679.
  • Jeronymo, V. et al. (2023). InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval. arXiv:2301.01820.
  • Li, Z. et al. (2023). Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations. ACL 2023.
  • Shumailov, I. et al. (2023). Nepotistically Trained Generative-AI Models Collapse. arXiv:2311.12202.
  • Long, L. et al. (2024). On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey. ACL Findings 2024.
  • Gao, J. C. et al. (2024). Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Synthetic Datasets. OpenReview.
  • Gehring, J. et al. (2025). RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning. arXiv:2410.02089.
  • Barr, A. A. et al. (2025). Large Language Models Generating Synthetic Clinical Datasets: A Feasibility and Comparative Analysis with Real-World Perioperative Data. Frontiers in AI.
  • Rao, H. et al. (2025). A Scoping Review of Synthetic Data Generation for Biomedical Research and Applications. arXiv:2506.16594.

Notes

  1. 1.0 1.1 1.2 Ye, J., et al. "Synthetic Data Generation Using Large Language Models: Advances in Text and Code". arXiv:2503.14023 [cs.CL], March 20, 2025. [1]
  2. 2.0 2.1 "Federal chief data officers seek information on synthetic data generation". FedScoop. [2]
  3. Li, Zhuoyan, et al. "Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations". ACL Anthology, 2023. [3]
  4. Schoen, F. P., et al. "Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data". Frontiers in Artificial Intelligence, 2025. [4]