The promise and perils of synthetic data
Synthetic data can train AI but has biases and risks; human input remains crucial.
As data becomes increasingly difficult to obtain, synthetic data emerges as a viable alternative for training AI models. Companies like Anthropic and OpenAI are experimenting with synthetic training data, which potentially offers a cheaper and more sustainable solution compared to traditional data sources. However, the generation of synthetic data is not without its challenges, as it may replicate existing biases present in the original datasets. This limitation necessitates a careful examination of the generated synthetic data to ensure the robustness of AI models.
Moreover, while synthetic data can be used to supplement annotated datasets, over-reliance on it can lead to a decline in model quality and diversity. Researchers at Rice University and Stanford found that a model's diversity can worsen over generations of synthetic training unless interspersed with real-world data. Alongside sampling biases, the use of complex models like OpenAI's o1 increases the risk of generating hallucinations, which could degrade model accuracy.
The AI community acknowledges the risks associated with unfiltered synthetic data and suggests a collaborative approach that pairs it with real data and human oversight. Experts like Luca Soldaini emphasize that synthetic data requires careful curation before use. Although synthetic data could significantly enhance AI training efficiency, the tech remains incapable of fully replacing human insight, highlighting the need for human intervention to prevent potential model failures.