The AI industry is grappling with a significant challenge as the availability of real-world data, which has been crucial for training AI models, is rapidly diminishing. Companies such as OpenAI and Google have relied on scraping data from the internet to fuel their large language models (LLMs) that power various AI applications. However, research firm Epoch AI predicts that textual data could be exhausted by 2028, while increased restrictions on data usage further limit access to valuable training data.
In response to this data scarcity, synthetic data has emerged as a potential alternative. Synthetic data refers to artificially generated data produced by AI systems trained on real-world data. Proponents argue that synthetic data can address the limitations of human-generated data, such as the need for cleaning and labeling. It also offers the tantalizing prospect of generating training data at a lower cost and seemingly infinite scale.
However, the use of synthetic data has sparked a heated debate within the AI community. Some researchers caution that overreliance on synthetic data could lead to AI models being poisoned with poor-quality information, potentially resulting in a collapse of their performance. A recent study by Oxford and Cambridge researchers found that feeding a model with AI-generated data eventually led to the production of nonsensical outputs. They argue that synthetic data should be balanced with real-world data to ensure model effectiveness.
The scarcity of real-world data has prompted more companies to explore the use of synthetic data. Research firm Gartner predicts that by 2024, 60% of data used for developing AI will be synthetically generated. However, critics, including AI analyst Gary Marcus, argue that synthetic data alone cannot address the fundamental limitations of AI systems, such as their inability to reason and plan effectively.
Tech companies have been aggressively pursuing publicly available data to train their AI models, leading to increased restrictions from data owners. OpenAI and Google have even paid substantial sums for access to data from platforms like Reddit and news outlets. However, these sources are also becoming limited, with fewer untapped areas of the textual web available for data collection.
To overcome these challenges, companies like Nvidia and Tencent have developed AI models and tools for generating synthetic data. Startups such as Gretel and SynthLabs have also emerged, specializing in generating and selling specific types of synthetic data. The potential benefits of synthetic data include filling gaps in human-generated data and counteracting biases present in real-world data.
While synthetic data offers some advantages, concerns remain about its potential to undermine AI models. Researchers warn that indiscriminate use of synthetic data during model training can lead to irreversible defects and model collapse. The concept of “Habsburg AI” has been coined to describe the mutation of models heavily trained on AI outputs.
Some companies are exploring the use of hybrid data, combining synthetic and non-synthetic data, to mitigate the risks associated with synthetic data. Scale AI, for example, sees hybrid data as the future. Additionally, alternative approaches, such as neuro-symbolic AI, which combines deep learning models with rule-based logical reasoning, are being explored as potential solutions.