Businesses and organizations around the world are rapidly settling into the harsh reality of safely deploying artificial intelligence.
After barging into the global scene fueling the biggest hype in tech in the past decade, AI presented itself as a magical black box that could do almost anything — although black box comes with some negative connotations, including not knowing what is going on inside.
Still, progress marches forward, and the revolution to integrate AI into operations everywhere is irresistible.
But companies are learning, often the hard way, that AI risks are abundant, and — when mismanaged — the risks far outweigh the benefits.
The biggest problems with AI? Compliance, data consent, copyright issues, training data, and bias. Synthetic data — created artificially — can mimic real-world data and could be the key to unlocking AI’s full potential. But can it save AI?
Key Takeaways
- Synthetic data offers several benefits over traditional data. It’s cheaper, customizable, avoids bias and privacy concerns, and allows for diverse scenario testing.
- However it requires careful validation and human oversight to ensure realistic outputs and ethical considerations.
- Synthetic data is valuable in various fields like healthcare (faster clinical trials), autonomous vehicles (simulating rare events), and finance (preserving data privacy).
- It can also help mitigate model drift by incorporating a wider range of scenarios into training.
Another Letter of Warning, How Synthetic Data Responds
On June 4, 2024, former and current employees from OpenAI and Google DeepMind released a public letter urging leading AI companies to allow their workers to speak their minds freely about the risks of AI.
Blocked Non-Disclosure Agreements, and working in a legal limbo where current whistleblower laws do not apply because most AI risks are not yet regulated as a crime, the workers — most of whom signed the letter anonymously due to fear of retaliation — say AI risks can further degrade inequalities, manipulate people and societies, and drive misinformation.
Bias, model drift, data used in training, consent, and copyright management are at the heart of these risks.
Synthetic data, which gained popularity before leading AI models became viral, offers several advantages over real-world data. It is cheaper, abundant, accessible, can be tailored and customized for different needs, can be diverse to avoid bias, does not require consent, and does not breach laws such as copyright regulations.
Advances in synthetic data also allow for hyperrealistic artificial data generation that aligns with real-world data.
The Brainy Insights estimates that the global synthetic data generation market will grow from $316.11 million in 2023 to more than $6.2 billion by 2033 driven by its potential to impact countless applications in the AI era.
But if synthetic data has so many benefits why isn’t it part of the bigger conversation about AI?
Ilia Badeev, Head of Data Science, Trevolution Group, innovating the travel industry with new technologies, spoke to Techopedia about how they use synthetic data.
“We use AI to generate synthetic training data that resembles real data without involving our clients’ information,” Badeev said. “This allows us to train AI models effectively without even the slightest risk of compromising our user privacy.
“Synthetic data can, however, still inherit or even amplify biases if the generation algorithms themselves are biased.”
Badeev explained that for example, Generative Adversarial Networks (GANs) can produce high-quality images, but if the synthetic training data for the GANs is biased, the generated data will reflect those biases.
“Achieving the same richness and variability as real-world data is challenging,” Badeev added. The solution? To meticulously validate it. “Just as you would with real-world data,” Badeev said.
The good news? Cleaning, deduplication, and verification can be done by the AI itself.
When Data Anonymization Holds Back AI Innovation
Studies warn that AI in healthcare is a double-edged sword — it allows breakthrough advancements but also allows for the possibility that the personal medical records of a patient can be accessed by a number of individuals.
Healthcare is not the only sector facing this problem. From governments, to finance and research, numerous industries struggle to deploy AI due to the high standards of data anonymization and data accuracy demands they must meet to operate.
Torsten Staab, principal engineering fellow and head of Innovation and AI at Nightwing, an intelligence services company working to advance national security interests, spoke to Techopedia about the issue.
“Synthetic data can also be algorithmically designed to exclude personally identifiable information, which might be irrelevant for certain model training tasks anyway, thus eliminating potential privacy concerns.”
By avoiding the cloning of potentially copyrighted materials, the risk of copyright infringement can also be lowered significantly, Staab explained.
“Synthetic data can also be used to help train models in a more ethical, controlled manner, preventing models from unfairly targeting or favoring a specific set of outputs.”
Staab warned that despite this potential, synthetic data it is not a silver bullet.
“Checks and balances, in the form of human oversight, must be put in place to ensure that the algorithms used to generate synthetic data are unbiased and produce realistic outputs.”
Feeding non-representative, unrealistic synthetic data into a machine-learning model could potentially create even more harm. “To reduce bias, consent, copyright, and privacy conflicts, there must be a balance between the use of synthetic and real-world data,” Staab said.
Synthetic Data in the Pharma Industry: Better, Faster, Cheaper
Amber Gosney, a managing director within FTI Consulting’s Information Governance, Privacy & Security practice spoke to Techopedia about synthetic data in the pharma industry.
Gosney referred to studies that show that in the clinical trials space a synthetic data set can be more useful or valuable than anonymized data. The Accenture report “Faster and cheaper clinical trials” says that an operating model that effectively integrates synthetic data into clinical trial design is essential for pharma companies to stay ahead of the game.
“Synthetic data can remain in the same (i.e. structured) format as the original data set and is often faster to produce than using regular anonymizing techniques,” Gosney said.
“It can also help with scaling problems, such as with rare diseases where the number of participants for a clinical trial might be very low.”
Gosney explained that a clinical trial data set could also be made “fairer” for under-represented groups in the trial that might otherwise experience disproportionate outcomes of a drug or product.
Model Drift: Real-World Data Vs Synthetic Data
‘Model drift’ is a machine learning (ML) term that refers to the degradation in performance and accuracy of an ML or AI system usually caused by widening gaps between the training data, knowledge base data, and models’ output data.
For example, when the global pandemic hit, organizations around the world soon discovered that their AI models were drifting, providing inaccurate or misleading outputs.
The reason for this was the unexpected data shift and changes in behaviors generated by COVID-19 globally. This unexpected new wave of different data made models no longer effective. Naturally, all AI systems, if not managed, updated, and monitored, tend to drift as new information is constantly presented to the world.
Badeev from Trevolution recognized that synthetic data may lack the complexity and richness that real-world data offers.
“However, synthetic data can be generated to include rare events, ensuring models are exposed to scenarios that might be underrepresented or non-existent in real-world data.”
Badeev said that, for example, in autonomous driving, synthetic data can simulate severe weather conditions and unusual or extreme driving scenarios, which might be lacking in real-world data but critical for safer operation.
Staab from Nightwing added that synthetic data can augment limited real-world data sets with a broader range of scenarios, improving a model’s accuracy and robustness and significantly lowering training costs.
He added:
“A car manufacturer’s ability to train its self-driving car algorithms on billions of miles of synthetically generated roads and complex traffic scenarios provides a significant competitive advantage.”
However, training a model with synthetic training data that is biased or unrepresentative of real-world conditions could reduce the model’s output accuracy to the point where the model becomes useless or even a liability. Staab warned that a model could drift in the wrong direction, effectively decoupling it from reality.
The Bottom Line
Synthetic data shines in crafting controlled experiments for AI models. It allows researchers to probe a model’s response to specific inputs, offering a window into its decision-making process.
Even more valuable, synthetic data enables testing models across diverse scenarios, ensuring consistent and predictable behavior. This is critical in safety-sensitive fields like healthcare, finance, and autonomous vehicles, where model reliability is paramount.
However, while synthetic data is a powerful tool, it’s not a magic solution. A balanced approach that incorporates real-world data and human oversight remains essential.
Real-world data grounds the model in the complexities of the actual environment it will operate. Human expertise serves as a crucial check, ensuring the model’s goals are aligned with ethical considerations and real-world applications.