What is synthetic data and how is it used for AI?

Over the last few months, the term ‘synthetic data’ has found itself being used more frequently by tech leaders across the world, heralded as a possible solution to the looming data shortage crisis.

Synthetic data has been around far longer than the current explosion in AI adoption, used as a technique by businesses to alter or synthesize data sets to mask information, achieve particular outcomes, or fill in knowledge gaps.

In short, it’s artificial data generated by algorithms to mimic organically created data, per a definition of synthetic data from the European Data Protection Supervisor (EDPS).

This means synthetic data and original data should deliver similar results when used for tasks such as statistical analysis and data analytics. The potential? Businesses can utilize the characteristics of user-generated data without having to harvest and use the data itself, or leverage the benefits of big data for data-driven decision making even if their organic datasets are too small.

How is synthetic data made?

Historically, synthetic data has been created using Generative Adversarial Networks (GANs), a process in which a machine learning (ML) model uses a combination of neural networks to create new data.

As described by Google Cloud, GANs work by pitting a ‘generator’ and a ‘discriminator’ against one another. The generator creates fake data and that discriminator has to spot the difference between this fake data and real data.

While the generator initially creates data that is easily identifiable as fake, it gradually begins to learn how to mimic real data more effectively. This process can be used to generate valuable synthetic data.

Now, however, large language models (LLMs) can be used to create synthetic data generatively. Users can prompt models to create synthetic versions of existing data that possess the same characteristics. This method of creating synthetic data has its advantages and can be a useful way of creating such data without undergoing the process of GAN use.

How can businesses deploy synthetic data?

Utilizing just the characteristics of data has many applications in a business context, from software development to ensuring data privacy compliance. For the latter, synthetic data is a useful tool that allows businesses to get value out of their data without breaching any data privacy regulation.

Synthetic data can be used to strengthen datasets from a privacy perspective in three main ways: businesses can create entirely synthetic data, they can create partially synthetic data that has had the sensitive information redacted, or they can tokenize or encrypt the data.

Royles also explains how synthetic data can be used to generate useful data for industry-specific models, based in part on existing data an enterprise may have such as customer purchases, phone call data in telecommunications, or transaction stats in financial services.

“What’s interesting is, as a business, that data becomes proprietary to you. It’s unique to you and therefore it might not represent the full spectrum of scenarios. You might not have all customers of every age range, say. You might not have every customer who lives in the same geographic region,” Royles explains.

“So what synthetic data enables you to do in the first stage is take the data that you do have and supplement it with the data you don’t have. Think about the white space you might need to fill in order to create a better representation of the data you’re going to use to train the model.”

Andreas Kollegger, generative AI lead at Neo4j, tells ITPro that synthetic data also has a long history of use for software development, particularly when it comes to testing software.

“From the first moment somebody commits some code, almost the second thing they do – sometimes the first – is they write some tests,” Kollegger says. “In order to test if the code works, you need some kind of input to run your software and figure out if it’s doing the thing.”

Synthetic data works well in these situations, as developers can use it to test if their piece of code or software is operating correctly without having to source real data or put the software live before it’s finished in order for it to interact with realistic inputs.

Why is synthetic data useful for AI?

Synthetic data has several applications in the context of AI. First, it can be used in its traditional sense, as a way of generating large amounts of sanitized data for training AI models. As Kollegger reminds ITPro, businesses can’t expect information they enter into an AI system to be forgotten and this can be troubling if it contains sensitive aspects.

“It’s made to remember the details. That’s the entire function of it. And so it becomes incredibly important, in the context of doing training, to do the data masking that you’re talking about,” Kollegger said.

“The personally identifying information (PII), all of that, has got to be stripped out before you feed it to an LLM, for sure, because the LLM will remember those things and then you will compromise people’s privacy,” he added.

As mentioned above, organizations can also use AI to create synthetic data. Dael Williamson, EMEA CTO at Databricks, discusses this in more detail to ITPro, explaining that using AI to create data that is not entirely synthetic, but rather “synthesized real data”. This can in turn be used to train or fine tune larger AI models that require enormous datasets.

“You create this life cycle, so it’s not as simple as old, deterministic systems, where you write an algorithm, put it in production, and leave it for the next four decades,” he adds.

Synthetic data vs organic data in AI training

While AI was initially fuelled by organic data, trained on the wealth of human-made information scraped from across the web, Williamson notes that there are increasing suggestions that AI developers have run out of data.

“I think it was Ilya Sutskever said this, that we’ve exhausted the fossil fuels of data. So there’s only X amount of data on the planet that’s been produced by humans, and we’ve used it all,” he tells ITPro.

While Williamson explains he doesn’t agree with the assertion that we’ve used all the available data, he acknowledges that the tech industry will ultimately hit these limits. Synthetic data could play a role in solving this issue and assist in large model development.

But how will synthetic data stack up to organic data? Asked whether the integrity of synthetic data is viable over time, Williamson suggests that there will continue to be a need for some organic data.

“The copy of a copy is fine,” he says. “The copy of a copy of a copy of a copy eventually degrades the effectiveness of it. That’s why, through this process, you have to collect actual traces from real world human interactions.”

Users can conduct monitoring to collect real-world traces that can be fed back into the creation of synthetic data. Because humans are interacting with the model or involved in creating synthetic data, elements of organic trace will end up in new synthetic data and stop the data from becoming less valuable.

“So what we do is we balance – we balance the real world with the synthetic and with the humans in the loop,” he added. “It’s this harmony and equilibrium between real-world data, real-world expertise, and the synthetic data generation to fill the gaps.”

Source link