Can synthetic data solve ML, AI, Deep Learning problems?

Posted by : Ajit Jain / Posted on : 09-Feb-2021

"Synthetic data will be the most important topic within the next decade."

Over 90% of world data is produced in the last two years alone with 2.5 quintillion bytes being produced every day now. This will keep growing and yet as if this much data was not enough we are up for creating artificial data called synthetic data. What is the notion behind synthetic data, how it is produced, and how is it a viable business model are some of the points we will discuss in this article.

Yes, we are sitting on a humongous amount of data and we generate more of it every day but is this the data we need for solving Machine Learning problems, and thus the data required for Machine Learning is sufficient or not. This the nitch we need to look into and this is just one of the reasons for generating synthetic data.

Take a look at this report to understand exactly what data is generated every day. Most of it is the consumer side data and useful for one-sided use cases in the industry. The majority of the business use cases need high-quality structured data and might be facing a shortage due to various reasons like availability, privacy, etc.

Synthetic Data Solves for:

  1. Privacy- Production quality data minus personal content masking.
  2. Product Testing- New product with no historical data.
  3. Machine Learning- Data that cannot be generated in the actual scenario, like for self-driving cars.

Consider the use case of Machine Learning where test data has to be real-world data but training data can be either fully synthetic or partially synthetic ( synthetic + real ). Adding a little bit of real data in training set can actually improve model performance. I am listing a few methods of generating synthetic data via a synthetic environment or synthetic data generator:

  1. 3D Models
  2. Structured Domain Randomization for Transfer Learning
  3. Sampling techniques like SMOTE
  4. Deep Learning Models like GAN
  5. Agent-Based Simulation

These are in practice methods and to name a few applications these are used to synthesize data in computer vision, time series analysis, and even tabular transactional data. But then the question that arises after all this praising is that this is basically artificial data that might produce artificial results in your model. To be precise as with each process it is important to know the accuracy of the algorithm built on synthetic data.

What if synthetic data is used to predict the next pandemic spread or for self-driving cars. These are serious applications and one should build the notion that outliers might be missing in synthetic data. Basic means of accessing synthetic data are:

  1. Statistical Techniques
  2. Comparing with real data
  3. QA for utility

Synthetic data also has some limitations like a high investment of cost and time but once done correctly endless amount of clean analytics-ready data can be produced at almost no cost. Needless to say, synthetic data will grow to the extent that artificial diamonds look no different than real diamonds. One could start considering possibilities of AI implementation on business processes where data is scarce or expensive to acquire.