Today’s generative AI models, such as those behind Cat And Geminiare trained on acts of real data, but even all the content on the Internet is not enough to prepare a model for each possible situation.
To continue to grow, these models must be formed on simulated or synthetic data, which are plausible, but not real scenarios. AI developers must do so responsible, said experts on a South by Southwest panel, where things could turn quickly.
The use of simulated data in the training of artificial intelligence models has attracted new attention this year since the launch of Deepseek aiA new model produced in China which has been formed using more synthetic data than other models, saving money and processing power.
But experts say that it is more than economics on the collection and processing of data. Synthetic data – The computer often generated by AI itself – can teach a model on scenarios that do not exist in the real information that it has been provided but that it could be confronted in the future. This possibility in a million million does not need to surprise an AI model if it is seen a simulation.
“With simulated data, you can get rid of the idea of EDGE cases, assuming that you can trust him,” said Oji Udezue, who managed product teams at Twitter, Atlassian, Microsoft and other companies. He and the other panelists spoke on Sunday at the SXSW conference in Austin, Texas. “We can build a product that works for 8 billion people, in theory, as long as we can trust it.”
The difficult part is to make sure you can trust him.
The problem with simulated data
The simulated data has many advantages. On the one hand, it costs less to produce. You can crush thousands of simulated cars using software, but to get the same results in real life, you must really break cars – which costs a lot of money – said Udezue.
If you form an autonomous car, for example, you must capture less common scenarios than a vehicle could live on the roads, even if they are not in training data, said Tahir Ekin, professor of commercial analysis at Texas State University. He used the case of bats which make spectacular emergence of the Austin Congress avenue bridge. This may not appear in training data, but an autonomous car will need to know how to respond to a battle of bats.
The risks come from the way a machine formed using synthetic data responds to real world changes. It cannot exist in an alternative reality, or it becomes less useful, or even dangerous, said Ekin. “How would you feel,” he asked, “entering an autonomous car that was not trained on the road, which was only trained on simulated data?” Any system using simulated data must “be based on the real world,” he said, including comments on how his simulated reasoning aligns with what really happens.
Udezue compared the problem to the creation of social media, which began as a way to extend communication worldwide, a goal it has achieved. But social media has also been used to be uncomfortable, he said, noting that “now the despots use it to control people, and people use it to tell jokes at the same time”.
As the AI tools develop in scale and popularity, a scenario facilitates the use of synthetic training data, the potential impacts of the real world of training and unworthy models of fiducière standing out from reality become more significant. “The burden is on American manufacturers, scientists, to be double, triple sure that the system is reliable,” said Udezue. “It’s not a fantasy.”
How to keep the data simulated under control
One way to ensure that models are trustworthy is to make their training transparent, that users can choose the model to be used according to their assessment of this information. Panelists have repeatedly used the analogy of a nutritional label, which is easy to understand for a user.
A certain transparency exists, such as the models available via the developer platform Face which break down the details of the different systems. This information must be as clear and transparent as possible, said Mike Hollinger, director of product management for the AI generative company at Chopmaker Nvidia. “These types of things should be in place,” he said.
Hollinger said that in the end, it will not only be AI developers but also AI users who will define the best practices in the industry.
The industry must also keep ethics and risks in mind, said Udezue. “Synthetic data will make a lot of things easier to do,” he said. “This will result in the cost of building things. But some of these things will change society.”
Udezue said observability, transparency and confidence must be integrated into models to ensure their reliability. This includes updating training models so that they reflect precise data and do not undertake errors in synthetic data. A concern is the collapse of the model, when an AI model formed on the data produced by other models of AI will be increasingly distant from reality, to the point of becoming useless.
“The more you hesitate to capture the diversity of the real world, the answers can be unhealthy,” said Udezue. The solution is the correction of errors, he said. “These do not look like insoluble problems if you combine the idea of confidence, transparency and error correction.”