Join our daily and weekly newsletters for the latest updates and exclusive content covering cutting-edge AI. Learn more
Generative AI has become a key part of infrastructure across many industries, and healthcare is no exception. Yet, as organizations like GSK pushing the limits of what Generative AI can achieve, they face significant challenges, particularly in terms of reliability. Hallucinationsor when AI models generate incorrect or fabricated information, is a persistent problem in high-stakes applications such as drug discovery and healthcare. For GSK, addressing these challenges requires leveraging compute scaling at test time to improve generation AI systems. Here’s how they do it.
The Problem of Hallucinations in Generative Healthcare
Healthcare applications demand an exceptionally high level of accuracy and reliability. Mistakes aren’t just annoying; they can have life-changing consequences. This makes hallucinations in large language models (LLM) a critical issue for companies like GSK, where AI generation is applied to tasks such as scientific literature analysis, genomic analysis, and drug discovery.
To mitigate hallucinations, GSK uses advanced inference time calculation strategies, including self-reflection mechanisms, multi-model sampling, and iterative evaluation of results. According to Kim Branson, SvP of AI and machine learning (ML) at GSK, these techniques help ensure agents are “robust and reliable,” while allowing scientists to generate actionable insights more quickly.
Leverage compute scaling at test time
Compute scaling at test time refers to the capacity of increase IT resources during the inference phase of AI systems. This enables more complex operations, such as iterative refinement of results or multi-model aggregation, which are essential to reduce hallucinations and improve model performance.
Branson highlighted the transformative role of scaling in GSK’s AI efforts, noting that “we are looking to increase iteration cycles at GSK – to think faster.” » Using strategies such as self-reflection and ensemble modeling, GSK can leverage these additional calculation cycles to produce results that are both accurate and reliable.
Branson also touched on the broader industry trend, saying: “You see this war going on with what I can serve, my cost per token and time per token. This allows people to bring these different algorithmic strategies that were previously not technically feasible, and it will also determine the type of deployment and adoption of agents.
Strategies to reduce hallucinations
GSK has identified hallucinations as a critical challenge in gen AI for healthcare. The company uses two main strategies that require additional computing resources during inference. Applying deeper processing steps ensures that each response is reviewed for accuracy and consistency before being provided in clinical or research settings, where reliability is paramount.
Self-reflection and iterative review of results
One of the basic techniques is self-reflection, in which LLMs critique or modify their own responses to improve quality. The model “thinks step by step,” analyzing its initial results, identifying weaknesses, and revising responses as necessary. GSK’s desk research tool is an example of this: it collects data from internal repositories and an LLM’s memory, then re-evaluates its findings through self-critique to uncover inconsistencies.
This iterative process results in clearer and more detailed final answers. Branson emphasized the value of self-criticism, saying, “If you can only afford to do one thing, do it. » Refining its own logic before producing results allows the system to produce information that meets strict healthcare standards.
Multi-model sampling
GSK’s second strategy relies on multiple LLMs or different configurations of a single model to cross-check results. In practice, the system can run the same query at different temperature settings to generate diverse responses, use fine-tuned versions of the same model specialized in particular domains, or use entirely separate models trained on separate data sets.
Comparing and contrasting these results allows us to confirm the most consistent or convergent conclusions. “You can achieve this effect by using different orthogonal ways to arrive at the same conclusion,” Branson said. Although this approach requires more computing power, it reduces hallucinations and increases confidence in the final answer – a critical advantage in high-stakes healthcare environments.
The inference wars
GSK’s strategies depend on infrastructure capable of handling significantly heavier computing loads. In what Branson calls the “inference wars,” AI infrastructure companies like CerebralGroq and SambaNova — compete to deliver hardware advancements that improve token throughput, reduce latency, and lower costs per token.
Specialized chips and architectures enable complex inference routines, including multi-model sampling and iterative self-reflection, at scale. Cerebras’ technology, for example, processes thousands of tokens per second, allowing advanced techniques to work in real-world scenarios. “You see the results of these innovations having a direct impact on how we can effectively deploy generative models in healthcare,” Branson noted.
When hardware keeps pace with software demands, solutions emerge to maintain accuracy and efficiency.
Challenges remain
Even with this progress, scaling compute resources presents obstacles. Longer inference times can slow down workflows, particularly if clinicians or researchers need rapid results. Increased use of compute also leads to increased costs, requiring careful management of resources. Nevertheless, GSK considers these compromises necessary for stronger reliability and richer functionality.
“As we enable more tools in the agent ecosystem, the system becomes more useful to users, and you end up with increased compute usage,” Branson noted. Balancing system performance, costs and capabilities allows GSK to maintain a strategy that is both practical and forward-looking.
What’s next?
GSK plans to continue refining its AI-based healthcare solutions, with scaling computing at the time of testing a top priority. The combination of self-reflection, multi-model sampling, and robust infrastructure helps ensure that generative models meet the rigorous demands of clinical environments.
This approach also serves as a roadmap for other organizations, illustrating how to balance accuracy, efficiency, and scalability. Maintaining a lead in computational innovations and sophisticated inference techniques not only addresses today’s challenges, but also lays the foundation for breakthroughs in drug discovery, patient care and beyond.