Researchers have called into question claims made by health tech company Babylon Health that its diagnostic and triage system performed better than doctors on a subset of the final exam for trainee GPs in the UK.
Babylon said its technology had demonstrated the ability to provide health advice that was “on par with doctors” in June, publishing an internal study to back the claim. The company said it used example questions publicly available for its exam preparation and testing, since the Royal College of GPs, which sets the final test, does not publish past papers.
Babylon said its system scored 81 per cent when it first sat the exam, while the average pass mark for the last five years for doctors was 72 per cent. Their team worked with the Royal College of Physicians, Stanford University and Yale New Haven Health to further test the technology against seven primary care doctors by developing 100 symptom sets or ‘vignettes’, while four other GPs acted as patients. Its system scored 80 per cent for accuracy, while doctors achieved an accuracy range between 64 to 94 per cent.
But in a letter published in medical journal The Lancet this week, Hamish Fraser, associate professor of medical science, Enrico Coiera, medical informatics professor, and David Wong, health informatics lecturer, warned that the evaluation did “not offer convincing evidence that the Babylon Diagnostic and Triage System can perform better than doctors in any realistic situation” and that there was “a possibility that it might perform significantly worse”.
"Further clinical evaluation is necessary to ensure confidence in patient safety”, the experts said, adding that it was not possible to determine how the system would perform "with a larger randomised set of cases" or if data was logged by patients instead of doctors.
The researchers “commended” Babylon for publishing details about the development of the system and their evaluation studies, but, quoting other concerns previously raised when looking at the performance of computerised diagnostic decision support programmes, they said the cases illustrated the “urgent need” for improvements in evaluation methods assessing the “safety, efficacy, effectiveness, and cost” of these tools.
“Such guidelines should form the basis of a regulatory framework, as there is currently minimal regulatory oversight of these technologies. Without such structure, commercial entities have little incentive to develop a culture that supports peer-reviewed independent evaluation,” they wrote in the analysis.
Babylon did not respond to our request for comment. In a statement provided to The Times earlier this week, a spokesperson for the company said:
“Babylon took an unprecedented step in publishing a study of the efficacy of our AI. Professors from Harvard, Stamford, Yale and the Royal College of Physicians took part in the study which found that the Babylon Chatbot performed on par with doctors on specific primary care cases under test conditions. In fact, Babylon is commended by the authors of the paper in The Lancet for sharing our methodology. It is unrealistic to believe that AI can currently compete with specialist doctors for all areas of medicine, but the power of these advances in healthcare should be celebrated as it will help to make healthcare accessible and affordable to all.
“What makes Babylon unique is the combination of easy-to-use AI tools, leading on to virtual consultations with real doctors in minutes and when necessary physical face to face consultations in our clinics. This has allowed us to solve fundamental problems in access to primary care that have remained unsolved for decades. Our patients can see a doctor within minutes, 24 hours a day, 365 days a year. No other service can provide this level of access. Tens of thousands have been drawn to our service, with 4 out of 5 giving Babylon a 5 star rating.”