Babylon Health’s announcement yesterday that its triage chatbot outperformed doctors in a simulated version of the UK’s Member of the Royal College of General Practitioners (MRCGP) exam made waves, but maybe not the type that the company had initially hoped for.
During a livestream, the company said that its AI recorded an 81 percent score on the recreated test, 9 percent higher than the average passing score for UK medical students. Further, the tool was 80 percent accurate in a separate 100-question trial devised by the company, while scores on the same test from seven primary care physicians ranged from 64 percent to 94 percent.
In a statement released by the Royal College of General Practitioners (RCGP) itself, Martin Marshall, vice chair for external affairs for the college and a professor at University College London, rebuked Babylon’s claims as “dubious” and generally incomparable to the full range of responsibilities handled by medical practitioners.
"The potential of technology to support doctors to deliver the best possible patient care is fantastic, but at the end of the day, computers are computers, and GPs are highly-trained medical professionals: the two can't be compared and the former may support but will never replace the latter,” he said in the statement. “No app or algorithm will be able to do what a GP does. Every day we deliver care to more than a million people across the UK, taking into account the physical, psychological and social factors that may be impacting on a patient's health; we consider the different heath conditions a patient is living with, and medications they might be taking, when formulating a treatment plan. Much of what GPs do is based on a trusting relationship between a patient and a doctor, and research has shown GPs have a 'gut feeling' when they just know something is wrong with a patient.”
Marshall’s comments went on to stress that the MRCGP questions Babylon included, although collected from previous versions of the assessment, were “not necessarily representative of the full-range of questions and standard used in the actual MRCGP exam.”
In a written statement provided to MobiHealthNews, Babylon Medical Director Dr. Mobasher Butt said that the RCGP was “completely off the mark” with its criticisms, which he noted were released hours before the company published a conference paper elaborating the study procedures and results on its website.
Butt stressed that the test used sample RCGP questions and others from independent exam preparation sources that were presented “in exactly the same format as the MRCGP exam,” and emphasized that the passing grade referenced by the company and exceeded by the tool was the five-year average pass mark established by the college itself. In addition, he said that his company has made “no claims that an ‘app or algorithm will be able to do what a GP does,’” and that decisions made by the tool are meant to be supported by live practitioners.
“I am saddened by the response of a College of which I am a member,” Butt wrote in the statement. “The research that Babylon published should be acknowledged for what it is — a significant advancement in medicine. Instead, the RCGP’s statement focuses on shoring-up an outmoded and financially self-interested status quo which solely works to the benefit of a limited number of partner GPs, rather than celebrating a scientific achievement which has the potential to improve the lives of patients and clinicians globally.”
In November Babylon partnered with the National Health Service to launch GP at Hand, a smartphone app that helps users book appointments, facilitate video consultations, and have their symptoms interpreted by company’s triage chatbot. The service and partnership was criticized by the RCGP for “cherry-picking” low-complexity patients and leaving more difficult cases to practitioners. Marshall reaffirmed this stance in his comments.
“We do not endorse Babylon, or its GP at Hand service, being used in the way that it is, in the NHS,” he said. “Technology has the potential to transform the NHS, but it must be implemented in an equitable way, that doesn't benefit some patients, and not others, and is not to the detriment of the general practice service as a whole."
Criticisms of Babylon’s claims weren’t limited to the RCGP, as numerous practitioners, researchers, and other commentators took to Twitter to voice their thoughts on whether the company’s tool could replace a human. Notably, a user named Dr Murphy (who only identifies online as an anonymous “NHS Consultant”) took the opportunity to post a 44-tweet chain describing interactions with the company and doubts regarding the chatbot’s capabilities. Among these were several videos running through the chatbot’s diagnostic process that highlighted questionable or unexpected conclusions.
— Dr Murphy (@DrMurphy11) June 28, 2018
While most reactions to Dr Murphy’s tweets primarily expressed concern that Babylon’s tool has already been rolled out by the NHS for live patients, others such as Enrico Coiera, a professor in medical informatics at Macquarie University and director of the Centre for Health Informatics, took their criticism to the methodology of the conference paper Babylon released. In a chain of tweets, he described the data as “a very preliminary and artificial test of a Bayesian reasoner on cases for which it has already been trained,” noting that the ways in which questions were chosen and presented to the tool and the human practitioners wouldn’t replicate real-world implementation. As a result, he concluded that the data generated by this trial are “confounded by artificial conditions and use of few and non-independent assessors.”
So, it is to fantastic that Babylon has undertaken this evaluation, and has sought to present it in public via this conference paper. They are to be applauded for that. One of the benefits of going public is that we can now provide feedback on the study's strength and weaknesses.
— Enrico Coiera (@EnricoCoiera) June 28, 2018
In a written response to criticisms of the preliminary study's findings and methodology, Johri stressed that its AI is not restricted to Bayesian reasoning alone and incorporates natural language processing as well as a deep neural network to better understand the patient's input.
"In addition, the independently created vignettes were not used for the model training, development or adjustment in any way — the cases on which the model was tested were fully unseen to the model making it a perfectly valid performance assessment set," he wrote. "Secondly, none of the doctors used in role-playing were part of the team who tested or developed the AI system, thus shielding them from knowing how the system preferred to ingest information and alienating such biases. Thirdly, as mentioned in the paper, we cover the majority of medical conditions encountered in general practice in the United Kingdom, which is an extensive list of conditions that have good coverage of what would be observed 'in-the-wild'."
Johri also wrote in the statement that the company is open to comments on the methodology, and will remain available to clarify additional details regarding the paper and the technology.