From A to Zebra: Data-driven strategies for training AI to understand rare diseases

Ahead of Rare Disease Day 2020, Paul Wicks, scientific advisor for Ada Health, writes about the potential of digital learning health systems in helping patients receive the right care faster.
By Paul Wicks

Paul Wicks is a scientific advisor for digital health company Ada Health. At Ada, Paul works on strategy, rare diseases and scientific communications. A rare diseases expert, Paul is a neuropsychologist and independent consultant for digital health, clinical trials, and patient centricity. Previously, he spent 13 years leading the R&D team at PatientsLikeMe, an online community for over 700,000 people living with medical conditions.

There’s a saying that medical students hear a lot: “When you hear hoofbeats, think of horses, not zebras”. This heuristic has earned its role in medical education because it counteracts a number of well-described human cognitive biases that might otherwise lead medical students to diagnose patients with exotic-sounding and ultra-rare diseases found in textbooks, rather than the more common (but less exciting) diagnosis. The numbers bear this out too; given that there are an estimated 60 million horses on earth, but only around 300,000 zebras, those hoofbeats are indeed 200 times more likely to belong to a horse than a zebra.

But while it might be true that a doctor could go their whole career without seeing any one rare disease, rare diseases as a group are not that rare at all. In fact, they’re quite common. The most recent estimates suggest there are about 10,000 rare diseases out there, with advances in medical genetics adding another 200 or so each year. When you add them all up, about 1 in 17 people have a rare disease. That’s around 4% of humanity, or some 400 million people worldwide.

AI has huge potential in tackling rare diseases — but it won’t be easy

Today there is a great deal of excitement about the potential for digital health to tackle some of the big challenges in rare diseases. For example, it takes patients an average of five years to be diagnosed with a rare disease, and much of this time can be spent without clear answers or support. But by leveraging the latest developments in AI and medical technology, we could make much faster diagnoses to direct people to the right treatments, trials, and support. After all, AI can be trained to spot patterns in data that human doctors can’t see, and should (in theory) be unaffected by human error.

So if you were trying to train an AI to understand all 10,000 rare diseases, where would you even begin?

You might be forgiven for thinking there would be one single, unified, global definition of what counts as a “rare” disease — sadly not. There are over 296 different ways of defining “rare” from 1,109 organisations, and there’s no official rulebook to divide up “rare”, “orphan”, “ultra rare”, or any of the many other terms used in the field. You might start with your own personal experiences; having studied rare forms of ALS / motor neurone disease for nearly 20 years now, as well as being a caregiver to someone living with it, I am conscious that I often frame rare problems through that lens. You might start at the hospital with the most expertise, where there’s the most funding, where there are newly approved drugs, or the most activated network of patient charities who ask for your help.

But none of these approaches are systematic or equitable; perhaps we should start with the data and see what new solutions we can unearth.

Some rare diseases are rarer than others

Academics and non-profits across the rare space have been collaborating for decades to build databases like Orphanet, which hosts publicly accessible datasets including statistics on epidemiology, genetics, and lists of the signs and symptoms that can serve as a unique fingerprint for rare diseases. What patterns in the data could guide our path?

A global rare diseases study run by researchers from Orphanet, NORD, and Eurordis revealed a useful insight that could guide a more strategic, data-driven approach than random disease selection. The epidemiological data revealed that the majority of patients diagnosed with a rare disease (~80% of patients) were found in just ~400 conditions that were the “least rare” (4% of diseases).

It was those uncommon but not unheard-of conditions that affected about one to five people in every 10,000 members of the population — diseases like ALS/MND, cleft lip, Down syndrome, or lupus. Conversely, the majority of diseases described in the scientific literature (~85% of diseases) are so rare that they are found to affect fewer than one person in every million members of the population. For these conditions, there might only be a handful of affected individuals ever described in the literature or alive at any one time, yet they make up the bulk of the onerous “10,000 rare diseases”.

Digital disease models could provide the answer

To turn back to the hoofbeat analogy, not all zebras are created equal. There are plains zebras (about 300,000 on earth), mountain zebras (just 3,000), and the incredibly rare “golden zebra” born with a unique genetic mutation that causes a lack of melanin (of which there have only been a handful). So if we were training an AI model to recognise zebras, it would make sense to start with the plains zebras, not the golden zebras, if we wanted to create something that’s actually useful in the wild.

In the world of rare diseases, that means training disease models that can efficiently identify features suggestive of autoimmune conditions, blood cancers, and neurological diseases that make up the “plains zebras”. Unlike the slowly-trained pattern recognition that is imprinted on a human doctor’s brain, a digital disease model is scalable to a much broader population, continues to learn dynamically, and with minor modifications can be deployed in any language.

Even in an advanced health system like the UK, there are only about 200 clinical geneticists, so rare disease diagnosis today still hinges on serendipitously finding the right specialist through personal networks. By contrast, a digital learning health system could help ensure that a potentially affected person gets on the path to see the right healthcare professional with a much greater degree of speed and equality of access.

Advanced digital platforms with access to a vast body of context-rich real-time and real-world evidence have the potential to deliver more widely applicable diagnoses and insights than can be inferred from the peer-reviewed scientific literature, skewed as it is to “WEIRD” populations (i.e. those that are “Western, Education, Industrialised, Rich, and Democratic”). Hopefully, by developing digital disease models we can support patients to get the right diagnosis promptly anywhere on Earth.