Learning from Experience: FDA’s Treatment of Machine Learning

By Bradley Merrill Thompson

There seems to be a modern day gold rush as companies explore how to use machine learning in clinical decision support software. Unfortunately for libertarians, FDA will regulate some of that software because of its risk profile. While the 21st Century Cures Act that passed last December exempted certain CDS from regulation and indeed FDA intends to exempt even more, FDA will continue to regulate high risk CDS. The question is: how will FDA regulate high risk CDS when the software involves machine learning?

Some might assume that machine learning in healthcare is so new, we have no idea how FDA will react. But that’s simply not the case. FDA has decades of experience regulating machine learning and, fortunately, that gives us some useful clues as to how FDA will respond to the expanded uses of that technology.

FDA’s Experience With Machine Learning

FDA’s Division of Radiological Health has been reviewing software that employs computer-aided image analysis since 1998. The software in many cases uses sophisticated algorithms to highlight areas of an image the radiologist ought to study more closely. Initially such software was placed in class III – the highest regulatory oversight for products with the greatest risk – but more recently FDA has regulated that software in class II for products of only moderate risk. Importantly such software, when sold for use by radiologists, directs the radiologist to do all the review the doctor would normally do, and not to rely on the software. In theory the software presents zero risk, but FDA probably suspects that radiologists will depend on the software regardless of warnings against such reliance.

In 2012, FDA published a pair of guidance documents that synthesized the agency’s approach to this category of software. In those documents, FDA reiterated what frankly is always true in FDA regulation, that the intended use of the product really drives the level of regulation. The agency differentiates between CADe, which is intended to merely highlight areas of interest, versus CADx, which indicates the likelihood of the presence of the disease, or specifies a disease type. CADx, because it presents greater risk, may be regulated more stringently, often times in class III.

But FDA’s attitude toward CADx seems to be evolving. Just last month, in July 2017, FDA decided to down classify into class II CADx software for lesions suspicious for cancer. Specifically FDA’s action addressed software “intended to aid in the characterization of lesions as suspicious for cancer identified on acquired medical images.” The software characterizes lesions based on features or information extracted from the images and provides information about the lesions to the user. Treating that as class II is indeed a big step toward encouraging the development of that software. Manufacturers of class III products must submit a very voluminous premarket approval application that is based typically on extensive clinical trials, where manufacturers of class II products need only demonstrate that their product is substantially equivalent to products already on the market (which can also require clinical testing, but more modest in design and scope.)

For image analysis software that employs machine learning, FDA has a relatively well-established approach to clinical trials. The FDA’s approach is predicated on the ability to confidently determine ground truth regarding what a given image actually represents. Researchers can create a set of medical images where the underlying presence or absence of disease has been confirmed through other techniques.  Thus, there is a ground truth, and applicants can design clinical trials that compare human readers assisted by the software with readers who don’t have the software, to see how well each group does. There are certainly other possible clinical trial designs, depending on what specifically the hypothesis is that the applicant needs to test.

Beyond understanding how ground truth is established, FDA has developed a fairly well-specified list of information they need to review for software that employs machine learning. In the 2012 guidance documents, FDA lists information such as algorithm design, features, models, classifiers, the data sets used to train and test the algorithm, and the test data hygiene used. The latter is important because apparently some applicants inadvertently select classifiers based on the test set, which is not permitted. FDA wants to understand how data are acquired to make sure that the data reflect real life. 

FDA also is sensitive to the statistical plan used and the appropriateness of the study hypothesis. In the agency’s experience, many applicants include multiple hypotheses in their studies, and that impacts the statistical plan, among other things. In the end, one of the agency’s most important goals is to make sure that the intended use is reflected in the product design and clinical validation.

FDA has also begun to receive submissions to clear software that employs machine learning in what the agency refers to as “adaptive systems” – systems that evolve over time based on the new evidence collected in the field after the device goes to market. While creating an adaptive system is in fact the ultimate goal of most developers, from an FDA standpoint this presents special challenges because under law manufacturers are required to seek new clearances or approvals for changes to medical devices. If the device changes on its own, the question is at what juncture is a new approval required? In the same vein, FDA also must decide when new changes trigger the need for new validation. In at least some cases it appears not to be adequate to simply specify the parameters used to control the software.

There are other questions here, including whether the software developer can continually reuse its test data set, or whether that amounts to training to the test. It’s possible that FDA might require noise be added to the test set to assure that the company is appropriately validating changes.

While the majority of its experience is in the context of medical image software, FDA is beginning to encounter machine learning in other medical software applications. The agency has received a handful of applications in such areas as software that analyzes the results of laboratory tests, vital signs as a part of remote monitoring, and signals such as EEGs. Consequently other divisions within FDA’s Device Center are grappling with questions of machine learning, and most likely consulting their colleagues in the Division of Radiological Health.

What FDA’s Experience Tells Us About the Future of CDS Regulation

Based on FDA’s experience, we can predict at least four things.

  1. Companies will have to think long and hard about intended use and how aggressive to be. Anything beyond flagging something for physician interpretation and certain low risk characterizations is likely to raise the regulatory bar substantially.
  2. Classification will be a big issue. Depending on the specific application, FDA may view machine learning as a new technology, and therefore require a new classification. If it does, the first company to bring such products to market will have to either seek premarket approval for a class III device, or seek to down classify the product through what’s called the de novo process. But it’s not all doom and gloom. We have seen FDA be flexible in several instances in allowing machine learning to be added to existing technologies without placing the technology in class III.
  3. Study design will be complicated if an applicant cannot convincingly establish ground truth. In radiology, we can establish objective truth in many cases by biopsy and other diagnostic procedures. When an applicant can’t start with the truth, the applicant can’t simply do a comparison of performance in finding that truth between people and machine. More creative clinical trial designs will be necessary.
  4. All of the technical concerns that FDA already has with regard to using machine learning in radiology will carry over to other forms of machine learning. FDA reviewers are likely to use the experts in radiological health to consult on machine learning.  This applies particularly to the more challenging regulatory issues associated with adaptive systems.

The good news is that FDA seems to appreciate the value of machine learning, and how it can significantly improve healthcare. So the agency is likely to be sympathetic in most instances, and not want to unduly stand in the way. Further, recently, FDA has been announcing a series of improvements to its regulatory oversight of software that will likely help those developing machine learning-based products. For example, the FDA seems intent on making the path to market easier, in exchange for greater requirements that manufacturers collect real-world evidence after the product is introduced. Nonetheless, the agency’s clinical and scientific concerns will need to be addressed through appropriate evidence.


FDA is continuing to research machine learning and will ultimately become more comfortable with it. But it is, practically speaking, very difficult for the agency to recruit and then retain experts in machine learning when they are so valuable in private industry.    
FDA’s Division of Imaging, Diagnostics and Software Reliability within the Office of Science and Engineering Labs is doing research on computer-assisted diagnostics. The agency may one day be able to make a range of simulation, analytic tools and data available to the public that will speed the development of software in this area. In the meantime, applicants can study the path followed by the machine learning pioneers over the last 20 years to discern the best path for new technology.