Google researchers: Speech recognition tech can lessen physicians' transcription burden

By Dave Muoio

Voice recognition technology employed by Google Assistant, Google Home, and Google Translate could soon become a transcription tool for documenting patient-doctor conversations. In a recent proof of concept study, Google researchers described their experiences developing two automatic speech recognition (ASR) methodologies for multi-speaker medical conversations, and concluded that both models could be used to streamline practitioners’ workflows.

“With the widespread adoption of EHR systems, doctors are now spending ~6 hours of their 11-hour workdays inside the EHR and ~1.5 hours on documentation alone,” the researchers wrote in the study. “With the growing shortage of primary care physicians and higher burnout rates, an ASR technology that could accelerate transcription of the clinical visit seemed imminently useful. It is a foundational technology that information extraction and summarization technologies can build on top of to help relieve the documentation burden.”

Currently, most ASR products designed specifically for healthcare transcription are limited to doctor’s dictations, which consist of a single speaker using predictable terminologies. Conversations between doctors and their patients, on the other hand, have presented more difficulties due to overlapping dialogue, voice distance and qualities, varying speech patterns, and differences in vocabulary.

To investigate transcription support for these multi-participant conversations, the researchers developed and evaluated two ASR methodologies. The first system, a connectionist temporal classification (CTC) model, focuses on the placement and sequence of individual units of phonetic speech to train its recurrent neural network. The other, known a listen, attend, and spell (LAS) model, is a multi-part neural network that translates speech into individual characters of language, then sequentially selects subsequent entries based on prior predictions. Each model was trained with more than 14,000 collective hours of anonymized medical conversations, which the researchers noted required a “significant amount of effort” to clean and properly align for the training.

This data cleaning turned out to be particularly essential to the success of the team’s CTC models, which eventually achieved a word error rate of 20.1 percent. The researchers’ analysis of the errors showed that most mistakes occurred near the beginning and end of utterances, during speaker statements shorter than one second, and more often during patients’ speech than during a doctor’s. The LAS models were more resilient to incorrect data alignment or noise, and ultimately reached an 18.3 percent word error rate. These were rarely related to medical terms, with most mistakes occurring among more conversational phrases. Of note, the model achieved a 98.2 percent rate of recall for drug names mentioned in a medical conversation.

Because both models’ mistakes were rarely related to the medical aspects of a discussion, the research team wrote that these technologies could feasibly provide high-quality results in a non-experimental setting. In a related Google Research Blog post, two of the authors — Katherine Chou, product manager, and Chung-Cheng Chiu, software engineer — also said that they will begin working with Stanford University physicians and researchers to explore how ASR technologies such as these can better lesson the burden on physicians.

“We hope these technologies will not only help return joy to practice by facilitating doctors and scribes with their everyday workload, but also help the patients get more dedicated and thorough medical attention, ideally, leading to better care,” they wrote.