Voice recognition: we can hear you now…

A summary of the third (Tech)eTuesday

By Gloria Zaionz (Advanced Technologies team at Kaiser Permanente and Emerging Technologies lead at the Innovation Learning Network)

What do the phrases “speech recognition”, “natural language processing”, “voice recognition”, “Siri”, “Alexa”… all have in common? They refer to a technology category commonly known as voice interfaces. Voice is the most natural method of interaction. It requires no screens, no gestures, no special training.

Since DARPA began its first voice project in 1971, voice technologies have come a long way. Today, voice technologies can almost mimic a human conversation: from interactive voice response systems that sound like a live-person with word error rate at the same level as human conversation to virtual assistants that live inside of our mobile phones to Amazon’s Alexa. We have come a long way. This accomplishment will lead to new possibilities in ways never imagined before, especially in health care.

But just how can this be used in health care? Today, voice technology is primarily used for transcription or dictation and documentation. However, many health care innovators recognize voice technology has the ability to change how we interact with machines. Instead of tapping or gesturing, we speak.

The session featured 3 distinguished experts on voice technologies and its opportunities in health care:

  • Introduction: Laura Kusumoto who is a director of innovation for a large entertainment company… but today she will be speaking as an innovator Enabling Innovation with Emerging Technologies, not on behalf of her professional affiliation.
  • Usecases: Mike Holland, Director, Innovation Lab, Care Delivery Technology Services at Kaiser Permanente
  • Upcoming innovator: Marie Johnson, Co-creator of ‘Nadia’ AI, Managing Director and Chief Digital Officer, Centre for Digital Business



Voice Technologies – An overview by Laura Kusumoto


Natural Interfaces – The END of Typing

“We are on the verge of the end of typing”, Pronounced by Laura Kusumoto. But we’ve been typing into computers for so long – why change now? Because it is natural to use our voice – our brain is wired for speech since birth. When we want to communicate with another brain, we rely on speech to be heard and to be understood.

Another reason why voice matters is to reduce friction between our systems, our patients, our consumers: hear them and in a very personal fashion.

The advancement of voice technologies in the past few years are around 2 areas:

  • Advancement in AI software and machine learning
  • Advancement in hardware devices that play the speech

A Glance at Speech Milestones

Courtesy: Laura Kusumoto

Even though DARPA embarked on the first speech project in 1971, back in 1958, early neuro-work had begun. Neural network and machine learning technology advancements are closely intertwined with speech.

In 1978, Texas Instruments came out with a device that pronounced words – while it was marketed as a toy for children – this was a huge advancement. The device contained a chip inside that stored digital signatures of word pronunciation.

In the 1980s’, neural networks became a technique used in AI and an alternative to creating rules. If you are trying to represent human being’s thought process and patterns, ‘rules’ as we know it cannot be used because they need to be written and defined. With the advancement of fuzzy logic and neural networks came the use of patterns in lieu of rules.

1994 – The introduction of Dragon Naturally Speaking and IBM Via Voice transcription software laid the foundation for medical transcription.

1996 – SRI introduced the first Interactive Voice Response system.

2007 – Microsoft allowed voice search on Bing setting the stage for personal virtual assistants that we know today like Siri, Cortana. Using voice to search the Internet without having to open a browser or type anything.

2011 – Siri introduced on iPhone

2012 – Rapid advances in deep learning, machine learning, and neural networks that enabled unsupervised learning and other advances

Today – Microsoft offers Skype Translation and voice transcription is parody human understanding – in 2016.

Voice Technology Capabilities

Courtesy: Laura Kusumoto

There are different types of voice technologies capabilities. Starting with speech recognition, which is the signals from the sound the voice makes and breaking them down into parts of speech. By breaking down parts of speech, the technology is trying to determine what is said.

  • Voice identification is the analysis of voice signals to determine who is speaking. It may have aspects of demographics, i.e. age, gender, etc. and is an important part of biometric technology so you know the person speaking is who they say they are.
  • Language understanding is understanding what the spoken words meant. Today’s level of understanding is rather shallow.
  • Intent of understanding is putting together the meaning of the words to try to assign meaning to the spoken words.
  • Sentiment analysis is used to understand the emotion state behind the spoken words.
  • Language translation has 2 different paths: First is translating into a standard language and the second is translating into a different language.
  • Speech production: synthesis or producing sounds in the correct order so one can understand what is said.
  • Text production is an output of voice technology.

Speech Recognition achieves parody with human conversation

Courtesy: Laura Kusumoto

The Switchboard corpus is used as the benchmark for voice recognition accuracy. The Y-axis marks the error rate. As you can see in the image above, error rate has been dropping steadily since the early 90’s. In 2016, Microsoft achieved a 6% error rate, the same level as human conversations, marking a major milestone.

Machine learning drives the advancement in voice technologies

Courtesy: Laura Kusumoto

Advancement in machine learning is driving voice technology capabilities. Machine learning is a data driven process. Instead of programming a computer, engineers are using machine learning to teach computers. Companies with large quantities of data have significant advantage since the data can be mined for analysis and understanding.

Courtesy: Laura Kusumoto

For every speech application there are different modules using different machine learning components. It’s not one single giant module, rather it’s a collection.

Many hardware now embody voice technologies

Courtesy: Laura Kusumoto

Voice interface advances in this area are primarily in devices – many of which are in our homes – with built-in microphones.

Innovative and unique voice technology offerings

#1: Houndify builds on 10 years of music recognition analysis and uses this process to quickly decipher complex questions.

Courtesy: Laura Kusumoto

#2: Interactions blends human and artificial intelligence for understanding words spoken.

Courtesy: Laura Kusumoto

#3: Google Pixel Buds, a new entrant in this space, offers instant translation in the wearer’s ear for over 40 different langugages at a price point of $159.

Courtesy: Laura Kusumoto

What’s next?

It’s important to remember Asian and the developing worlds that are very into using speech as an interface in addition to those that do not type quickly or have an impediment that prevents them from being able to use touch surfaces.

Courtesy: Laura Kusumoto

As speech synthesis advances, we will have more natural sounding speech production. Below is a list of items on the voice tech road map:

Courtesy: Laura Kusumoto
  • Enhanced client-side processing so one does not have to be connected in order to use it.
  • Being able to hear in a noisy environment is another challenge to be addressed.
  • Understanding and processing users with thick accent and understanding children are still being researched.
  • Understand the real meaning of a conversation; holding an extended conversation across mixed topics without losing context
  • Privacy is a big issue
  • Assigning personality of the voices is an art form and is a hot research topic



Health care use cases and applications of voice technologies by Mike Holland

Setting context on the Kaiser Permanente technology channels

The Innovation Lab at Kaiser Permanente focuses on prototyping and bringing concepts to life using emerging technologies. Kaiser Permanente prides itself in using technology to enable members to access information and enhance interactions with members and patients.

History of Kaiser Permanente’s Technology Journey

Courtesy: Mike Holland

Starting in the 90’s, Kaiser Permanente used telephones to support pharmacy refills. It also provided members a book where they can look up symptoms and advise.

As the internet became prevalent, Kaiser Permanente built out its website, KP.ORG and moved the phones and the books onto the internet where people can access that same info online. KP also rolled out additionals such as to view lab results, email physicians, make appointments, and supporting most member transactions online.

With the onset of mobile devices, Kaiser Permanente added mobile as a channel, heading down the omni-channel path.

Voice technologies use cases: TODAY

Courtesy: Mike Holland

Voice is another channel for member interaction and KP is exploring different opportunities to implement voice technologies. Here are a few examples of pilots underway:

1. Making information available through Amazon Echo or Google Home, i.e. phone numbers, hours, pharmacy, location.

Right now, members can get standard information using voice. The next step is to use natural language search to give members access to more info.

2. Integrating voice into the hospital room.

The Kaiser Permanente San Diego hospital is being used to test voice technology opportunities. The first phase is to test out simple commands like “turn the lights on”, “close the blinds”, or listen to music using a device that is placed in the room. The objective is to let people with mobility issues control their surroundings without having to ring a nurse or have someone to assist them to complete a task. The next phase is to interact with the TV screen – for infotainment purposes such as order food, watch videos, surf the internet. The long term goal is to enable clinicians to use voice to access patient medical records and charts.

3. Moving everything available on kp.org onto a voice-enabled platform for the members’ home.

This would allow patients to schedule appointments, order refills, and conduct a number of other activities by speaking directly to their voice-enabled device at home, without having to use the internet or their mobile device.

4. Security and authenticating patients.

5. Filling out forms and questionnaires.

People get frustrated with filling out long forms on a mobile device and calls are not optimal from a cost stand point. As a result, KP is exploring introducing using a voice assistant (Siri, Google) to access the backend of a dermatology questionnaire to guide members through the list of questions while capturing responses.

Voice technologies use cases: NEAR FUTURE

Courtesy: Mike Holland

The next use cases are more complex and involves more AI interaction using voice and agent technologies. These are more futuristic.

1. Symptom checker + advise.

Members can access advise based on KP treatment approaches. When patients set up appointments through the voice interaction, the diagnosis and treatment recommendation are sent to the clinician for review. More sophisticated and interaction for members.

2. Physician assistance: charting and scribing for clinicians and completing tasks.

This use case intends to move doctors away from typing into the EMR. Behind the scenes is an intelligent agent platform that extracts relevant data, orders prescriptions, and perform tasks to enable physicians to focus on the 1:1 conversation with patients. At the end, the system queues the physician to review and approve charting data before submitting into the EMR.

3. Voice as a way to support health and wellbeing of seniors at home.

KP anticipates voice can be used as a command module for controlling sensors and other monitors to help monitor seniors and patients with special needs, chronic conditions, or recovering from a surgery. This is especially important for patients that have been discharged post hospital stay and in need of physical movement as part of their recovery process. Reminders can be used as a way to allow seniors to live an independent life.

Challenges voice technologies face

Mike Holland touches on 2 big challenges voice technology faces today.

1. Privacy

Privacy is a big deal. Authentication is critical. Where does the interaction data live? Who owns the data? Is it secure? What about PHI that is shared? How does KP manage around hackers?

2. Usability

A big challenge – how to design for empathy. For example, a symptom checker. How does one build empathy into an electronic voice? KP studies show by using the voice of a loved one is more comforting and more apt to listen and follow the instruction given. There are solutions that can mimic voices of anyone to produce a ‘voice’ of a loved one. While this is an excellent option, it also can create authentication issues.


Featured Innovator

Voice Technologies – “Nadia” by Marie Johnson


Setting context that led to the creation of Nadia

Courtesy: Marie Johnson

NADIA was created for the National Disability Insurance Scheme (NDIS) focused on innovation using new technologies to support the disabled Australians. NDIS is the largest social reform in Australia in the past 30 – 40 years since the introduction of Medicare. The scheme covers 460,000 people with disabilities, out of which 60% have intellectual disability.

The scheme also supports 2 – 3 million caregivers and family members. When the scheme is fully rolled out in 2 years, it will cost $22 billion / year to run. The scheme is projected to create 70,000 jobs and out of the $22 billion in funds to run, $1 billion of it is spent on assistive technologies.

The Inspiration

Courtesy: Marie Johnson

The UN’s Convention on the Rights of Persons with Disability calls for adaptive services for disabled individuals based on their needs.

Courtesy: Marie Johnson

Co-design and co-creation with the target users

For many years, Marie spoke about the complexity of information on websites which pushes the responsibility of finding information to the users. The advent of virtual assistants simplified the information finding process. Extending this to the disabled population – how does one create a frictionless interaction?

Courtesy: Marie Johnson

The design process involved the creation of the Digital Innovation Reference Group comprised of individuals with intellectual disability, physical disability, psychologists, carers to co-design these new empathetic interfaces.

The left side is a drawing as to how they would like to interact with the system: a face, simple words, have a conversation. On the right side is a screen shot of the result after 12 months.

Courtesy: Marie Johnson

Collaboration with the University of Auckland’s digital human team. The challenge put forth was: can we put together a digital human backed with a cognitive system?

Courtesy: Marie Johnson

The co-design and co-creation process involved university teams, people in the community. The co-design and co-creation process revealed that AI is not UX nor IT. It’s an embodiment of many layers, including personality, once defined, drove many other aspects of the layers of embodiment: interactions, conversation models, questions and answers, and the way answers were constructed. All of these were co-designed with individuals with cognitive disability so the answers can be understood.

Courtesy: Marie Johnson

This model has broad application and can be applied in many other context.

Other health care examples of voice technologies by Gloria Zaionz

Sykehuset Østfold HF is using voice tech for the doctors’ dictaphones from voice to text. Exploring translation next as the county has many foreign speakers. Today they use expensive human translators per phone.

Nijmegen Max Planck institute is researching telephone calls from COPD patients for diagnosis. For more info contact REshape center for innovation.

Future Use Cases of voice technology

What if… A nurse walks into a patient room and says: Siri, give me the observations/medication of the last couple of days. AI answers with EHR info in their ears (= secure). Extra security by measuring vital data of nurse as extra authentication.




Fellowship opportunity at NeuroLex (Shared by Shawna Butler)

For those innovating in the voice technology space, there is a fellowship opportunity with a company called NeuroLex — a seed stage diagnostics company company applying speech analysis to detect various health conditions early, before full-blown symptoms occur. http://innovate.neurolex.co

NeuroLex core vision is to pioneer a universal voice test, like a blood test with extracted features and reference ranges, for use in primary care to detect psychiatric and neurological conditions. One of their execs, Jim Schwoebel joined the webinar.


Q: Guesstimate when this tech will be so small we can put it in the ears of healthcare workers (get feedback in an earpiece) in healthcare organisations. Sort of medical EarPods meets HER

A: (Laura Kusumoto): That’s a great concept. The consumer electronics (Pixel Bud, Dash Pro) are already available so the challenge is to develop the apps to retrieve deliver the information needed at the right time. The Pixel Bud uses Google Translate does now. The Google Translate is available as a platform, too. The challenge would be integrating this with the hospital / EHR backend systems.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s