Podcasts SDS 724: Decoding Speech from Raw Brain Activity, with Dr. David Moses

42 minutes
Artificial Intelligence, Data Science, Machine Learning

SDS 724: Decoding Speech from Raw Brain Activity, with Dr. David Moses

Subscribe on Apple Podcasts, Spotify, Stitcher Radio or TuneIn

In this Friday episode, host Jon Krohn talks to UCSF’s David Moses about BRAVO (Brain-Computer Interface Restoration of Arm and Voice), a study led by Edward Chang and Karunesh Ganguly that helps patients who have lost the ability to speak to communicate once again via a speech neuroprosthesis. Postdoctoral engineer David Moses, who is a part of BRAVO, reveals the data and machine learning models that help BRAVO predict the words and facial expressions that a paralyzed patient is trying to form via their brain activity, crucially helping patients to communicate with medical practitioners and loved ones.

About David Moses

Dr. David Moses is an adjunct professor at UCSF within the Edward Chang lab. After getting his PhD and doing post-doctoral research in the lab, he is now the project lead on the BRAVO clinical trial to decode intended speech from the brain activity of participants with paralysis as they attempt to speak. He is co-first author on a recent Nature publication in which the team demonstrated high-performance translation of neural activity into text, audible speech, and avatar animation.

Overview

After watching a video in which a woman with paralysis was able to communicate through an avatar, Jon Krohn brought David Moses on the show to talk about these exciting new developments in MedTech. David is a part of the BRAVO team, who develop speech neuroprostheses to support people with paralysis. In 2021, they discovered that a clinical trial participant who had not spoken in over 15 years had still retained a functional representation of speech in the brain. They identified a way to help patients who are unable to speak to communicate once more by using machine learning to decode data gathered from the brain.

The motor cortex is the part of the brain that controls the coordination and contraction of the tongue, jaw and larynx. It is essential for transmitting auditory vocal information. Implanting sensors on this part of the brain helps scientists retrieve data for machine learning models that can convert this gathered information into the patient’s intended speech.

In a new frontier for medical technology, the team at UCSF are now finding ways to increase the accuracy in reconstructing what patients are saying. Doubling the channel count in their sensors, for example, increases the amount of acquired data, enabling patients to converse in full sentences and with a larger vocabulary, at a rate of 75 words a minute. The team achieved this rate after just two weeks of model training.

Considering the typical rate of speech is 150 words a minute, David highlights that BRAVO still has some way to go before the models’ approximated speech outputs can achieve these rates. Nevertheless, this development presents a major leap in brain computer interface (BCI) technology. In addition, patients can now select from hundreds of avatars (digital likenesses) to best represent them, to help them feel more at ease with their digital identity. You can watch this in practice in the video, “How a Brain Implant and AI Gave a Woman with Paralysis Her Voice Back”.

Listen to the episode to hear David explain how advanced models are decoding speech sounds directly from the brain, how he selects and trains his model architecture, and how he plans to increase model performance in predicting facial expressions and speech.

Items mentioned in this podcast:

Follow David:

Did you enjoy the podcast?

How close do you think machine learning will come to mapping and understanding the human brain’s capacities in the next 10 years?
Download The Transcript

Podcast Transcript

Jon Krohn: 00:00

This is episode number 724 with Dr. David Moses, adjunct professor at UC San Francisco.

00:19

Welcome back to the Super Data Science Podcast. Today we’ve got a scientist on the show who has used his intellect and determination to pull off nothing short of a miracle, taking paralyzed patients who have lost the ability to speak, and then using machine learning to decode their brain activity, thereby enabling an avatar on a video screen to speak for them in real-time.

00:39

The scientist is named Dr. David Moses, and he’s an adjunct professor at UC San Francisco. He’s also the project lead on the BRAVO, Brain Computer Interface Restoration of Arm and Voice clinical trial. The recent success of this extraordinary BRAVO project resulted in an article in the prestigious journal Nature, that was published two months ago, as well as a YouTube video that has over 3 million views despite being published just last month.

01:04

Today’s episode does touch on specific machine learning terminology at points, but otherwise should be fascinating to anyone who’d like to hear how AI is facilitating real-life miracles. In this episode, David details the genesis of the BRAVO project, the data and the machine learning models they’re using on the BRAVO project in order to predict text, to predict speech sounds, and to predict facial expressions from the brain activity of paralyzed patients. And then he fills us in on what’s next for this exceptional project, including how long it might be before these brain-to-speech capabilities are available to anyone who needs them. This is a special one. Let’s jump right into our conversation.

01:41

David, welcome to the Super Data Science Podcast. I’ve been excited about this episode for a while. Thanks for coming on. Where are you calling in from?

David Moses: 01:48

Thanks Jon, for having me. I’m calling in from San Francisco right now. I work at UCSF.

Jon Krohn: 01:54

Yep, yep. And yeah, I feel super lucky to have you on the show. I actually told you before we started recording that I wasn’t going to tell the audience this story, but I’m going to do it anyway.

David Moses: 02:07

Of course.

Jon Krohn: 02:07

Because I came across a video on LinkedIn. It was a clip of a video, and so it didn’t have context around it. It didn’t say whether there was an academic paper associated with a video, it didn’t say UC San Francisco. It didn’t have any of the author’s names associated with this research, but there was this incredible video of a paralyzed woman with recording electrodes in the back of her head who was using her brain to control an avatar on a screen, and so this avatar was able to speak the words that she was thinking. It was able to capture her tone of voice from before she was paralyzed, and it was able to have both gestures, face gestures, those related to speech, as well as even those unrelated to speech. And so in those three modalities, text, as well as speech sound, as well as facial gestures… Yeah, recording electrode just in a woman’s head, allowing her to be that expressive in real-time to her loved ones, to physicians.

03:29

You’ve taken somebody who is, yeah, I mean, completely unable to do any of that kind of real-time communication, and all of a sudden making them as expressive as somebody who doesn’t have any paralysis at all. And so when I saw that video, I was blown away. And so because there was no context, I commented on the LinkedIn post, I was like, “Could somebody please provide me with some extra context on this? I want to read about it more.” And somebody commented, “You should talk to David Moses.” And now you’re here. So this is your work. You’re one of the authors of this paper. So yeah, so tell us about it, tell us about how this came about, maybe if there were like… Yeah, how did this research come about? Is this the kind of stuff that you’ve been doing in your lab the whole time? Yeah, go. Shoot.

David Moses: 04:24

Yeah, that’s great. And shout out to my friend Kesshi who messaged me and was like, “You should talk to this person.” I think she’s the one who commented and kind of orchestrated this meeting. But yeah, thanks again for having me on. I’m happy to talk about it. I think it’d be good to give an overview. It is a very complicated project and I’m hoping that I can kind of break it down into something that everyone can understand. So to start, we’ve done a lot of work in the lab with… Actually, before we worked with people who have paralysis, we worked with people who have epilepsy. And they get an electrode array implanted on the surface of their brain, and these are small circular sensors that pick up electrical activity from the neurons on the brain surface, and in the cortex, which is the outermost layer of the brain, the outermost neuron containing layer of the brain.

05:25

The clinicians use it to identify where seizures are originating from so that they can do surgery to fix it or to plan treatment. And sometimes these patients very generously volunteer to work with us and to do speech tasks. So we play them speech sounds, or they speak, and we learn the mapping of the brain activity to speech. So how does the brain represent the vowels and consonants that make up English or even other languages? And this was the basis of our scientific understanding, and also our engineering understanding as we had some projects where we tried to reconstruct what they were saying from the brain activity. This was a really great progression of work throughout the past decade, even a little bit longer, from our lab and other labs who have been really focused on understanding the neurological basis of speech.

06:22

And now our question turned to can we actually use what we’ve learned to help people who are paralyzed and who are unable to speak? And this is what’s made us start the BRAVO trial, which stands for BCI Brain Computer Interface Restoration of Arm and Voice. Our collaborators and Dr. Ganguly’s lab also at UCSF, focus on motor restoration, like cursor control robotics. We don’t focus on that, we focus on speech. And so we had a work that came out in 2021 with Pancho, our first clinical trial participant that really showed that someone who hasn’t even spoken in over 15 years, still retains this functional representation of speech in the brain. And we can actually pick up on that and decode that into words as they’re trying to speak.

Jon Krohn: 07:14

It’s really interesting. Do you think that part of that is that people would still be having an internal dialogue?

David Moses: 07:21

So it’s a really great question, and that is one point of clarification. We do really try to make it. It is subtle, but actually does make a huge difference in the brain activity, and that is, for this to work, for our approach to work, you can’t just think about the speech. You can’t have an internal monologue. You can’t just imagine yourself saying it or imagine yourself hearing it. You have to try to say it, because what we’re doing is we’re implanting the sensors over the surface of the speech motor cortex, which is the part of the brain around here that allows us to control our vocal tract and speak. So it’s actually sending commands that go through and activate different muscles in your vocal tract, your lips, your tongue, your larynx, your jaw, and it’s the precise and rapid coordination of these muscles that allows us to speak.

08:17

And that is the part of the brain that we’re picking up on. That’s the part of the brain that we record from. So the person has to actually try to speak because when they try to speak, even though both Pancho and Ann, our latest participant who I’ll talk about shortly, they have brainstem strokes. So the connection from this brain area, through the brainstem, and then to the facial muscles, that is what has been damaged, that pathway. And so what we’re doing is we’re bypassing the brainstem where they had a stroke, and we’re taking the data from the speech motor cortex, the articulator representations, and converting that directly into their intended speech using machine learning. And so that was our first work with Pancho, and now with Ann, our latest participant, and that’s the video you saw.

09:07

We used a denser electrode array with more channels. So we basically doubled the channel count in our sensors, so we get more information, higher fidelity signals from the speech motor cortex, and that’s what we use to decode their intended speech. And from this increase in fidelity, we were able to achieve something a little bit beyond what we showed with Pancho, and that is we can have Ann communicate in full sentences at near rapid 75 words per minute. That’s kind of around that speed, we even can show a little bit beyond that for more constrained outputs. But for the purposes of the main takeaway, it’s more rapid, a larger vocabulary. So now she can communicate with over a thousand words and construct sentences that are relatively arbitrary as long as she uses those words, and communicate at about 75% accuracy at about 75 words per minute.

10:10

So it isn’t at the rate and efficacy of fluent speech, which is usually about 150 words per minute, when we’re talking, at probably around that rate, and with a larger vocabulary. But this is a major next step in brain computer interface technology to restore this level of function to someone who is paralyzed. And as you mentioned earlier, we not only are translating this into text, but we can also decode directly the speech sounds. So restoring a voice directly from brain activity to vocal output. And we conditioned this model on some footage we had of her before she was injured.

Jon Krohn: 10:56

Wow.

David Moses: 10:56

And so it’s in her own voice, it’s personalized to her likeness. And she was also able to choose an avatar from hundreds of options that she felt embodied her the most, and we animated the avatar alongside the synthesized speech. So she’s able to communicate using this digital likeness of herself in an audio visual output, both the speech sounds and a facial animation of a digital avatar. And for that, we partnered with this company Speech Graphics, and they helped us with the avatar animation processing and algorithm. But all of this is being decoded from the brain. All of the information that’s driving all of these outputs is directly from brain activity.

Jon Krohn: 11:43

That is so fascinating. So these three different elements, did you do them sequentially, or…? Yeah, yeah. Yeah. So it wasn’t like you were trying to do all three at once?

David Moses: 11:54

Well, yeah. Sorry to interrupt. It’s a little bit. We have tried those three individually, and then we also tried them all at the same time. And so, we think our culminating demonstration is it’s video one in the publication, and it shows her using the system, and we’re decoding text through one model as she’s attempting to speak, and we’re also decoding the speech sounds directly from the brain in a separate model pathway. The model structures are similar, which we can get into later. And then the animation of the avatar is happening in real-time alongside using the synthesized speech. So all three outputs are being generated at the same time, in real-time.

Jon Krohn: 12:45

Yeah, yeah. Very cool. Yeah. So I guess there’s separate machine learning models for all three modalities, or is it one?

David Moses: 12:54

We did train for this three separate machine learning models. That’s right. There are advantages and disadvantages, and I think it’s this really interesting problem to say, “Oh, how can we have one universal model to do all of that?” But for this work, we did train three separate models. Now the structure of these models were very similar, but the output features that they were targeting were different. And yeah, there might be some small differences in the model structure, but overall the pathway is very similar.

Jon Krohn: 13:30

So in terms of the way that the hardware works, I’m actually going to leave listeners to check out episode number 696 with Bob Knight, who did an episode focused primarily on how the hardware works for this kind of brain recording. And that’s his expertise, he’s been doing that for decades. And you actually know him because he’s affiliated. I think his primary institution is Berkeley, but he is also associated with UCSF. So, yeah. So listeners can check out that episode, 696, if they want to learn more about the recording hardware. Bob mostly was just leading it of, “Then the data are collected, and you can train a machine learning model downstream for various kinds of tasks.” And so let’s focus on that with you, David. So yeah, how do you do it? I mean, I guess you could pick one of these three that you think is most interesting, or potentially go through all three and kind of explain how you collect the training data, how you choose a model architecture, how you train it, how you validate it, and then how you get it into production.

David Moses: 14:33

Great. Yeah, happy to do that. We can start with the text model. And I should say this work was an enormous collaborative effort, as you might imagine. I did not do all this by myself, and we have some really, really talented people on the team who have done a lot of the machine learning, like the hands-on machine learning, the nitty-gritty. So that’s Sean Metzger, Kaylo Littlejohn, Alex Silva. They were hugely instrumental to this project. They’re also the co-first authors on the paper, and also a shout-out to Margaret Seaton, who is our Clinical Research Coordinator who did a lot of organizing. It is a lot of work that makes me like this to bring into fruition. But I can definitely walk through the high level so that people understand what we’re doing. One of the beauties of our approach is that it’s the same training data that we can use to train all three models. And the training data is very simple, we show a sentence on the screen, there’s a brief little countdown, the text turns green, and she silently attempts to say it. I’ll take a brief pause to explain it a little bit. When she tries to vocalize because of her specific form of paralysis, it’s very laborious, very effortful, and very slow, and it’s unintelligible.

16:01

But when she tries to speak silently, so that’s like miming what we call miming, it’s like mouthing words. So you’re moving your mouth, your lips, but you’re not producing any vocal output. She found that she could do that fairly rapidly, that was actually something she could try to do rapidly. So that’s what we do in this work. So she’s not producing any vocal output and she’s just silently trying. Even though she doesn’t have full control of her articulators, she’s trying to read the sentences aloud or, okay, she’s trying to say the sentences silently. And so from that, that’s basically our training data. We know the times where she’s instructed to start to try to say the sentences, we know what sentences she’s trying to say, and we have the brain activity, that’s our dataset.

Jon Krohn: 16:49

Wow.

David Moses: 16:49

There’s no acoustics.

Jon Krohn: 16:51

No.

David Moses: 16:51

There’s no face tracking or any of that. It’s just what the sentence was when she was trying to silently say it and the brain activity during that process. And so let’s start. So now we have this dataset of brain activity and sentences. Let’s start with the text model. So what we do is we actually break this down from graphing, so letters and words into phonemes, and phonemes you can think of as the alphabet of speech sounds. So it’s different contrastive units where if you change one of these to something else, it’s a different word. So let’s say the word cat, that has three phonemes, the /k/ sound, the /æ/ sound, and the /t/ sound. And if I were to change one of those to /ɔ/, then it’d be cot, which is a different word. And so there’s about 39 of these in English, and that’s what we convert the sentences to, is to sequences of these phonemes.

17:54

Now, we don’t have explicit timing because she’s silently trying to say these sentences. So we don’t know exactly when each phoneme is occurring, but we know they’ll occur in this specific sequence, and we have the brain activity. And so we use a model structure. Well, we use a bidirectional recurrent neural network to learn the mapping between the brain activity and the phoneme sequences. And we do that using what’s called a CTC Loss, so a Connectionist Temporal Classification. And all this is doing, it’s a way to let the model learn without forcing it to care about the exact timing of the certain elements in the sequence. So basically, the model is able to, across many sentences, learn the mapping between the neural activity and phonemes, even if it doesn’t know exactly when each phoneme happens in the sentences. It just knows the order in which they happen, and it will learn and predict the timing on its own, and it’s a way of basically learning the mapping and the timings across many, many sentences.

19:01

And so the output of this model when we run it in real-time is a series of phoneme probability vectors. So you can think of a 2D matrix. Time is on one axis, phonemes, like the 39 different phonemes on the other, and it’s like a heat map of how likely each phoneme was at each time step. So that’s our output from this model step, okay? And so from that, we use language model techniques to convert that into sentences, and so there’s a few steps there. The first is to apply electrical constraints. So basically, we have our vocabulary and so we know any word in the output should be in that vocabulary. And so, it basically generates a bunch of hypotheses of saying, “Okay, I have these phoneme probabilities. What could the sentence have been? Maybe it’s, ‘Hello, I’m here,’ or, ‘Hello, I am near,’ or, ‘Hello, I go there.'” That’s not a great example, but you can see from the probabilities you can generate a bunch of potential sentences just by applying this constraint to make sure it’s actually a series of allowed vocabulary words. And then we apply an actual language model to rescore these possibilities based on how likely they are to occur in English.

20:36

So, “How are you?” Is a more likely sentence than, “How are glasses?” For example. And so you can learn that from English, from online sources. You don’t have to use brain data to learn that. And the vocabulary mapping I just talked about, you don’t have to use brain data to learn that either. You can use readily available technology developed in natural language processing, et cetera, to go from this mapping of basically the phoneme probabilities to the output sentences. And so that’s kind of one beauty that I’d like to emphasize of this approach is we know that the brain activity in this area is correlated to at least a certain… Well, it’s correlated to phonemes. Phoneme representations are encoded. It’s primarily articulatory representations in the brain area, but phoneme representations is a close correlation as well.

21:33

And so that’s the only part that we actually have to create from neural data, is the neural to phoneme probability mapping, and everything else downstream can be done using technology, and techniques, and algorithms that have been developed and optimized over years and years of research using millions of hours of data, millions of words of data. And so we can get this approach where we can leverage a lot of advances in this kind of adjacent field of natural language processing. And so that’s the pipeline, and that can all happen in real-time. The models are trained on our servers, and then we pull them down to our system, and we run it in real-time right at the side by side with the participant, and that’s what generates the text output. So I’ll pause there if there was anything that was maybe unclear about that pipeline.

Jon Krohn: 22:28

No, that was crystal clear, David. I’ll just try to articulate back to you what you said in my own words. So yeah, for this text model, you have amazing results. You’re getting something like 75 words per minute, with a thousand word vocabulary, at 75% accuracy, and you’re using the same training data for all three models. For the case of this text model, where you’re trying to predict what the words are, you use a bidirectional recurrent neural network to map brain activity into phonemes, which you described as the alphabet. It’s like the unit of speech sounds.

David Moses: 23:09

That’s right.

Jon Krohn: 23:10

And so I think you said that there’s 39 possible phonemes making up the parts of speech. And yeah, so that model right there is the completely new thing that you’ve done. Nobody had done that before. And I love that, visualization was so clear to me of a heat map where you have just two-dimensional… So over time, going from left to right in the heat map, you have words and then you have have… Sorry, going left or right over time you have time, and then the rows represent the 39 possible phonemes.

David Moses: 23:57

That’s right.

Jon Krohn: 23:58

So you have this heat map of, at a given unit in time, what is the phoneme that is likely being spoken here? And sometimes I would imagine there’d be some ambiguity. So some of the phonemes would have closer neural correlates. So I don’t know, I’m kind of guessing here, but things like /b/ and /ʌ/ or something. You’re using more of the same motor neurons to cause those two sounds then like /b/ and /ɑː/ so there could be… So this heat map that is produced has some ambiguity, but then you can resolve a lot of that ambiguity by using NLP models that have been trained on orders of magnitude more data, maybe more or less out of the box to be able to convert, like you said, “To use probabilities of what phonemes is likely to be next,” so that you could predict, yeah, sentences that make sense.

David Moses: 24:58

Exactly. It’s almost like an autocorrect or autocomplete features on your phone.

Jon Krohn: 25:03

Right.

David Moses: 25:03

It’s a similar vein of effort and algorithm. Yeah, I mean the model we use, this RNN, to go from the neural activity to phonemes, it was something that we hadn’t shown before, but it’s definitely… There’s a lot that we learned from these adjacent fields, and I think that there was another work that came out recently as well alongside our paper that used a similar approach. So it’s all… I think the field is moving together to use these newer advances, newer techniques. It’s really exciting to be able to have this kind of huge fascination with speech processing just outside of the neuroscience realm, and that’s benefiting. It can actually benefit us and benefit patients one day.

Jon Krohn: 25:49

Yeah, it’s super cool. I don’t know if we could really quickly talk about the models for the speech sounds, to actually try to… So it was really cool that you mentioned there how you were able to use a video from prior to Ann’s paralysis to be able to emulate what her speech actually sounded like.

David Moses: 26:08

Right.

Jon Krohn: 26:08

If we could do that one quickly, and then the animation part quickly as well, because then I still have some follow-up questions to squeeze in.

David Moses: 26:15

Oh, no worries. Let’s do it. So yeah, the speech part was very interesting, and that really was something that hadn’t been shown before, that you can go from brain activity of someone who has paralysis, directly into these speech sounds. And for that, what we did was it actually ended up being fairly similar to the text. There were some differences. We couldn’t use a language model like what we use with the text, because now we’re just in an acoustic space. But what we did was we used what’s called these HuBERT units. And so HuBERT is a way, it’s kind of like a self supervised model of speech sounds.

26:57

And so what this model… This is something that was developed not by us, but it processes acoustics and it learns discreet units that can represent and be used to reconstruct the acoustics. And so instead of the complicated space of direct acoustics and sound wave forms, this is a representation that we can use to go to something that’s not that different from phonemes actually. I’m not saying that the units are phonemes, but in how we work with them, the structure is similar because they’re now these discrete units, just like phonemes are discrete units, but these are just discrete abstract units, that they may not all have a direct easily intuitive interpretation. But we have this representation of speech that we now take the brain activity and we map it to that.

27:55

So let’s take a step back. So we have the sentences that she strives to read, we take those sentences and we use a text to speech model to generate a waveform from the sentence, and then we take that waveform, and apply this model to generate these units. So now we have, imagine another heat map, now it’s time again, but HuBERT units on the y-axis this time, if you want to imagine it like that, and then this map is telling you which unit occurred at each time point.

28:28

And actually, for the training labels, this is actually not a map, it’s a sequence. So at every time point you have one unit. But when you predict it… So now we do a same thing using this CTC to learn the mapping, we don’t know the timing. We have the brain activity and we have the sequence of units, and now we’re learning this mapping. And so when we predict it, then we have this heat map of HuBERT units. And then we pass those through additional models called vocoding models. So eventually we actually take those units, convert them back to a spectrogram of the speech. So I brought another representation of acoustics, and then we take that, and generate an actual speech output from that.

29:11

And then finally we apply a voice conversion model to go from that representation and convert it into the participant’s voice. And actually in a different… The latest version of this, which we do talk about briefly in the publication, we don’t have to go through that extra step. We can go directly from… We cut a step out, we don’t need the voice conversion model. It’s actually the synthesizer itself that has the participant’s vocal likeness in it, and so we can cut a step out there. But just to recap, it’s brain activity to these special units that are a, you can think of a compressed representation of speech down, that’s what we predict, and then we do a few steps to synthesize that in the participants, in a personalized voice, to the participant.

Jon Krohn: 30:02

Very cool. And you even did the summary back there for our audience, so you can skip me doing that this time. You articulated that perfectly. And yeah, let’s just skip to quickly to the third modality, the actual visual animation. So you mentioned that Ann chose from hundreds of different avatars, and I’ll be sure to include the video of this YouTube video. It’s only five minutes long that shows all of this, and so you can see it for yourselves exactly how this looked. But yeah, there’s just a screen in front of Ann, and so in real-time, as she imagines silently making sounds.

David Moses: 30:46

I guess so.

Jon Krohn: 30:48

Yeah, silently attempting to make these sounds. It’s so funny, we don’t really have the language for describing that well.

David Moses: 30:57

If you’re not in the space, it’s really hard to think about that specific thing, yeah. But yeah, I’m happy to describe that. So for this, we worked with this company called Speech Graphics, and they specialize in… You take acoustics and you can animate a digital avatar from the acoustics. So they do this, for example, video games. You have a video game character, and they have a voice line. You can animate their face given the acoustics. In our implementation, we use Unreal Engine, and we have an avatar sheet. The structure of it enabled us to have and choose from a variety of different candidates, and it would all work the same. We can animate the same kind of avatar because the structure, the underlying structure of the avatar is similar.

31:51

And so what we do there is we work with Speech Graphics to get their representation. It’s like a proprietary representation of the articulators that can be used for animation. And we actually were able to convert. So we had the text sentences. So during training she sees the sentences, attempts to silently say the sentence, convert that to speech, and then we convert that to these other representations that’s articulatory. And then from that we actually do something to discretize it, because it’s easier for our model. And so what we’re left with actually still resembles the first two.

32:33

We have a series of articulatory units through time for each sentence, and then we map the brain activity to that series, and we predict a heat map of the articulatory units over time. And this we actually do show, we compress those into a series of units, and then that is fed into the animation engine that’s used to animate the avatar. So you can see for texts, its brain activity to this representation of text, into the text itself. For speech, its brain activity into a representation of speech, and then back into the speech, and then the avatar’s brain activity into a representation of articulation of muscle movements, and then that gets rendered into an avatar.

33:25

And in this work, when we ran this in real-time, we actually just took the synthesized acoustics, so the actual voice output, and we use that to animate the avatar in real-time, but offline we did analysis where we did directly brain to articulator assessment. Those are just two different ways of doing it. And this culminates in the outputs. You can see the video that we have really demonstrates a future use case of the technology, and how we envision this can be used in the future, which she can use this to… A participant can use this to communicate with their loved ones and with their caregivers.

34:08

In the paper, also the first video will show, kind of in the experimental laboratory setting, what we actually showed in the work, which is her given a prompt and she reads back the prompt, and it decodes all three outputs in kind of a more controlled environment. But we definitely want to move towards having someone be able to use this to just do their daily conversations and go live their daily lives, express themselves. That’s really the goal.

Jon Krohn: 34:37

Yeah, it’s really amazing. So the video on YouTube has over 3 million views, which is wild. It came out just a month ago, and to summarize what a lot of us are thinking about this, the most upvoted comment on YouTube says, “To see these young scientists using their dedication and intellect to change lives like this is really heartwarming. You’re really making a difference, guys.” It is, yeah. That’s their quote, but it expresses my sentiment and I’m sure a lot of our listeners’ sentiments exactly. It’s amazing, man. So what’s next? I mean, so I imagine there’s things like getting that accuracy up from 75%, maybe getting the words per minute closer to what you and I are able to do in this podcast. So it sounded like about double what you have right now. And then probably beyond that it’s like scaling this up, because this was with Ann, was it months or years of work to be able to train?

David Moses: 35:35

Actually, for Ann, what we saw is we could achieve the results that you see that we published, and you see there, in two weeks of training.

Jon Krohn: 35:48

Oh, wow.

David Moses: 35:49

So this, it can learn fairly, fairly quickly. I think it was about 12 or 13 hours of neural data. It’s in the paper in case I got that a little bit off. But yeah, it’s in that realm of amount of training required. So it is something that can be trained fairly quickly. But to answer your overarching question of what’s next is definitely we want to see how we can drive those error rates lower, like drive the accuracy higher. I think the speed, it would be great of course, if we can get that even faster, but to be honest, for this patient population who has to communicate letter by letter with eye trackers, or head trackers, or other things like that, even something like 75 words per minute, it is not just a linear improvement. And this is something that was described by Ann and her partner, which is that it actually changes, it’s not just, “Oh, now I can communicate X times faster.” It’s categorically different, what kinds of conversations they can have.

37:13

They were describing that sometimes when they’re arguing about something, like talking about something, debating, and they’re getting… You know how we are, we get into the moment and we get passionate about it. And it turns out if you have to wait two or three minutes or longer for someone to respond, that cadence is disrupted and the engagement is lost, and it’s something very natural, it’s very human that it’s just not possible when you are communicating very slowly. Maybe I shouldn’t say it’s not possible, but it’s difficult, and it’s very unnatural.

Jon Krohn: 37:48

I know you mean, although it also sounds like that maybe in an argument, people having to wait two or three minutes to make their point, might…

David Moses: 37:55

Maybe they’d be more levelheaded. Right.

Jon Krohn: 37:57

Yeah.

David Moses: 37:57

Yeah, so…

Jon Krohn: 38:00

But I know exactly what you mean, as an illustrative example, yeah.

David Moses: 38:03

Exactly. Exactly. And so we just want to focus on expanding the vocabulary, making the accuracy much higher. Getting it more expressive, if they can control the pitch of the output, so I’m communicating in a sentence, or I’m communicating in a question, or placing emphasis on certain words. These things are all very interesting to us. And in parallel, there’s improvements that can be made to the hardware as well. But this, to actually reach patients, we believe that you can’t have this system that we have where we have to plug in basically to a port that’s implanted through her skull. It’s just that there’s so much complication and risk there. We need a fully implantable device. We want more channels.

38:49

We have 253 electrode channels right now. What happens if we go even beyond that, right? There’s evidence to suggest that performance could be even higher. And so I think there’s just a variety of angles that will all over time come together as more improvements are made, and hopefully, yeah, within a decade this can become a solution for patients. There’s lots of companies, brain computer interface companies, that are working towards this goal or a similar goal. So it’s a very, very exciting time for brain computer interfaces.

Jon Krohn: 39:29

It’s fantastic. Amazing to hear it, David. Quickly before I let you go, do you have a book recommendation for us, and then how can people follow you, get your thoughts on the latest in these developments after the episode?

David Moses: 39:43

That’s great. Let me think about the book. I had one that I thought was very interesting that I read that it was just kind of… It’s about willpower and motivation, but not in a… It’s in a very scientific way, it’s not in a, “Do these five things to become more productive.” It’s not like that. It’s very backed by science and some theory and philosophy into willpower, and how people can fight the urges to get instant gratification, and just a different way of thinking about the world, and we don’t agree with everything. But it’s called Breakdown of Will by George Ainslie I want to say. Apologies to George if I’m butchering his last name, but this was recommended by a friend, and I thought it was quite interesting. I think there’s probably a little bit for everyone to take away from that book, even though it’s a little dense. And then to follow me, I’m admittedly not super active on social media, but probably the best chance would be on Twitter, which is @, and then the word AtDavidMoses. There was already a @DavidMoses so I’m @AtDavidMoses. Good enough.

Jon Krohn: 41:05

Just @ and then AtDavidMoses?

David Moses: 41:09

Yes.

Jon Krohn: 41:10

Nice. All right. Yeah, we’ll be sure to include that in the show notes. David, thank you so much for being on the show. This was an eye-opening episode, and yeah, it’s amazing the things that you’re doing. Keep it up and we can’t wait to see what you do next.

David Moses: 41:26

Thank you so much for having me. It was my pleasure. Yeah, take care.

Jon Krohn: 41:31

Amazing. In today’s episode, David covered how after a dozen hours of data collection over two weeks, the BRAVO trial enables a paralyzed woman named Ann to have a video avatar speak for her in real time using her neural activity alone. He also talked about how distinct machine learning models are used for three simultaneous capabilities, namely predicting text, predicting speech sounds, and predicting facial expressions. He also filled us in on how going forward, implanted recording electrodes with many times more channels, could allow for lower error rates and become a commonplace clinical solution in about 10 years time.

42:05

All right, that’s it for today’s episode. I hope you found it inspiring. If you enjoyed it, consider supporting the show by sharing, reviewing, or subscribing, but most importantly, just keep listening. And until next time, keep on rocking it out there. I’m looking forward to enjoying another round of the Super Data Science Podcast with you very soon.

Podcasts SDS 724: Decoding Speech from Raw Brain Activity, with Dr. David Moses

SDS 724: Decoding Speech from Raw Brain Activity, with Dr. David Moses

Podcast Transcript

Share on

Related Podcasts

December 5, 2025

December 2, 2025

November 28, 2025

Podcasts SDS 724: Decoding Speech from Raw Brain Activity, with Dr. David Moses

Share

SDS 724: Decoding Speech from Raw Brain Activity, with Dr. David Moses

Podcast Transcript

Share on

Related Podcasts

December 5, 2025

SDS 946: How Robotaxis Are Transforming Cities

December 2, 2025

SDS 945: AI is a Joke, with Joel Beasley

November 28, 2025

SDS 944: Gemini 3 Pro: Google’s Back on Top