Knowledge Vault 3/38 - G.TEC BCI & Neurotechnology Spring School 2024 - Day 3
A high-performance neuroprosthesis for speech decoding and avatar control
Kaylo Littlejohn, University of California, San Francisco,
2nd place winner BCI Award (USA)
<Resume Image >

Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Llama 3:

graph LR classDef neural fill:#f9d4d4, font-weight:bold, font-size:14px; classDef decoding fill:#d4f9d4, font-weight:bold, font-size:14px; classDef synthesis fill:#d4d4f9, font-weight:bold, font-size:14px; classDef avatar fill:#f9f9d4, font-weight:bold, font-size:14px; classDef future fill:#f9d4f9, font-weight:bold, font-size:14px; A[Kaylo Littlejohn] --> B[Speech prosthesis restores communication
for paralyzed. 1] A --> C[Current tech slower
than speech. 2] A --> D[Decode speech from
brain activity. 3] D --> E[Century of speech
brain research. 4] E --> F[Neural responses for
speech sounds. 5] D --> G[Decoded speech from
brain to waveforms. 6] D --> H[Decoded brain to
text sentences. 7] D --> I[BRAVO trial: decode
paralyzed communication, movement. 8] I --> J[Decoded 50-word speech,
15 wpm. 9] I --> K[Spelling BCI allows
open vocabulary. 10] K --> L[11% WER, 6%
CER, 1000 words. 11] A --> M[Synthesized speech from
paralyzed, without speech. 12] M --> N[Anne had stroke,
lost speech. 13] M --> O[Decode phonemes, acoustics,
articulation from brain. 14] O --> P[CTC maps brain
to phoneme probabilities. 15] O --> Q[25% WER, 78
wpm, 1000 words. 16] O --> R[Synthesized personalized speech
from decoded probabilities. 17] R --> S[90% intelligibility on
50 phrases. 18] M --> T[Decoded gestures animate
avatar in real-time. 19] T --> U[Avatar intelligible, correlated
with real speakers. 20] M --> V[Decoded audio, text, avatar
for embodied prosthesis. 21] V --> W[Anne: voice, avatar
enable self-expression, interaction. 22] M --> X[Articulatory brain representations
intact post-paralysis. 23] A --> Y[Future: implantable home
prosthesis device. 24] Y --> Z[Challenges: user robustness,
wireless miniaturization, accuracy. 25] Y --> AA[Spelling accurate but
slow, synthesis faster. 26] AA --> AB[Streaming synthesis doubles
rate vs delayed. 27] AB --> AC[Uninterrupted multi-minute use
with streaming, voice detection. 28] Y --> AD[Error neural signals
may improve BCI. 29] Y --> AE[Decoding articulation enables
multi-language support. 30] class B,C,M,V,W,X,Y,Z,AA,AB,AC,AD,AE neural; class D,E,F,G,H,I,J,K,L,N,O,P,Q decoding; class R,S synthesis; class T,U avatar; class Y,Z future;

Resume:

1.- Speech neural prosthesis aims to restore natural communication to people with severe paralysis, potentially benefiting over 3 million in the U.S.

2.- Current assistive technologies like spelling with head movements or eye tracking are much slower than natural speech (15 vs 120+ wpm).

3.- Speech neural prosthesis could decode intended speech from the brain using invasive (ECoG, microelectrode arrays) or non-invasive (EEG) neural interfaces.

4.- Over a century of work has characterized speech in the brain, with recent advances in decoding speech and text in the last 15 years.

5.- Bouchard et al. 2013 showed distinct neural responses for different speech sounds and an articulatory map in speech cortex during syllable production.

6.- Anumanchipalli et al. decoded speech from healthy speakers by mapping brain activity to speech waveforms, but it requires residual speech ability.

7.- Makin et al. decoded brain activity into text sentences using a convolutional-recurrent neural network, but also relies on overt speech alignment.

8.- The BRAVO clinical trial aims to decode and restore communication and movement to paralyzed individuals using various neural recording interfaces.

9.- Using a 50-word vocabulary, they decoded speech from a paralyzed participant with 25% WER at 15 wpm using an RNN and language model.

10.- Expanding to spelling-based BCI allows open vocabulary decoding - 26 letter classes are decoded as the user attempts to spell words.

11.- Spelling-based approach achieves 11% WER and 6% CER at 1000 word vocabulary, driven by neural decoding with language model refining the output.

12.- However, spelling is unnatural and slower than natural speech. Their new approach can synthesize speech from a paralyzed person without requiring overt speech.

13.- Participant Anne, who had a stroke causing loss of intelligible speech, was implanted with a high-density ECoG grid over her speech cortex.

14.- They decode probabilities over phonemes, acoustic speech features, and articulatory gestures from Anne's neural activity as she attempts to speak.

15.- The text decoding model uses CTC loss to map neural activity to phoneme probabilities without alignment, enabling open vocabulary decoding via language model.

16.- Achieved 25% WER at 78 wpm over 1000 words within a few weeks, with neural decoding driving performance more than the language model.

17.- Synthesized speech by decoding speech unit probabilities and using a synthesizer conditioned on Anne's pre-stroke voice for a personalized voice.

18.- Synthesized speech achieved up to 90% intelligibility on 50 phrases, with lower but promising performance on larger 500-1000 phrase sets.

19.- Decoded articulatory gestures from neural activity to drive a live 3D avatar in real-time, capturing speech and expressive facial movements.

20.- Avatar animation from decoded gestures was both intelligible and correlated well with real speakers' face movements during the same speech.

21.- Combined decoding of speech audio, text, and avatar animation provides an embodied neuroprosthetic for more complete communication restoration.

22.- Anne felt the personalized voice synthesis and avatar could enable her to counsel clients again and have fuller self-expression and interaction.

23.- The same articulatory representations remain intact in the brain even years after paralysis, enabling the speech neuroprosthesis to work.

24.- Future work aims to translate these proofs-of-concept into a fully implantable clinical device suitable for day-to-day use at home.

25.- Challenges include robustness across users, wireless system miniaturization, improving performance metrics like accuracy and latency, and expanding languages supported.

26.- High performance spelling-based approaches allow above 99% accuracy, but are slow. Streaming synthesis could enable more natural back-and-forth conversations.

27.- Continuously streaming approach synthesizes speech audio and text in 80 ms increments during decoding, doubling speaking rate vs delayed synthesis.

28.- Streaming synthesis with implicit voice activity detection enables uninterrupted multi-minute use of the decoder without explicit trial windowing.

29.- Error-related neural signals may help identify mistakes and improve the performance and robustness of the brain-computer interface system.

30.- These speech neuroprosthesis approaches work across multiple languages since they rely on decoding speech articulation rather than language-specific features.

Knowledge Vault built byDavid Vivancos 2024