Knowledge Vault 2/2 - ICLR 2014-2023
Hynek Hermansky ICLR 2014 - Invited Talk - Speech Representations: Knowledge or Data?
<Resume Image >

Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:

graph LR classDef speech fill:#f9d4d4, font-weight:bold, font-size:14px; classDef data fill:#d4f9d4, font-weight:bold, font-size:14px; classDef auditory fill:#d4d4f9, font-weight:bold, font-size:14px; classDef recognition fill:#f9f9d4, font-weight:bold, font-size:14px; classDef representations fill:#f9d4f9, font-weight:bold, font-size:14px; classDef learning fill:#d4f9f9, font-weight:bold, font-size:14px; A[Hynek Hermansky ICLR 2014] --> B[High-rate speech reduced
to low-rate sounds. 1] A --> C[Data-driven approaches dominate
since 1970s. 2] C --> D[Bayes' rule: likelihoods
from acoustic data. 3] A --> E[Architecture, features key
design choices. 4] A --> F[Filtering, smoothing spectra
using human hearing. 5] F --> G[Estimating auditory parameters
from speech data. 6] F --> H[Removing slow spectral
variations reduces differences. 7] F --> I[Data-driven analysis revealed
frequency resolution. 8] A --> J[Neural networks derive
efficient feature sets. 9] J --> K[Posterior probabilities used
in hybrid recognition. 10] J --> L[Convolutional nets learn
general auditory features. 11] A --> M[Cortical receptive fields
selective to frequencies. 12] M --> N[Principal components do
filtering, span bands. 13] A --> O[Auditory system maintains
information rate. 14] O --> P[Hearing derives multiple
varying representations. 15] P --> Q[Adapting by monitoring
representation agreement promising. 16] A --> R[Deriving reusable knowledge
avoids redundant learning. 17] A --> S[Recognizers need deep,
long, parallel representations. 18] S --> T[Recurrent networks model
long speech dependencies. 19] S --> U[Cortical representations progress
from acoustic to words. 20] U --> V[Success reconstructing neural
responses from representations. 21] A --> W[Non-linear methods may
better match auditory system. 22] A --> X[Handling unknown words/languages/environments
is open problem. 23] X --> Y[Detecting out-of-vocabulary words
would be useful. 24] A --> Z[Sensory systems use parallel
representations to adapt. 25] Z --> AA[Learning high-level abstractions
helps changing environments. 26] Z --> AB[Parallel representations racing
seems promising, consistent. 27] A --> AC[Future data matching past
is concerning assumption. 28] AC --> AD[Detecting changed data/environments
is important problem. 29] AD --> AE[Principled ways to handle
unfamiliar data needed. 30] class B,G,H,T,U speech; class C,D,I data; class F,L,M,N,O,W auditory; class J,K,S,V,X,Y recognition; class P,Q,Z,AA,AB representations; class R,AC,AD,AE learning;

Resume:

1.-Speech recognition involves reducing high-rate speech signal to low-rate speech sounds, requiring knowledge from data and textbooks.

2.-Early recognizers used knowledge-based rules or data-driven templates, with data-driven approaches dominating since the 1970s.

3.-Stochastic approach uses Bayes' rule with likelihoods trained on acoustic data and prior probabilities from language data.

4.-Architecture and feature representation are key design choices. Raw speech vs biologically-inspired auditory models considered.

5.-Filtering and smoothing spectra using aspects of human hearing like critical bands and loudness compression helps normalize speaker differences.

6.-Estimating parameters of auditory processing from speech data, not just textbooks, is important. Speech may have evolved to match hearing.

7.-Removing slow spectral variations, similar to cortical processing, helps reduce effects of different frequency responses.

8.-Data-driven discriminative spectral basis using linear discriminant analysis revealed decreasing frequency resolution with increasing frequency, matching human hearing.

9.-Neural networks are useful for deriving small, efficient feature sets like posterior probabilities of speech sounds.

10.-Posterior probabilities can be used directly in hybrid recognition or converted to normally-distributed features for conventional recognizers.

11.-Convolutional nets with shared weights in initial layers learn general auditory filterbank features from data.

12.-Physiological recordings show cortical receptive fields selective to different frequencies, temporal resolutions, and spectral resolutions.

13.-Principal components of receptive fields do bandpass filtering and span 3 critical bands spectrally, as seen in engineered features.

14.-Auditory system maintains information rate by increasing number of neurons as firing rates decrease in higher areas.

15.-Hearing may derive multiple representations of varying sparsity and time scale to pick the most useful one for a situation.

16.-Adapting to unknown situations by monitoring agreement between multiple representations and picking reliable ones is promising.

17.-Deriving reusable knowledge, not just classification boundaries, from training data is important to avoid redundant learning of common sense.

18.-Speech recognizers should use deep, long time spans, and multiple wide parallel representations to handle real-world complexity.

19.-Recurrent networks are a natural way to model long time dependencies in speech spanning at least a segment length.

20.-Cortical representations progress from acoustic features to phonetic features to phonemes to syllables and words at different levels.

21.-Some success reconstructing auditory neural responses from learned sparse representations, similar to vision research.

22.-PCA is a linear approximation, while non-linear ICA and sparse coding may better match auditory system.

23.-Dealing with unknown words/languages/environments not seen in training is a key open problem in speech recognition.

24.-Successfully detecting out-of-vocabulary words would itself be very useful for speech recognizers to handle the unknown.

25.-Sensory systems seem to use multiple parallel representations to extract useful stimulus features and adapt to new situations.

26.-Learning high-level abstractions applicable to many prediction problems may help systems deal with changing environments.

27.-Experimentally investigating parallel representations racing to provide useful features seems promising and consistent with neuroscience.

28.-Machine learning often assumes future data will match past training data, but this is a concerning limitation.

29.-Having systems detect when data/environments have changed from training conditions is an important machine learning problem to address.

30.-The speaker hopes the ML community will work on principled ways to handle unfamiliar data that doesn't match training sets.

Knowledge Vault built byDavid Vivancos 2024