Knowledge Vault 2/2 - ICLR 2014-2023
Hynek Hermansky ICLR 2014 - Invited Talk - Speech Representations: Knowledge or Data?
<Resume Image >

Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:

Hynek Hermansky ICLR 2014
High-rate speech reduced
to low-rate sounds. 1
Data-driven approaches dominate
since 1970s. 2
Bayes' rule: likelihoods
from acoustic data. 3
Architecture, features key
design choices. 4
Filtering, smoothing spectra
using human hearing. 5
Estimating auditory parameters
from speech data. 6
Removing slow spectral
variations reduces differences. 7
Data-driven analysis revealed
frequency resolution. 8
Neural networks derive
efficient feature sets. 9
Posterior probabilities used
in hybrid recognition. 10
Convolutional nets learn
general auditory features. 11
Cortical receptive fields
selective to frequencies. 12
Principal components do
filtering, span bands. 13
Auditory system maintains
information rate. 14
Hearing derives multiple
varying representations. 15
Adapting by monitoring
representation agreement promising. 16
Deriving reusable knowledge
avoids redundant learning. 17
Recognizers need deep,
long, parallel representations. 18
Recurrent networks model
long speech dependencies. 19
Cortical representations progress
from acoustic to words. 20
Success reconstructing neural
responses from representations. 21
Non-linear methods may
better match auditory system. 22
Handling unknown words/languages/environments
is open problem. 23
Detecting out-of-vocabulary words
would be useful. 24
Sensory systems use parallel
representations to adapt. 25
Learning high-level abstractions
helps changing environments. 26
Parallel representations racing
seems promising, consistent. 27
Future data matching past
is concerning assumption. 28
Principled ways to handle
unfamiliar data needed. 30

Resume:

1.-Speech recognition involves reducing high-rate speech signal to low-rate speech sounds, requiring knowledge from data and textbooks.

2.-Early recognizers used knowledge-based rules or data-driven templates, with data-driven approaches dominating since the 1970s.

3.-Stochastic approach uses Bayes' rule with likelihoods trained on acoustic data and prior probabilities from language data.

4.-Architecture and feature representation are key design choices. Raw speech vs biologically-inspired auditory models considered.

5.-Filtering and smoothing spectra using aspects of human hearing like critical bands and loudness compression helps normalize speaker differences.

6.-Estimating parameters of auditory processing from speech data, not just textbooks, is important. Speech may have evolved to match hearing.

7.-Removing slow spectral variations, similar to cortical processing, helps reduce effects of different frequency responses.

8.-Data-driven discriminative spectral basis using linear discriminant analysis revealed decreasing frequency resolution with increasing frequency, matching human hearing.

9.-Neural networks are useful for deriving small, efficient feature sets like posterior probabilities of speech sounds.

10.-Posterior probabilities can be used directly in hybrid recognition or converted to normally-distributed features for conventional recognizers.

11.-Convolutional nets with shared weights in initial layers learn general auditory filterbank features from data.

12.-Physiological recordings show cortical receptive fields selective to different frequencies, temporal resolutions, and spectral resolutions.

13.-Principal components of receptive fields do bandpass filtering and span 3 critical bands spectrally, as seen in engineered features.

14.-Auditory system maintains information rate by increasing number of neurons as firing rates decrease in higher areas.

15.-Hearing may derive multiple representations of varying sparsity and time scale to pick the most useful one for a situation.

16.-Adapting to unknown situations by monitoring agreement between multiple representations and picking reliable ones is promising.

17.-Deriving reusable knowledge, not just classification boundaries, from training data is important to avoid redundant learning of common sense.

18.-Speech recognizers should use deep, long time spans, and multiple wide parallel representations to handle real-world complexity.

19.-Recurrent networks are a natural way to model long time dependencies in speech spanning at least a segment length.

20.-Cortical representations progress from acoustic features to phonetic features to phonemes to syllables and words at different levels.

21.-Some success reconstructing auditory neural responses from learned sparse representations, similar to vision research.

22.-PCA is a linear approximation, while non-linear ICA and sparse coding may better match auditory system.

23.-Dealing with unknown words/languages/environments not seen in training is a key open problem in speech recognition.

24.-Successfully detecting out-of-vocabulary words would itself be very useful for speech recognizers to handle the unknown.

25.-Sensory systems seem to use multiple parallel representations to extract useful stimulus features and adapt to new situations.

26.-Learning high-level abstractions applicable to many prediction problems may help systems deal with changing environments.

27.-Experimentally investigating parallel representations racing to provide useful features seems promising and consistent with neuroscience.

28.-Machine learning often assumes future data will match past training data, but this is a concerning limitation.

29.-Having systems detect when data/environments have changed from training conditions is an important machine learning problem to address.

30.-The speaker hopes the ML community will work on principled ways to handle unfamiliar data that doesn't match training sets.

Knowledge Vault built byDavid Vivancos 2024