Hynek Hermansky ICLR 2014 - Invited Talk - Speech Representations: Knowledge or Data?
1.-Speech recognition involves reducing high-rate speech signal to low-rate speech sounds, requiring knowledge from data and textbooks.

2.-Early recognizers used knowledge-based rules or data-driven templates, with data-driven approaches dominating since the 1970s.

3.-Stochastic approach uses Bayes' rule with likelihoods trained on acoustic data and prior probabilities from language data.

4.-Architecture and feature representation are key design choices. Raw speech vs biologically-inspired auditory models considered.

5.-Filtering and smoothing spectra using aspects of human hearing like critical bands and loudness compression helps normalize speaker differences.

6.-Estimating parameters of auditory processing from speech data, not just textbooks, is important. Speech may have evolved to match hearing.

7.-Removing slow spectral variations, similar to cortical processing, helps reduce effects of different frequency responses.

8.-Data-driven discriminative spectral basis using linear discriminant analysis revealed decreasing frequency resolution with increasing frequency, matching human hearing.

9.-Neural networks are useful for deriving small, efficient feature sets like posterior probabilities of speech sounds.

10.-Posterior probabilities can be used directly in hybrid recognition or converted to normally-distributed features for conventional recognizers.

11.-Convolutional nets with shared weights in initial layers learn general auditory filterbank features from data.

12.-Physiological recordings show cortical receptive fields selective to different frequencies, temporal resolutions, and spectral resolutions.

13.-Principal components of receptive fields do bandpass filtering and span 3 critical bands spectrally, as seen in engineered features.

14.-Auditory system maintains information rate by increasing number of neurons as firing rates decrease in higher areas.

15.-Hearing may derive multiple representations of varying sparsity and time scale to pick the most useful one for a situation.

16.-Adapting to unknown situations by monitoring agreement between multiple representations and picking reliable ones is promising.

17.-Deriving reusable knowledge, not just classification boundaries, from training data is important to avoid redundant learning of common sense.

18.-Speech recognizers should use deep, long time spans, and multiple wide parallel representations to handle real-world complexity.

19.-Recurrent networks are a natural way to model long time dependencies in speech spanning at least a segment length.

20.-Cortical representations progress from acoustic features to phonetic features to phonemes to syllables and words at different levels.

21.-Some success reconstructing auditory neural responses from learned sparse representations, similar to vision research.

22.-PCA is a linear approximation, while non-linear ICA and sparse coding may better match auditory system.

23.-Dealing with unknown words/languages/environments not seen in training is a key open problem in speech recognition.

24.-Successfully detecting out-of-vocabulary words would itself be very useful for speech recognizers to handle the unknown.

25.-Sensory systems seem to use multiple parallel representations to extract useful stimulus features and adapt to new situations.

26.-Learning high-level abstractions applicable to many prediction problems may help systems deal with changing environments.

27.-Experimentally investigating parallel representations racing to provide useful features seems promising and consistent with neuroscience.

28.-Machine learning often assumes future data will match past training data, but this is a concerning limitation.

29.-Having systems detect when data/environments have changed from training conditions is an important machine learning problem to address.

30.-The speaker hopes the ML community will work on principled ways to handle unfamiliar data that doesn't match training sets.

