CONTRIBUTION OF ACOUSTIC LANDMARKS TO SPEECH RECOGNITION
IN NOISE BY COCHLEAR IMPLANT USERS
Ning Li, PhD
December 2009
Cochlear implant (CI) user’s performance degrades significantly in noisy environments, especially in non-steady noisy conditions. Unlike normal hearing listeners, CI users generally perform better when listening to speech in steady-state noise than in fluctuating maskers, and the reasons for that are unclear. In this dissertation, we propose a new hypothesis for the observed absence of release from masking by CI users. A new strategy is also developed and integrated into existing CI systems to improve speech recognition in noise for CI users.
In our hypothesis, when listening to speech in fluctuating maskers (e.g., competing talkers), CI users cannot fuse the pieces of the message over temporal gaps because they are unable to perceive reliably the acoustic landmarks introduced by obstruent consonants (e.g., stops). These landmarks, often blurred in noisy conditions, are evident in spectral discontinuities associated with consonant closures and releases, and are posited to aid listeners in determining word/syllable boundaries. To test this hypothesis, normal-hearing (NH) listeners are presented with vocoded sentences containing clean obstruent segments, but corrupted (by steady noise or fluctuating maskers) sonorant segments (e.g., vowels). Results indicated that NH listeners performed better with fluctuating maskers than with steady noise. This outcome suggests that having access to the acoustic landmarks provided by the obstruent consonants enables listeners to integrate effectively pieces of the message glimpsed over temporal gaps into one coherent speech stream.
The same hypothesis was also tested with CI listeners. IEEE sentences containing clean obstruent segments, but corrupted (by steady noise or fluctuating maskers) sonorant segments (e.g., vowels) were presented to CI users. Results indicated that cochlear implant users received a substantial gain in intelligibility when they had access to the acoustic landmarks provided by obstruent consonants. A second experiment was conducted to examine the reasons contributing to the CI users’ inability to perceive reliably the acoustic landmarks embedded in the signal. It is hypothesized that the envelope compression smears the acoustic landmarks that signify syllable/word boundaries. To test this hypothesis, we presented to CI users noise-corrupted sentences processed via an algorithm that compresses the envelopes by applying a logarithmic-shaped function during voiced segments (e.g., vowels) and a less-compressive mapping function during unvoiced segments (e.g., stops). Results showed substantial improvement for sentence recognition under both stationary and non-stationary noise conditions.
To detect the landmarks used in the selective compression strategy, automatic consonant-landmark detection algorithms were developed to handle adverse speech conditions. High accuracy detection rate was achieved using machine learning algorithms. A selective compression algorithm, with estimated landmarks, was incorporated into the CI strategy, and tested by presenting the processed stimuli to CI users. Significant benefits were observed compared to performance obtained with the unprocessed noisy speech. Overall, the data from the present dissertation highlight the importance of preserving the acoustic landmarks present in the speech signal for improved speech understanding by cochlear implant users in noisy conditions.
Download thesis: [PDF - 5 M]
Kalyan Kasturi, PhD
December 2006
Cochlear implants are prosthetic devices, consisting of implanted electrodes and a signal processor and are designed to restore partial hearing to the profoundly deaf community. Since their inception in early 1970s cochlear implants have gradually gained popularity and consequently considerable research has been done to advance and improve the cochlear implant technology. Most of the research conducted so far in the field of cochlear implants has been primarily focused on improving speech perception in quiet. Music perception and speech perception in noisy listening conditions with cochlear implants are still highly challenging problems. Many research studies have reported low recognition scores in the task of simple melody recognition. Most of the cochlear implant devices use envelope cues to provide electric stimulation. Understanding the effect of various factors on melody recognition in the context of cochlear implants is important to improve the existing coding strategies.
In the present work we investigate the effect of various factors such as filter spacing, relative phase, spectral up-shifting, carrier frequency and phase perturbation on melody recognition in acoustic hearing. The filter spacing currently used in the cochlear implants is larger than the musical semitone steps and hence not all musical notes can be resolved. In the current work we investigate the use of new filter spacing techniques called ‘Semitone filter spacing techniques’ in which filter bandwidths are varied in correspondence to the musical semitone steps.
Noise reduction methods investigated so far for use with cochlear implants are mostly pre-processing methods. In these methods, the speech signal is first enhanced using the noise reduction method and the enhanced signal is then processed using the speech processor. A better and more efficient approach is to integrate the noise reduction mechanism into the cochlear implant signal processing. In this dissertation we investigate the use of two such embedded noise reduction methods namely, ‘SNR weighting method’ and ‘S-shaped compression’ to improve speech perception in noisy listening conditions. SNR weighting noise reduction method is an exponential weighting method that uses the instantaneous signal to noise ratio (SNR) estimate to perform noise reduction in each frequency band that corresponds to a particular electrode in the cochlear implant. S-shaped compression technique divides the compression curve into two regions based on the noise estimate. This method applies a different type of compression for the noise portion and the speech portion and hence better suppresses the noise compared to the regular power-law compression.
Download thesis: [PDF - 1 M]
SPEECH ENHANCEMENT USING A LAPLACIAN-BASED MMSE ESTIMATOR
OF THE MAGNITUDE SPECTRUM
Chen Bin, PhD
December 2005
A number of speech enhancement algorithms based on MMSE spectrum estimators have been proposed over the years. Although some of these algorithms were developed based on Laplacian and Gamma distributions, no optimal spectral magnitude estimators were derived. This dissertation focuses on optimal estimators of the magnitude spectrum for speech enhancement. We present an analytical solution for estimating in the MMSE sense the magnitude spectrum when the clean speech DFT coefficients are modeled by a Laplacian distribution and the noise DFT coefficients are modeled by a Gaussian distribution. Furthermore, we derive the MMSE estimator under speech presence uncertainty and a Laplacian statistical model. Results indicated that the Laplacian-based MMSE estimator yielded less residual noise in the enhanced speech than the traditional Gaussian-based MMSE estimator. Overall, the present study demonstrates that the assumed distribution of the DFT coefficients can have a significant effect on the quality of the enhanced speech.
Download thesis: [PDF - 1.7M]
NOISE ESTIMATION ALGORITHMS FOR HIGHLY NON-STATIONARY ENVIRONMENTS
Sundarrajan Rangachari, M.S.E.E.
August 2004
The quality and intelligibility of the speech in the presence of background noise can be improved by speech enhancement algorithms. This thesis addresses the issue of estimating the noise spectrum for speech enhancement applications. Two noise estimation algorithms are proposed for highly non-stationary noise environments. In method-1 a voice activity detector is first used to classify each frame of speech continuously into the speech present/absent frames, and the noise spectrum estimate is updated using a constant smoothing factor for speech absent frames and a frequency dependent smoothing factor for speech present frames. In method-2 the noise spectrum estimate is updated using a frequency dependent smoothing factor irrespective of speech present/absent frames. In both methods, the frequency dependent smoothing factor is calculated based on estimated speech presence probabilities in subbands. Speech presence is determined by computing the ratio of the noisy speech power spectrum to its local minimum, which is computed by averaging past values of the noisy speech power spectra with a look-ahead factor. The local minimum estimation algorithm adapts very quickly to highly non-stationary noise environments. This was confirmed with formal listening tests that indicated that the proposed noise estimation algorithms when integrated in speech enhancement were preferred over other noise estimation algorithms.
Download thesis: [PDF - 556 kB]
It is generally accepted that the fusion of two speech signals presented dichotically is affected by the relative onset time. This study investigated the hypothesis that spectral resolution might be an additional factor influencing spectral fusion when the spectral information is split and presented dichotically to the two ears. Two different methods of splitting the spectral information were investigated. In the first method, the odd-index channels were presented to one ear and the even-index channels to the other ear. In the second method the lower frequency channels were presented to one ear and the high frequency channels to the other ear. The experiments were conducted with both normal hearing listeners and bilateral cochlear implant listeners. Results with normal hearing listeners indicated that spectral resolution did affect spectral fusion. Results with bilateral cochlear implant users indicated that subjects were able to fuse information presented to the two ears accurately in quiet but not in noise.
Download thesis: [PDF - 520 kB]
SUBSPACE AND MULTITAPER METHODS FOR SPEECH ENHANCEMENT
Yi Hu, Ph.D.
December 2003
Several speech enhancement algorithms have been proposed over the years. Although most algorithms improve the quality of speech, they introduce speech distortion and suffer from the ``musical noise" artifact. To minimize speech distortion, we propose subspace methods which can be generally applied for colored noise environments. To make the residual noise perceptually inaudible, we propose two methods for incorporating psychoacoustical models. In the first method, we use a well known perceptual weighting technique from speech coding to shape the residual noise spectrum. In the second method, we constrain the noise spectrum to be less than the masking threshold of the speech signal. To eliminate musical noise, we propose the use of multitaper spectrum estimators which have low variance. We further wavelet threshold the multitaper spectrum to reduce the estimation variance. For subspace methods, we propose the use of multiwindow covariance matrix estimation.
Results, based on formal listening tests and objective measures, indicated significant improvements in speech quality with the proposed algorithms. Furthermore, the proposed subspace methods yielded improved speech intelligibility when tested with cochlear implant listeners.
Download thesis: [pdf - 635 kB]
A MULTI-BAND SPECTRAL SUBTRACTION METHOD FOR SPEECH ENHANCEMENT
Sunil Devdas Kamath, M.S.E.E
May 2001
The corruption of speech due to presence of additive background noise causes severe difficulties in various communication environments. This thesis addresses the problem of reduction of additive background noise in speech. The proposed approach is a frequency-dependent speech enhancement method based on the proven spectral subtraction method. Most implementations and variations of the basic spectral subtraction technique advocate subtraction of the noise spectrum estimate over the entire speech spectrum. However, real world noise is mostly colored and does not affect the speech signal uniformly over the entire spectrum. This thesis explores a Multi-Band Spectral Subtraction (MBSS) approach with suitable pre-processing of the speech data. Speech is processed into frequency bands and spectral subtraction is performed independently on each band using band-specific over-subtraction factors. This method provides a greater degree of flexibility and control on the noise subtraction levels that reduces artifacts in the enhanced speech, resulting in improved speech quality. The effect of the number of frequency band and the type of filter spacing (linear, logarithmic or mel) was investigated. Results showed that the proposed MBSS method with four linear-spaced frequency bands outperformed the conventional spectral subtraction method with respect to speech quality and reduced musical noise.
Publications
Kamath, S. and Loizou, P. (2002). “A
multi-band spectral subtraction method for enhancing speech corrupted by
colored noise, “ Proceedings of ICASSP-2002,
Orlando, FL, May 2002.
Download thesis: [pdf - 906 kB]
Reducing noise in corrupted speech remains an important problem and has a broad range of applications, most of which are driven by the explosive growth of mobile communications. Numerous approaches have been proposed for speech enhancement, with the spectral subtraction method being one of the most popular, due to its relatively simple implementation and computational efficiency. The spectral subtraction method has some inherent limitations and drawbacks. This thesis proposes a modification to the conventional spectral subtraction approach in order to address the problem of musical noise and speech distortion that is inherent to the conventional spectral subtraction based approach. Further enhancements in speech quality were obtained by applying a perceptual weighting function (estimated using a psychoacoustics model) that was designed to minimize noise distortion. Objective measures and informal listening tests showed that the proposed modified spectral subtraction method combined with perceptual weighting outperformed the conventional power spectral subtraction method resulting in better speech quality and reduced levels of musical noise.
Papers
Download thesis: [pdf
- 1.1 Mb]
Real world noise is mostly colored and does not affect the speech signal uniformly over the entire spectrum. Little is known about the effect of noise on the spectrum of speech. Such knowledge could potentially help us develop better speech enhancement algorithms. This thesis investigates the affect of colored noise viz. multi-talker babble and speech-shaped noise on the spectrum of vowels and consonants. Multi-talker babble and speech-shaped noise were added to vowels and stop consonants at -5 to 15 dB SNR and the spectral effect of noise was quantified in terms of various acoustic measures: (a) spectral contrast of the noisy vowel spectra, (b) spectral distance between the noisy and clean vowel and consonant spectra for three frequency bands, (c) detection and estimation of first two formant frequencies in noise, (d) frequency deviation of first two formant frequencies in noise, (e) spectral tilt of stop consonants, and (f) burst frequency for the stop consonants. Results showed that for vowels and stop consonants, the effect of colored noise on the frequency spectrum was non-uniform.
Download thesis: [pdf
- 436 Kb]
An understanding of how information about the speech signal is spread among the various frequency bands of the spectrum is essential in numerous communications, audio and hearing related applications. Although many studies investigated the intelligibility of high-pass, low-pass and band-pass filtered speech, not many studies investigated the perception of band-stop filtered speech (i.e., speech with holes in the spectrum) or speech composed of disjoint frequency bands. The most recent studies examined speech recognition either for a single hole varying in frequency location and size or for a single hole in the middle of the spectrum. The scope of these studies is limited in the sense that they did not consider perception of speech composed of multiple disjoint bands involving low, middle and/or high frequency information. The present study addresses this question in a systematic fashion, considering all possible combinations of missing disjoint bands from the spectrum. In this work, we also derive frequency-importance functions for consonant and vowel recognition using (a) a least squares approach that utilizes the results of intelligibility tests for speech with holes in the spectrum and (b) an information theoretic approach based on the calculation of mutual information between frequency bands and phonetic labels.
Download thesis: [pdf
- 320 Kb]
This thesis presents a new technique for subband feedback active noise control. The problem of controlling the noise level in the environment has been the focus of a tremendous amount of research over the years. Active Noise Cancellation (ANC) is one such approach that has been proposed for reduction of steady state noise. ANC refers to an electromechanical or electroacoustic technique of canceling acoustic disturbance to yield a quieter environment. The basic principle of ANC is to introduce a canceling “antinoise” signal that has the same amplitude but the exact opposite phase, thus resulting in an attenuated residual noise signal. Wideband active noise control systems often involve adaptive filter lengths with hundreds of taps. Using subband processing can considerably reduce the length of the adaptive filter. Conventional subband algorithms are generally based in the frequency domain and use at least 2 sensors. This thesis presents a time domain algorithm for single sensor subband feedback ANC targeted for use in headsets and hearing protectors. The subband processing is done using relatively short fixed FIR filters. The algorithm also adopts the weight constrained NLMS algorithm for feedback ANC. Results showed that the proposed subband feedback ANC algorithm outperformed the traditional single band ANC system.
Download thesis: [pdf - 1.4 Mb]
This thesis addresses the problem of helping hearing-impaired people to use telephones. There are two aspects of this work: a Bluetooth-based wireless phone adapter and a bandwidth-extension algorithm. Built upon the Bluetooth technology, the proposed phone adapter routes the telephone audio signal to the hearing aid or the CI processor wirelessly, and hence disables environmental noise and interference. The proposed bandwidth-extension algorithm has the potential to increase speech intelligibility for the hearing-impaired people by estimating a wide-band signal from the narrow-band telephone signal. This is done by a piecewise linear estimation based on line spectral frequencies, and a statistical speech-frame classification technique based on Hidden Markov Models integrated to overcome the drawback of conventional bandwidth extension algorithms. The phone adapter was tested by CI users, and the proposed algorithm was evaluated by objective measures. Both results showed good performance.
Download thesis: [pdf - 572 Kb]
Multichannel cochlear implants electrically stimulate the auditory
nerve to restore partial hearing to the profoundly deaf patient.
The multichannel implant was designed to selectively stimulate discrete
populations of spiral ganglion cells along the length of the cochlea.
However, selective stimulation is not often, or at least imperfectly, achieved
even with the most modern cochlear implant designs and speech processing
strategies. When multiple electrodes are stimulated simultaneously,
electrical fields generated around each electrode can interact with the
electrical fields of neighboring electrodes, thereby reducing selectivity.
Several studies have suggested that electrical-field interactions can disrupt
the acoustic properties of the signal and severely degrade speech intelligibility,
however this relationship has not been directly tested.
Electrical-field interactions can be reduced by decreasing the
current levels delivered to each electrode through improved electrode positioning
and design, or by using speech processing strategies that maximize the
separation between simultaneously stimulated electrodes or stimulate the
electrodes sequentially. The proximity of the cochlear implant electrode
array to the modiolus has been shown to reduce the amount of current required
to reach threshold (Rebscher et al., 1994). When less current is
required, current spread and electrical field overlap is reduced.
Recently, cochlear implant manufacturers have taken interest in designing
“positioners” which place the electrode array in close proximity to the
spiral ganglion cells and new electrode arrays which attempt to direct
their current toward spiral ganglion cell bodies.
The following experiments examine electrical-field interactions
and speech recognition performance for three electrode designs: patients
implanted with the Enhanced Bipolar Clarion electrode array without a “positioner”,
patients implanted with the Clarion Electrode Positioning SystemTM (EPS)
and the Enhanced Bipolar electrode array, and patients with the EPS and
the Clarion Hi-FocusTM electrode array. A simultaneous masking task
was used to measure electrical-field interactions as a function of electrode
separation for monopolar and bipolar configurations. The relationship
between electrical-field interaction and speech recognition was also examined
for several speech strategies varying in the number of electrodes stimulated
simultaneously. Subjects identified consonants, vowels,
and sentences with each of the following speech strategies, listed in order
from sequential stimulation to fully simultaneous stimulation: Continuous
Interleaved Sampler (CIS), Paired Pulsatile Sampler (PPS), Quadruple Pulsatile
Sampler (QPS), Hybrid Analog Pulsatile (HAPs), and Simultaneous Analog
Stimulation (SAS). Based on previous research, susceptibility to electrical-field
interactions is expected to vary as a function of electrode design, the
speech processing strategy used in the device, and factors specific to
each patient. The contribution from each of these variables was investigated.
The results showed a moderate to strong negative correlation
between electrical-field interaction and speech recognition performance,
which indicates that patients with lower levels of electrical-field interaction
have higher speech recognition scores than patients with high levels of
electrical-field interaction. In addition, patients with strong susceptibility
to electrical-field interactions produced higher speech recognition scores
for sequential than simultaneous speech strategies. An information
analysis revealed that vowel recognition and consonant place-of-articulation
were most affected by electrical-field interactions, demonstrating that
electrode interactions severely disrupt spectral cues. The pattern
of results also suggests that, with acute listening trials, patients achieve
the highest speech recognition scores with the speech processing strategy
most similar to their own. Future studies are needed to determine
if patients with minimal levels of electrical-field interaction can benefit
from the partially-simultaneous QPS or HAPs strategies with more listening
exposure.
Download dissertation: [pdf - 1.2 Mb]
The variability in patient performance noticed in cochlear implant users demands the development of new and improved speech processing strategies that will help improve speech recognition for poor users of the device. The Clarion cochlear implant has various parameters that can be manipulated and Clarion patients can be fitted with several speech processing strategies. In this thesis, the Clarion research interface was used to evaluate the performance of commercially available as well as new speech processing strategies. Six different strategies were implemented and tested with 12 Clarion implant patients (10 CIS users and 2 SAS users). The six different strategies included three commercially available strategies (CIS, PPS and SAS) and three new (not commercially available in the Clarion device) strategies: the hybrid, quadruple pulsatile sampler (QPS) and the 6-of-8 strategy. These strategies differed in the degree of simultaneity and rate of stimulation. Speech recognition results showed that the performance obtained with the CIS strategy was not statistically different with the performance obtained with the PPS, QPS, the hybrid strategies in quiet, and with the 6-of-8 strategy in noise. There was a large variability in performance among subjects. In noise, some subjects benefited with the 6-of-8 strategy. In quiet, some subjects obtained higher performance with the PPS, QPS and the hybrid strategies compared to the CIS strategy. We believe that this variability was due to the amount of channel interaction. Subjects with small channel interaction are most likely to benefit with the high rates of stimulation provided by the PPS and QPS strategies. Further research is needed to identify the various factors that affect implant users' performance.
Download thesis: [pdf - 653 Kb]