EURASIP Journal on Audio, Speech, and Music Processing

Syndicate content
The latest articles from Hindawi Publishing Corporation
Updated: 6 years 20 weeks ago

Phoneme and Sentence-Level Ensembles for Speech Recognition

Tue, 02/28/2012 - 09:46
We address the question of whether and how boosting and bagging can be used for speech recognition. In order to do this, we compare two different boosting schemes, one at the phoneme level and one at the utterance level, with a phoneme-level bagging scheme. We control for many parameters and other choices, such as the state inference scheme used. In an unbiased experiment, we clearly show that the gain of boosting methods compared to a single hidden Markov model is in all cases only marginal, while bagging significantly outperforms all other methods. We thus conclude that bagging methods, which have so far been overlooked in favour of boosting, should be examined more closely as a potentially useful ensemble learning technique for speech recognition.
Categories: Journals

Multiple Source Localization Based on Acoustic Map De-Emphasis

Tue, 02/28/2012 - 09:46
This paper describes a novel approach for localization of multiple sources overlapping in time. The proposed algorithm relies on acoustic maps computed in multi-microphone settings, which are descriptions of the distribution of the acoustic activity in a monitored area. Through a proper processing of the acoustic maps, the positions of two or more simultaneously active acoustic sources can be estimated in a robust way. Experimental results obtained on real data collected for this specific task show the capabilities of the given method both with distributed microphone networks and with compact arrays.
Categories: Journals

Pitch Ranking, Melody Contour and Instrument Recognition Tests Using Two Semitone Frequency Maps for Nucleus Cochlear Implants

Tue, 02/28/2012 - 09:46
To overcome harmonic structure distortions of complex tones in the low frequency range due to the frequency to electrode mapping function used in Nucleus cochlear implants, two modified frequency maps based on a semitone frequency scale (Smt-MF and Smt-LF) were implemented and evaluated. The semitone maps were compared against standard mapping in three psychoacoustic experiments with the three mappings; pitch ranking, melody contour identification (MCI) and instrument recognition. In the pitch ranking test, two tones were presented to normal hearing (NH) subjects. The MCI test presented different acoustic patterns to NH and CI recipients to identify the patterns. In the instrument recognition (IR) test, a musical piece was played by eight instruments which subjects had to identify. Pitch ranking results showed improvements with semitone mapping over Std mapping. This was reflected in the MCI results with both NH subjects and CI recipients. Smt-LF sounded unnaturally high-pitched due to frequency transposition. Clarinet recognition was significantly enhanced with Smt-MF but the average IR decreased. Pitch ranking and MCI showed improvements with semitone mapping over Std mapping. However, the frequency limits of Smt-LF and Smt-MF produced difficulties when partials were filtered out due to the frequency limits. Although Smt-LF provided better pitch ranking and MCI, the perceived sounds were much higher in pitch and some CI recipients disliked it. Smt-MF maps the tones closer to their natural characteristic frequencies and probably sounded more natural than Smt-LF.
Categories: Journals

Monaural Voiced Speech Segregation Based on Dynamic Harmonic Function

Tue, 02/28/2012 - 09:46
Correlogram is an important representation for periodic signals. It is widely used in pitch estimation and source separation. For these applications, major problems of correlogram are its low resolution and redundant information. This paper proposes a voiced speech segregation system based on a newly introduced concept called dynamic harmonic function (DHF). In the proposed system, conventional correlograms are further processed by replacing the autocorrelation function (ACF) with DHF. The advantages of DHF are: 1) peak's width is adjustable by controlling the variance of the Gaussian function and 2) the invalid peaks of ACF, not at the pitch period, tend to be suppressed. Based on DHF, pitch detection and effective source segregation algorithms are proposed. Our system is systematically evaluated and compared with the correlogram-based system. Both the signal-to-noise ratio results and the perceptual evaluation of speech quality scores show that the proposed system yields substantially better performance.
Categories: Journals

A Novel MPEG Audio Degrouping Algorithm and Its Architecture Design

Tue, 02/28/2012 - 09:46
Degrouping is the key component in MPEG Layer II audio decoding. It mainly contains the arithmetic operations of division and modulo. So far no dedicated degrouping algorithm and architecture is well realized. In the paper we propose a novel degrouping algorithm and its architecture design with low complexity design consideration. Our approach relies on only using the addition and subtraction instead of the division and modulo arithmetic operations. By use of this technique, it achieves the equivalent result without any loss of accuracy. The proposed design is without any multiplier, divider and ROM table and thus it can reduce the design complexity and chip area. In addition, it does not need any programming effort on numerical analysis. The result shows that it takes the advantages of simple and low cost design. Furthermore, it achieves high efficiency on fixed throughput with only one clock cycle per sample. The VLSI implementation result indicates the gate counts are only 527.
Categories: Journals

Instrumental Estimation of E-Model Parameters for Wideband Speech Codecs

Tue, 02/28/2012 - 09:46
A method is described for quantifying the quality of wideband speech codecs. Two parameters are derived from signal-based speech quality model estimations: (i) a wideband equipment impairment factor Ie,WB and (ii) a wideband packet-loss robustness factor Bpl,WB. The equipment impairment factor can be combined with impairment factors for other quality degradations to form an estimate of the overall conversational quality R of a wideband communication scenario, using a wideband extension of the E-model. The packet-loss robustness factor captures the robustness of the codec against packet-loss degradations. In contrast to past work, these parameters are no longer determined on the basis of auditory test results, but from signal-based speech quality models. We applied three intrusive models to several databases and compared the derived quality estimates and impairment factors to those obtained from auditory tests. The results show that when migrating from narrowband to wideband transmission—a quality improvement of roughly 30% can be obtained, which is very similar to the one observed in auditory tests. The estimated impairment factors show a high correlation to those derived from auditory scores. Congruences and discrepancies to auditory test results are discussed, and an outline of work necessary to set up a wideband or even superwideband E-model is given.
Categories: Journals

The Effect of a Voice Activity Detector on the Speech Enhancement Performance of the Binaural Multichannel Wiener Filter

Tue, 02/28/2012 - 09:46
A multimicrophone speech enhancement algorithm for binaural hearing aids that preserves interaural time delays was proposed recently. The algorithm is based on multichannel Wiener filtering and relies on a voice activity detector (VAD) for estimation of second-order statistics. Here, the effect of a VAD on the speech enhancement of this algorithm was evaluated using an envelope-based VAD, and the performance was compared to that achieved using an ideal error-free VAD. The performance was considered for stationary directional noise and nonstationary diffuse noise interferers at input SNRs from −10 to +5 dB. Intelligibility-weighted SNR improvements of about 20 dB and 6 dB were found for the directional and diffuse noise, respectively. No large degradations (<1 dB) due to the use of envelope-based VAD were found down to an input SNR of 0 dB for the directional noise and −5 dB for the diffuse noise. At lower input SNRs, the improvement decreased gradually to 15 dB for the directional noise and 3 dB for the diffuse noise.
Categories: Journals

Optimizing the Directivity of Multiway Loudspeaker Systems

Tue, 02/28/2012 - 09:46
In multiway loudspeaker systems, digital signal processing techniques have been used to correct the frequency response, the propagation time, and the lobbing errors. These solutions are mainly based on correcting the delays between the signals coming from loudspeaker system transducers, and they still show limited performances over the overlap frequency bands. In this paper, we propose an enhanced optimization of relevant directivity characteristics of a multiway loudspeaker system such as the frequency response, the radiation pattern, and the directivity index over an extended transducers' frequency overlap bands. The optimization process is based on applying complex weights to the crossover filter transfer functions by using an iterative approach.
Categories: Journals

On the Characterization of Slowly Varying Sinusoids

Tue, 02/28/2012 - 09:46
We give a brief discussion on the amplitude and frequency variation rates of the sinusoid representation of signals. In particular, we derive three inequalities that show that these rates are upper bounded by the 2nd and 4th spectral moments, which, in a loose sense, indicates that every complex signal with narrow short-time bandwidths is a slowly varying sinusoid. Further discussions are given to show how this result helps providing extra insights into relevant signal processing techniques.
Categories: Journals

Efficient Advertisement Discovery for Audio Podcast Content Using Candidate Segmentation

Tue, 02/28/2012 - 09:46
Nowadays, audio podcasting has been widely used by many online sites such as newspapers, web portals, journals, and so forth, to deliver audio content to users through download or subscription. Within 1 to 30 minutes long of one podcast story, it is often that multiple audio advertisements (ads) are inserted into and repeated, with each of a length of 5 to 30 seconds, at different locations. Automatic detection of these attached ads is a challenging task due to the complexity of the search algorithms. Based on the knowledge of typical structures of podcast contents, this paper proposes a novel efficient advertisement discovery approach for large audio podcasting collections. The proposed approach offers a significant improvement on search speed with sufficient accuracy. The key to the acceleration comes from the advantages of candidate segmentation and sampling technique introduced to reduce both search areas and number of matching frames. The approach has been tested over a variety of podcast contents collected from MIT Technology Review, Scientific American, and Singapore Podcast websites. Experimental results show that the proposed algorithm archives detection rate of 97.5% with a significant computation saving as compared to existing state-of-the-art methods.
Categories: Journals

Correlation-Based Amplitude Estimation of Coincident Partials in Monaural Musical Signals

Tue, 02/28/2012 - 09:46
This paper presents a method for estimating the amplitude of coincident partials generated by harmonic musical sources (instruments and vocals). It was developed as an alternative to the commonly used interpolation approach, which has several limitations in terms of performance and applicability. The strategy is based on the following observations: (a) the parameters of partials vary with time; (b) such a variation tends to be correlated when the partials belong to the same source; (c) the presence of an interfering coincident partial reduces the correlation; and (d) such a reduction is proportional to the relative amplitude of the interfering partial. Besides the improved accuracy, the proposed technique has other advantages over its predecessors: it works properly even if the sources have the same fundamental frequency, it is able to estimate the first partial (fundamental), which is not possible using the conventional interpolation method, it can estimate the amplitude of a given partial even if its neighbors suffer intense interference from other sources, it works properly under noisy conditions, and it is immune to intraframe permutation errors. Experimental results show that the strategy clearly outperforms the interpolation approach.
Categories: Journals

Employing Second-Order Circular Suprasegmental Hidden Markov Models to Enhance Speaker Identification Performance in Shouted Talking Environments

Tue, 02/28/2012 - 09:46
Speaker identification performance is almost perfect in neutral talking environments. However, the performance is deteriorated significantly in shouted talking environments. This work is devoted to proposing, implementing, and evaluating new models called Second-Order Circular Suprasegmental Hidden Markov Models (CSPHMM2s) to alleviate the deteriorated performance in the shouted talking environments. These proposed models possess the characteristics of both Circular Suprasegmental Hidden Markov Models (CSPHMMs) and Second-Order Suprasegmental Hidden Markov Models (SPHMM2s). The results of this work show that CSPHMM2s outperform each of First-Order Left-to-Right Suprasegmental Hidden Markov Models (LTRSPHMM1s), Second-Order Left-to-Right Suprasegmental Hidden Markov Models (LTRSPHMM2s), and First-Order Circular Suprasegmental Hidden Markov Models (CSPHMM1s) in the shouted talking environments. In such talking environments and using our collected speech database, average speaker identification performance based on LTRSPHMM1s, LTRSPHMM2s, CSPHMM1s, and CSPHMM2s is 74.6%, 78.4%, 78.7%, and 83.4%, respectively. Speaker identification performance obtained based on CSPHMM2s is close to that obtained based on subjective assessment by human listeners.
Categories: Journals

Independent Component Analysis and Time-Frequency Masking for Speech Recognition in Multitalker Conditions

Tue, 02/28/2012 - 09:46
When a number of speakers are simultaneously active, for example in meetings or noisy public places, the sources of interest need to be separated from interfering speakers and from each other in order to be robustly recognized. Independent component analysis (ICA) has proven a valuable tool for this purpose. However, ICA outputs can still contain strong residual components of the interfering speakers whenever noise or reverberation is high. In such cases, nonlinear postprocessing can be applied to the ICA outputs, for the purpose of reducing remaining interferences. In order to improve robustness to the artefacts and loss of information caused by this process, recognition can be greatly enhanced by considering the processed speech feature vector as a random variable with time-varying uncertainty, rather than as deterministic. The aim of this paper is to show the potential to improve recognition of multiple overlapping speech signals through nonlinear postprocessing together with uncertainty-based decoding techniques.
Categories: Journals

Adaptive Long-Term Coding of LSF Parameters Trajectories for Large-Delay/Very- to Ultra-Low Bit-Rate Speech Coding

Tue, 02/28/2012 - 09:46
This paper presents a model-based method for coding the LSF parameters of LPC speech coders on a “long-term” basis, that is, beyond the usual 20–30 ms frame duration. The objective is to provide efficient LSF quantization for a speech coder with large delay but very- to ultra-low bit-rate (i.e., below 1 kb/s). To do this, speech is first segmented into voiced/unvoiced segments. A Discrete Cosine model of the time trajectory of the LSF vectors is then applied to each segment to capture the LSF interframe correlation over the whole segment. Bi-directional transformation from the model coefficients to a reduced set of LSF vectors enables both efficient “sparse” coding (using here multistage vector quantizers) and the generation of interpolated LSF vectors at the decoder. The proposed method provides up to 50% gain in bit-rate over frame-by-frame quantization while preserving signal quality and competes favorably with 2D-transform coding for the lower range of tested bit rates. Moreover, the implicit time-interpolation nature of the long-term coding process provides this technique a high potential for use in speech synthesis systems.
Categories: Journals

Pitch- and Formant-Based Order Adaptation of the Fractional Fourier Transform and Its Application to Speech Recognition

Tue, 02/28/2012 - 09:46
Fractional Fourier transform (FrFT) has been proposed to improve the time-frequency resolution in signal analysis and processing. However, selecting the FrFT transform order for the proper analysis of multicomponent signals like speech is still debated. In this work, we investigated several order adaptation methods. Firstly, FFT- and FrFT- based spectrograms of an artificially-generated vowel are compared to demonstrate the methods. Secondly, an acoustic feature set combining MFCC and FrFT is proposed, and the transform orders for the FrFT are adaptively set according to various methods based on pitch and formants. A tonal vowel discrimination test is designed to compare the performance of these methods using the feature set. The results show that the FrFT-MFCC yields a better discriminability of tones and also of vowels, especially by using multitransform-order methods. Thirdly, speech recognition experiments were conducted on the clean intervocalic English consonants provided by the Consonant Challenge. Experimental results show that the proposed features with different order adaptation methods can obtain slightly higher recognition rates compared to the reference MFCC-based recognizer.
Categories: Journals

Automatic Query Generation and Query Relevance Measurement for Unsupervised Language Model Adaptation of Speech Recognition

Tue, 02/28/2012 - 09:46
We are developing a method of Web-based unsupervised language model adaptation for recognition of spoken documents. The proposed method chooses keywords from the preliminary recognition result and retrieves Web documents using the chosen keywords. A problem is that the selected keywords tend to contain misrecognized words. The proposed method introduces two new ideas for avoiding the effects of keywords derived from misrecognized words. The first idea is to compose multiple queries from selected keyword candidates so that the misrecognized words and correct words do not fall into one query. The second idea is that the number of Web documents downloaded for each query is determined according to the “query relevance.” Combining these two ideas, we can alleviate bad effect of misrecognized keywords by decreasing the number of downloaded Web documents from queries that contain misrecognized keywords. Finally, we examine a method of determining the number of iterative adaptations based on the recognition likelihood. Experiments have shown that the proposed stopping criterion can determine almost the optimum number of iterations. In the final experiment, the word accuracy without adaptation (55.29%) was improved to 60.38%, which was 1.13 point better than the result of the conventional unsupervised adaptation method (59.25%).
Categories: Journals

Drum Sound Detection in Polyphonic Music with Hidden Markov Models

Tue, 02/28/2012 - 09:46
This paper proposes a method for transcribing drums from polyphonic music using a network of connected hidden Markov models (HMMs). The task is to detect the temporal locations of unpitched percussive sounds (such as bass drum or hi-hat) and recognise the instruments played. Contrary to many earlier methods, a separate sound event segmentation is not done, but connected HMMs are used to perform the segmentation and recognition jointly. Two ways of using HMMs are studied: modelling combinations of the target drums and a detector-like modelling of each target drum. Acoustic feature parametrisation is done with mel-frequency cepstral coefficients and their first-order temporal derivatives. The effect of lowering the feature dimensionality with principal component analysis and linear discriminant analysis is evaluated. Unsupervised acoustic model parameter adaptation with maximum likelihood linear regression is evaluated for compensating the differences between the training and target signals. The performance of the proposed method is evaluated on a publicly available data set containing signals with and without accompaniment, and compared with two reference methods. The results suggest that the transcription is possible using connected HMMs, and that using detector-like models for each target drum provides a better performance than modelling drum combinations.
Categories: Journals

Compact Acoustic Models for Embedded Speech Recognition

Tue, 02/28/2012 - 09:46
Speech recognition applications are known to require a significant amount of resources. However, embedded speech recognition only authorizes few KB of memory, few MIPS, and small amount of training data. In order to fit the resource constraints of embedded applications, an approach based on a semicontinuous HMM system using state-independent acoustic modelling is proposed. A transformation is computed and applied to the global model in order to obtain each HMM state-dependent probability density functions, authorizing to store only the transformation parameters. This approach is evaluated on two tasks: digit and voice-command recognition. A fast adaptation technique of acoustic models is also proposed. In order to significantly reduce computational costs, the adaptation is performed only on the global model (using related speaker recognition adaptation techniques) with no need for state-dependent data. The whole approach results in a relative gain of more than 20% compared to a basic HMM-based system fitting the constraints.
Categories: Journals

An Adaptive Framework for Acoustic Monitoring of Potential Hazards

Tue, 02/28/2012 - 09:46
Robust recognition of general audio events constitutes a topic of intensive research in the signal processing community. This work presents an efficient methodology for acoustic surveillance of atypical situations which can find use under different acoustic backgrounds. The primary goal is the continuous acoustic monitoring of a scene for potentially hazardous events in order to help an authorized officer to take the appropriate actions towards preventing human loss and/or property damage. A probabilistic hierarchical scheme is designed based on Gaussian mixture models and state-of-the-art sound parameters selected through extensive experimentation. A feature of the proposed system is its model adaptation loop that provides adaptability to different sound environments. We report extensive experimental results including installation in a real environment and operational detection rates for three days of function on a 24 hour basis. Moreover, we adopt a reliable testing procedure that demonstrates high detection rates as regards average recognition, miss probability, and false alarm rates.
Categories: Journals

Performance Study of Objective Speech Quality Measurement for Modern Wireless-VoIP Communications

Tue, 02/28/2012 - 09:46
Wireless-VoIP communications introduce perceptual degradations that are not present with traditional VoIP communications. This paper investigates the effects of such degradations on the performance of three state-of-the-art standard objective quality measurement algorithms—PESQ, P.563, and an “extended” E-model. The comparative study suggests that measurement performance is significantly affected by acoustic background noise type and level as well as speech codec and packet loss concealment strategy. On our data, PESQ attains superior overall performance and P.563 and E-model attain comparable performance figures.
Categories: Journals