News aggregator

Efficient Advertisement Discovery for Audio Podcast Content Using Candidate Segmentation

Nowadays, audio podcasting has been widely used by many online sites such as newspapers, web portals, journals, and so forth, to deliver audio content to users through download or subscription. Within 1 to 30 minutes long of one podcast story, it is often that multiple audio advertisements (ads) are inserted into and repeated, with each of a length of 5 to 30 seconds, at different locations. Automatic detection of these attached ads is a challenging task due to the complexity of the search algorithms. Based on the knowledge of typical structures of podcast contents, this paper proposes a novel efficient advertisement discovery approach for large audio podcasting collections. The proposed approach offers a significant improvement on search speed with sufficient accuracy. The key to the acceleration comes from the advantages of candidate segmentation and sampling technique introduced to reduce both search areas and number of matching frames. The approach has been tested over a variety of podcast contents collected from MIT Technology Review, Scientific American, and Singapore Podcast websites. Experimental results show that the proposed algorithm archives detection rate of 97.5% with a significant computation saving as compared to existing state-of-the-art methods.
Categories: Journals

Correlation-Based Amplitude Estimation of Coincident Partials in Monaural Musical Signals

This paper presents a method for estimating the amplitude of coincident partials generated by harmonic musical sources (instruments and vocals). It was developed as an alternative to the commonly used interpolation approach, which has several limitations in terms of performance and applicability. The strategy is based on the following observations: (a) the parameters of partials vary with time; (b) such a variation tends to be correlated when the partials belong to the same source; (c) the presence of an interfering coincident partial reduces the correlation; and (d) such a reduction is proportional to the relative amplitude of the interfering partial. Besides the improved accuracy, the proposed technique has other advantages over its predecessors: it works properly even if the sources have the same fundamental frequency, it is able to estimate the first partial (fundamental), which is not possible using the conventional interpolation method, it can estimate the amplitude of a given partial even if its neighbors suffer intense interference from other sources, it works properly under noisy conditions, and it is immune to intraframe permutation errors. Experimental results show that the strategy clearly outperforms the interpolation approach.
Categories: Journals

Employing Second-Order Circular Suprasegmental Hidden Markov Models to Enhance Speaker Identification Performance in Shouted Talking Environments

Speaker identification performance is almost perfect in neutral talking environments. However, the performance is deteriorated significantly in shouted talking environments. This work is devoted to proposing, implementing, and evaluating new models called Second-Order Circular Suprasegmental Hidden Markov Models (CSPHMM2s) to alleviate the deteriorated performance in the shouted talking environments. These proposed models possess the characteristics of both Circular Suprasegmental Hidden Markov Models (CSPHMMs) and Second-Order Suprasegmental Hidden Markov Models (SPHMM2s). The results of this work show that CSPHMM2s outperform each of First-Order Left-to-Right Suprasegmental Hidden Markov Models (LTRSPHMM1s), Second-Order Left-to-Right Suprasegmental Hidden Markov Models (LTRSPHMM2s), and First-Order Circular Suprasegmental Hidden Markov Models (CSPHMM1s) in the shouted talking environments. In such talking environments and using our collected speech database, average speaker identification performance based on LTRSPHMM1s, LTRSPHMM2s, CSPHMM1s, and CSPHMM2s is 74.6%, 78.4%, 78.7%, and 83.4%, respectively. Speaker identification performance obtained based on CSPHMM2s is close to that obtained based on subjective assessment by human listeners.
Categories: Journals

Independent Component Analysis and Time-Frequency Masking for Speech Recognition in Multitalker Conditions

When a number of speakers are simultaneously active, for example in meetings or noisy public places, the sources of interest need to be separated from interfering speakers and from each other in order to be robustly recognized. Independent component analysis (ICA) has proven a valuable tool for this purpose. However, ICA outputs can still contain strong residual components of the interfering speakers whenever noise or reverberation is high. In such cases, nonlinear postprocessing can be applied to the ICA outputs, for the purpose of reducing remaining interferences. In order to improve robustness to the artefacts and loss of information caused by this process, recognition can be greatly enhanced by considering the processed speech feature vector as a random variable with time-varying uncertainty, rather than as deterministic. The aim of this paper is to show the potential to improve recognition of multiple overlapping speech signals through nonlinear postprocessing together with uncertainty-based decoding techniques.
Categories: Journals

Adaptive Long-Term Coding of LSF Parameters Trajectories for Large-Delay/Very- to Ultra-Low Bit-Rate Speech Coding

This paper presents a model-based method for coding the LSF parameters of LPC speech coders on a “long-term” basis, that is, beyond the usual 20–30 ms frame duration. The objective is to provide efficient LSF quantization for a speech coder with large delay but very- to ultra-low bit-rate (i.e., below 1 kb/s). To do this, speech is first segmented into voiced/unvoiced segments. A Discrete Cosine model of the time trajectory of the LSF vectors is then applied to each segment to capture the LSF interframe correlation over the whole segment. Bi-directional transformation from the model coefficients to a reduced set of LSF vectors enables both efficient “sparse” coding (using here multistage vector quantizers) and the generation of interpolated LSF vectors at the decoder. The proposed method provides up to 50% gain in bit-rate over frame-by-frame quantization while preserving signal quality and competes favorably with 2D-transform coding for the lower range of tested bit rates. Moreover, the implicit time-interpolation nature of the long-term coding process provides this technique a high potential for use in speech synthesis systems.
Categories: Journals

Pitch- and Formant-Based Order Adaptation of the Fractional Fourier Transform and Its Application to Speech Recognition

Fractional Fourier transform (FrFT) has been proposed to improve the time-frequency resolution in signal analysis and processing. However, selecting the FrFT transform order for the proper analysis of multicomponent signals like speech is still debated. In this work, we investigated several order adaptation methods. Firstly, FFT- and FrFT- based spectrograms of an artificially-generated vowel are compared to demonstrate the methods. Secondly, an acoustic feature set combining MFCC and FrFT is proposed, and the transform orders for the FrFT are adaptively set according to various methods based on pitch and formants. A tonal vowel discrimination test is designed to compare the performance of these methods using the feature set. The results show that the FrFT-MFCC yields a better discriminability of tones and also of vowels, especially by using multitransform-order methods. Thirdly, speech recognition experiments were conducted on the clean intervocalic English consonants provided by the Consonant Challenge. Experimental results show that the proposed features with different order adaptation methods can obtain slightly higher recognition rates compared to the reference MFCC-based recognizer.
Categories: Journals

Automatic Query Generation and Query Relevance Measurement for Unsupervised Language Model Adaptation of Speech Recognition

We are developing a method of Web-based unsupervised language model adaptation for recognition of spoken documents. The proposed method chooses keywords from the preliminary recognition result and retrieves Web documents using the chosen keywords. A problem is that the selected keywords tend to contain misrecognized words. The proposed method introduces two new ideas for avoiding the effects of keywords derived from misrecognized words. The first idea is to compose multiple queries from selected keyword candidates so that the misrecognized words and correct words do not fall into one query. The second idea is that the number of Web documents downloaded for each query is determined according to the “query relevance.” Combining these two ideas, we can alleviate bad effect of misrecognized keywords by decreasing the number of downloaded Web documents from queries that contain misrecognized keywords. Finally, we examine a method of determining the number of iterative adaptations based on the recognition likelihood. Experiments have shown that the proposed stopping criterion can determine almost the optimum number of iterations. In the final experiment, the word accuracy without adaptation (55.29%) was improved to 60.38%, which was 1.13 point better than the result of the conventional unsupervised adaptation method (59.25%).
Categories: Journals

Drum Sound Detection in Polyphonic Music with Hidden Markov Models

This paper proposes a method for transcribing drums from polyphonic music using a network of connected hidden Markov models (HMMs). The task is to detect the temporal locations of unpitched percussive sounds (such as bass drum or hi-hat) and recognise the instruments played. Contrary to many earlier methods, a separate sound event segmentation is not done, but connected HMMs are used to perform the segmentation and recognition jointly. Two ways of using HMMs are studied: modelling combinations of the target drums and a detector-like modelling of each target drum. Acoustic feature parametrisation is done with mel-frequency cepstral coefficients and their first-order temporal derivatives. The effect of lowering the feature dimensionality with principal component analysis and linear discriminant analysis is evaluated. Unsupervised acoustic model parameter adaptation with maximum likelihood linear regression is evaluated for compensating the differences between the training and target signals. The performance of the proposed method is evaluated on a publicly available data set containing signals with and without accompaniment, and compared with two reference methods. The results suggest that the transcription is possible using connected HMMs, and that using detector-like models for each target drum provides a better performance than modelling drum combinations.
Categories: Journals

Compact Acoustic Models for Embedded Speech Recognition

Speech recognition applications are known to require a significant amount of resources. However, embedded speech recognition only authorizes few KB of memory, few MIPS, and small amount of training data. In order to fit the resource constraints of embedded applications, an approach based on a semicontinuous HMM system using state-independent acoustic modelling is proposed. A transformation is computed and applied to the global model in order to obtain each HMM state-dependent probability density functions, authorizing to store only the transformation parameters. This approach is evaluated on two tasks: digit and voice-command recognition. A fast adaptation technique of acoustic models is also proposed. In order to significantly reduce computational costs, the adaptation is performed only on the global model (using related speaker recognition adaptation techniques) with no need for state-dependent data. The whole approach results in a relative gain of more than 20% compared to a basic HMM-based system fitting the constraints.
Categories: Journals

An Adaptive Framework for Acoustic Monitoring of Potential Hazards

Robust recognition of general audio events constitutes a topic of intensive research in the signal processing community. This work presents an efficient methodology for acoustic surveillance of atypical situations which can find use under different acoustic backgrounds. The primary goal is the continuous acoustic monitoring of a scene for potentially hazardous events in order to help an authorized officer to take the appropriate actions towards preventing human loss and/or property damage. A probabilistic hierarchical scheme is designed based on Gaussian mixture models and state-of-the-art sound parameters selected through extensive experimentation. A feature of the proposed system is its model adaptation loop that provides adaptability to different sound environments. We report extensive experimental results including installation in a real environment and operational detection rates for three days of function on a 24 hour basis. Moreover, we adopt a reliable testing procedure that demonstrates high detection rates as regards average recognition, miss probability, and false alarm rates.
Categories: Journals

Performance Study of Objective Speech Quality Measurement for Modern Wireless-VoIP Communications

Wireless-VoIP communications introduce perceptual degradations that are not present with traditional VoIP communications. This paper investigates the effects of such degradations on the performance of three state-of-the-art standard objective quality measurement algorithms—PESQ, P.563, and an “extended” E-model. The comparative study suggests that measurement performance is significantly affected by acoustic background noise type and level as well as speech codec and packet loss concealment strategy. On our data, PESQ attains superior overall performance and P.563 and E-model attain comparable performance figures.
Categories: Journals

Adaptive V/UV Speech Detection Based on Characterization of Background Noise

The paper presents an adaptive system for Voiced/Unvoiced (V/UV) speech detection in the presence of background noise. Genetic algorithms were used to select the features that offer the best V/UV detection according to the output of a background Noise Classifier (NC) and a Signal-to-Noise Ratio Estimation (SNRE) system. The system was implemented, and the tests performed using the TIMIT speech corpus and its phonetic classification. The results were compared with a nonadaptive classification system and the V/UV detectors adopted by two important speech coding standards: the V/UV detection system in the ETSI ES 202 212 v1.1.2 and the speech classification in the Selectable Mode Vocoder (SMV) algorithm. In all cases the proposed adaptive V/UV classifier outperforms the traditional solutions giving an improvement of 25% in very noisy environments.
Categories: Journals

Signal Processing Implementation and Comparison of Automotive Spatial Sound Rendering Strategies

Design and implementation strategies of spatial sound rendering are investigated in this paper for automotive scenarios. Six design methods are implemented for various rendering modes with different number of passengers. Specifically, the downmixing algorithms aimed at balancing the front and back reproductions are developed for the 5.1-channel input. Other five algorithms based on inverse filtering are implemented in two approaches. The first approach utilizes binaural (Head-Related Transfer Functions HRTFs) measured in the car interior, whereas the second approach named the point-receiver model targets a point receiver positioned at the center of the passenger's head. The proposed processing algorithms were compared via objective and subjective experiments under various listening conditions. Test data were processed by the multivariate analysis of variance (MANOVA) method and the least significant difference (Fisher's LSD) method as a post hoc test to justify the statistical significance of the experimental data. The results indicate that inverse filtering algorithms are preferred for the single passenger mode. For the multipassenger mode, however, downmixing algorithms generally outperformed the other processing techniques.
Categories: Journals

Tracking Intermittently Speaking Multiple Speakers Using a Particle Filter

The problem of tracking multiple intermittently speaking speakers is difficult as some distinct problems must be addressed. The number of active speakers must be estimated, these active speakers must be identified, and the locations of all speakers including inactive speakers must be tracked. In this paper we propose a method for tracking intermittently speaking multiple speakers using a particle filter. In the proposed algorithm the number of active speakers is firstly estimated based on the Exponential Fitting Test (EFT), a source number estimation technique which we have proposed. The locations of the speakers are then tracked using a particle filtering framework within which the decomposed likelihood is used in order to decouple the observed audio signal and associate each element of the decomposed signal with an active speaker. The tracking accuracy is then further improved by the inclusion of a silence region detection step and estimation of the noise-only covariance matrix. The method was evaluated using live recordings of 3 speakers and the results show that the method produces highly accurate tracking results.
Categories: Journals

Musical Sound Separation Based on Binary Time-Frequency Masking

The problem of overlapping harmonics is particularly acute in musical sound separation and has not been addressed adequately. We propose a monaural system based on binary time-frequency masking with an emphasis on robust decisions in time-frequency regions, where harmonics from different sources overlap. Our computational auditory scene analysis system exploits the observation that sounds from the same source tend to have similar spectral envelopes. Quantitative results show that utilizing spectral similarity helps binary decision making in overlapped time-frequency regions and significantly improves separation performance.
Categories: Journals

Analysis of Salient Feature Jitter in the Cochlea for Objective Prediction of Temporally Localized Distortion in Synthesized Speech

Temporally localized distortions account for the highest variance in subjective evaluation of coded speech signals (Sen (2001) and Hall (2001)). The ability to discern and decompose perceptually relevant temporally localized coding noise from other types of distortions is both of theoretical importance as well as a valuable tool for deploying and designing speech synthesis systems. The work described within uses a physiologically motivated cochlear model to provide a tractable analysis of salient feature trajectories as processed by the cochlea. Subsequent statistical analysis shows simple relationships between the jitter of these trajectories and temporal attributes of the Diagnostic Acceptability Measure (DAM).
Categories: Journals

A Decision-Tree-Based Algorithm for Speech/Music Classification and Segmentation

We present an efficient algorithm for segmentation of audio signals into speech or music. The central motivation to our study is consumer audio applications, where various real-time enhancements are often applied. The algorithm consists of a learning phase and a classification phase. In the learning phase, predefined training data is used for computing various time-domain and frequency-domain features, for speech and music signals separately, and estimating the optimal speech/music thresholds, based on the probability density functions of the features. An automatic procedure is employed to select the best features for separation. In the test phase, initial classification is performed for each segment of the audio signal, using a three-stage sieve-like approach, applying both Bayesian and rule-based methods. To avoid erroneous rapid alternations in the classification, a smoothing technique is applied, averaging the decision on each segment with past segment decisions. Extensive evaluation of the algorithm, on a database of more than 12 hours of speech and more than 22 hours of music showed correct identification rates of 99.4% and 97.8%, respectively, and quick adjustment to alternating speech/music sections. In addition to its accuracy and robustness, the algorithm can be easily adapted to different audio types, and is suitable for real-time operation.
Categories: Journals

Integrated Phoneme Subspace Method for Speech Feature Extraction

Speech feature extraction has been a key focus in robust speech recognition research. In this work, we discuss data-driven linear feature transformations applied to feature vectors in the logarithmic mel-frequency filter bank domain. Transformations are based on principal component analysis (PCA), independent component analysis (ICA), and linear discriminant analysis (LDA). Furthermore, this paper introduces a new feature extraction technique that collects the correlation information among phoneme subspaces and reconstructs feature space for representing phonemic information efficiently. The proposed speech feature vector is generated by projecting an observed vector onto an integrated phoneme subspace (IPS) based on PCA or ICA. The performance of the new feature was evaluated for isolated word speech recognition. The proposed method provided higher recognition accuracy than conventional methods in clean and reverberant environments.
Categories: Journals

Analysis of Damped Mass-Spring Systems for Sound Synthesis

There are many ways of synthesizing sound on a computer. The method that we consider, called a mass-spring system, synthesizes sound by simulating the vibrations of a network of interconnected masses, springs, and dampers. Numerical methods are required to approximate the differential equation of a mass-spring system. The standard numerical method used in implementing mass-spring systems for use in sound synthesis is the symplectic Euler method. Implementers and users of mass-spring systems should be aware of the limitations of the numerical methods used; in particular we are interested in the stability and accuracy of the numerical methods used. We present an analysis of the symplectic Euler method that shows the conditions under which the method is stable and the accuracy of the decay rates and frequencies of the sounds produced.
Categories: Journals

An Overview of the Coding Standard MPEG-4 Audio Amendments 1 and 2: HE-AAC, SSC, and HE-AAC v2

In 2003 and 2004, the ISO/IEC MPEG standardization committee added two amendments to their MPEG-4 audio coding standard. These amendments concern parametric coding techniques and encompass Spectral Band Replication (SBR), Sinusoidal Coding (SSC), and Parametric Stereo (PS). In this paper, we will give an overview of the basic ideas behind these techniques and references to more detailed information. Furthermore, the results of listening tests as performed during the final stages of the MPEG-4 standardization process are presented in order to illustrate the performance of these techniques.
Categories: Journals
Syndicate content