Voice-Conversion

We are inviting the community interested in the Voice Conversion field (VC) to participate in the definition of an evaluation framework for Voice Conversion algorithms. We aim to define subjective and objective evaluation measures and to propose a standard methodology  which could allow us to a efficient comparison of  Voice Conversion algorithms. The discussion about this topic is just starting, we strongly encourage the collegues working on VC research to get in contact with us and share your ideas.

We specially invite you to join us at the DAFxTRa presentation talk that will be organized in the context of the DAFx 08 conference.

Here we present a brief introduction on the evaluation of  Voice Conversion Systems and some propositions as a starting point. Feel free to send your comments.

 

Introduction

The goal of a Voice Conversion system is to render perceptually similar the voice of a "source" speaker in that of a "target" one. Tipically, this conversion effect is achieved through statistical learning of acoustical, articulatory or physical-model based parameters computed from speech databases. Despite the gender-age information provided by his average pitch, the "identity" of a speaker is mainly attributed to his voice timbre. This vocal quality is characterized by the form of the spectral envelope of the speech signal. Following, in its simplest way, the task of a VC system consists of: the modification of the spectral envelope information and the normalization of the average pitch of the source speaker to match the target one.

Evaluation of VC systems


Usually, the performance evaluation of VC algorithms is carried out using objective measures and performing some well known subjective tests. However, a performance comparison between different VC approaches can not be established in a straightforward way since different databases are commonly used and the evaluation criteria is not homogenic.  After investigate the evaluation methodology used in some of the current state-of-the-art VC propositions, we propose to start our discussion presenting some aspects we consider important to include if aiming to define efficient, uniform and homogenic evaluation criteria.

 

  • Categorization of the conversion cases
The performance of VC systems shows to be sensible to the source-target similiarity concerning some characteristics such as speaker gender (female,male), age (child, young, old) and voice quality (rough, breathy, clean).  In general, the gender category is evaluated in an independent way. Nevertheless, an extended categorization of the conversion cases considering characteristics as the age and the voice quality must lead us to a more judicious comparison. More precisely, the consideration of the voice quality of the speaker could help us to clearify whether or not a perceived low perfomance could be due to the conversion performance of the particular synthesis framework. The inclusion of the language as an additional category could be also discussed since the performance of some VC approaches can be strongly dependant on the  semantic, prosodic and phonetic content of the speech databases.
 
  • Similar evaluation set
Clearly, in order to etablish a direct comparison, the  evaluation set of  utterences applied across the different VC algorithms must to be invariant.  In principle, it seems clear that the nature of the speakers to use at the evaluation stage must to be defined in relation to the categorization mentioned in the first point.  In addition, the definition of a standard sampling rate at the evaluation stage represents an important issue to discuss since some  VC approaches were designed to deal with different speech quality signals (telephonic appliactions, high-quality audio).
 
  • Universal database content

Some algorithms are based on the use of large parallel source-target databases to achieve the learning stage. On the other hand, some approaches not restricted to the use of paired data need a variety of small single-speaker databases in order to fit a multi-speaker model. It seems preferably to use a speech database fitting  all these conditions (large size, parallel content and an important amount of speakers). 

 

  • Definition of the F0 evaluation modality
 As mentioned in the introductory part, to achieve the pitch "conversion" the average target F0 is applied to the source speech. This is specially necessary for inter-gender conversion and/or cases where the age of the speakers significantly differs. However, some VC approaches perform local F0 transformation since they include the characterisation and  conversion of prosodic features. Clearly, this could represent an advantage for these methods (compared to the average normalization case) if performing perceptual tests where the evaluated utterences are prononced with a different prosody by the source and target speaker. The way in which the prosodic information will be evaluated must defenetly be considered into the definition of the prosodic nature of the speech database.  In addition, is necessary to take into account the fact that some VC approaches could be restricted, at the trainning stage, at the use of specific prosodic content. 

 

  • Use of perceptual criteria on the objective evaluation
Commonly, the objective evaluation on the VC systems is based on spectral distortion measures. This measure give us some information about the conversion performance but it does not provide us enough information about the perceptual effect of the conversion error. We propose then to include some perceptual concepts (as the ones found on the PEAQ and PESQ standards) on the objective evaluation. In addition, since the most of VC systems include a machine learning stage, a conversion rate could be also considered in order to evaluate the performance of the learning stage.
 
  • Definition of subjective tests modalities
It appears logical the use of original speech as source and target reference signals instead of synthetic ones. Otherwise, it could represent an important bias on the perceptual tests. Also, the use of five points scores as the MOS could be extended to the inclusion of categories concerning the synthesis quality and some aspects of  the conversion effect (speech naturality, third speaker effect, voice quality). Finally, if aiming to evaluate the timbre conversion exclusively it is necessary to exclude as much as possible the perceptual effect of the prosodic information when comparing source, target and converted phrases (tests XAB, ABX). A solution is the use of  different examples when presenting  the source and target speakers in relation to the one used for the conversion.
 

We could summerize our initial interest in the unification of criteria for the selection of the speech database and the consideration of perceptual cues on the evaluation framework.   We propose then to the people interested in this field to share with us your opinion and suggestions in order to give a step forward towards the definition of the evaluation framework. 

 

 

This page is maintained by Fernando Villavicencio (fvillavicencio@iua.upf.edu)