Sound

In this section we review the research on sound that is being carried out within the boundaries identified in the definition of the field. From a sound to sense point of view, we include the analysis, understanding and description of all musical and non-musical sounds except speech. Then, in the sense to sound direction, we include the research that is more related to sound synthesis and processing.

Sound Description and Understanding

One of the basic aims of SMC research is to understand the different facets of sound from a computational point of view, or by using computational means and models. We want to understand and model not only the properties of sound waves but also the mechanisms of their generation, transmission and perception by humans. Even more, we want to understand sound as the basic communication channel for music and a fundamental element in our interaction with the environment. Sound serves as one of the main signals for human communication, and its understanding and description requires a notably multidisciplinary approach.

Traditionally, the main interest of SMC researchers has been musical sounds and thus the understanding of the sound generated by musical instruments and the specific transmission and perception mechanisms involved in the music communication chain. In recent years, this focus has been broadened and there is currently an increased interest in non-musical sounds and aspects of communication beyond music. A number of the methodologies and technologies developed for music are starting to be used for human communication and interaction through sound in general (e.g., ecological sounds) and there is increasing cross-fertilisation between the various sound-related disciplines.

There has been a great deal of research work on the analysis and description of sound by means of signal processing techniques, extracting features at different abstraction levels and developing source-specific and application-dependent technologies. Most of the current research in this domain starts from frequency domain techniques as a step towards developing sound models that might be used for recognition, retrieval, or synthesis applications. Other approaches consider sparse atomic signal representations such as matching pursuit, the analytical counterpart to granular synthesis [Sturm et al. 2006].

Also of importance has been the study of sound-producing physical objects. The aim of such study is to understand the acoustic characteristics of musical instruments and other physical objects which produce sounds relevant to human communication. Its main application has been the development of physical models of these objects for synthesis applications [Smith, 2006; Välimäki, et al., 2006; Rocchesso & Fontana, 2003], so that the user can produce sound by interacting with the models in a physically meaningful way.

However, beyond the physical aspect, sound is a communication channel that carries information. We are therefore interested in identifying and representing this information. Signal processing techniques can only go so far in extracting the meaningful content of a sound. Thus, in the past few years there has been an exponential increase in research activity which aims to generate semantic descriptions automatically from audio signals. Statistical Modelling, Machine Learning, Music Theory and Web Mining technologies have been used to raise the semantic level of sound descriptors. MPEG-7 [Kim et al., 2005] has been created to establish a framework for effective management of multimedia materials, standardising the description of sources, perceptual aspects and other relevant descriptors of a sound or any multimedia asset.

Most research approaches to sound description are essentially bottom-up, starting from the audiosignal and trying to reach the highest possible semantic level. There is a general consensus that this approach has clear limitations and does not allow us to bridge what is known as the ‘semantic gap’– that is, the discrepancy between what can currently be extracted from audio signals and the kinds of high-level, semantically meaningful concepts that human listeners associate with sounds and music. The current trend is towards multimodal processing methods and top-down approaches based on ontologies, reasoning rules, and cognition models. Also, in practical applications (e.g., in web-based digital music services), collaborative tagging by users is being increasingly used to gain semantic information that would be hard or impossible to extract with current computational methods.

Sound Description and Understanding: Key Issues

The above synopsis of research in sound description and understanding has already revealed a number of current limitations and open problems. Below, we present some selected research questions that should addressed, or issues that should be taken into account in future research.

Perceptually informed models of acoustic information processing: There is an active field of research in neuroscience that tries to relate behavioural and physiological observations, by means of computational models. There is a wide variety of approaches in the computational neuroscience field, from models based on accurate simulations of single neurons to systems-based models relying heavily on information theory. SMC has already benefitted in the past from auditory models as signal processing tools. For instance, audio compression schemessuch has MP3 are heavily based on models of perceptual masking. This trend is set to continue as the models become more robust and computationally efficient. In the future, the interaction between auditory models and SMC could also be on a conceptual level. For instance, the sensory-motor theory suggests that the study of sound perception and production should be intimately related.

Sound source recognition and classification: The ability of a normal human listener to recognise objects in the environment from only the sounds they produce is extraordinarily robust. In contrast, computer systems designed to recognise sound sources function precariously, breaking down whenever the target sound is degraded by reverberation, noise, or by competing sounds. Musical signals present a real challenge for existing systems as the three sources of difficulty are almost always present. SMC can thus contribute to the development of sound source recognition systems, by providing well-controlled test situations that retain an ecological value [Elhilali, Shamma, Thorpe and Pressnitzer, 2007]. In return, models of sound source recognition will have obvious applications in current and future application of SMC, such as score following (adding timbre cues to the pitch cues normally used) or music-information retrieval systems.

Sound search and retrieval based on content: Audio content analysis and description enables various new and advanced audiovisual applications and services. Search engines or specific filters could use the extracted description to help users navigate or browse through large collections of audio data. Digital analysis of an audio file may be able to discriminate between speech, music and other entities or identify how many speakers are contained in a speech segment, what gender they are, and even who exactly is speaking. Spoken content may be identified and converted to text. Music might be classified into categories, such as jazz, rock and classical [Tzanetakis & Cook, 2002] (although this is problematic because such categories are user-dependent and perhaps cannot be unequivocally defined). Finally, it may be possible to automatically identify and find particular sounds, such as explosions, gunshots, etc.[Cano, 2007]. For such scenarios to become really useful, the necessary improvements in sound search and retrieval will call for a change of paradigm in the description of sounds – from descriptions constrained to a finite number of crisp labels, towards natural language descriptions, at a higher semantic level, similar to that used by humans. A step in this direction might be the inclusion of reasoning rules and knowledge bases (sound ontologies) encoding common sense knowledge about sound. Another key issue is the combination of information from complementary media, such as video or images.

Sound Synthesis and Processing

Sound synthesis and processing has been the most active research area in SMC for more than 40 years. Quite a number of the research results of the 1960s and 70s are now standard components of many audio and music devices, and new technologies are continuously being developed and integrated into new products [Välimäki et al., 2007]. The sounds of our age are digital. Most of them are generated, processed, and transcoded digitally. Given that these technologies have already become so common and that most recent developments represent only incremental improvements, research in this area has lost some of its prominence in comparison to others in SMC. Nonetheless, there remain a number of open issues to be worked on, and some of the new trends have the potential for huge industrial impact.

With respect to sound synthesis, most of the abstract algorithms that were the focus of work in the1970s and 80s – e.g., FM and waveshaping – were not directly related to a sound source or its perception (though some of the research was informed by knowledge of musical acoustics and source physics). The 1990s saw the emergence of computational modeling approaches to sound synthesis. These aimed either at capturing the characteristics of a sound source, known as physical models [Smith, 2006; Välimäki et al., 2006; Cadoz et al., 1993], or at capturing the perceptual characteristics of the sound signal, generally referred to as spectral or signal models [Serra, 1997]. The technology transfer expectations of the physical models of musical instruments have not been completely fulfilled. Their expressiveness and intuitive control – advantages originally attributed to this kind of model – did not help commercial music products to succeed in the market place. Meanwhile, synthesis techniques based on spectral modelling have met with competitive success in voice synthesisers, both for speech and singing voices [Bonada and Serra, 2007], but to a lesser extent in the synthesis of all other musical instruments. A recent and promising trend is the combination of physical and spectral models, such as physically informed sonic modelling [Cook, 1997] and commuted synthesis [Smith, 2006; Välimäki et al., 2006]. Another recent trend is to simulate traditional analog electronics used in music synthesizers of the 1960s and 1970s [Lane, 1997; Välimäki and Huovilainen, 2006] and in amplifiers used by electric guitar and bass players [Karjalainen et al., 2006; Yeh and Smith, 2006].

As an evolution of granular synthesis techniques (e.g., [Roads, 2001]), new corpus-based concatenative methods for musical sound synthesis, also known as mosaicing, have attracted much attention recently [Schwarz, 2007]. They make use of a variety of sound snippets in a database to assemble a desired sound or phrase according to a target specification given via sound descriptors or by an example sound. With ever-larger sound databases readily available, together with a pertinent description of their contents, these methods are increasingly used for composition, high-level instrument synthesis, interactive exploration of sound corpora, and other applications[Lindeman, 2007].

In sound processing, there are a large number of active research topics. Probably the most well established are audio compression and sound spatialisation, both of which have clear industrial contexts and quite well defined research agendas. Digital audio compression techniques allow the efficient storage and transmission of audio data, offering various degrees of complexity, compressed audio quality and degree of compression. With the widespread uptake of mp3, audio compression technology has spread to mainstream audio and is being incorporated into most sound devices [Mock, 2004]. These recent advances have resulted from an understanding of the human auditory system and the implementation of efficient algorithms in advanced DSP processors. Improvements to the state of the art will not be easy, but there is a trend towards trying to make use of our new understanding of human cognition and of the sound sources to be coded. Sound spatialisation effects attempt to widen the stereo image produced by two loudspeakers or stereo headphones, or to create the illusion of sound sources placed anywhere in three dimensional space, including behind, above or below the listener. Some techniques, such as ambisonics, vector base amplitude panning and wave-field synthesis, are readily available, and new models are being worked on that combine signal-driven bottom-up processing with hypothesis-driven top-down processing [Blauert, 2005]. Auditory models and listening tests currently help us to understand the mechanisms of binaural hearing and exploit them in transcoding and spatialisation. Recent promising examples include the Binaural Cue Coding method (Faller, 2006) and Spatial Impulse Response Rendering (Pulkki and Merimaa, 2006).

Digital sound processing also includes techniques for audio post-production and other creative usesin music and multimedia applications [Zölzer, 2002]. Time and frequency domain techniques have been developed for transforming sounds in different ways. But the current trend is to move from signal processing to content processing; that is, to move towards higher levels of representation for describing and processing audio material. There is a strong trend towards the use of all these signal processing techniques in the general field of interactive sound design. Sound generation techniques have been integrated in various multimedia and entertainment applications (e.g., sound effects and background music for gaming), sound product design (ring tones for mobile phones) and interactive sound generation for virtual reality or other multimodal systems. Old sound synthesis technologies have been brought back to life and adapted to the needs of these new interactive situations. The importance of control has been emphasised, and source-centred and perception-centre modelling approaches have been expanded towards interactive sonification [Hermann & Ritter, 2005].

Sound Synthesis and Processing: Key Issues

Interaction-centred sound modelling: The interactive aspects of music and sound generation should be given greater weight in the design of future sound synthesis techniques. A challenge is how to make controllability and interactivity central design principles in sound modelling. It is widely believed that the main missing element in existing synthesis techniques is adequate control. The extraction of expressive content from human gestures, from haptics (e.g., pressure, impacts or friction-like interactions on tangible interfaces), from movement (motion capture and analysis) or voice (extraction of expressive content from the voice or breath of the performer), should become a focus of new research in sound generation. This will also open the field to multisensory and cross-modal interaction research. The next problem then concerns how to exploit the extracted contents in order to model sound. Effective sound generation needs to achieve a perceptually robust link between gesture and sound. The mapping problem is in this sense crucial both in musical instruments and in any other device/artefact involving sound as one of its interactive elements.

Modular sound generation: Sound synthesis by physical modelling has, so far, mainly focused on accurate reproduction of the behaviour of musical instruments. Some other efforts have been devoted to everyday sounds [Rocchesso et al., 2003; Rocchesso and Fontana, 2004; Peltola et al., 2007] or to the application of sophisticated numerical methods for solving wave propagation problems [Trautmann et al., 2005; Bilbao, 2007]. A classic dream is to be able to build or alter the structure of a musical instrument on the computer and listen to it before it is actually built. By generalizing this thought, the dream changes to the idea of having a toolkit for constructing sounding objects from elementary blocks such as waveguides, resonators and nonlinear functions [Rabenstein et al., 2007]. This goal has
faced a number of intrinsic limitations in block-based descriptions of musical instruments. In general, it is difficult to predict the sonic outcome of an untested connection of blocks. However, by associating macro-blocks to salient phenomena, it should be possible to devise a constructivist approach to sound modelling. At the lowest level, blocks should correspond to fundamental interactions (impact, friction, air flow on edge, etc.). The sound quality of these blocks should be tunable, based on properties of both the interaction (e.g., pressure, force) and the interactants (e.g., size and material of resonating object). Higher-level, articulated phenomena should be modelled on top of lower-level blocks according to characteristic dynamic evolutions (e.g., bouncing, breaking). This higher level of sound modelling is suitable for tight coupling with emerging computer animation and haptic rendering techniques, as its time scale is compatible with the scale of visual motion and gestural/tactile manipulation. In this way, sound synthesis can become part of a moregeneral constructivist, physics-based approach to multisensory interaction and display.

Physical modelling based on data analysis: To date, physical models of sound and voice have been appreciated for their desirable properties in terms of synthesis, control and expressiveness. However, it is also widely recognised that they are very difficult to fit onto real observed data due to the high number of parameters involved, the fact that control parameters are not related to the produced sound signal in an intuitive way and, in some cases, the radical non-linearities in the numerical schemes. All these issues make the parametric identification of physics-based models a formidable problem. Future research in physical voice and sound modelling should thus take into account the importance of models fitting real data, in terms of both system structure design and parametric identification. Co-design of numerical structures and identification procedures may also be a possible path to complexity reduction. It is also desirable that from the audio-based physical modelling paradigm, new model structures emerge which will be general enough to capture the main sound features of broad families of sounds (e.g. sustained tones from wind and string instruments, percussive sounds) and to be trained to reproduce the peculiarities of a given instrument from recorded data.

Audio content processing: Currently, a very active field of research is Auditory Scene analysis [Bregman, 1990], which is conducted both from perceptual and computational points of view. This research is conducted mostly within the cognitive neurosciences community. But a multidisciplinary approach would allow the translation of its fundamental research advances to many practical applications. For instance, as soon as robust results emerge from this field, it will be possible to approach (re)synthesis from a higher-level sound-object perspective, permitting us to identify, isolate, transform and recombine sound-objects in a flexible way. Sound synthesis and manipulation using spectral models is based on features emerging from audio analysis. The use of auditory scene representations for sound manipulation and synthesis could be based on sound objects captured from the analysis. This possibility offers great prospects for music, sound and media production. With the current work on audio content analysis, we can start identifying and processing higher-level elements in an audio signal. For example, by identifying the rhythm of a song, a time- stretching technique can become a rhythm-changing system, and by identifying chords, a pitch shifter might be able to transpose the key of the song.

References

J. Blauert. Communication Acoustics (Signals and Communication Technology). Springer, Berlin, Germany, July 2005.

P. R. Cook. Physically informed sonic modeling (PhISM): Synthesis of percussive sounds. Computer Music J., 21(3):38-49, 1997.

X. Serra. Musical sound modeling with sinusoids plus noise. In C. Roads, S. Pope, A. Piccialli, and G. De Poli, editors, Musical Signal Processing, pages 91-122. Swets & Zeitlinger Publishers, Lisse, the Netherlands, 1997.

J. O. Smith. Physical audio signal processing: for virtual musical instruments and digital audio effects. http://ccrma.stanford.edu/~jos/pasp/, 2006.

V. Välimäki, J. Pakarinen, C. Erkut, and M. Karjalainen. Discrete-time modelling of musical instruments. Rep. Prog. Phys., 69(1), January 2006.

T Hermann and H. Ritter. Model-based sonification revisited - Authors' comments on Hermann and Ritter, ICAD 2002. ACM Trans. Appl. Percept., 2(4):559-563, October 2005.

U. Zölzer, editor. DAFX:Digital Audio Effects. John Wiley & Sons, May 2002.

Albert S. Bregman. Auditory Scene Analysis: The Perceptual Organization of Sound. The MIT Press, September 1990.

M.S. Gazzaniga. The New Cognitive Neurosciences. MIT Press, Cambridge, Mass., 2000.

D. Rocchesso and F. Fontana, editors. The Sounding Object. Edizioni di Mondo Estremo, 2003.

H.-G. Kim, N. Moreau, and T. Sikora. MPEG-7 Audio and Beyond: Audio Content Indexing and Retrieval. Wiley & Sons, 2005.

K. D. Martin. Sound-source recognition: A theory and computational model. PhD thesis, MIT, 1999.

P. Cano. Content-based Audio Search: From Fingerprinting to Semantic Audio Retrieval. PhD thesis, Pompeu Fabra University, 2007.

D. Schwarz. Corpus-Based Concatenative Synthesis. IEEE Signal Processing Magazine, 24(2):92-104, 2007.

C. Cadoz, A. Luciani, and J.-L. Florens. "CORDIS-ANIMA : a Modeling and Simulation System for Sound and Image Synthesis- The General Formalism". Computer Music Journal, 17(1):19-29, 1993