journal article Open Access May 12, 2016

A Review of Physical and Perceptual Feature Extraction Techniques for Speech, Music and Environmental Sounds

Applied Sciences Vol. 6 No. 5 pp. 143 · MDPI AG
View at Publisher Save 10.3390/app6050143
Abstract
Endowing machines with sensing capabilities similar to those of humans is a prevalent quest in engineering and computer science. In the pursuit of making computers sense their surroundings, a huge effort has been conducted to allow machines and computers to acquire, process, analyze and understand their environment in a human-like way. Focusing on the sense of hearing, the ability of computers to sense their acoustic environment as humans do goes by the name of machine hearing. To achieve this ambitious aim, the representation of the audio signal is of paramount importance. In this paper, we present an up-to-date review of the most relevant audio feature extraction techniques developed to analyze the most usual audio signals: speech, music and environmental sounds. Besides revisiting classic approaches for completeness, we include the latest advances in the field based on new domains of analysis together with novel bio-inspired proposals. These approaches are described following a taxonomy that organizes them according to their physical or perceptual basis, being subsequently divided depending on the domain of computation (time, frequency, wavelet, image-based, cepstral, or other domains). The description of the approaches is accompanied with recent examples of their application to machine hearing related problems.
Topics

No keywords indexed for this article. Browse by subject →

References
247
[1]
Lyon "Machine Hearing: An Emerging Field" IEEE Signal Process. Mag. (2010) 10.1109/msp.2010.937498
[2]
Gerhard, D. (2003). Audio Signal Classification: History and Current Techniques, Department of Computer Science, University of Regina. Technical Report TR-CS 2003-07.
[3]
Temko, A. (2007). Acoustic Event Detection and Classification. [Ph.D. Thesis, Universitat Politècnica de Catalunya].
[4]
Dennis, J. (2014). Sound Event Recognition in Unstructured Environments Using Spectrogram Image Processing. [Ph.D. Thesis, School of Computer Engineering, Nanyang Technological University].
[5]
Bach "Robust speech detection in real acoustic backgrounds with perceptually motivated features" Speech Commun. (2011) 10.1016/j.specom.2010.07.003
[6]
Kinnunen "An overview of text-independent speaker recognition: From features to supervectors" Speech Commun. (2011) 10.1016/j.specom.2009.08.009
[7]
Pieraccini, R. (2012). The Voice in the Machine. Building Computers That Understand Speech, MIT Press. 10.7551/mitpress/9072.001.0001
[8]
Wang, A.L.C. (2003, January 26–30). An industrial-strength audio search algorithm. Proceedings of the 4th International Conference on Music Information Retrieval (ISMIR), Baltimore, MD, USA.
[9]
Wang, F., Wang, X., Shao, B., Li, T., and Ogihara, M. (2009, January 26–30). Tag Integrated Multi-Label Music Style Classification with Hypergraph. Proceedings of the 10th International Society for Music Information Retrieval Conference (ISMIR), Kobe, Japan.
[10]
Benetos, E., Kotti, M., and Kotropoulos, C. (2006, January 14–19). Musical Instrument Classification using Non-Negative Matrix Factorization Algorithms and Subset Feature Selection. Proceedings of the 2006 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Toulouse, France.
[11]
Liu, M., and Wan, C. (2001, January 24–28). Feature selection for automatic classification of musical instrument sounds. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL), Roanoke, VA, USA. 10.1145/379437.379663
[12]
Lu "Automatic Mood Detection and Tracking of Music Audio Signals" IEEE Trans. Audio Speech Lang. Process. (2006) 10.1109/tsa.2005.860344
[13]
Lyon "A Survey of Audio-Based Music Classification and Annotation" IEEE Trans. Multimedia (2011) 10.1109/tmm.2010.2098858
[14]
Chu "Environmental Sound Recognition With Time-Frequency Audio Features" IEEE Trans. Audio Speech Lang. Process. (2009) 10.1109/tasl.2009.2017438
[15]
Valero, X., and Alías, F. (2012, January 27–31). Classification of audio scenes using Narrow-Band Autocorrelation features. Proceedings of the 20th European Signal Processing Conference (EUSIPCO), Bucharest, Romania.
[16]
Schafer, R.M. (1993). The Soundscape: Our Sonic Environment and the Tuning of the World, Inner Traditions/Bear & Co.
[17]
Zeppelzauer "Features for content-based audio retrieval" Adv. Comput. (2010) 10.1016/s0065-2458(10)78003-7
[18]
Peltonen, V., Tuomi, J., Klapuri, A., Huopaniemi, J., and Sorsa, T. (2002, January 13–17). Computational Auditory Scene Recognition. Proceedings of the 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Orlando, FL, USA. 10.1109/icassp.2002.1006149
[19]
Geiger, J., Schuller, B., and Rigoll, G. (2013, January 20–23). Large-scale audio feature extraction and SVM for acoustic scene classification. Proceedings of the 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA. 10.1109/waspaa.2013.6701857
[20]
Oppenheim, A.V., and Schafer, R.W. (1989). Discrete-Time Signal Processing, Prentice Hall.
[21]
Gygi, B. (2001). Factors in the Identification of Environmental Sounds. [Ph.D. Thesis, Indiana University].
[22]
Foote, J., and Uchihashi, S. (2001, January 22–25). The Beat Spectrum: A New Approach To Rhythm Analysis. Proceedings of the 2001 IEEE International Conference on Multimedia and Expo (ICME), Tokyo, Japan. 10.1109/icme.2001.1237863
[23]
Bellman, R. (2003). Dynamic Programming, Dover Publications.
[24]
Rabaoui "Using One-Class SVMs and Wavelets for Audio Surveillance" IEEE Trans. Inf. Forensics Secur. (2008) 10.1109/tifs.2008.2008216
[25]
Hurst "Long-term storage capacity of reservoirs" Trans. Amer. Soc. Civ. Eng. (1951) 10.1061/taceat.0006518
[26]
Eronen "Audio-based context recognition" IEEE Trans. Audio Speech Lang. Process. (2006) 10.1109/tsa.2005.854103
[27]
Yuo "Combination of autocorrelation-based features and projection measure technique for speaker identification" IEEE Trans. Speech Audio Process. (2005) 10.1109/tsa.2005.848893
[28]
Tzanetakis "Musical genre classification of audio signals" IEEE Trans. Speech Audio Process. (2002) 10.1109/tsa.2002.800560
[29]
Ando "A theory of primary sensations and spatial sensations measuring environmental noise" J. Sound Vib. (2001) 10.1006/jsvi.2000.3272
[30]
Valero "Hierarchical Classification of Environmental Noise Sources by Considering the Acoustic Signature of Vehicle Pass-bys" Arch. Acoustics (2012) 10.2478/v10168-012-0054-z
[31]
Richard "An Overview on Perceptually Motivated Audio Indexing and Classification" Proc. IEEE (2013) 10.1109/jproc.2013.2251591
[32]
Clemins, P.J., Trawicki, M.B., Adi, K., Tao, J., and Johnson, M.T. (2006, January 14–19). Generalized Perceptual Features for Vocalization Analysis Across Multiple Species. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toulouse, France.
[33]
Clemins "Generalized perceptual linear prediction features for animal vocalization analysis" J. Acoust. Soc. Am. (2006) 10.1121/1.2203596
[34]
Peeters, G. (2004). A Large Set of Audio Features for Sound Description (Similarity And Classification) in the CUIDADO Project, IRCAM. Technical Report.
[35]
Sharan, R., and Moir, T. (2016). An overview of applications and advancements in automatic sound recognition. Neurocomputing. 10.1016/j.neucom.2016.03.020
[36]
Gubka, R., and Kuba, M. (2013, January 29–31). A comparison of audio features for elementary sound based audio classification. Proceedings of the 2013 International Conference on Digital Technologies (DT), Zilina, Slovak Republic. 10.1109/dt.2013.6566278
[37]
Boonmatham, P., Pongpinigpinyo, S., and Soonklang, T. (2012, January 3–5). A comparison of audio features of Thai Classical Music Instrument. Proceedings of the 7th International Conference on Computing and Convergence Technology (ICCCT), Seoul, South Korea.
[38]
Krijnders "A Comparison of Spectro-Temporal Representations of Audio Signals" IEEE/ACM Trans. Audio Speech Lang. Process. (2014) 10.1109/tasl.2013.2283105
[39]
Kedem "Spectral Analysis and Discrimination by Zero-crossings" Proc. IEEE (1986) 10.1109/proc.1986.13663
[40]
Li, T., Ogihara, M., and Li, Q. (August, January 28). A Comparative Study on Content-based Music Genre Classification. Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, ON, Canada.
[41]
Bergstra "Aggregate features and ADABOOST for music classification" Mach. Learn. (2006) 10.1007/s10994-006-9019-7
[42]
Ultsch "Modeling timbre distance with temporal statistics from polyphonic music" IEEE Trans. Audio Speech Lang. Process. (2006) 10.1109/tsa.2005.860352
[43]
Kobayashi, T., Hirose, K., and Nakamura, S. (2010, January 26–30). Noise robust voice activity detection using features extracted from the time-domain autocorrelation function. Proceedings of the 11th Annual Conference of the International Speech (InterSpeech), Makuhari, Japan.
[44]
El-Maleh, K., Klein, M., Petrucci, G., and Kabal, P. (2000, January 5–9). Speech/music discrimination for multimedia applications. Proceedings of the 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Istanbul, Turkey.
[45]
International Organization for Standardization (ISO)/International Organization for Standardization (IEC) Information technology—Multimedia content description interface. Available online: http://mpeg.chiariglione.org/standards/mpeg-7/audio.
[46]
Mitrović, D., Zeppelzauer, M., and Breiteneder, C. (2006, January 4–6). Discrimination and retrieval of animal sounds. Proceedings of the 12th International Multi-Media Modelling Conference, Beijing, China.
[47]
Muhammad, G., and Alghathbar, K. (2009, January 10–12). Environment Recognition from Audio Using MPEG-7 Features. Proceedings of the 4th International Conference on Embedded and Multimedia Computing, Jeju, Korea. 10.1109/em-com.2009.5402978
[48]
Valero, X., and Alías, F. (2010, January 15–18). Applicability of MPEG-7 low level descriptors to environmental sound source recognition. Proceedings of the EAA EUROREGIO 2010, Ljubljana, Slovenia.
[49]
Klingholz "The measurement of the signal-to-noise ratio (SNR) in continuous speech" Speech Commun. (1987) 10.1016/0167-6393(87)90066-5
[50]
Kreiman "Perception of aperiodicity in pathological voice" J. Acoust. Soc. Am. (2005) 10.1121/1.1858351

Showing 50 of 247 references

Metrics
180
Citations
247
References
Details
Published
May 12, 2016
Vol/Issue
6(5)
Pages
143
License
View
Funding
European Commission Award: ENV/IT/001254
Secretaria d\''Universitats i Recerca del Departament d\''Economia i Coneixement (Generalitat de Catalunya Award: 2014-SGR-0590
Cite This Article
Francesc Alías, Joan Socoró, Xavier Sevillano (2016). A Review of Physical and Perceptual Feature Extraction Techniques for Speech, Music and Environmental Sounds. Applied Sciences, 6(5), 143. https://doi.org/10.3390/app6050143