journal article Open Access Jan 09, 2024

Multimodal fusion for audio-image and video action recognition

View at Publisher Save 10.1007/s00521-023-09186-5
Abstract
AbstractMultimodal Human Action Recognition (MHAR) is an important research topic in computer vision and event recognition fields. In this work, we address the problem of MHAR by developing a novel audio-image and video fusion-based deep learning framework that we call Multimodal Audio-Image and Video Action Recognizer (MAiVAR). We extract temporal information using image representations of audio signals and spatial information from video modality with the help of Convolutional Neutral Networks (CNN)-based feature extractors and fuse these features to recognize respective action classes. We apply a high-level weights assignment algorithm for improving audio-visual interaction and convergence. This proposed fusion-based framework utilizes the influence of audio and video feature maps and uses them to classify an action. Compared with state-of-the-art audio-visual MHAR techniques, the proposed approach features a simpler yet more accurate and more generalizable architecture, one that performs better with different audio-image representations. The system achieves an accuracy 87.9% and 79.0% on UCF51 and Kinetics Sounds datasets, respectively. All code and models for this paper will be available at https://tinyurl.com/4ps2ux6n.
Topics

No keywords indexed for this article. Browse by subject →

References
70
[1]
Arandjelovic R, Zisserman A (2017) Look, listen and learn. In: IEEE, Proceedings of the ICCV, pp 609–617 10.1109/iccv.2017.73
[2]
Baldominos A, Saez Y, Isasi P (2018) Evolutionary convolutional neural networks: an application to handwriting recognition. Neurocomputing 283:38–52 10.1016/j.neucom.2017.12.049
[3]
Boehm KM, Aherne EA, Ellenson L et al (2022) Multimodal data integration using machine learning improves risk stratification of high-grade serous ovarian cancer. Nat Cancer 3(6):723–733 10.1038/s43018-022-00388-9
[4]
Brousmiche M, Rouat J, Dupont S (2019) Audio-visual fusion and conditioning with neural networks for event recognition. In: IEEE, Proceedings of the machine learning for signal processing (MLSP) Workshop, pp 1–6 10.1109/mlsp.2019.8918712
[5]
Brousmiche M, Rouat J, Dupont S (2022) Multimodal attentive fusion network for audio-visual event recognition. Inf Fusion 85:52–59 10.1016/j.inffus.2022.03.001
[6]
Deng Z, Lei L, Sun H, et al (2017) An enhanced deep convolutional neural network for densely packed objects detection in remote sensing images. In: IEEE, proceedings of the remote sensing with intelligent processing (RSIP) workshops, pp 1–4 10.1109/rsip.2017.7958800
[7]
Convolutional Two-Stream Network Fusion for Video Action Recognition

Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman

2016 IEEE Conference on Computer Vision and Patter... 10.1109/cvpr.2016.213
[8]
Feichtenhofer C, et al (2019) Slowfast networks for video recognition. In: Proceedings of the ICCV, pp 6202–6211 10.1109/iccv.2019.00630
[9]
Gao R, Grauman K (2021) VisualVoice: Audio-visual speech separation with cross-modal consistency. IEEE, Proceedings of the CVPR, pp 15495–15505, https://doi.org/10.1109/CVPR46437.2021.01524 10.1109/cvpr46437.2021.01524
[10]
Gao R, et al (2020) Listen to look: action recognition by previewing audio. In: IEEE Proceedings of the CVPR, pp 10457–10467 10.1109/cvpr42600.2020.01047
[11]
Compact Bilinear Pooling

Yang Gao, Oscar Beijbom, Ning Zhang et al.

2016 IEEE Conference on Computer Vision and Patter... 10.1109/cvpr.2016.41
[12]
Gaver WW (1993) What in the world do we hear?: an ecological approach to auditory event perception. Ecol. Psychol. 5(1):1–29 10.1207/s15326969eco0501_1
[13]
Gibbon DC, Liu Z (2008) Introduction to video search engines. Springer. https://doi.org/10.1007/978-3-540-79337-3 10.1007/978-3-540-79337-3
[14]
Girdhar R, et al (2017) ActionVLAD: Learning spatio-temporal aggregation for action classification. In: IEEE, Proceedings of the CVPR, pp 971–980 10.1109/cvpr.2017.337
[15]
Gouyon F, Dixon S, Pampalk E, et al (2004) Evaluating rhythmic descriptors for musical genre classification. In: Proceedings of the AESIC, p 204
[16]
Gu J, et al (2021) NTIRE 2021 challenge on perceptual image quality assessment. In: IEEE, Proceedings of the CVPR, pp 677–690 10.1109/cvprw53098.2021.00077
[17]
He D, et al (2019) StNet: Local and global spatial-temporal modeling for action recognition. In: Proceedings of the AAAI conference on artificial intelligence, pp 8401–8408 10.1609/aaai.v33i01.33018401
[18]
Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren et al.

2016 IEEE Conference on Computer Vision and Patter... 10.1109/cvpr.2016.90
[19]
Herath S, Harandi M, Porikli F (2017) Going deeper into action recognition: a survey. Image Vis Comput 60:4–21 10.1016/j.imavis.2017.01.010
[20]
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: PMLR, Proceedings of the ICML, pp 448–456, https://doi.org/10.5555/3045118.3045167 10.5555/3045118.3045167
[21]
Jing C, Wei P, Sun H et al (2020) Spatiotemporal neural networks for action recognition based on joint loss. Neural Comput Appl 32:4293–4302 10.1007/s00521-019-04615-w
[22]
Jung D, Son JW, Kim SJ (2018) Shot category detection based on object detection using convolutional neural networks. In: IEEE, Proceedings of the ICACT, pp 36–39 10.23919/icact.2018.8323637
[23]
Kala R (2016) On-road intelligent vehicles: motion planning for intelligent transportation systems. Butterworth-Heinemann, OXford
[24]
Kay W, Carreira J, Simonyan K, et al (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950
[25]
Kazakos E, et al (2019) EPIC-Fusion: audio-visual temporal binding for egocentric action recognition. In: Proceedings of the ICCV, pp 5492–5501 10.1109/iccv.2019.00559
[26]
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980https://doi.org/10.48550/arXiv.1412.6980 10.48550/arxiv.1412.6980
[27]
Kulkarni SR, Rajendran B (2018) Spiking neural networks for handwritten digit recognition-supervised learning and network optimization. Neural Netw 103:118–127 10.1016/j.neunet.2018.03.019
[28]
Kwon H, Kim M, Kwak S, et al (2021) Learning self-similarity in space and time as generalized motion for video action recognition. In: Proceedings of the ICCV, pp 13065–13075 10.1109/iccv48922.2021.01282
[29]
Lei J, Li L, Zhou L, et al (2021) Less is more: clipbert for video-and-language learning via sparse sampling. In: IEEE, Proceedings of the CVPR, pp 7331–7341 10.1109/cvpr46437.2021.00725
[30]
Li Y, Zou B, Deng S et al (2020) Using feature fusion strategies in continuous authentication on smartphones. IEEE Internet Comput 24(2):49–56 10.1109/mic.2020.2971447
[31]
Li Y, Tao P, Deng S et al (2021) Deffusion: Cnn-based continuous authentication using deep feature fusion. ACM Trans Sens Netw (TOSN) 18(2):1–20
[32]
Li Y, Liu L, Qin H et al (2022) Adaptive deep feature fusion for continuous authentication with data augmentation. IEEE Trans Mobile Comput. https://doi.org/10.1109/TMC.2022.3186614 10.1109/tmc.2022.3186614
[33]
Li Y, et al (2016) VLAD3: encoding dynamics of deep features for action recognition. In: IEEE, Proceedings of the CVPR, pp 1951–1960 10.1109/cvpr.2016.215
[34]
Li Z, Tang J (2015) Weakly supervised deep metric learning for community-contributed image retrieval. IEEE Trans Multimed 17(11):1989–1999 10.1109/tmm.2015.2477035
[35]
Li Z, Tang J, Mei T (2018) Deep collaborative embedding for social image understanding. IEEE Trans Pattern Anal Mach Intell 41(9):2070–2083 10.1109/tpami.2018.2852750
[36]
Lidy T, Rauber A (2005) Evaluation of feature extractors and psycho-acoustic transformations for music genre classification. In: Proceedings of the ISMIR, pp 34–41
[37]
Lin J, Gan C, Han S (2019) TSM: Temporal shift module for efficient video understanding. In: Procedings of the ICCV, pp 7083–7093 10.1109/iccv.2019.00718
[38]
Long X, Gan C, De Melo G, et al (2018a) Attention clusters: purely attention based local feature integration for video classification. In: IEEE, Proceedings of the CVPR, pp 7834–7843 10.1109/cvpr.2018.00817
[39]
Long X, Gan C, Melo G, et al (2018b) Multimodal keyless attention fusion for video classification. In: No. 1 in Proceedings of the AAAI 10.1609/aaai.v32i1.12319
[40]
Long X, De Melo G, He D, et al (2020) Purely attention based local feature integration for video classification. IEEE TPAMI pp 2140 – 2154 10.1109/tpami.2020.3029554
[41]
der Maaten LV, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9(86):2579–2605
[42]
McFee B, Raffel C, Liang D, et al (2015) Librosa: audio and music signal analysis in python. In: Proceedings of the python in science conference, pp 18–25 10.25080/majora-7b98e3ed-003
[43]
Mei X, Lee HC, Ky Diao et al (2020) Artificial intelligence-enabled rapid diagnosis of patients with covid-19. Nat Med 26(8):1224–1228 10.1038/s41591-020-0931-3
[44]
Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: Proceedings of the ICCV, pp 3163–3172, https://doi.org/10.1109/ICCVW54120.2021.00355 10.1109/iccvw54120.2021.00355
[45]
Paoletti M, Haut J, Plaza J et al (2018) A new deep convolutional neural network for fast hyperspectral image classification. ISPRS J Photogramm Remote Sens 145:120–147. https://doi.org/10.1016/j.isprsjprs.2017.11.021 10.1016/j.isprsjprs.2017.11.021
[46]
Paszke A et al (2019) PyTorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst 32:8024–8035
[47]
Patel CI, Garg S, Zaveri T et al (2018) Human action recognition using fusion of features for unconstrained video sequences. Comput Electr Eng 70:284–301 10.1016/j.compeleceng.2016.06.004
[48]
Roitberg A, Pollert T, Haurilet M, et al (2019) Analysis of deep fusion strategies for multi-modal gesture recognition. In: IEEE, Proceedings of The CVPRW, pp 198–206 10.1109/cvprw.2019.00029
[49]
ImageNet Large Scale Visual Recognition Challenge

Olga Russakovsky, Jia Deng, Hao Su et al.

International Journal of Computer Vision 2015 10.1007/s11263-015-0816-y
[50]
Seo Y, Ks Shin (2019) Hierarchical convolutional neural networks for fashion image classification. Expert Syst Appl 116:328–339. https://doi.org/10.1016/j.eswa.2018.09.022 10.1016/j.eswa.2018.09.022

Showing 50 of 70 references

Metrics
35
Citations
70
References
Details
Published
Jan 09, 2024
Vol/Issue
36(10)
Pages
5499-5513
License
View
Funding
Higher Education Commission, Pakistan Award: PM/HRDI-UESTPs/UETs-457 I/Phase-1/Batch-VI/2018
Edith Cowan University
Office of National Intelligence Award: NIPG-2021-001
Cite This Article
Muhammad Bilal Shaikh, Douglas Chai, Syed Mohammed Shamsul Islam, et al. (2024). Multimodal fusion for audio-image and video action recognition. Neural Computing and Applications, 36(10), 5499-5513. https://doi.org/10.1007/s00521-023-09186-5
Related

You May Also Like

Multi-Verse Optimizer: a nature-inspired algorithm for global optimization

Seyedali Mirjalili, Seyed Mohammad Mirjalili · 2015

2,600 citations

A review of feature selection methods based on mutual information

Jorge R. Vergara, Pablo A. Estévez · 2013

1,065 citations

A CNN–LSTM model for gold price time-series forecasting

Ioannis E. Livieris, Emmanuel Pintelas · 2020

718 citations

A survey of multi-view machine learning

Shiliang Sun · 2013

710 citations