Multimodal fusion for audio-image and video action recognition

Muhammad Bilal Shaikh; Douglas Chai; Syed Mohammed Shamsul Islam; Naveed Akhtar

doi:10.1007/s00521-023-09186-5

journal article Open Access Jan 09, 2024

Multimodal fusion for audio-image and video action recognition

Muhammad Bilal Shaikh

Douglas Chai

Syed Mohammed Shamsul Islam

Naveed Akhtar

Neural Computing and Applications Vol. 36 No. 10 pp. 5499-5513 · Springer Science and Business Media LLC

View at Publisher Save 10.1007/s00521-023-09186-5

Abstract

AbstractMultimodal Human Action Recognition (MHAR) is an important research topic in computer vision and event recognition fields. In this work, we address the problem of MHAR by developing a novel audio-image and video fusion-based deep learning framework that we call Multimodal Audio-Image and Video Action Recognizer (MAiVAR). We extract temporal information using image representations of audio signals and spatial information from video modality with the help of Convolutional Neutral Networks (CNN)-based feature extractors and fuse these features to recognize respective action classes. We apply a high-level weights assignment algorithm for improving audio-visual interaction and convergence. This proposed fusion-based framework utilizes the influence of audio and video feature maps and uses them to classify an action. Compared with state-of-the-art audio-visual MHAR techniques, the proposed approach features a simpler yet more accurate and more generalizable architecture, one that performs better with different audio-image representations. The system achieves an accuracy 87.9% and 79.0% on UCF51 and Kinetics Sounds datasets, respectively. All code and models for this paper will be available at https://tinyurl.com/4ps2ux6n.

Topics

No keywords indexed for this article. Browse by subject →

References

70

[1]

Arandjelovic R, Zisserman A (2017) Look, listen and learn. In: IEEE, Proceedings of the ICCV, pp 609–617 10.1109/iccv.2017.73

[2]

Baldominos A, Saez Y, Isasi P (2018) Evolutionary convolutional neural networks: an application to handwriting recognition. Neurocomputing 283:38–52 10.1016/j.neucom.2017.12.049

[3]

Boehm KM, Aherne EA, Ellenson L et al (2022) Multimodal data integration using machine learning improves risk stratification of high-grade serous ovarian cancer. Nat Cancer 3(6):723–733 10.1038/s43018-022-00388-9

[4]

Brousmiche M, Rouat J, Dupont S (2019) Audio-visual fusion and conditioning with neural networks for event recognition. In: IEEE, Proceedings of the machine learning for signal processing (MLSP) Workshop, pp 1–6 10.1109/mlsp.2019.8918712

[5]

Brousmiche M, Rouat J, Dupont S (2022) Multimodal attentive fusion network for audio-visual event recognition. Inf Fusion 85:52–59 10.1016/j.inffus.2022.03.001

[6]

Deng Z, Lei L, Sun H, et al (2017) An enhanced deep convolutional neural network for densely packed objects detection in remote sensing images. In: IEEE, proceedings of the remote sensing with intelligent processing (RSIP) workshops, pp 1–4 10.1109/rsip.2017.7958800

[7]

Convolutional Two-Stream Network Fusion for Video Action Recognition

Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman

2016 IEEE Conference on Computer Vision and Patter... 10.1109/cvpr.2016.213

[8]

Feichtenhofer C, et al (2019) Slowfast networks for video recognition. In: Proceedings of the ICCV, pp 6202–6211 10.1109/iccv.2019.00630

[9]

Gao R, Grauman K (2021) VisualVoice: Audio-visual speech separation with cross-modal consistency. IEEE, Proceedings of the CVPR, pp 15495–15505, https://doi.org/10.1109/CVPR46437.2021.01524 10.1109/cvpr46437.2021.01524

[10]

Gao R, et al (2020) Listen to look: action recognition by previewing audio. In: IEEE Proceedings of the CVPR, pp 10457–10467 10.1109/cvpr42600.2020.01047

[11]

Compact Bilinear Pooling

Yang Gao, Oscar Beijbom, Ning Zhang et al.

2016 IEEE Conference on Computer Vision and Patter... 10.1109/cvpr.2016.41

[12]

Gaver WW (1993) What in the world do we hear?: an ecological approach to auditory event perception. Ecol. Psychol. 5(1):1–29 10.1207/s15326969eco0501_1

[13]

Gibbon DC, Liu Z (2008) Introduction to video search engines. Springer. https://doi.org/10.1007/978-3-540-79337-3 10.1007/978-3-540-79337-3

[14]

Girdhar R, et al (2017) ActionVLAD: Learning spatio-temporal aggregation for action classification. In: IEEE, Proceedings of the CVPR, pp 971–980 10.1109/cvpr.2017.337

[15]

Gouyon F, Dixon S, Pampalk E, et al (2004) Evaluating rhythmic descriptors for musical genre classification. In: Proceedings of the AESIC, p 204

[16]

Gu J, et al (2021) NTIRE 2021 challenge on perceptual image quality assessment. In: IEEE, Proceedings of the CVPR, pp 677–690 10.1109/cvprw53098.2021.00077

[17]

He D, et al (2019) StNet: Local and global spatial-temporal modeling for action recognition. In: Proceedings of the AAAI conference on artificial intelligence, pp 8401–8408 10.1609/aaai.v33i01.33018401

[18]

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren et al.

2016 IEEE Conference on Computer Vision and Patter... 10.1109/cvpr.2016.90

[19]

Herath S, Harandi M, Porikli F (2017) Going deeper into action recognition: a survey. Image Vis Comput 60:4–21 10.1016/j.imavis.2017.01.010

[20]

Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: PMLR, Proceedings of the ICML, pp 448–456, https://doi.org/10.5555/3045118.3045167 10.5555/3045118.3045167

[21]

Jing C, Wei P, Sun H et al (2020) Spatiotemporal neural networks for action recognition based on joint loss. Neural Comput Appl 32:4293–4302 10.1007/s00521-019-04615-w

[22]

Jung D, Son JW, Kim SJ (2018) Shot category detection based on object detection using convolutional neural networks. In: IEEE, Proceedings of the ICACT, pp 36–39 10.23919/icact.2018.8323637

[23]

Kala R (2016) On-road intelligent vehicles: motion planning for intelligent transportation systems. Butterworth-Heinemann, OXford

[24]

Kay W, Carreira J, Simonyan K, et al (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950

[25]

Kazakos E, et al (2019) EPIC-Fusion: audio-visual temporal binding for egocentric action recognition. In: Proceedings of the ICCV, pp 5492–5501 10.1109/iccv.2019.00559

[26]

Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980https://doi.org/10.48550/arXiv.1412.6980 10.48550/arxiv.1412.6980

[27]

Kulkarni SR, Rajendran B (2018) Spiking neural networks for handwritten digit recognition-supervised learning and network optimization. Neural Netw 103:118–127 10.1016/j.neunet.2018.03.019

[28]

Kwon H, Kim M, Kwak S, et al (2021) Learning self-similarity in space and time as generalized motion for video action recognition. In: Proceedings of the ICCV, pp 13065–13075 10.1109/iccv48922.2021.01282

[29]

Lei J, Li L, Zhou L, et al (2021) Less is more: clipbert for video-and-language learning via sparse sampling. In: IEEE, Proceedings of the CVPR, pp 7331–7341 10.1109/cvpr46437.2021.00725

[30]

Li Y, Zou B, Deng S et al (2020) Using feature fusion strategies in continuous authentication on smartphones. IEEE Internet Comput 24(2):49–56 10.1109/mic.2020.2971447

[31]

Li Y, Tao P, Deng S et al (2021) Deffusion: Cnn-based continuous authentication using deep feature fusion. ACM Trans Sens Netw (TOSN) 18(2):1–20

[32]

Li Y, Liu L, Qin H et al (2022) Adaptive deep feature fusion for continuous authentication with data augmentation. IEEE Trans Mobile Comput. https://doi.org/10.1109/TMC.2022.3186614 10.1109/tmc.2022.3186614

[33]

Li Y, et al (2016) VLAD3: encoding dynamics of deep features for action recognition. In: IEEE, Proceedings of the CVPR, pp 1951–1960 10.1109/cvpr.2016.215

[34]

Li Z, Tang J (2015) Weakly supervised deep metric learning for community-contributed image retrieval. IEEE Trans Multimed 17(11):1989–1999 10.1109/tmm.2015.2477035

[35]

Li Z, Tang J, Mei T (2018) Deep collaborative embedding for social image understanding. IEEE Trans Pattern Anal Mach Intell 41(9):2070–2083 10.1109/tpami.2018.2852750

[36]

Lidy T, Rauber A (2005) Evaluation of feature extractors and psycho-acoustic transformations for music genre classification. In: Proceedings of the ISMIR, pp 34–41

[37]

Lin J, Gan C, Han S (2019) TSM: Temporal shift module for efficient video understanding. In: Procedings of the ICCV, pp 7083–7093 10.1109/iccv.2019.00718

[38]

Long X, Gan C, De Melo G, et al (2018a) Attention clusters: purely attention based local feature integration for video classification. In: IEEE, Proceedings of the CVPR, pp 7834–7843 10.1109/cvpr.2018.00817

[39]

Long X, Gan C, Melo G, et al (2018b) Multimodal keyless attention fusion for video classification. In: No. 1 in Proceedings of the AAAI 10.1609/aaai.v32i1.12319

[40]

Long X, De Melo G, He D, et al (2020) Purely attention based local feature integration for video classification. IEEE TPAMI pp 2140 – 2154 10.1109/tpami.2020.3029554

[41]

der Maaten LV, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9(86):2579–2605

[42]

McFee B, Raffel C, Liang D, et al (2015) Librosa: audio and music signal analysis in python. In: Proceedings of the python in science conference, pp 18–25 10.25080/majora-7b98e3ed-003

[43]

Mei X, Lee HC, Ky Diao et al (2020) Artificial intelligence-enabled rapid diagnosis of patients with covid-19. Nat Med 26(8):1224–1228 10.1038/s41591-020-0931-3

[44]

Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: Proceedings of the ICCV, pp 3163–3172, https://doi.org/10.1109/ICCVW54120.2021.00355 10.1109/iccvw54120.2021.00355

[45]

Paoletti M, Haut J, Plaza J et al (2018) A new deep convolutional neural network for fast hyperspectral image classification. ISPRS J Photogramm Remote Sens 145:120–147. https://doi.org/10.1016/j.isprsjprs.2017.11.021 10.1016/j.isprsjprs.2017.11.021

[46]

Paszke A et al (2019) PyTorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst 32:8024–8035

[47]

Patel CI, Garg S, Zaveri T et al (2018) Human action recognition using fusion of features for unconstrained video sequences. Comput Electr Eng 70:284–301 10.1016/j.compeleceng.2016.06.004

[48]

Roitberg A, Pollert T, Haurilet M, et al (2019) Analysis of deep fusion strategies for multi-modal gesture recognition. In: IEEE, Proceedings of The CVPRW, pp 198–206 10.1109/cvprw.2019.00029

[49]

ImageNet Large Scale Visual Recognition Challenge

Olga Russakovsky, Jia Deng, Hao Su et al.

International Journal of Computer Vision 2015 10.1007/s11263-015-0816-y

[50]

Seo Y, Ks Shin (2019) Hierarchical convolutional neural networks for fashion image classification. Expert Syst Appl 116:328–339. https://doi.org/10.1016/j.eswa.2018.09.022 10.1016/j.eswa.2018.09.022

Showing 50 of 70 references

Metrics

35

Citations

70

References

Details

Published: Jan 09, 2024
Vol/Issue: 36(10)
Pages: 5499-5513
License: View

Authors

M

Muhammad Bilal Shaikh

D

Douglas Chai

S

Syed Mohammed Shamsul Islam

School of Science, Edith Cowan University, Joondalup 6027, Australia; School of Computer Science and Software Engineering, The University of Western Australia, Crawley 6009, Australia

N

Naveed Akhtar

Department of Computer Science and Software Engineering, The University of Western Australia, Crawley, WA, Australia

Funding

Higher Education Commission, Pakistan Award: PM/HRDI-UESTPs/UETs-457 I/Phase-1/Batch-VI/2018

Edith Cowan University

Office of National Intelligence Award: NIPG-2021-001

Cite This Article

Muhammad Bilal Shaikh, Douglas Chai, Syed Mohammed Shamsul Islam, et al. (2024). Multimodal fusion for audio-image and video action recognition. Neural Computing and Applications, 36(10), 5499-5513. https://doi.org/10.1007/s00521-023-09186-5

Multimodal fusion for audio-image and video action recognition

You May Also Like