journal article Apr 11, 2026

A lightweight bidirectional GRU–DCNN hybrid framework for end-to-end automatic speech recognition

View at Publisher Save 10.1007/s00542-025-05973-3
Topics

No keywords indexed for this article. Browse by subject →

References
63
[1]
Agarwalla S, Sarma KK (2016) Machine learning based sample extraction for automatic speech recognition using dialectal Assamese speech. Neural Netw 78:97–111 10.1016/j.neunet.2015.12.010
[2]
Alhroob E, Mohammed MF, Al Sayaydeh ON, Hujainah F, Ab Ghani N, Lim CP (2024) A flexible enhanced fuzzy min-max neural network for pattern classification. Expert Syst Appl 251:124030 10.1016/j.eswa.2024.124030
[3]
Amodei D, Ananthanarayanan S, Anubhai R, Bai J, Battenberg E, Case C, Zhu Z (2016), June Deep speech 2: End-to-end speech recognition in English and Mandarin. In International conference on machine learning (pp. 173–182). PMLR
[4]
Andrusenko A, Laptev A, Medennikov I (2020) Towards a competitive end-to-end speech recognition for CHiME-6 dinner party transcription. arXiv preprint arXiv:2004.10799. 10.21437/interspeech.2020-1074
[5]
Assael YM, Shillingford B, Whiteson S, De Freitas N (2016) Lipnet: End-to-end sentence-level lipreading. ArXiv Preprint arXiv :161101599
[6]
Baevski A, Zhou Y, Mohamed A, Auli M (2020) wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv Neural Inf Process Syst 33:12449–12460
[7]
Bain M, Huh J, Han T, Zisserman A (2023) Whisperx: Time-accurate speech transcription of long-form audio. ArXiv Preprint arXiv :230300747 10.21437/interspeech.2023-78
[8]
Barker J, Watanabe S, Vincent E, Trmal J (2018) The fifth ‘CHIME’ speech separation and recognition challenge: dataset, task and baselines. ArXiv Preprint arXiv:1803 10609. https://doi.org/10.48550/arXiv.1803.10609 10.48550/arxiv.1803.10609
[9]
Casale S, Russo A, Scebba G, Serrano S (2008), August Speech emotion classification using machine learning algorithms. In 2008 IEEE international conference on semantic computing (pp. 158–165). IEEE 10.1109/icsc.2008.43
[10]
Chan W, Lane I (2015) Deep recurrent neural networks for acoustic modelling. arXiv preprint arXiv:1504.01482.
[11]
Chan W, Park D, Lee C, Zhang Y, Le Q, Norouzi M (2021) Speechstew: simply mix all available speech recognition data to train one large neural network. ArXiv Preprint arXiv :210402133
[12]
Chen Z, Rosenberg A, Zhang Y, Wang G, Ramabhadran B, Moreno PJ (2020), October Improving Speech Recognition Using GAN-Based Speech Synthesis and Contrastive Unspoken Text Selection. In Interspeech (pp. 556–560) 10.21437/interspeech.2020-1475
[13]
Chen Q, Chu Y, Gao Z, Li Z, Hu K, Zhou X, Zhang S (2023) Lauragpt:Listen, attend, understand, and regenerate audio with gpt. arXiv preprint arXiv:2310.04673.
[14]
Chiu CC, Qin J, Zhang Y, Yu J, Wu Y (2022), June Self-supervised learning with random-projection quantizer for speech recognition. In International Conference on Machine Learning (pp. 3915–3924). PMLR
[15]
On the Properties of Neural Machine Translation: Encoder–Decoder Approaches

Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau et al.

Proceedings of SSST-8, Eighth Workshop on Syntax,... 10.3115/v1/w14-4012
[16]
Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. ArXiv Preprint arXiv :14123555
[17]
Corpus TED-LIUM (2017) http://www-lium.univ-lemans.fr/en/content/ted-lium-corpus, last retrieved July
[18]
Fernandes JB, Mannepalli K (2022) Enhanced deep hierarchal GRU & BILSTM using data augmentation and spatial features for Tamil emotional speech recognition. Int J Mod Educ Comput Sci 14(3):45–63 10.5815/ijmecs.2022.03.03
[19]
Furui S, Kikuchi T, Shinnaka Y, Hori C (2004) Speech-to-text and speech-to-speech summarization of spontaneous speech. IEEE Trans Speech Audio Process 12(4):401–408 10.1109/tsa.2004.828699
[20]
Gulati A, Qin J, Chiu CC, Parmar N, Zhang Y, Yu J, Pang R (2020) Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100. 10.21437/interspeech.2020-3015
[21]
Hannun A, Case C, Casper J, Catanzaro B, Diamos G, Elsen E, Ng AY (2014) Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567.
[22]
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai et al.

IEEE/ACM Transactions on Audio, Speech, and Langua... 2021 10.1109/taslp.2021.3122291
[23]
Jeon S, Kim MS (2022) End-to-end sentence-level multi-view lipreading architecture with spatial attention module integrated multiple CNNs and cascaded local self-attention-CTC. Sensors 22(9):3597 10.3390/s22093597
[24]
Jeon S, Lee J, Yeo D, Lee YJ, Kim S (2024) Multimodal audiovisual speech recognition architecture using a three-feature multi‐fusion method for noise‐robust systems. ETRI J 46(1):22–34 10.4218/etrij.2023-0266
[25]
Karthikeyan V, Visu YP (2024) Attention-based lightweight deep hybrid CNN framework for image restoration. Imaging Sci J 73(5):572–597. https://doi.org/10.1080/13682199.2024.2439731 10.1080/13682199.2024.2439731
[26]
Karthikeyan V, Subbulakshmi K (2025) 3D NoC-enabled spiking neural networks: a high-performance computing paradigm. Evol Syst 16(3):1–18 10.1007/s12530-025-09721-w
[27]
Karthikeyan V, Divyesh S, Subramaniam CV (2025), May Augmenting Speech Emotion Recognition with Generative Adversarial Networks. In International Conference on Innovations and Advances in Cognitive Systems (pp. 251–263). Cham: Springer Nature Switzerland 10.1007/978-3-031-97713-8_17
[28]
Karthikeyan V, Keerthana S, Pavithra KC (2025) River watch: AI-enhanced monitoring of Vaigai’s water quality through IoT and deep learning. Intell Decis Technol. https://doi.org/10.1177/18724981251376910 10.1177/18724981251376910
[29]
Karthikeyan V, Praveen S, Nandan SS (2025) Lightweight deep hybrid CNN with attention mechanism for enhanced underwater image restoration. Vis Comput. https://doi.org/10.1007/s00371-024-03785-6 10.1007/s00371-024-03785-6
[30]
Karthikeyan V, Priyadharsini SS, Balamurugan K (2025) Attention-based multi dimension fused-feature convolutional neural network framework for speaker recognition. Multimedia Tools Appl. https://doi.org/10.1007/s11042-025-20694-5 10.1007/s11042-025-20694-5
[31]
Karthikeyan V, Saranya P, Natchiyar M (2025) A lightweight ECA-based DCNN approach for speech command recognition. Comput Biol Med 197:110984 10.1016/j.compbiomed.2025.110984
[32]
Karthikeyan V, Visu PY, Raja E (2025) An integrated water quality and water theft assessment system using machine learning and embedded IoT. International Journal of Information Technology & Decision Making, pp 1–27 10.1142/s0219622025500993
[33]
Khandelwal S, Lecouteux B, Besacier L (2016) Comparing GRU and LSTM for automatic speech recognition (Doctoral dissertation, LIG)
[34]
Lai S, Xu L, Liu K, Zhao J (2015), February Recurrent convolutional neural networks for text classification. In Proceedings of the AAAI conference on artificial intelligence (Vol. 29, No. 1) 10.1609/aaai.v29i1.9513
[35]
Li N, Liu S, Liu Y, Zhao S, Liu M (2019), July Neural speech synthesis with transformer network. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, No. 01, pp. 6706–6713) 10.1609/aaai.v33i01.33016706
[36]
[37]
Liu Y, Xiong H, He Z, Zhang J, Wu H, Wang H, Zong C (2019) End-to-end speech translation with knowledge distillation. arXiv preprint arXiv:1904.08075. 10.21437/interspeech.2019-2582
[38]
Malik M, Malik MK, Mehmood K, Makhdoom I (2021) Automatic speech recognition: a survey. Multimedia Tools Appl 80:9411–9457 10.1007/s11042-020-10073-7
[39]
Moumen A, Parcollet T (2023), June Stabilising and accelerating light gated recurrent units for automatic speech recognition. In ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1–5). IEEE 10.1109/icassp49357.2023.10095763
[40]
Speech Recognition Using Deep Neural Networks: A Systematic Review

Ali Bou Nassif, Ismail Shahin, Imtinan Attili et al.

IEEE Access 2019 10.1109/access.2019.2896880
[41]
Nedjah N, Bonilla AD, de Macedo Mourelle L (2023) Automatic speech recognition of Portuguese phonemes using neural networks ensemble. Expert Syst Appl 229:120378 10.1016/j.eswa.2023.120378
[42]
Nguyen TS, Stüker S, Waibel A (2020) Toward cross-domain speech recognition with end-to-end models. arXiv preprint arXiv:2003.04194.
[43]
Oh YR, Park K, Jeon HB, Park JG (2020) Automatic proficiency assessment of Korean speech read aloud by non-natives using bidirectional LSTM‐based speech recognition. ETRI J 42(5):761–772 10.4218/etrij.2019-0400
[44]
Oord AVD, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kavukcuoglu K (2016) Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.
[45]
Pradhan JD, Prasad LN, Dash TK, Guduri M, Panda G (2024) Cascaded PFLANN model for intelligent health informatics in detection of respiratory diseases from speech using bio-inspired computation. J Artif Intell Technol 4(2):124–131
[46]
Reddy BR, Mahender E (2013) Speech to text conversion using android platform. Int J Eng Res Appl (IJERA) 3(1):253–258
[47]
Ravanelli M, Parcollet T, Plantinga P, Rouhe A, Cornell S, Lugosch L, Bengio Y (2021) SpeechBrain: A general-purpose speech toolkit. arXiv preprint arXiv:2106.04624.
[48]
Reddy CK, Gopal V, Cutler R, Beyrami E, Cheng R, Dubey H, Gehrke J (2020) The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results. arXiv preprint arXiv:2005.13981. 10.21437/interspeech.2020-3038
[49]
Ren Y, Liu J, Tan X, Zhang C, Qin T, Zhao Z, Liu TY (2020), July SimulSpeech: End-to-end simultaneous speech to text translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 3787–3796) 10.18653/v1/2020.acl-main.350
[50]
Rousseau A, Deléglise P, Esteve Y (2014), May Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks. In LREC (pp. 3935–3939). Reduced TED-LIUM release 2 corpus (11.7 GB), http://www.cs.ndsu.nodak.edu/˜siludwig/data/ TEDLIUM release2.zip , last retrieved

Showing 50 of 63 references

Metrics
0
Citations
63
References
Details
Published
Apr 11, 2026
Vol/Issue
32(5)
License
View
Cite This Article
V. Karthikeyan, P. Saranya, M. Natchiyar (2026). A lightweight bidirectional GRU–DCNN hybrid framework for end-to-end automatic speech recognition. Microsystem Technologies, 32(5). https://doi.org/10.1007/s00542-025-05973-3
Related

You May Also Like

An overview of additive manufacturing (3D printing) for microfabrication

Bharat Bhushan, Matt Caspers · 2017

286 citations

Injection molding of components for microsystems

V. Piotter, W. Bauer · 2001

91 citations

Microgripper design and evaluation for automated µ-wire assembly: a survey

H. Llewellyn-Evans, C. A. Griffiths · 2020

48 citations