A lightweight bidirectional GRU–DCNN hybrid framework for end-to-end automatic speech recognition

V. Karthikeyan; P. Saranya; M. Natchiyar

doi:10.1007/s00542-025-05973-3

journal article Apr 11, 2026

A lightweight bidirectional GRU–DCNN hybrid framework for end-to-end automatic speech recognition

V. Karthikeyan

P. Saranya

M. Natchiyar

Microsystem Technologies Vol. 32 No. 5 · Springer Science and Business Media LLC

View at Publisher Save 10.1007/s00542-025-05973-3

Topics

No keywords indexed for this article. Browse by subject →

References

63

[1]

Agarwalla S, Sarma KK (2016) Machine learning based sample extraction for automatic speech recognition using dialectal Assamese speech. Neural Netw 78:97–111 10.1016/j.neunet.2015.12.010

[2]

Alhroob E, Mohammed MF, Al Sayaydeh ON, Hujainah F, Ab Ghani N, Lim CP (2024) A flexible enhanced fuzzy min-max neural network for pattern classification. Expert Syst Appl 251:124030 10.1016/j.eswa.2024.124030

[3]

Amodei D, Ananthanarayanan S, Anubhai R, Bai J, Battenberg E, Case C, Zhu Z (2016), June Deep speech 2: End-to-end speech recognition in English and Mandarin. In International conference on machine learning (pp. 173–182). PMLR

[4]

Andrusenko A, Laptev A, Medennikov I (2020) Towards a competitive end-to-end speech recognition for CHiME-6 dinner party transcription. arXiv preprint arXiv:2004.10799. 10.21437/interspeech.2020-1074

[5]

Assael YM, Shillingford B, Whiteson S, De Freitas N (2016) Lipnet: End-to-end sentence-level lipreading. ArXiv Preprint arXiv :161101599

[6]

Baevski A, Zhou Y, Mohamed A, Auli M (2020) wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv Neural Inf Process Syst 33:12449–12460

[7]

Bain M, Huh J, Han T, Zisserman A (2023) Whisperx: Time-accurate speech transcription of long-form audio. ArXiv Preprint arXiv :230300747 10.21437/interspeech.2023-78

[8]

Barker J, Watanabe S, Vincent E, Trmal J (2018) The fifth ‘CHIME’ speech separation and recognition challenge: dataset, task and baselines. ArXiv Preprint arXiv:1803 10609. https://doi.org/10.48550/arXiv.1803.10609 10.48550/arxiv.1803.10609

[9]

Casale S, Russo A, Scebba G, Serrano S (2008), August Speech emotion classification using machine learning algorithms. In 2008 IEEE international conference on semantic computing (pp. 158–165). IEEE 10.1109/icsc.2008.43

[10]

Chan W, Lane I (2015) Deep recurrent neural networks for acoustic modelling. arXiv preprint arXiv:1504.01482.

[11]

Chan W, Park D, Lee C, Zhang Y, Le Q, Norouzi M (2021) Speechstew: simply mix all available speech recognition data to train one large neural network. ArXiv Preprint arXiv :210402133

[12]

Chen Z, Rosenberg A, Zhang Y, Wang G, Ramabhadran B, Moreno PJ (2020), October Improving Speech Recognition Using GAN-Based Speech Synthesis and Contrastive Unspoken Text Selection. In Interspeech (pp. 556–560) 10.21437/interspeech.2020-1475

[13]

Chen Q, Chu Y, Gao Z, Li Z, Hu K, Zhou X, Zhang S (2023) Lauragpt:Listen, attend, understand, and regenerate audio with gpt. arXiv preprint arXiv:2310.04673.

[14]

Chiu CC, Qin J, Zhang Y, Yu J, Wu Y (2022), June Self-supervised learning with random-projection quantizer for speech recognition. In International Conference on Machine Learning (pp. 3915–3924). PMLR

[15]

On the Properties of Neural Machine Translation: Encoder–Decoder Approaches

Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau et al.

Proceedings of SSST-8, Eighth Workshop on Syntax,... 10.3115/v1/w14-4012

[16]

Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. ArXiv Preprint arXiv :14123555

[17]

Corpus TED-LIUM (2017) http://www-lium.univ-lemans.fr/en/content/ted-lium-corpus, last retrieved July

[18]

Fernandes JB, Mannepalli K (2022) Enhanced deep hierarchal GRU & BILSTM using data augmentation and spatial features for Tamil emotional speech recognition. Int J Mod Educ Comput Sci 14(3):45–63 10.5815/ijmecs.2022.03.03

[19]

Furui S, Kikuchi T, Shinnaka Y, Hori C (2004) Speech-to-text and speech-to-speech summarization of spontaneous speech. IEEE Trans Speech Audio Process 12(4):401–408 10.1109/tsa.2004.828699

[20]

Gulati A, Qin J, Chiu CC, Parmar N, Zhang Y, Yu J, Pang R (2020) Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100. 10.21437/interspeech.2020-3015

[21]

Hannun A, Case C, Casper J, Catanzaro B, Diamos G, Elsen E, Ng AY (2014) Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567.

[22]

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai et al.

IEEE/ACM Transactions on Audio, Speech, and Langua... 2021 10.1109/taslp.2021.3122291

[23]

Jeon S, Kim MS (2022) End-to-end sentence-level multi-view lipreading architecture with spatial attention module integrated multiple CNNs and cascaded local self-attention-CTC. Sensors 22(9):3597 10.3390/s22093597

[24]

Jeon S, Lee J, Yeo D, Lee YJ, Kim S (2024) Multimodal audiovisual speech recognition architecture using a three-feature multi‐fusion method for noise‐robust systems. ETRI J 46(1):22–34 10.4218/etrij.2023-0266

[25]

Karthikeyan V, Visu YP (2024) Attention-based lightweight deep hybrid CNN framework for image restoration. Imaging Sci J 73(5):572–597. https://doi.org/10.1080/13682199.2024.2439731 10.1080/13682199.2024.2439731

[26]

Karthikeyan V, Subbulakshmi K (2025) 3D NoC-enabled spiking neural networks: a high-performance computing paradigm. Evol Syst 16(3):1–18 10.1007/s12530-025-09721-w

[27]

Karthikeyan V, Divyesh S, Subramaniam CV (2025), May Augmenting Speech Emotion Recognition with Generative Adversarial Networks. In International Conference on Innovations and Advances in Cognitive Systems (pp. 251–263). Cham: Springer Nature Switzerland 10.1007/978-3-031-97713-8_17

[28]

Karthikeyan V, Keerthana S, Pavithra KC (2025) River watch: AI-enhanced monitoring of Vaigai’s water quality through IoT and deep learning. Intell Decis Technol. https://doi.org/10.1177/18724981251376910 10.1177/18724981251376910

[29]

Karthikeyan V, Praveen S, Nandan SS (2025) Lightweight deep hybrid CNN with attention mechanism for enhanced underwater image restoration. Vis Comput. https://doi.org/10.1007/s00371-024-03785-6 10.1007/s00371-024-03785-6

[30]

Karthikeyan V, Priyadharsini SS, Balamurugan K (2025) Attention-based multi dimension fused-feature convolutional neural network framework for speaker recognition. Multimedia Tools Appl. https://doi.org/10.1007/s11042-025-20694-5 10.1007/s11042-025-20694-5

[31]

Karthikeyan V, Saranya P, Natchiyar M (2025) A lightweight ECA-based DCNN approach for speech command recognition. Comput Biol Med 197:110984 10.1016/j.compbiomed.2025.110984

[32]

Karthikeyan V, Visu PY, Raja E (2025) An integrated water quality and water theft assessment system using machine learning and embedded IoT. International Journal of Information Technology & Decision Making, pp 1–27 10.1142/s0219622025500993

[33]

Khandelwal S, Lecouteux B, Besacier L (2016) Comparing GRU and LSTM for automatic speech recognition (Doctoral dissertation, LIG)

[34]

Lai S, Xu L, Liu K, Zhao J (2015), February Recurrent convolutional neural networks for text classification. In Proceedings of the AAAI conference on artificial intelligence (Vol. 29, No. 1) 10.1609/aaai.v29i1.9513

[35]

Li N, Liu S, Liu Y, Zhao S, Liu M (2019), July Neural speech synthesis with transformer network. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, No. 01, pp. 6706–6713) 10.1609/aaai.v33i01.33016706

[36]

Bidirectional LSTM with attention mechanism and convolutional layer for text classification

Gang Liu

Neurocomputing 2019 10.1016/j.neucom.2019.01.078

[37]

Liu Y, Xiong H, He Z, Zhang J, Wu H, Wang H, Zong C (2019) End-to-end speech translation with knowledge distillation. arXiv preprint arXiv:1904.08075. 10.21437/interspeech.2019-2582

[38]

Malik M, Malik MK, Mehmood K, Makhdoom I (2021) Automatic speech recognition: a survey. Multimedia Tools Appl 80:9411–9457 10.1007/s11042-020-10073-7

[39]

Moumen A, Parcollet T (2023), June Stabilising and accelerating light gated recurrent units for automatic speech recognition. In ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1–5). IEEE 10.1109/icassp49357.2023.10095763

[40]

Speech Recognition Using Deep Neural Networks: A Systematic Review

Ali Bou Nassif, Ismail Shahin, Imtinan Attili et al.

IEEE Access 2019 10.1109/access.2019.2896880

[41]

Nedjah N, Bonilla AD, de Macedo Mourelle L (2023) Automatic speech recognition of Portuguese phonemes using neural networks ensemble. Expert Syst Appl 229:120378 10.1016/j.eswa.2023.120378

[42]

Nguyen TS, Stüker S, Waibel A (2020) Toward cross-domain speech recognition with end-to-end models. arXiv preprint arXiv:2003.04194.

[43]

Oh YR, Park K, Jeon HB, Park JG (2020) Automatic proficiency assessment of Korean speech read aloud by non-natives using bidirectional LSTM‐based speech recognition. ETRI J 42(5):761–772 10.4218/etrij.2019-0400

[44]

Oord AVD, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kavukcuoglu K (2016) Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.

[45]

Pradhan JD, Prasad LN, Dash TK, Guduri M, Panda G (2024) Cascaded PFLANN model for intelligent health informatics in detection of respiratory diseases from speech using bio-inspired computation. J Artif Intell Technol 4(2):124–131

[46]

Reddy BR, Mahender E (2013) Speech to text conversion using android platform. Int J Eng Res Appl (IJERA) 3(1):253–258

[47]

Ravanelli M, Parcollet T, Plantinga P, Rouhe A, Cornell S, Lugosch L, Bengio Y (2021) SpeechBrain: A general-purpose speech toolkit. arXiv preprint arXiv:2106.04624.

[48]

Reddy CK, Gopal V, Cutler R, Beyrami E, Cheng R, Dubey H, Gehrke J (2020) The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results. arXiv preprint arXiv:2005.13981. 10.21437/interspeech.2020-3038

[49]

Ren Y, Liu J, Tan X, Zhang C, Qin T, Zhao Z, Liu TY (2020), July SimulSpeech: End-to-end simultaneous speech to text translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 3787–3796) 10.18653/v1/2020.acl-main.350

[50]

Rousseau A, Deléglise P, Esteve Y (2014), May Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks. In LREC (pp. 3935–3939). Reduced TED-LIUM release 2 corpus (11.7 GB), http://www.cs.ndsu.nodak.edu/˜siludwig/data/ TEDLIUM release2.zip , last retrieved

Showing 50 of 63 references

Metrics

0

Citations

63

References

Details

Published: Apr 11, 2026
Vol/Issue: 32(5)
License: View

Authors

V

V. Karthikeyan

Department of ECE Mepco Schlenk Engineering College Sivakasi Tamil Nadu India

Cite This Article

V. Karthikeyan, P. Saranya, M. Natchiyar (2026). A lightweight bidirectional GRU–DCNN hybrid framework for end-to-end automatic speech recognition. Microsystem Technologies, 32(5). https://doi.org/10.1007/s00542-025-05973-3

A lightweight bidirectional GRU–DCNN hybrid framework for end-to-end automatic speech recognition

You May Also Like