On the interface between linguistics, computer science and psychiatry: analyzing textual key-factors affecting BERT-based classification of schizophrenia in social media texts

J. V. Miranda e Silva; C. Rodrigues; E. Vital Brazil

doi:10.3389/frai.2026.1781552

journal article Open Access Apr 08, 2026

On the interface between linguistics, computer science and psychiatry: analyzing textual key-factors affecting BERT-based classification of schizophrenia in social media texts

J. V. Miranda e Silva C. Rodrigues E. Vital Brazil

Frontiers in Artificial Intelligence Vol. 9 · Frontiers Media SA

View at Publisher Save 10.3389/frai.2026.1781552

Abstract

Introduction
This paper investigates language impairments in schizophrenia (SZ) by analyzing the decision-making process of a transformer-based model in discriminating between texts produced by persons with SZ and persons without SZ. By doing so, we integrate insights from language-centered investigations with computational approaches. Using BERT-base-cased, we explore how linguistic markers of SZ can be identified through Natural Language Processing (NLP) techniques, with emphasis on improving performance reliability via dataset refinement and approaching interpretability of deep learning outputs via statistical analyses of thematic content.

Methods
We report the fine-tuning of a BERT model for text classification of 31,278 Reddit posts (15,639 SZ, 15,639 controls). The experiment evaluated the capacity of the model to distinguish language produced by individuals with SZ.

Results

The model achieved moderate performance (Accuracy = 0.6969; AUC = 0.78) and remained stable across hyperparameter configurations, indicating that foundation models such as BERT fit to data and, therefore, further performance gains are more likely to be derived from dataset refinement than from additional hyperparameter optimization. There were three key factors affecting the model’s performance: text length, topic of discussion and vocabulary choices. Posts that were correctly classified tended to be significantly longer (
p
 &lt; 0.001, M = 37.30), focused on certain specific topics (e.g., r/Christianity), and contained more words related to mental health conditions, particularly those semantically related to SZ.

Discussion
These factors have also been reported in manual analyses of the impacts of SZ on language. These findings contribute to the accuracy of computational models aimed at working on linguistic classification tasks and underscore the value of carefully curated datasets, while demonstrating the viability of NLP methods in profiling SZ language.

Topics

No keywords indexed for this article. Browse by subject →

References

30

[1]

Akiba (2019)

[2]

Andreasen "Thought, language, and communication disorders: I. Clinical assessment, definition of terms, and evaluation of their reliability" Arch. Gen. Psychiatry 10.1001/archpsyc.1979.01780120045006

[3]

Andreasen "Thought, language, and communication disorders: II. Diagnostic significance" Arch. Gen. Psychiatry 10.1001/archpsyc.1979.01780120055007

[4]

(2013)

[5]

Bae "Schizophrenia detection using machine learning approach from social media content" Sensors (2021) 10.3390/s21175924

[6]

Birnbaum "A collaborative approach to identifying social media markers of schizophrenia by employing machine learning and clinical appraisals" J. Med. Internet Res. (2017) 10.2196/jmir.7956

[7]

Bommasani "On the opportunities and risks of foundation models" (2022)

[8]

Jumping NLP Curves: A Review of Natural Language Processing Research [Review Article]

Erik Cambria, Bebo White

IEEE Computational Intelligence Magazine 2014 10.1109/mci.2014.2307227

[9]

Carvalho "Machine learning interpretability: a survey on methods and metrics" Electronics (2019) 10.3390/electronics8080832

[10]

Chaves (2022)

[11]

Çokal "The language profile of formal thought disorder" NPJ Schizophr. (2018) 10.1038/s41537-018-0061-9

[12]

Çokal "Comprehension of embedded clauses in schizophrenia with and without formal thought disorder" J. Nerv. Ment. Dis. (2019) 10.1097/nmd.0000000000000981

[13]

Devlin (2019)

[14]

Docherty "Communication disturbances in schizophrenia and mania" Arch. Gen. Psychiatry. (1996) 10.1001/archpsyc.1996.01830040094014

[15]

Docherty "Communication disturbances in the natural speech of schizophrenic patients and non‐schizophrenic parents of patients" Acta Psychiatrica Scandinavica. (1997) 10.1111/j.1600-0447.1997.tb10138.x

[16]

Guerra (2023)

[17]

Herring "Grammar and electronic communication" Encycl. Appl. Linguist. (2012) 10.1002/9781405198431.wbeal0466

[18]

Hinzen "Reference across pathologies: a new linguistic lens on disorders of thought" Theor. Linguist. (2017) 10.1515/tl-2017-0013

[19]

Kayi (2017)

[20]

McManus "Mining twitter data to improve detection of schizophrenia" AMIA Summits Transl. Sci. Proc. (2015)

[21]

Morice "Language changes in schizophrenia: a limited replication" Schizophr. Bull. (1986) 10.1093/schbul/12.2.239

[22]

Moro "Detecting syntactic and semantic anomalies in schizophrenia" Neuropsychologia (2015) 10.1016/j.neuropsychologia.2015.10.030

[23]

Mota "Graph analysis of dream reports is especially informative about psychosis" Sci. Rep. (2014) 10.1038/srep03691

[24]

Pugh "Assessing dimensions of thought disorder with large language models: the tradeoff of accuracy and consistency" Psychiatry Res. (2024) 10.1016/j.psychres.2024.116119

[25]

Rezaii "A machine learning approach to predicting psychosis using semantic density and latent content analysis" NPJ Schizophr. (2019) 10.1038/s41537-019-0077-9

[26]

Tang "Natural language processing methods are sensitive to sub-clinical linguistic differences in schizophrenia spectrum disorders" NPJ Schizophr. (2021) 10.1038/s41537-021-00154-3

[27]

Tovar Torres (2019)

[28]

Walenski "Grammatical processing in schizophrenia: evidence from morphology" Neuropsychologia (2010) 10.1016/j.neuropsychologia.2009.09.012

[29]

Ziv "Morphological characteristics of spoken language in schizophrenia patients–an exploratory study" Scand. J. Psychol. (2022) 10.1111/sjop.12790

[30]

Zomick (2019)

Metrics

0

Citations

30

References

Details

Published: Apr 08, 2026
Vol/Issue: 9
License: View

Authors

J

J. V. Miranda e Silva

Programa de Pós-Graduação em Estudos da Linguagem, PUC-Rio

C

C. Rodrigues

Programa de Pós-Graduação em Estudos da Linguagem, PUC-Rio; IMPA Tech

E

E. Vital Brazil

IMPA Tech; PUC-Behring Institute for Artificial Intelligence

Cite This Article

J. V. Miranda e Silva, C. Rodrigues, E. Vital Brazil (2026). On the interface between linguistics, computer science and psychiatry: analyzing textual key-factors affecting BERT-based classification of schizophrenia in social media texts. Frontiers in Artificial Intelligence, 9. https://doi.org/10.3389/frai.2026.1781552

On the interface between linguistics, computer science and psychiatry: analyzing textual key-factors affecting BERT-based classification of schizophrenia in social media texts

You May Also Like