On the interface between linguistics, computer science and psychiatry: analyzing textual key-factors affecting BERT-based classification of schizophrenia in social media texts
This paper investigates language impairments in schizophrenia (SZ) by analyzing the decision-making process of a transformer-based model in discriminating between texts produced by persons with SZ and persons without SZ. By doing so, we integrate insights from language-centered investigations with computational approaches. Using BERT-base-cased, we explore how linguistic markers of SZ can be identified through Natural Language Processing (NLP) techniques, with emphasis on improving performance reliability via dataset refinement and approaching interpretability of deep learning outputs via statistical analyses of thematic content.
Methods
We report the fine-tuning of a BERT model for text classification of 31,278 Reddit posts (15,639 SZ, 15,639 controls). The experiment evaluated the capacity of the model to distinguish language produced by individuals with SZ.
Results
The model achieved moderate performance (Accuracy = 0.6969; AUC = 0.78) and remained stable across hyperparameter configurations, indicating that foundation models such as BERT fit to data and, therefore, further performance gains are more likely to be derived from dataset refinement than from additional hyperparameter optimization. There were three key factors affecting the model’s performance: text length, topic of discussion and vocabulary choices. Posts that were correctly classified tended to be significantly longer (
p
< 0.001, M = 37.30), focused on certain specific topics (e.g., r/Christianity), and contained more words related to mental health conditions, particularly those semantically related to SZ.
Discussion
These factors have also been reported in manual analyses of the impacts of SZ on language. These findings contribute to the accuracy of computational models aimed at working on linguistic classification tasks and underscore the value of carefully curated datasets, while demonstrating the viability of NLP methods in profiling SZ language.
No keywords indexed for this article. Browse by subject →
Erik Cambria, Bebo White
- Published
- Apr 08, 2026
- Vol/Issue
- 9
- License
- View
You May Also Like
Tirth Dave, Sai Anirudh Athaluri · 2023
1,036 citations
Frank Emmert-Streib, Zhen Yang · 2020
443 citations
Richard Ribón Fletcher, Audace Nakeshimana · 2021
187 citations
Anupreet Kaur Singh, Sridhar Krishnan · 2023
140 citations