Enabling Early Health Care Intervention by Detecting Depression in Users of Web-Based Forums using Language Models: Longitudinal Analysis and Evaluation

Background Major depressive disorder is a common mental disorder affecting 5% of adults worldwide. Early contact with health care services is critical for achieving accurate diagnosis and improving patient outcomes. Key symptoms of major depressive disorder (depression hereafter) such as cognitive distortions are observed in verbal communication, which can also manifest in the structure of written language. Thus, the automatic analysis of text outputs may provide opportunities for early intervention in settings where written communication is rich and regular, such as social media and web-based forums. Objective The objective of this study was 2-fold. We sought to gauge the effectiveness of different machine learning approaches to identify users of the mass web-based forum Reddit, who eventually disclose a diagnosis of depression. We then aimed to determine whether the time between a forum post and a depression diagnosis date was a relevant factor in performing this detection. Methods A total of 2 Reddit data sets containing posts belonging to users with and without a history of depression diagnosis were obtained. The intersection of these data sets provided users with an estimated date of depression diagnosis. This derived data set was used as an input for several machine learning classifiers, including transformer-based language models (LMs). Results Bidirectional Encoder Representations from Transformers (BERT) and MentalBERT transformer-based LMs proved the most effective in distinguishing forum users with a known depression diagnosis from those without. They each obtained a mean F1-score of 0.64 across the experimental setups used for binary classification. The results also suggested that the final 12 to 16 weeks (about 3-4 months) of posts before a depressed user’s estimated diagnosis date are the most indicative of their illness, with data before that period not helping the models detect more accurately. Furthermore, in the 4- to 8-week period before the user’s estimated diagnosis date, their posts exhibited more negative sentiment than any other 4-week period in their post history. Conclusions Transformer-based LMs may be used on data from web-based social media forums to identify users at risk for psychiatric conditions such as depression. Language features picked up by these classifiers might predate depression onset by weeks to months, enabling proactive mental health care interventions to support those at risk for this condition.


Background
Major depressive disorder (MDD) is one of the most prevalent mental illnesses worldwide, affecting nearly 5% of adults [1]. Depressive episodes, which are symptoms of MDD and other psychiatric conditions, are even more common, with nearly 30% of individuals developing them at least once in their lifetime [2]. The characteristics of MDD and depressive episodes ("depression" hereafter) include low mood, feelings of worthlessness or guilt, and recurrent thoughts of death [3]. Early intervention has been reported to significantly improve patient outcomes and reduce the financial burden on health care services [4]. However, the stigma associated with psychiatric conditions, such as depression, leads to patients underreporting to health care services [5,6].
Given that a number of individuals who would normally meet the criteria for depression underreport to health care services, consideration should be given to how key symptoms may manifest in written language on social media platforms [7]. Longhand discussion websites such as Reddit are a rich source of such information where users may publish a series of posts spanning many months or years [8]. Natural language processing (NLP) can be used to identify features in posts that are predictive of a user who may have depression. Crucially, if affected users are identified before formal diagnosis, this may provide an opportunity for early health care intervention in these cases.
In this study, we derive a specialized subset of an annotated data set that contains Reddit posts belonging to users who have received a diagnosis of depression. This subset allowed us to consider posts before each user's approximate diagnosis date.
We used state-of-the-art, domain-specific language models (LMs) to assist in the detection of depression. These LMs outperformed the baseline approaches in various experimental settings. Notably, they are adept at early detection of depression. Moreover, through our model analysis, we provide an exhaustive analysis of the temporal aspect related to preemptive detection, providing insights into the time depression symptoms materialized before the diagnosis. Finally, we investigated the role of sentiment in depressed users' posts and provided a qualitative analysis based on the model performance.

Related Work
There is a growing body of literature on the use of NLP techniques to analyze depression patterns on social media [9,10].
Yates et al [11] developed an approach to distinguish forum users who self-reported a diagnosis of depression from those who did not. It used a convolutional neural network to aggregate user posts in a purpose-built data set, the Reddit Self-reported Depression Diagnosis (RSDD) data set. Their follow-up work involved the conception of a sister data set, RSDD-Time [12], which contained Reddit posts where users declared a past diagnosis of depression, and this diagnosis was linked to an estimated date. Dates were inferred from explicit but often imprecise time expressions in user posts. However, these works did not consider the preemptive detection of depression among Reddit users in their data sets. That is, they did not consider methods for detecting depression in users before their diagnoses.
Recent NLP studies have explicitly focused on the early detection of depression. Preemptive detection of mentions of depression among Twitter users has been demonstrated with a degree of success by Owen et al [13]. Abed-Esfahani [14] reported similar findings using Reddit data. However, both studies were limited by the uncertainty of whether the users referring to this condition were formally diagnosed. Shah et al [15] also considered approaches for the early detection of depression in Reddit users. In this case, it was determined whether the user had received a physician's diagnosis. However, it was not certain whether the users' posts occurred before or after their diagnoses because the dates of the diagnoses were unknown. To gauge the effectiveness of the preemptive detection methods, a series of user posts before a known diagnosis date is required. Eichstaedt et al [16] examined the language in Facebook posts that may have been predictive of depression, as shown in patients' medical records. They achieved an F 1 -score of 0.66 via logistic regression modeling, which used only the language preceding each patient's depression diagnosis.
Therefore, this study also sought to extend existing work on preemptive depression detection. We considered social media users whose depression diagnosis date is known and used LMs to harness the language of user posts.
Ren et al [17] performed emotion-driven detection of depression using Reddit, achieving F 1 -scores exceeding 0.9. Their work considered individual depression posts, rather than a series of posts. Nevertheless, their effective use of emotional semantic information suggested that the dissection of our own results could be enhanced using sentiment analysis, which we included in our analysis to provide further insights.

Objectives
We sought to gauge the performance of several machine learning classifiers in the task of distinguishing between RSDD data set users reporting and not reporting a diagnosis of depression, which from here onward we will term as "depressed" and "controls," respectively. We then used the best-performing classifier in a temporally driven binary classification task. The purpose was to determine the volume of posts in a depressed user's post timeline, which was the most indicative of their illness. To do this, we considered only the posts authored before the depressed users' estimated diagnosis dates. Moreover, we considered only posts published up to 6 months before those dates.
The motivation for considering this 6-month time range hails from Winkour et al [18], and their observation that over 50% of patients with depression experienced their first onset at least 6 months before their formal diagnosis. Reece et al [19] made similar observations when examining Twitter users.
The time during which individuals with symptoms or traits of depression remain undiagnosed poses serious health risks. Patients who remain undiagnosed and thus untreated experience a worse outcome than would be the case if they were treated [20], particularly after their first episode [21]. assessing suitable time points for health care interventions are needed to identify ways to improve patient outcomes. They are also likely to advance the field of psychiatric therapeutics by supporting modifications to clinical guidelines or the design of randomized controlled trials [22]. A larger body of evidence on this matter could also help identify patients to be targeted for more thorough mental health assessments and provided with further resources, support, and treatment [23].

Overview
Our work is based on the RSDD and RSDD-Time data sets [24].

Deriving RSDD-Matched
We used this information to estimate the diagnosis dates of the 529 users present in both RSDD and RSDD-Time. Those with recency annotations of 0 or 1 were ignored because their diagnosis dates could not be estimated with any degree of accuracy. For each of the remaining users, we determined whether the estimated diagnosis date fell between the date of their first RSDD post and the date of their RSDD-Time diagnosis post. A total of 72 depressed users remained in the study.
A total of 10 matching control users were sought for each of the 72 depressed users. To accomplish this, candidate control users were randomly retrieved from the RSDD and analyzed sequentially. The candidates' posts dated before the corresponding depressed user's estimated diagnosis date were considered. If the number of posts belonging to the candidate did not vary by >15% with respect to the depressed user, the candidate was considered a match. A control user matched in this manner was not considered a candidate for subsequent depressed users.
Because sufficient matching control users could not be found for 2 of the depressed users, they were excluded from the resulting data set. The data set contained 70 depressed users, each of whom had 10 matching control users. Thus, there were a total of 770 users. The posts were published between April 2006 and June 2016. We named our data set RSDD-Matched. The characteristics of RSDD-Matched are shown in Table 1. Statistics pertaining to individual users in RSDD-Matched can be found in Multimedia Appendix 1.
Because RSDD does not include posts made in mental health subreddits, a depressed user's diagnosis is certain to not be revealed until the time of their diagnosis post. There is language indicative of mental health conversation in the other subreddits.

Descriptive Analysis of RSDD
To better understand our data set, we performed a simple descriptive analysis of RSDD. Word-level exploratory analyses of corpora have been extensively used in corpus linguistics and NLP to gain insight into word prominence. Typically, these follow a bag-of-words [25], pointwise mutual information [26], or term frequency-inverse document frequency (TF-IDF) [27] approach. In our case, we used lexical specificity [28], which is a statistical measure based on hypergeometric distribution, to identify the most prominent words in a corpus. We chose to use lexical specificity because it is structured in a way that is ideal for extracting corpus-specific vocabulary given a global corpus (RSDD) and its subsets (depressed and control users) [29]. It is also a more robust metric for term importance when dealing with different lengths of text [30], which is often the case for Reddit posts.
RSDD is partitioned into 2 subsets, or subcorpora, one containing posts of depressed users, and another containing posts of the control users. After lemmatizing the corpus, lexical specificity analysis revealed the unigrams (single words) that were the most frequently used by depressed and control participants ( Table 2). The score column indicates the relevance of a unigram to each subset. For reference, the term "woman" makes up 0.18% (460,893/257,873,124) of the total words that appear in the depressed user subset compared with only 0.06% (569,330/950,988,726) of the control user subset.
To put the results into context, we should mention that a lexical specificity score of X for a given word W with frequency f means that the probability of W occurring at least f times in the subcorpus is lower than 10 -X (assuming a random distribution). For instance, a lexical specificity score of 42,234 for "game" means that the probability of "game" having a frequency of f=5,373,938 or higher in the control users subcorpus is 10 -42,234 (ie, an exceptionally low probability which means "game" is overrepresented in the control users' subset). In general, we can observe a pattern in which depressed users tend to use more relationship or family-related words (eg, "woman" or "relationship") and words related to the depression symptoms themselves (eg, "life"). In contrast, control users seem to use more mundane terms related to the subreddit communities, such as game-related terms (eg, "game" or "team"). Although this analysis is based only on the statistical frequency of the terms used, it may provide further evidence that developing automatic methods to identify users with depression may indeed be feasible. In the Results section, we extend this initial inspection to better understand the errors made by the automatic models.

Methodology
In this section, we provide more details of our proposed methods for tackling the depression detection task. Framing the task as a machine learning problem, we considered 9 methods based on linear classifiers and more recent LMs.
The initial baselines entailed a support vector machine (SVM) architecture. SVM is an algorithm that learns by example to assign labels to objects [31]. In our case, the objects are Reddit users, and permissible labels are "depressed" and "control." SVMs have demonstrated effectiveness in the detection of depression-related posts in Reddit [8,32]. Our SVM configurations used different features derived from user posts. These features included TF-IDF, word embeddings, and a combination of both TF-IDF and word embeddings. The TF-IDF [33] features represent the words deemed most notable among the user posts. Word embedding is a real-valued vector representation of a word [34]. Words with similar meanings have vectors with similar values.
The SVM model used was that of scikit-learn [35], as was the TF-IDF vectorizer implementation. The word embeddings generated for each Reddit post were drawn from global vectors trained on Wikipedia and Gigaword data [36]. These vectors had a dimensionality of 300, similar to the average embedding generated. We performed Reddit posttext preprocessing before their input to the SVM. All posts underwent quotation normalization; therefore, each quotation character was represented by a single apostrophe. All new lines and carriage return characters were replaced with spaces so that posts were represented as a single line string. The posts were then concatenated on a per-user basis so that each user's posting history was represented as a single-line string. SVM used a linear kernel, which is appropriate for text-classification problems [37][38][39].
The remaining 6 classifiers were transformer-based LMs. LMs are a statistical means of predicting words [40], whereas transformers provide a neural-network-based approach to generating such models [41]. Transformer-based LMs have proven effective in detecting psychiatric illness-related Reddit posts [12,42,43]. Therefore, we chose to use transformer-based LMs to support the detection of depression in RSDD-Matched. We chose Bidirectional Encoder Representations from Transformers (BERT) [44] and A Lite BERT (ALBERT) [45], which are appropriate for a wide variety of applications. We also chose 4 specialist LMs: BioBERT [46], Longformer [47], MentalBERT [48], and MentalRoBERTa [48]. BioBERT is suitable for use where biomedical concepts are prevalent, such as electronic medical records [49], patient descriptions [50], and health-related Twitter posts [51]. Longformer is designed for use when text is formed from long documents. Indeed, there were posts in RSDD-Matched that exceed 2000 words. Finally, MentalBERT and MentalRoBERTa are customized for the domain of mental health care and trained using text drawn from mental health discussion forums.
All 6 transformer-based LMs were pretrained bidirectional language representations. This means that for any given word in a text segment, its neighboring words to both the left and right are examined so that the context of the word is well understood. These representations lend themselves to high performance in text classification tasks when compared with traditional approaches using SVMs, for example [52,53].
We used the Simple Transformers software library [54] to deploy LMs. The library provides an application programming interface to the transformer library, which itself provides access to the BERT, ALBERT, BioBERT, Longformer, MentalBERT, and MentalRoBERTa models [55]. The BERT, ALBERT, BioBERT, Longformer, MentalBERT, and MentalRoBERTa classifiers used were "bert-base-uncased," "albert-base-v1," "biobert-base-cased-v1.1," "longformer-base-4096," "mental-bert-base-uncased," and "mental-roberta-base," respectively. In addition to the default hyperparameters of the Simple Transformers, the LM classifiers were instantiated, with the sliding window enabled. Transformer-based LMs may consume only a limited number of tokens (512 tokens). Because the posting histories of most users in RSDD-Matched exceed 512 words, a specialist approach to applying LMs to these posts is needed. Sliding window is one such approach [56].

Preemptive Depression Identification Experiment
The first experiment examined the performance of several machine learning classifiers in the task of distinguishing between depressed and control users in RSDD-Matched. The purpose of this experiment was to understand the extent to which the preemptive detection of depression in social media is possible. Moreover, this experiment was aimed at understanding the capabilities of machine learning classifiers for this task and the suitability of different methods in the task. The results were used to provide a competitive model for subsequent fine-grained temporal experiments.
We used 9 different classifiers. Three entailed an SVM, as described in the Methodology section. The remaining 6 were BERT, ALBERT, BioBERT, Longformer, MentalBERT, and MentalRoBERTa, which are also described in the Methods section.
In addition to the aforementioned classifiers, we included a naive baseline that predicted positive instances in all cases.
Because the number of positive instances (ie, depressed users) in RSDD-Matched was small, we chose not to use a traditional train-test split. Instead, we used 5-fold cross-validation; an approach also used by Eichstaedt et al [14]. Furthermore, we varied the number of matching control users across the 4 iterations of the experiment ( Table 3).
The purpose of these variations is to test the performance of classifiers against increasingly imbalanced data sets. This mimics the conditions likely to be observed in web-based forums where the number of positive instances (ie, depressed users) is dwarfed by the number of negative instances (ie, nondepressed users).

Temporal Experiment
The purpose of the second primary experiment was to determine which posting period in a depressed user's post timeline was the most indicative of depression. This involved the use of a subset of RSDD-Matched users. The performance of binary classifiers versus temporal subsets of the posts in the 6 months before the users' estimated diagnosis dates was measured.
The RSDD-Matched subset contained only depressed users who had at least one post in the 2 weeks before their estimated diagnosis date. Of the 70 depressed users in our RSDD subset, 14 did not have any posts in this 2-week period. Consequently, we used only 56 depressed users in the temporal experiment. Furthermore, not all 10 control users matched with each of the 56 depressed were useable because some did not have at least one post in this 2-week period. Thus, we performed additional random exclusions of controls to rebalance the data set. After these exclusions, the data set used in the temporal experiment contained 56 depressed users, each of which had 3 matching control users, totaling to 224 users.
The results of the preemptive depression identification experiment were used to partially inform the design of the temporal experiment. Because BERT scored the highest average F 1 -score across all runs of the preemptive depression identification experiment, it was decided that this was the sole general-purpose transformer-based LM to be used in the temporal experiment. Likewise, MentalBERT had the highest average F 1 -score; therefore, it was selected as the sole specialist LM. The 3 variations of the SVM classifier used in the preemptive depression-identification experiment were used once again.
Once again, we used 5-fold cross-validation. Two chief variations of the RSDD-Matched subset and several different temporal configurations were used ( Table 4).
The 2 chief strands to our experimental setup are summarized in Figure 1.
We complemented the temporal experiment with sentiment analysis. The purpose of this study was to identify whether there is a link between sentiment and depression with respect to user posts. Text sentiment has been extensively used as a predictor for detecting signs of depressive mood in microblog users [57][58][59]. Specifically, negatively charged text has often been correlated with depression via expressions of low mood and suicidal ideation [60]. Approaches used to extract sentiment from social media posts include the use of LMs [61] and lexicons such as Valence Aware Dictionary and Sentiment Reasoner (VADER) [62].
To determine whether there is a relationship between sentiment and depression, we used BERTweet-sentiment, a state-of-the-art transformer model, to classify each post in RSDD-Matched as either negative, neutral, or positive. BERTweet-sentiment is based on the BERTweet [63] implementation, which is trained on a large Twitter corpus and fine-tuned for sentiment analysis. Although the model is not trained on Reddit data, we believe that there are enough overlapping lexical characteristics between the 2 domains in terms of internet slang and text lengths that justify its use.
Our sentiment analysis focused on changes in the sentiment distribution of depressed and control users over time. In step with the design of our temporal experiment, each user's posts are divided into 6 temporal bands, namely 0-4, 4-8, 8-12, 12-16, 16-20, and 20-24 weeks before their estimated diagnosis date (for a control user, this is the estimated diagnosis of its matched depressed user). The average percentage of each sentiment in each band was considered.
To establish whether the diagnosis was associated with the sentiment of a post, 2 regression models were used. The first was based on the lme4 framework [64], and the second on mgcv [65]. The implementations used were those of the R (version 4.02) statistical environment [66]. We set our outcome variable to be whether a post is "sentimental" (that is, either negative or positive) or not (neutral), and a logistic mixed effects regression was fitted using all the available posts with the individual user identifier as a random effect term. As fixed effects, we used the estimated depression diagnosis (ie, either depressed or control), the time to estimated diagnosis in weeks, the post's word count, and the interaction term of estimated diagnosis with time.
Having sought to establish whether the diagnosis of the user was associated with the sentimentality inferred for each post, we also considered a more fine-grained multinomial regression model. This is equivalent to fitting a series of logistic models against a reference category [67] and is similar to the "stacked" designs used in other disciplines [68]. For our purposes, we will consider "neutral" as the reference category of our multinomial outcome, so all effect sizes will indicate the probability of a post being positive or negative instead of neutral.

Preemptive Depression Identification Experiment
The results of the preemptive depression identification experiment are presented in Tables 5-8. Each table shows a variation in the number of matched control users. Positive predictive value, sensitivity, and F 1 -score were used to measure the performance in each variation. The positive predictive value denotes the number of users classified as depressed who were indeed depressed. Sensitivity denotes how many of the depressed users were correctly classified as depressed. The F 1 -score, which is the harmonic mean of the positive predictive value and sensitivity, is suitable for use with data sets such as ours, where the class distribution (of depressed and controls) is uneven [69]. In contrast, accuracy is not suitable for such data sets [70]. Therefore, we used F 1 -score as the primary performance metric.
Using F 1 -score as a primary performance indicator, MentalBERT performs best across the variations.
A detailed breakdown of the results of the preemptive depression identification experiment can be found in Multimedia Appendix 1.
Word embeddings (vector representations) result in strong sensitivity (recall), whereas TF-IDF features cause deficient performance. The positive predictive value (precision) was best observed when using the specialist LM, MentalBERT. The best F 1 -score was also achieved by MentalBERT and exceeded the naive baseline.
We now consider the selected users from RSDD-Matched and the performance of the classifiers against them. We will examine one misclassified user per variation in the experiment (in terms of depressed users and the number of matched controls). For each variation, we will examine the strongest performing classifier and the user that it misclassified with the highest probability.
To identify the potential reasons for the misclassifications, we examined the lexical properties of user posts using 3 approaches. The first approach involves ascertaining the chief topic conveyed by the posts, a topic represented by 5 words. Topic modeling via latent Dirichlet allocation was used to accomplish this [71,72]. The second approach examines the chief TF-IDF features of the user posts. The third approach is to count the frequencies of depressed and control vocabularies ( Table 2) that appear across the posts.
We present the misclassified depressed users with respect to each variation in the experiment (Table 9). We also present the misclassified control users with respect to each variation (Table  10).
One depressed user is often misclassified. User d13 was deemed a control user using 3 different classifiers across 3 different variations. Although depressed vocabulary counts slightly outweigh their control counterparts, the totals for both vocabularies were nominal. The topic of the user's posts is probably more indicative of the reasons for the misclassification. Certainly, a theme concerning death or dying appears to be present, but this is diluted by optimistic sounding references of temporal and geographic nature. Further diluting references are revealed among the TF-IDF features, where strong terms such as "love" are present. It seems that the classifiers construe such references as those belonging to a control user.
User d38 may have been misclassified for similar reasons. Counts for both depressed and control vocabularies were small.
Positive terms, such as "welcome" and "invite" might be deemed to belong to a control user.
An inferior performance was observed across the classifiers in the most imbalanced environment. We examine depressed user d57, which has been misclassified with a probability close to certainty. The depressed vocabulary count dwarfs the control vocabulary count. However, when making its decision, the classifier seems to harness the overarching nature of the user's posts, as indicated by the topic model and TF-IDF features. The prevalence of "good" natured posts will inevitably see the user deemed similar to a control user when represented in a vector space.
We now consider misclassified control users with respect to each variation in the experiment (Table 10).
Certain users appear to be confounding across several different classifiers and variations. User c13 was strongly misclassified as a depressed user by both MentalBERT and MentalRoBERTa in the relatively noisy environments of 3 and 5 matched control users, respectively ( Table 10). The depressed vocabulary counts far outweigh the control vocabulary counts for this user. In addition, the theological topic and TF-IDF features of the user's posts are deemed likely to be those of a depressed user, according to the classifier.
MentalBERT demonstrated adeptness in the most balanced variation in the experiment. We sought possible explanations for the misclassification of user c521. The control vocabulary count slightly outweighed that of depressed vocabulary. Moreover, the topic model and TF-IDF features are composed of terms that complement the control vocabulary. Intuitive reasons for misclassification as depressed are difficult to cite. Therefore, it is possible that, in a balanced environment, the classifier simply has too few control users to compare with depressed users.
In the noisiest environment, the simpler word-based model (SVM using word embeddings) demonstrated the strongest performance. Transformer-based language modeling cannot be performed. The vocabulary of the most strongly misclassified user in this case (c535) only offers a tenuous explanation. The count of depressed vocabulary was small, although it outweighed that of the control vocabulary. However, the topic and TF-IDF terms appeared to complement the depressed vocabulary, which may have been the cause of the misclassification.

Temporal Experiment
We then performed a temporal experiment. Because BERT achieved the highest F 1 -score across all preemptive depression identification experiment variations, it was selected as the exclusive general-purpose LM here. For the same reason, MentalBERT was selected as an exclusive specialist LM. The results are presented in Tables 11 and 12. Each table shows a variation in the number of matched control users. The average performance of each LM across the 2 variations is shown in Figure 2.
For BERT, the strongest sensitivity and F 1 -scores were observed when only 12 weeks (approximately 3 months) of posts before the estimated diagnosis dates were considered. Subsets larger or smaller than 12 weeks caused degradation in the classifier performance. For MentalBERT, the strongest sensitivity and F 1 -scores were obtained when either 16 or 24 weeks of posts were considered. With BERT scoring a higher F 1 -score at 12 weeks than MentalBERT, this suggests that the final 12 weeks of posts before a depressed user's estimated diagnosis date may be the most indicative of their illness.
An explanation for the slightly inferior performance of MentalBERT may be found in its construction: it is pretrained on text from mental health subreddits such as "r/depression" and "r/mental health" [48]. However, RSDD (from which we derived RSDD-Matched) does not contain posts from mental health subreddits. Therefore, when RSDD-Matched data are limited, as in our temporal experiment, more general-purpose models, such as BERT, may be able to achieve stronger performance. BERT is pretrained on more general corpora, such as Wikipedia [44].
A detailed breakdown of the results of the temporal experiment can be found in Multimedia Appendix 1.
We once again consider selected users from RSDD-Matched and the performance of the classifiers against them. We again examined one misclassified user per variation in the experiment (in terms of depressed users and number of matched controls). For each variation, we will examine the strongest performing time span, and the user that is misclassified with the highest probability. To identify the reasons for the misclassifications, we again examined the lexical properties of the user posts using topic models, TF-IDF features, and vocabulary (Table 2) frequency counts.
Misclassified depressed users with respect to the 2 variations in the experiment are listed in Table 13.
User d52 is a depressed user misclassified in both balanced and imbalanced environments, where only the final 12 weeks of their posts are considered. The vocabulary of these posts intersected with very little of the chief depressed vocabulary. It intersects with slightly more of the chief control vocabulary. The topic and TF-IDF features, intuitively speaking, appear to belong to that of a control rather than a depressed user. Perhaps, a balanced environment with temporally limited post histories provides little training data from which the classifier can learn to differentiate between controls and depressed users. Although rare, these cases may occur in practice and highlight the importance of being careful in overrelying on automatic models for individual assessments without human expert intervention.
We now consider the misclassified control users with respect to the 2 variations in the experiment (Table 14).
First, we consider user c481. Both its depressed and control vocabulary counts were zero, which offers some insight into misclassification. The topic and TF-IDF features of the posts appear to align with those of the control user. However, it is likely that the prevalence of "pain" is a confounding factor. This term may be intuitively linked to depressed users, which may mislead the classifier. Again, the limited temporal range of posts in this setting provided little data from which the classifier could learn.
User c13 is a confounder in the preemptive depression identification experiment and has been proven to be so in the temporal experiment. Even when considering only the last 12 weeks of the user's posts in an imbalanced environment, theologically themed vocabulary is not diluted. It intersects strongly with the vocabulary of depressed users and explains this misclassification. Table 11. Binary classification scores using 56 depressed users and 1 of their matched control users and 6 temporal post subsets a .   Table 12. Binary classification scores using 56 depressed users and 3 of their matched control users and 6 temporal post subsets a .

Sentiment Analysis
A sentiment analysis was then performed to complement the temporal experiment. We present the band-wise changes in sentiment for each class (Figures 3 and 4). It is observed that negatively charged posts for depressed users are less frequent as we approach the (estimated) diagnosis date, which may be deemed counterintuitive (Figure 3). However, it is also notable that depressed users' posts were, on average, more negative than those of control users throughout the 24-week period ( Figure 4). This aligns with previous studies that found a positive correlation between mental illness and negative sentiments [73].
We then sought to establish whether the diagnosis was associated with the sentiment of the post. The results of the logistic regression model (Table 15) indicate that there is a clear significant association between the diagnosis and the "sentimentality" of the post (P<.05), despite no apparent effect of temporality. Interestingly, the word count of a post appeared as a significant covariate of this model (P=.001), indicating that longer posts are slightly more likely to be classified as "sentimental," irrespective of the depression status of the user.

Principal Findings
We obtained evidence that LMs (particularly BERT-like models) can be used in preemptive mental health detection and analysis in longhand forums, even if they have room for improvement.
In our preemptive depression detection experiment, depressed and control subjects were placed in ratios of 1:1, 1:3, 1:5, and 1:10. The purpose was to simulate increasingly realistic settings in which most users were controls. In the balanced arrangement of 1:1, we obtained an F 1 -score of 0.738 using the MentalBERT LM. This is comparable with the works of Eichstaedt et al [14], de Choudhury et al [74], and Reece et al [19], who obtained F 1 -scores of 0.660, 0.680, and 0.650, respectively. This study provides evidence that LMs are more effective than existing methods for predicting depression in social media data before diagnosis.
Our temporal analysis suggested that the final 12 weeks (approximately 3 months) of posts before a depressed user's estimated diagnosis date are likely to be the most indicative of their condition. Another broader interpretation is that LMs do not appear to improve with the addition of more data before 12-16 weeks. The BERT and MentalBERT obtained F 1 -scores of 0.726 and 0.715, respectively. This is in contrast to a certain extent with the results of Eichstaedt et al [14], albeit using area under curve scores rather than F 1 -scores. Six months before the diagnosis date, 0.72 was obtained, and 3 months prior, 0.62 was obtained. From these results, it is difficult to draw clear conclusions because the results may be affected by the nature of the data and models used.
We also observed that posts made during the 4-to 8-week period before the user's estimated diagnosis date are also pertinent. They exhibited more negative sentiment than posts made during any other 4-week period (up to 24 weeks before their estimated diagnosis date). This finding may be supportive of prior work that distinct changes in mood may be predictive of the onset of depression [75].
We were able to corroborate the importance of sentiment in the discourse of depressed users. We found that depressed users are approximately 1.18 times more likely to make a sentimental post than nondepressed users.

Limitations
Constraints on our investigation primarily concern RSDD-Matched, where 70 depressed users make up a small sample. However, use 5-fold cross-validation to mitigate this and performed different experiments with various numbers of control users.
RSDD-Matched is derived from RSDD and RSDD-Time. As a result, the diagnosis dates of the users in RSDD-Matched are estimates only. Furthermore, posts made in mental health subreddits were deliberately elided from the RSDD and were not available for consideration by our machine classifiers.

Conclusions
Using state-of-the-art LMs, this study posits how far the diagnosis of depression in a person with depressive traits can be determined in advance. With this knowledge, it may be possible to direct people with depression to physicians much sooner than they would otherwise. Moreover, perhaps more importantly, we have shown how these automatic NLP tools can serve to analyze the main traits arising from web-based posts.
We have also observed that the sentiment exhibited in web-based forum postings demonstrates good sensitivity in detecting depressive traits.
Further work may include a multimodal approach to the detection of people with depression in web-based forums such as Reddit. For example, along with the text of Reddit users' posts, we might also consider the subreddits where they have upvoted and downvoted posts. The awards received or given may also indicate a user's mental health. Such a study would, of course, be contingent on the ability to synthesize a suitable data set or source an existing one. Moreover, the use of temporal information such as temporal word embeddings [76] may enhance any multimodal approach.
Methods for gauging the severity of depression in web-based forum users should also be investigated. This might involve mining language features from user posts and observing how they correlate with ground-truth severity. Features of interest may include terms used in Linguistic Inquiry and Word Count dictionaries, sentiment, and emotion [77].