This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR AI, is properly cited. The complete bibliographic information, a link to the original publication on https://www.ai.jmir.org/, as well as this copyright and license information must be included.
There are a wide range of potential adverse health effects, ranging from headaches to cardiovascular disease, associated with longterm negative emotions and chronic stress. Because many indicators of stress are imperceptible to observers, the early detection of stress remains a pressing medical need, as it can enable early intervention. Physiological signals offer a noninvasive method for monitoring affective states and are recorded by a growing number of commercially available wearables.
We aim to study the differences between personalized and generalized machine learning models for 3class emotion classification (neutral, stress, and amusement) using wearable biosignal data.
We developed a neural network for the 3class emotion classification problem using data from the Wearable Stress and Affect Detection (WESAD) data set, a multimodal data set with physiological signals from 15 participants. We compared the results between a participantexclusive generalized, a participantinclusive generalized, and a personalized deep learning model.
For the 3class classification problem, our personalized model achieved an average accuracy of 95.06% and an
Our results emphasize the need for increased research in personalized emotion recognition models given that they outperform generalized models in certain contexts. We also demonstrate that personalized machine learning models for emotion classification are viable and can achieve high performance.
Stress and negative affect can have longterm consequences for physical and mental health, such as chronic illness, higher mortality rates, and major depression [
Physiological signals, including electrocardiography (ECG), electrodermal activity (EDA), and photoplethysmography (PPG), have been shown to be robust indicators of emotions [
The vast majority of research in recognizing emotions from biosignals involves machine learning models that are generalizable, which means that the models were trained on one group of subjects and tested on a separate group of subjects [
We present 1 personalized and 2 generalized machine learning approaches for the 3class emotion classification problem (neutral, stress, and amusement) on the Wearable Stress and Affect Detection (WESAD) data set, a publicly available data set that includes both stress and emotion data [
To classify physiological data into the neutral, stress, and amusement classes, we developed a machine learning framework and evaluated the framework using data from the WESAD data set. Our machine learning framework consists of data preprocessing, a convolutional encoder for feature extraction, and a feedforward neural network for supervised prediction (
Overview of our model architecture for the 3class emotion classification task. FNN: feedforward neural network; SiLU: sigmoid linear unit.
We selected WESAD, a publicly available data set that combines both stress and emotion annotations. WESAD consists of multimodal physiological data in the form of continuous timeseries data for 15 participants and corresponding annotations of 4 affective states: neutral, stress, amusement, and meditation. However, we only considered the neutral, stress, and amusement classes since the objective of WESAD is to provide data for the 3class classification problem, and the benchmark model in WESAD ignores the meditation state as well. Our model incorporated data from 8 modalities recorded in WESAD: ECG, EDA, electromyogram (EMG), respiration, temperature, and acceleration (x, y, and z axes). In the data set, measurements for each of the 8 modalities were sampled by a RespiBAN sensor at 700 Hz to enforce uniformity, and data were collected for approximately 36 minutes per participant.
Each data modality was normalized with a mean of 0 and an SD of 1. We used a sliding window algorithm to partition each modality into intervals consisting of 64 data points, with a 50% overlap between consecutive intervals. We ensured that all 64 data points within an interval shared a common annotation, which allowed us to assign a single affective state to each interval. The process of normalization, followed by a sliding window partition, is illustrated in
For the personalized model, we partitioned the training, validation, and testing sets as follows: each participant in the data set had their own model that was trained, validated, and tested independently of other participants. For each affective state (neutral, stress, and amusement), we allocated the initial 70% of intervals with that affective state for training, the next 15% for validation, and the final 15% for testing. This guaranteed that the relative frequencies of each affective state were consistent across all 3 sets. Simply using the first 70% of all intervals for the training data would skew the distribution of affective states, given the nature of the WESAD data set. Furthermore, our partitioning of intervals according to sequential time order rather than random selection helped prevent overfitting by guaranteeing that 2 adjacent intervals with similar features would be in the same set. The partitioning of training, validation, and testing sets for the personalized model is shown in
A comparison of different generalized and personalized approaches to the 3class emotion classification task. The participantexclusive generalized model mimics generalized approaches used in other papers. The participantexclusive generalized model shown in the figure differs from what we use in this paper.
Standard generalized models partition the training, validation, and testing sets by participant [
A second generalized model baseline was created, called the participantinclusive generalized model. Like the testing sets for the participantexclusive generalized and personalized models, the testing set for this model contained the last 15% of intervals for each affective state for a single participant. The training set consisted of the first 70% of intervals for each affective state for all participants, and the validation set consisted of the next 15%. The set of participants in the training and testing sets overlapped by 1 participant—the subject in the testing set—which is why this model is called the participantinclusive generalized model. This is illustrated in
The model architecture consisted of an encoder network followed by a feedforward head, which is shown in
Hyperparameters relating to model structure.
Hyperparameter  Value 
Encoder depth (number of blocks), n  3 
Dropout rate, %  15 
Number of fully connected layers, n  2 
Convolutional kernel size, n  3 
Max pooling kernel size, n  2 
Activation function  SiLU^{a} 
^{a}SiLU: sigmoid linear unit.
We trained the 2 generalized baseline models and the personalized model under the same hyperparameters to guarantee a fair comparison. Both models were trained with crossentropy loss using AdamW optimization. All models were written using PyTorch [
This study did not require institutional review board (IRB) review because we exclusively used a commonly analyzed publicly available data set. We did not work with any human subjects.
For the 3class emotion classification task (neutral, stress, and amusement),
A comparison of model accuracy between the personalized and generalized models.
Participant  Model accuracy, %  

Personalized model  Participantinclusive generalized model  Participantexclusive generalized model 
1  68.36  82.69  53.94 
2  82.32  67.12  81.91 
3  99.99  82.81  82.81 
4  99.90  82.86  82.31 
5  98.02  82.94  74.67 
6  99.57  54.57  54.03 
7  100.00  82.05  83.23 
8  100.00  53.72  53.70 
9  100.00  51.86  51.83 
10  93.69  82.05  79.85 
11  100.00  60.86  62.11 
12  98.34  53.53  53.60 
13  99.81  53.26  65.35 
14  100.00  53.47  53.54 
15  85.83  60.43  81.91 
A comparison of F_{1}score between the personalized and generalized models.
Participant  

Personalized model  Participantinclusive generalized model  Participantexclusive generalized model 
1  58.14  61.91  23.36 
2  58.88  44.55  58.53 
3  99.98  62.05  62.05 
4  99.87  61.95  61.50 
5  96.87  61.99  54.74 
6  99.35  24.94  23.59 
7  100.00  61.16  62.09 
8  100.00  23.38  23.29 
9  100.00  22.85  22.89 
10  94.29  61.04  59.23 
11  100.00  38.27  40.15 
12  97.40  26.79  26.90 
13  99.75  24.47  44.63 
14  100.00  23.93  24.09 
15  71.28  38.26  58.71 
Average accuracy and F_{1}score of models across all participants.
Model type  Accuracy, mean (SD [%])  
Personalized  95.06 (9.24)  91.72 (15.33) 
Participantinclusive generalized  66.95 (13.76)  42.50 (17.37) 
Participantexclusive generalized  67.65 (13.48)  43.05 (17.20) 
Model comparison  
Personalized versus participantinclusive generalized  
Personalized versus participantexclusive generalized  
Participantinclusive generalized versus participantexclusive generalized  .81  .88 
We demonstrated that a personalized deep learning model outperforms a generalized model in both the accuracy and
Our work indicates that personalized models for emotion recognition should be further explored in the realm of health care. Machine learning methods for emotion classification are clearly viable and can achieve high accuracy, as shown by our personalized model. Furthermore, given that numerous wearable technologies collect physiological signals, data acquisition is both straightforward and noninvasive. Combined with the popularity of consumer wearable technology, it is feasible to scale emotion recognition systems. This can ultimately play a major role in the early detection of stress and negative emotions, thus serving as a preventative measure for serious health problems.
The vast majority of prior studies using WESAD developed generalized approaches to the emotion classification task. Schmidt et al [
Sah and Ghasemzadeh [
As shown in
Deviations of mean and SD for participants 1 and 2 for neutral class modalities.
Deviations of mean and SD for subjects 1 and 2 for stress class modalities. EMG: electromyogram.
Deviations of mean and SD for subjects 1 and 2 for amusement class modalities.
Ranges of emotion class distributions per participant.
Emotion class  Range, % 
Neutral  51.854.0 
Stress  29.031.8 
Amusement  16.317.4 
Our participantinclusive and participantexclusive generalized models do not outperform previously published generalized models on the WESAD data set (eg, Schmidt et al [
Given the variations between participants, one approach to improving generalized model performance is adding embedding representations for each participant or participantspecific demographic data as additional features as a method of distinguishing individual participants in generalized models. However, to prevent overfitting to participantspecific features like demographic data, data sets with significantly more participants would need to be created, given the small sample size of the WESAD data set.
One limitation that personalized models may encounter during training is the cold start problem, given that personalized models receive less data than generalized models. Moreover, despite the accuracy improvement in personalized models, developing a model for each participant may be costly and unscalable: data must be labeled specifically per participant, and enough data must be provided to the model to overcome the cold start problem (notably, however, even though the cold start problem should theoretically put our personalized model at a disadvantage, the WESAD data set provided enough data for our personalized model to outperform our generalized model). Both of these limitations can be addressed by a selfsupervised learning approach to emotion recognition.
A selfsupervised learning approach follows a framework used by natural language processing models such as the Bidirectional Encoder Representations from Transformers (BERT) model [
Finally, to expand beyond the WESAD data set, it is valuable to reproduce results on additional physiological signal data sets for emotion analysis, such as the Database for Emotion Analysis using Physiological Signals (DEAP) [
Bidirectional Encoder Representations from Transformers
Cognitive Load, Affect, and Stress
Database for Emotion Analysis using Physiological Signals
electrocardiography
electrodermal activity
electromyogram
leaveonesubjectout
photoplethysmography
sigmoid linear unit
Smart Reasoning for Wellbeing at Home and at Work
SWELL knowledge work
Wearable Stress and Affect Dataset
The project described was supported by grant U54GM138062 from the National Institute of General Medical Sciences (NIGMS), a component of the National Institutes of Health (NIH), and its contents are solely the responsibility of the author and do not necessarily represent the official view of NIGMS or NIH. The project was also supported by a grant from the Medical Research Award fund of the Hawai’i Community Foundation (grant MedRes_2023_00002689).
None declared.