This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR AI, is properly cited. The complete bibliographic information, a link to the original publication on https://www.ai.jmir.org/, as well as this copyright and license information must be included.
Lung disease is a severe problem in the United States. Despite the decreasing rates of cigarette smoking, chronic obstructive pulmonary disease (COPD) continues to be a health burden in the United States. In this paper, we focus on COPD in the United States from 2016 to 2019.
We gathered a diverse set of non–personally identifiable information from public data sources to better understand and predict COPD rates at the corebased statistical area (CBSA) level in the United States. Our objective was to compare linear models with machine learning models to obtain the most accurate and interpretable model of COPD.
We integrated non–personally identifiable information from multiple Centers for Disease Control and Prevention sources and used them to analyze COPD with different types of methods. We included cigarette smoking, a wellknown contributing factor, and race/ethnicity because health disparities among different races and ethnicities in the United States are also well known. The models also included the air quality index, education, employment, and economic variables. We fitted models with both multiple linear regression and machine learning methods.
The most accurate multiple linear regression model has variance explained of 81.1%, mean absolute error of 0.591, and symmetric mean absolute percentage error of 9.666. The most accurate machine learning model has variance explained of 85.7%, mean absolute error of 0.456, and symmetric mean absolute percentage error of 6.956. Overall, cigarette smoking and household income are the strongest predictor variables. Moderately strong predictors include education level and unemployment level, as well as American Indian or Alaska Native, Black, and Hispanic population percentages, all measured at the CBSA level.
This research highlights the importance of using diverse data sources as well as multiple methods to understand and predict COPD. The most accurate model was a gradient boosted tree, which captured nonlinearities in a model whose accuracy is superior to the best multiple linear regression. Our interpretable models suggest ways that individual predictor variables can be used in tailored interventions aimed at decreasing COPD rates in specific demographic and ethnographic communities. Gaps in understanding the health impacts of poor air quality, particularly in relation to climate change, suggest a need for further research to design interventions and improve public health.
Lung disease is a severe problem in the United States. According to the Centers for Disease Control and Prevention (CDC), asthma is responsible for at least 3000 deaths per year and chronic obstructive pulmonary disease (COPD) is responsible for at least 150,000 deaths per year. COPD is a progressive lung disease, encompassing chronic bronchitis and emphysema, which is characterized by airflow limitation and breathing difficulties. Asthma and COPD can cooccur (asthmaCOPD overlap), with increased risk of mortality [
Cigarette smoking has been trending downward in recent years, thanks in part to public health advertisement campaigns. Nevertheless, air quality can be dangerously poor at times, which exacerbates lung health problems [
The rest of the paper is organized as follows. We first review prior work regarding the possible factors contributing to COPD in adults. We then describe our methods, including data sources for the variables of interest and descriptive statistics. Following this, we will describe and interpret the results of our multiple linear regression (MLR) and machine learning (ML) models. We conclude by describing the overall research contributions as well as limitations and future directions.
There is substantial literature on factors contributing to COPD, including a wide variety of environmental, economic, and demographic variables; the etiology of COPD is multifactorial, with smoking being the most wellknown contributing factor. Furthermore, the combination of environmental pollutants and cigarette smoke has shown synergistic effects, accelerating the decline in lung function and worsening COPD [
Pollutants and copollutants are associated with decreased lung function and can lead to COPD. The loss can range from mild, such as allergies, to severe, that is, mortality. Air quality varies widely throughout the United States because of pollutants and copollutants, and climate change may be worsening it, particularly for populations considered vulnerable [
ML methods have been applied increasingly to public health and medical problems. For example, ML has been used to support the public health response to COVID19 through surveillance, case identification, contact tracing, and evaluating interventions [
ML methods have been used to study COPD, in particular. For example, ML methods have been used to develop a prediction system using lifestyle data, environmental factors, and patient symptoms for the early detection of acute exacerbations of COPD within a 7day window [
Different races and ethnicities may have different baseline rates of disease due to various factors, including historical misdiagnosis and mistreatment of various racial or ethnic groups, which leads to differential outcomes [
We had three general expectations of COPD in our models:
Cigarette smoking will have the highest impact on COPD rates.
AQI will have a strong impact on COPD rates.
There will be differences in COPD rates based on racial or ethnic demographics.
This paper used MLR and ML methods to predict COPD at the corebased statistical area (CBSA) level [
Data sources.
Source  Reference 
National Center for Health Statistics  [ 
Chronic Disease Indicators data  [ 
US Chronic Disease Indicator, stratification values  [ 
Data were collected for all CBSAs that were available from 2016 to 2019. All data obtained from the CDC were contributed voluntarily at the individual level and aggregated to remove all personally identifiable information [
The COPD rates are for 2019, whereas all the predictor variables are averaged over the timespan from 2016 to 2018. As such, the models obtained are predictive over time. The data collection result was 517 (56%) of the 929 CBSAs, with proportionally more from the 388 metropolitan statistical areas than from the 541 micropolitan statistical areas. The response variable is the percentage of the CBSA having COPD. We modeled all factors as random variables directly contributing to COPD, which is measured as the proportion (percentage) of the population having COPD. Race or ethnicity was also modeled as percentage of the population rather than as categorical variables. All variables in
In
Main variables and descriptive statistics and average within corebased statistical areas.

Years  Values, median (IQR)  Values, mean (SD)  Values, range 
Population (n)  20162018  96,811 (48,763180,484)  191,892 (408,308)  73516,633,096 
GDP^{a} (US $)  20162018  13,126,907 (2,562,70439,046,120)  64,223,036 (212,975,821)  447,3553,218,209,695 
Median household income (US $)  20162018  52,632 (46,86760,494)  54,736 (11,319)  27,842119,332 
GDP per capita (US $)  20162018  100.07 (47.83277.17)  253.77 (479)  16.864731.50 
Air quality index  20162018  38.67 (34.0043.00)  38.02 (10)  9.0095.00 
Smoking rate  20162018  17.12 (15.3319.28)  17.29 (3)  8.4129.59 
Poverty rate (all ages)  20162018  13.80 (10.9217.12)  14.36 (4)  3.8735.56 
Unemployment rate  20162018  4.52 (3.675.43)  4.71 (2)  1.9720.93 
Education rate  20162018  22.91 (17.9427.96)  24.22 (8)  8.7765.75 
White (%)  20162018  87.6 (78.692.8)  84.6 (0.129)  22.1100 
Black (%)  20162018  4 (1.512.5)  9.3 (0.124)  0.3100 
AI or AN^{c} (%)  20162018  0.7 (0.41.7)  2 (0.044)  0.145.9 
Asian (%)  20162018  1.6 (0.93)  2.8 (0.041)  0.242.8 
NH or PI^{d} (%)  20162018  0.1 (0.10.2)  0.3 (0.01)  0.012.9 
Hispanic (%)  20162018  7 (3.914.9)  13.3 (0.164)  0.995.5 
COPD^{b} rate (%)  2019  6.7 (5.77.9)  6.871 (1.511)  3.215 
^{a}GDP: gross domestic product.
^{b}COPD: chronic obstructive pulmonary disease.
^{c}AI or AN: American Indian or Alaska Native.
^{d}NH or PI: Native Hawaiian or Pacific Islander.
Population, gross domestic product (GDP), GDP per capita, and median household income.
Therefore, we made a log transformation of these variables (ie, logPopl, logGDP, logGDPpc, and logHHI) to make them less skewed, and we show a heat map of correlations of them with the other variables in
Correlations among main variables. COPD: chronic obstructive pulmonary disease; GDP: gross domestic product.
We see a range of correlations, from very negative (green) to negative (orange) to positive (purple) to very positive (pink). In the rightmost column, we see the correlations between the response variable, COPD rate, and the other variables, ranging from very positive (smoking rate) to moderately positive (poverty and unemployment rates) to moderately negative (education and logged household income) to slightly negative (log of GDP, log of population, log of GDP per capita, and AQI). Given these correlations, we are likely to find good predictive models, but we need to check for multicollinearity in any linear model that we identify.
To understand and model COPD, one has to consider the consistently largest contributing factor: cigarette smoking. Research tends to either control for cigarette smoking or exclude it entirely. In this paper, we chose to include cigarette smoking, accounting for it in our models, but also to examine other factors to compare the magnitudes of influence among the various factors. We aimed to model a variety of factors, including cigarette smoking, to arrive at the model that predicts COPD with the greatest accuracy.
Our MLR baseline model in R (version 4.2.3) yielded the output in
Multiple linear regression.

Estimate  SE  
(Intercept)  32.4000  2.930  11.065  <.001 
Smoking_Rate  0.2570  0.015  16.635  <.001 
Log_HH_Income  −2.8100  0.264  −10.638  <.001 
Hispanic_percentage  −2.3900  0.234  −10.249  <.001 
Education_Rate  −0.0334  0.005  −6.627  <.001 
AI_or_AN_percentage^{a}  −2.9000  0.778  −3.726  <.001 
Black_percentage  −1.2700  0.356  −3.558  <.001 
NH_or_PI_percentage^{b}  −15.8000  4.430  −3.558  <.001 
Log_GDP  0.0741  0.035  2.126  .034 
White_percentage  −0.7380  0.358  −2.060  .04 
Unemployment_Rate  0.0456  0.023  1.993  .047 
Asian_percentage  2.4800  1.250  1.988  .047 
Log_Population  0.0899  0.055  1.626  .105 
Air_Quality_Index  0.0003  0.003  0.098  .922 
^{a}AI_or_AN: American Indian or Alaska Native.
^{b}NH_or_PI: Native Hawaiian or Pacific Islander.
The model has residual SE 0.658 on 503
There are 7 predictors of high statistical significance: smoking rate, Black percentage, Native Hawaiian or Pacific Islander percentage, American Indian or Alaska Native percentage, education rate, Hispanic percentage, and log of household income. Smoking rate has a positive association with COPD, with every additional percentage increase associated with a 0.257% increase in the COPD rate. The other 6 highly significant predictors have a negative association. Every percentage increase in the log of household income lowers the COPD rate by 2.81%. The Hispanic percentage is nearly as strong; every percentage increase corresponds to a drop of 2.39% in COPD rate. American Indian or Alaska Native is a bit stronger in its coefficient estimate; every percentage point increase corresponds to a drop of 2.9% in COPD rate. Every percentage point increase in Native Hawaiian or Pacific Islanders corresponds to a drop of 15.8%, which is much stronger. Every percentage point increase in Black percentage corresponds to a drop of 1.27% in COPD rate. Education rate has a strongly statistically significant relationship, but a small percentage point impact: every percentage increase corresponds to a decrease of 0.0334% in COPD rate. The remaining 4 predictors—White percentage, GDP (logged), unemployment rate, and Asian percentage—are far less statistically significant and, therefore, should be interpreted with caution.
Linear models are simpler than ML models, and they are sometimes perfectly adequate for explaining a phenomenon. They are easier to interpret, communicate, and implement as new policy. They make statistical assumptions, which can be verified. Linear regression is certainly a good place to start. However, we argue that one should not stop there because an ML model can capture substantial variance from nonlinear relationships (if there are any) in the data and thus produce a more accurate model. By capturing additional variance, the model can capture subtler effects and relationships due to interactions, context, and tipping points. This is crucial because public health practice tends to use simple ifthen rules, that is, decision trees. ML models can add nuance to those decision trees based on the captured nonlinearities. Although an adjusted
The 7 ML methods evaluated in this paper are lasso regression, ridge regression, generalized additive model, support vector machine, artificial neural network, random forest, and GBT. These methods were selected for their known strengths in minimizing errors of bias or errors of variance, that is, their ability to fit data well on test data without overfitting. They also represent the range of algorithms commonly used in ML prediction, from methods established in classical statistics to more modern methods derived from computer science. They are commonly used because they are accurate and well understood. Trying a variety of methods is a common practice because the different methods make different statistical assumptions, which may enhance or inhibit optimal performance. All methods were available as R packages for R (version 4.2.3). We summarize each method in terms of its main pros and cons:
Lasso regression is an MLR method that incorporates regularization to perform variable selection. It minimizes the sum of squared errors between predicted and actual values, while adding a penalty term based on the absolute value of coefficients multiplied by a tuning parameter. Doing so shrinks some coefficients to exactly 0, effectively performing feature selection by excluding less important variables from the model. This reduces model complexity and minimizes multicollinearity. This is a standard refinement of MLR (R package glmnet).
Ridge regression is an MLR technique that adds a penalty term to the objective function to reduce the coefficients of less important predictors and guard against overweighting the most important predictors. While it retains all predictors in the model, ridge regression can help improve the robustness of the model in the presence of correlated predictors by reducing multicollinearity. This is a standard refinement of MLR (R package ridge).
The generalized additive model is a nonparametric generalization of MLR, which allows for nonlinear terms and coefficient regularization while maintaining interpretability. Each term is a function of X_{n} rather than simply a numeric coefficient multiplied with X_{n}. As with MLR, all the terms are added together. Although overfitting can occur, regularization and crossvalidation help to minimize it (R package mgcv).
Support vector machine is a technique that transforms the data into a highdimensional variable space using a kernel function, fitting a function that best fits the data while allowing a certain margin of error (epsilon) and maintaining robustness against outliers. Epsilon tubes can provide a visual representation of the model’s uncertainty. Points within the tube are considered well predicted, while those outside represent errors. A regularization parameter controls the tradeoff between accuracy and complexity (R package e1071).
Artificial neural network is a generalization of MLR with hidden layers of nodes between input and output nodes; it may result in overfitting. Depending on the number of hidden layers, nodes per layer, and the activation function used to convert inputs to outputs, an arbitrarily complex model can be fit. This can be thought of as a simplified version of a human brain, in which input and output nodes are separated by ≥1 layers of hidden nodes. Prediction error causes the weights of the hidden nodes to be adjusted until minimal error is achieved (R package neuralnet).
Random forest is an ensemble technique to fit a large number of a bootstrapsampled aggregation (bagging) of trees by considering a random subset of variables at each tree split. Intuitively, a random forest is a blending of a large number of decision trees, the “wisdom of the forest.” The random subset of variables restriction is done to prevent strong variables from dominating the weaker variables. A random forest tends to perform very well but is difficult to interpret (R package RandomForest).
GBT is an ensemble of sequential trees that focuses on the errors of the previous tree. It is able to find interaction effects implicitly. It uses gradient descent search to rapidly minimize error via an arbitrary, differentiable loss function. It uses many trees to help ensure that the local minimum error found is the global minimum. Intuitively, this builds a strong predictive model by combining many weak models, each correcting the errors of the previous one (R package XGBoost).
Our ML approach followed best practices. We randomly partitioned the data set into train (311/517, 60%), crossvalidate (103/517, 20%), and test (103/517, 20%) subsets. We checked for outliers, multicollinearity, and target leakage to ensure valid models [
This research did not involve human subjects at the individual level and therefore did not require institutional review board approval. Our data were collected from CDC sources at the level of CBSA. All sources were free of personal identifying information, because the CDC is legally required to ensure the protection of the data. All data were collected and aggregated in a non–personal identifying information manner. The results of our analysis do suggest communicating with different racial and ethnic groups differently, tailoring the implications directly to patients as well as indirectly to their families, communities, and health care providers in a race or ethnicitysensitive manner.
In
Machine learning models versus multiple linear regression.
Method  Adjusted 
Root mean square error  Mean absolute error  Symmetric mean absolute percentage error  


Train  CV^{a}  Test  Train  CV  Test  Train  CV  Test  
Gradient boosted tree (XGBoost, loss function=least squares, learning rate=0.05, and maximum tree depth=10)  0.857  0.550  0.598  0.557  0.433  0.445 

6.473  6.543 


Support vector machine (Nystroem kernel and loss function=Poisson deviance) 

0.555 


0.435 

0.462  6.515 

6.989  
Random forest (maximum trees=500, maximum depth=none, and maximum leaves=100)  0.836 

0.614  0.596 

0.462  0.479 

6.819  7.339  
Neural network (2 layers: 512, 512 units; regularization via random dropout rate=0.05 and activation function=prelu)  0.845  0.601  0.609  0.580  0.455  0.467  0.468  6.856  6.928  7.182  
Generalized additive model (learning rate=0.3, maximum bins=100, and loss function=least squares)  0.822  0.629  0.658  0.621  0.515  0.508  0.488  7.619  7.502  7.212  
Ridge regression  0.810  0.589  0.618  0.641  0.467  0.483  0.527  6.986  7.346  7.986  
Lasso regression  0.758  0.750  0.778  0.724  0.585  0.593  0.597  8.425  8.544  8.824  
Multiple linear regression  0.811  0.620  0.699  0.749  0.474  0.548  0.591  7.205  8.403  9.666 
^{a}CV: crossvalidation.
^{b}Values in italics represent the best metrics.
The ML methods were superior to MLR on most metrics. Support vector machine was the best on adjusted
The top five variables in terms of impact were (1) smoking rate and (2) household income, followed by (3) American Indian or Alaska Native percentage, (4) education rate, and (5) unemployment rate. Black percentage was sixth, Hispanic percentage was seventh, and there was only a small impact from the remaining variables: White percentage, AQI, Asian percentage, Native Hawaiian or Pacific Islander percentage, population, and GDP. Relative to the MLR, smoking rate, household income, education rate, and Black percentage remained the same in terms of rank importance. Hispanic percentage dropped from third to seventh rank; American Indian or Alaska Native percentage rose from fifth to third rank; and unemployment rate rose sharply, from 10th to 5th in importance. Native Hawaiian or Pacific Islander percentage dropped sharply, from 7th to 11th in rank.
Lift plot showing chronic obstructive pulmonary disease rate as a function of 10 decile bins; predicted values are in blue and actual values are in red. COPD: chronic obstructive pulmonary disease.
Prediction residuals.
In addition to the variable importance plot, other plots were used to gain an understanding of ML models: local interpretable modelagnostic explanations (LIME) models and SHAP (Shapley additive explanations) plots [
The top 5 variables (smoking rate, household income, American Indian or Alaska Native percentage, education rate, and unemployment rate) have substantially more impact on COPD percentage than the remaining variables. We show the top 5 variables as well as the next 4 as individual SHAP plots of the GBT in
Smoking had the greatest impact: as the smoking rate increased, the COPD rate also rose substantially, following a steeply curved, nearly exponential relationship. Median household income had the second highest impact, an almost linear (and negative) relationship. The greater the household income, the lower the COPD rate. This could indicate better insurance coverage, better health care access, higher quality health care (ie, prevention or treatment), lower occupational exposure, or lower home exposure (eg, gas stoves). The next variable was American Indian or Alaska Native percentage, indicating a negative but nonlinear relationship with COPD rate: a steep drop followed by a gradual tapering. This represents a significant protective influence shown for the American Indian or Alaska Native community, which has not yet been noted in the literature.
SHAP (Shapley additive explanations) values for all features (variables). AI: American Indian; AN: Alaska Native; GDP: gross domestic product; NH: Native Hawaiian; PI: Pacific Islander.
SHAP (Shapley additive explanations) plots for the 9 most important variables. AI: American Indian; AN: Alaska Native.
The next variable, education rate had a negative, curvilinear relationship. The more educated the population, the lower the COPD rate. The explanation could be similar to that of income: better insurance coverage or health care access, better quality of health care, lower occupational exposure, or lower home exposure [
The next variable, Hispanic percentage, showed a negative linear relationship with COPD rate. This represents a significant protective influence shown for people in the Hispanic community, which is consistent with the literature [
We had three general expectations, which were largely met:
The impact of cigarette smoking was the largest in all models.
The AQI had an impact in the best ML model, but it was smaller than expected.
There were substantial racial or ethnic differences, particularly among American Indian or Alaska Native, Black, and Hispanic communities.
Consistent with the literature, we found that smoking remains the most significant risk factor for COPD, with research consistently demonstrating a strong association between smoking status and COPD prevalence. In our MLR, we found that smoking rate is the strongest predictor of COPD rate. We found the same result in our GBT but also found that the smoking rate has a curvilinear, almost exponential, relationship with COPD. The Rotterdam study, a largescale populationbased cohort study, found that current and former smokers had a substantially higher risk of developing COPD compared to never smokers [
Notably, 3 of the 4 next most important variables, in terms of impact in our GBT, are socioeconomic variables: household income (rank 2), education rate (rank 4), and unemployment rate (rank 5). In the MLR, we found that household income (logged) had the second highest impact. In the GBT, household income had the second highest impact, but the tipping point was around US $40,000, after which higher income had a linear, negative relationship with COPD. Education rate had a strongly negative, curvilinear relationship with COPD. Unemployment rate had a sharply positive relationship with COPD, but then peaked at 5% unemployment, after which it plateaued.
These results are largely consistent with the literature on socioeconomic factors and smoking behavior, suggesting an indirect relationship with COPD via smoking. A study examining smoking among adolescents in 6 European cities found that disposable income was positively associated with smoking [
Ethnic or racial variables accounted for 3 of the top 7 variables in the GBT: American Indian or Alaska Native percentage (rank 3), Black percentage (rank 6), and Hispanic percentage (rank 7). The greater the size of those minority populations, the lower the COPD rate. Our SHAP plots show significant tipping points (nonlinearities) for American Indian or Alaska Native percentage and Black percentage and a mostly linear relationship for Hispanic percentage. Consistent with the literature, all 3 variables show a strongly negative association with COPD.
The regression and GBT models show that in addition to strongly protective impacts for lower cigarette smoking and higher household income, there are protective impacts for larger American Indian or Alaska Native and Hispanic populations as well as a nonlinear impact on larger Black populations. Higher education rate and lower unemployment rate are also protective, whereas AQI shows mixed effects. These results have implications for private health care practitioners, public health care officials, and health care policy makers who aim to reduce COPD rates. Such policies and programs should not assume high digital literacy [
Several studies have identified ethnic and racial disparities in COPD prevalence and risk among smokers. One study found that racial and ethnic minority individuals, particularly African Americans and Hispanics, had a lower prevalence of airflow obstruction than nonHispanic White individuals, even after adjusting for smoking status and other risk factors [
There are varying levels of patient trust and implicit bias in health care practitioners themselves [
Educational materials and behavior change strategies may need to be customized according to different risk factors, beliefs, preferences, and technographics of different subpopulations [
AQI was not significant in the MLR, but it was significant in the GBT, albeit not as strongly as we expected. It could be that the AQI is more of a diffuse, macrolevel environmental factor that fluctuates over time, making some CBSAs worse on average, but with wide volatility, for example, as weather and wind directions change [
There is a small but growing body of research that uses ML models in health care and medicine. There is recognition that the models can be highly accurate, but there is no consensus yet on how to interpret the results in a way that meshes seamlessly with clinical practice. The following examples provide an overview.
Elshawi et al [
Hakkoum et al [
Meng et al [
In sum, there is no consensus on the best way to interpret highperforming ML models in health care. There are always tradeoffs between accuracy and interpretability or explainability. We chose to use Shapley values because they represent the frontier in explainability, and they are similar to interpreting a multiple regression, interpreting 1 variable at a time, without the assumptions of linear models. In addition, Shapley values allow for nonlinear relationships between each independent (predictor) variable and the dependent variable. Variable importance plots in conjunction with Shapley values help us to identify the most important variables and characterize their relationships with COPD.
Our best MLR model had variance explained of 81.1%, MAE of 0.591, and SMAPE of 9.666. Our best ML model was the GBT, with variance explained of 85.7%, MAE of 0.456, and SMAPE of 6.956. The GBT explains most of the variance—4.6% more than the best MLR—with far less predictive error. The GBT’s SMAPE (6.956) was 28% lower than that of the MLR’s SMAPE (9.666). Similarly, the GBT’s RMSE was 26% lower than the MLR’s RMSE, and its MAE was 23% lower than that of the MLR. Realworld predictive accuracy should be similar to that found in the test data set because the test data were never used in the GBT’s model development.
Our GBT performed strongly on the test data, with very little performance deterioration on the test data versus performance on the training and validation data. This demonstrates that the GBT model does not overfit the data. To interpret the GBT, we used a variable importance plot [
This research has a few limitations. The data were obtained from 517 (56%) of the 929 CBSAs. We assumed that this was an adequate sample and that the remaining CBSAs that did not report the data were similar to those that did. Alternatively, it could be that the CBSA that did not report COPD rates did so because the rates were low, that is, COPD was not considered a major problem by the local public health officials. Data covering additional demographic variables, such as gender and age, in addition to occupational exposures and physical exercise, could be gathered [
Future data collection could focus on understanding racial or ethnic disparities. By collecting data more intensively from the minority populations, we could go deeper into understanding how their rates of COPD drop so dramatically. Is it related to active peer recommendations for better selfcare in a predominantly White health care system and population? Is it related to successfully tailored smoking prevention or cessation programs? Data pertaining to answering these more specific questions could be collected to enhance our understanding of how best to tailor communications to different demographic or ethnographic groups.
All our models were structured as direct effects. We applied MLR and ML methods with data from CBSAs, which have significant variation in terms of health care access and quality. Using these models as a foundation, we should recognize the interconnectedness (ie, direct, indirect, and interactive) of pollutants and copollutants to fully understand COPD’s complex etiology. Future research could model interaction, moderating, or mediating effects, perhaps with a structural equation model, to identify the direct and indirect effects of COPD, for example, showing how asthma may lead to COPD or to asthmaCOPD overlap [
There are many research knowledge gaps in the health impacts of extreme air pollution, including the effects of interactions between temperature and air pollution on respiratory health due to climate change [
Greenhouse gas emissions may exacerbate overall air quality [
The association between COPD and environmental pollutants, including tropospheric ozone, nitrogen dioxide, sulfur dioxide, and occupational exposures, has been extensively investigated [
Our novel contributions in this paper include the following: (1) integration of multiple publicly available CDC data sources, (2) development of highly accurate models using linear and nonlinear methods, and (3) interpretation of the variable impacts for the best model. Smoking was the number 1 variable impacting the COPD rate, which was expected. Household income was the second most influential predictor variable. Four economic factors spanned the full range of influence, from large (household income) to moderate (education rate) to small (unemployment rate and GDP). The race or ethnicity variable also had a range of impacts, from moderately high (American Indian or Alaska Native percentage) to moderate (Black or Hispanic percentage) to small (White, Asian, or Native Hawaiian or Pacific Islander percentage).
This research demonstrates the power of ML methods in general and a GBT, which produced a highly accurate model of COPD rates. The computational complexity of a GBT enables it to obtain high accuracy, but health care policy makers may be reluctant to adopt it unless they can obtain a rulebased explanation. Furthermore, clinicians typically want to be able to explain, justify, and communicate results to others in an intuitive manner. Finally, there may be legal, auditing, or regulatory requirements concerning transparency. If the method is audited, and it cannot be clearly explained, there may be serious legal or financial consequences [
Importance of model variables for gradient boosted tree.
air quality index
corebased statistical area
Centers for Disease Control and Prevention
chronic obstructive pulmonary disease
gradient boosted tree
gross domestic product
local interpretable modelagnostic explanations
mean absolute error
machine learning
multiple linear regression
root mean square error
Shapley additive explanations
symmetric mean absolute percentage error
The authors thank Malavika Andavilli, Xuhui Bai, Chulin Chen, Shan He, Lanxiang Shao, Moshe Shpits, and Ziyu Tang for contributions to an early version of this paper. The authors also thank the anonymous reviewers at the International Business School, Brandeis University, for their helpful comments.
None declared.