Air pollution is a pressing environmental issue, with various types of pollutants affecting air quality. Among these, fine particulate matter (PM2.5)—particles smaller than 2.5 µm in diameter—is particularly harmful. 1 Exposure to PM2.5 is linked to severe health problems, and some regions in the United States experience pollution levels that exceed the World Health Organization’s recommended limits. 2 3
Accurately predicting annual average air pollution concentrations in the U.S. has significant benefits, such as informing public health initiatives and guiding policy decisions. While traditional air pollution measurement methods provide valuable data, their uneven distribution nationwide and limited coverage of PM2.5 levels create gaps in understanding. 4 To address this problem, we use machine learning to develop a model aimed at improving the accuracy of air pollution predictions. This model also incorporates climate region as a factor to account for geographic variability, seeking to enhance prediction accuracy, especially in regions with sparse monitor coverage.
Our data comes from the US Environmental Protection Agency (EPA), the National Aeronautics and Space Administration (NASA), the US Census, and the National Center for Health Statistics (NCHS).
This dataset contains information from air quality monitors placed all throughout the United States, although they are not uniformly distributed. There are 48 features (variables) with values for each of the 876 monitors (each monitor is a row/observation).
id
| Monitor number
fips
| Federal information processing standard number
for the county where the monitor is located
Lat
| Latitude of the monitor in degreesLon
| Longitude of the monitor in degreesstate
| State where the monitor is locatedcounty
| County where the monitor is locatedcity
| City where the monitor is locatedCMAQ
| Estimated values of air pollution from a
computational model called Community
Multiscale Air Quality (CMAQ)
zcta
| Zip Code
Tabulation Area where the monitor is located
zcta_area
| Land area of the zip code area in meters
squared
zcta_pop
| Population in the zip code area
imp_a500, imp_a1000, imp_a5000, imp_a10000, imp_a15000
| Impervious surface measure
county_area
| Land area of the county of the monitor in
meters squaredcounty_pop
| Population of the county of the
monitorLog_dist_to_prisec
| Log (Natural log) distance to a
primary or secondary road from the monitor
log_pri_5000m, log_pri_10000m, log_pri_15000m, log_pri_25000m
| Count of primary road length in meters in a circle with a radius of
5000, 10000, 15000, and 25000 meters respectively around the monitor
(Natural log)
log_prisec_500m, log_prisec_1000m, log_prisec_5000m, log_prisec_10000m, log_prisec_15000m, log_prisec_25000m
| Count of primary and secondary road length in meters in a circle with
a radius of 500, 1000, 5000, 10000, 15000, 25000 meters around the
monitor (Natural log)
log_nei_pm25_10000m, log_nei_pm25_15000m, log_nei_pm25_25000m
| Tons of emissions from major sources data base (annual data) sum of
all sources within a circle with a radius of 10000, 15000, 25000 meters
of distance respectively around the monitor (Natural log)
log_nei_pm10_10000m, log_nei_pm10_15000m, log_nei_pm10_25000m
| Tons of emissions from major sources data base (annual data) sum of
all sources within a circle with a radius of 10000, 15000, 25000 meters
of distance respectively around the monitor (Natural log)
popdens_county
| Population density (number of people
per kilometer squared area of the county)popdens_zcta
| Population density (number of people per
kilometer squared area of zcta)no_hs
| Percentage of people in zcta area where the
monitor is that do not have a high school degreesome_hs
| Percentage of people in zcta area where the
monitor whose highest formal educational attainment was some high school
educationhs_grad
| Percentage of people in zcta area where the
monitor whose highest formal educational attainment was completing a
high school degreesome_college
| Percentage of people in zcta area where
the monitor whose highest formal educational attainment was completing
some college educationassociate
| Percentage of people in zcta area where the
monitor whose highest formal educational attainment was completing an
associate degreebachelor
| Percentage of people in zcta area where the
monitor whose highest formal educational attainment was a bachelor’s
degreegrad
| Percentage of people in zcta area where the
monitor whose highest formal educational attainment was a graduate
degreepoverty
| Percentage of people in zcta area where the
monitor is that lived in poverty in
2008hs_or_less
| Percentage of people in zcta area where
the monitor whose highest formal educational attainment was a high
school degree or less (sum of no_hs, some_hs, and hs_grad)urc2013, urc2006
| 2013,
2006
Urban-rural classification of the county where the monitor is
located
aod
| Aerosol Optical Depth measurement from a NASA
satellite
To start, we renamed certain lengthy variables for ease of viewing and consistency:
pm <- pm |>
rename(
# Distance to a primary or secondary road (Natural Log)
log_dist_prisec = log_dist_to_prisec,
# Road lengths (Natural Log) - Primary roads
log_pri_5000m = log_pri_length_5000,
log_pri_10000m = log_pri_length_10000,
log_pri_15000m = log_pri_length_15000,
log_pri_25000m = log_pri_length_25000,
# Road lengths (Natural Log) - Primary and Secondary roads
log_prisec_500m = log_prisec_length_500,
log_prisec_1000m = log_prisec_length_1000,
log_prisec_5000m = log_prisec_length_5000,
log_prisec_10000m = log_prisec_length_10000,
log_prisec_15000m = log_prisec_length_15000,
log_prisec_25000m = log_prisec_length_25000,
# NEI PM2.5 emission sums (Natural Log)
log_nei_pm25_10000m = log_nei_2008_pm25_sum_10000,
log_nei_pm25_15000m = log_nei_2008_pm25_sum_15000,
log_nei_pm25_25000m = log_nei_2008_pm25_sum_25000,
# NEI PM10 emission sums (Natural Log)
log_nei_pm10_10000m = log_nei_2008_pm10_sum_10000,
log_nei_pm10_15000m = log_nei_2008_pm10_sum_15000,
log_nei_pm10_25000m = log_nei_2008_pm10_sum_25000,
# Education-related variables
no_hs = nohs,
some_hs = somehs,
hs_grad = hs,
some_college = somecollege,
# Poverty rate
poverty = pov,
hs_or_less = hs_orless
)
To explore our data, we split the monitors into four quadrants according to the latitude and longitude lines that bisect the U.S. and compared CMAQ levels across these quadrants. We found that the northwestern quadrant had much lower CMAQ levels than the other three quadrants.
# Creating a new column based on the quadrant of the US the monitor is in
pm_quad <- pm |>
mutate(quadrant = case_when(
lon >= -100 & lat >= 40 ~ "northeast",
lon >= -100 & lat < 40 ~ "southeast",
lon < -100 & lat >= 40 ~ "northwest",
lon < -100 & lat < 40 ~ "southwest"
))
pm_quad %>%
ggplot(aes(y = CMAQ, x = quadrant)) +
geom_boxplot(fill = "coral") +
labs(title = "Lowest CMAQ Levels in Northwestern Quadrant",
subtitle = "CMAQ levels per quadrant of the United States",
x = "Quadrant of the U.S.") +
theme_minimal() +
theme(plot.title.position = "plot") +
scale_x_discrete(
labels = c(
"northeast" = "Northeast",
"northwest" = "Northwest",
"southeast" = "Southeast",
"southwest" = "Southwest"
))
We were also interested in examining PM2.5 levels based on population density. In the following box plot you can see that high population density has very consistent PM 2.5 levels—there are no outliers and the box is very thin—while the low population density has a greater spread and several outliers.
pm_cl <- pm |>
mutate(pop_density_cluster = cut(popdens_county, breaks = 3, labels = c("Low", "Medium", "High")))
ggplot(pm_cl, aes(x = pop_density_cluster, y = value)) +
geom_boxplot(fill = "coral") +
labs(title = "PM2.5 Levels Most Widely Distributed in the Low Population Density Cluster",
x = "Population Density Cluster",
y = "PM2.5 Levels") +
theme_minimal()
We also wanted to examine urbanicity and how the level of urbanization is correlated with different pollution levels. To visualize the potential correlation between urbanicity and emissions, we created a correlation matrix:
#Correlation Matrix between Urbancity and Log NEI Emissions
ggcorr(select(pm, matches("nei|urc")),
hjust = .85,
size = 3,
layout.exp = 2,
label = TRUE) +
labs(title = "Correlation Between Urbanicity and PM 2.5 & PM 10 Levels",
subtitle = "Log NEI Emissions from Major Sources within 10,000m, 15,000m, and 25,000m Radii\nfor 2006 and 2013 Urban-Rural Classifications"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 15, face = "bold"),
plot.subtitle = element_text(size = 12))
The shaded blue areas highlight a negative correlation between
urbanicity (urc
) and log NEI emissions
(log_nei
). The negative correlation means that as
urc
increases (from urban to rural), emission levels tend
to decrease. This finding is expected given that urban environments are
typically more industrialized, densely populated, and often have higher
vehicle traffic—all contributing to elevated pollution levels.
Since there seems to be a correlation between urbanicity and emission levels, we employed sets of boxplots to examine the data more closely. As a reminder, the urbanicity measure spans 6 categorical values with 1 most urban and 6 being rural.
#Reshaped Log NEI Emission and Urbanicity values to long format for easier faceting
pm_long_urc <- pm |>
pivot_longer(cols = starts_with("log_nei_pm25"),
names_to = "log_nei",
values_to = "emissions") |>
pivot_longer(cols = starts_with("urc"),
names_to = "urc_year",
values_to = "urbanicity")
#Facet Labels
facet_labels <- list(
log_nei = c(
"log_nei_pm25_10000m" = "10,000m",
"log_nei_pm25_15000m" = "15,000m",
"log_nei_pm25_25000m" = "25,000m"),
urc_year = c(
"urc2006" = "2006 Classification",
"urc2013" = "2013 Classification"))
#Faceted Boxplots for both urbanicity years and emission levels
ggplot(pm_long_urc, aes(x = urbanicity, y = emissions)) +
geom_boxplot(aes(fill = factor(urbanicity))) +
labs(x = "Urbanicity (1 = Most Urban, 6 = Most Rural)", y = "Log NEI Emissions",
title = "PM2.5 Levels by Urban-Rural Classifications",
subtitle = "Log NEI Emissions from Major Sources within 10,000m, 15,000m, and 25,000m Radii\nfor 2006 and 2013 Urban-Rural Classifications") +
facet_wrap(~ urc_year + log_nei,
labeller = labeller(urc_year = as_labeller(facet_labels$urc_year),
log_nei = as_labeller(facet_labels$log_nei))) +
scale_x_continuous(breaks = 1:6) +
#scale_colour_manual(values = blue_gradient) +
guides(color = "none") +
theme_minimal() +
theme(
strip.text.x = element_text(size = 10, hjust = 0.5),
axis.text.x = element_text(size = 8),
plot.title = element_text(size = 15, face = "bold"),
plot.subtitle = element_text(size = 12),
legend.position = "none")
From these boxplots, we observe a clear decrease in average emissions as urbanicity levels decrease (from urban to rural), further reinforcing the idea that urban counties contribute more significantly to pollution.
To create our first model, we used the pm
dataset.
First, id
, fips
, and zcta
were
coded as factors, to ensure they were treated as categorical variables
in the model.
In addition, we created modified the city
column,
classifying locations as either “In a city” or “Not in a city” based on
the original city
column values.
pm <- pm |>
mutate(city = case_when(city == "Not in a city" ~ "Not in a city",
city != "Not in a city" ~ "In a city"))
pm
## # A tibble: 876 × 50
## id value fips lat lon state county city CMAQ zcta zcta_area
## <fct> <dbl> <fct> <dbl> <dbl> <chr> <chr> <chr> <dbl> <fct> <dbl>
## 1 1003.001 9.60 1003 30.5 -87.9 Alabama Baldwin In a… 8.10 36532 190980522
## 2 1027.0001 10.8 1027 33.3 -85.8 Alabama Clay In a… 9.77 36251 374132430
## 3 1033.1002 11.2 1033 34.8 -87.7 Alabama Colbert In a… 9.40 35660 16716984
## 4 1049.1003 11.7 1049 34.3 -86.0 Alabama DeKalb In a… 8.53 35962 203836235
## 5 1055.001 12.4 1055 34.0 -86.0 Alabama Etowah In a… 9.24 35901 154069359
## 6 1069.0003 10.5 1069 31.2 -85.4 Alabama Houston In a… 9.12 36303 162685124
## 7 1073.0023 15.6 1073 33.6 -86.8 Alabama Jeffer… In a… 10.2 35207 26929603
## 8 1073.1005 12.4 1073 33.3 -87.0 Alabama Jeffer… Not … 10.2 35111 166239542
## 9 1073.1009 11.1 1073 33.5 -87.3 Alabama Jeffer… Not … 8.16 35444 385566685
## 10 1073.101 13.1 1073 33.5 -86.5 Alabama Jeffer… In a… 9.30 35094 148994881
## # ℹ 866 more rows
## # ℹ 39 more variables: zcta_pop <dbl>, imp_a500 <dbl>, imp_a1000 <dbl>,
## # imp_a5000 <dbl>, imp_a10000 <dbl>, imp_a15000 <dbl>, county_area <dbl>,
## # county_pop <dbl>, log_dist_prisec <dbl>, log_pri_5000m <dbl>,
## # log_pri_10000m <dbl>, log_pri_15000m <dbl>, log_pri_25000m <dbl>,
## # log_prisec_500m <dbl>, log_prisec_1000m <dbl>, log_prisec_5000m <dbl>,
## # log_prisec_10000m <dbl>, log_prisec_15000m <dbl>, …
set.seed(1234) # same seed as before
pm_split <- rsample::initial_split(data = pm, prop = 0.7)
pm_split
## <Training/Testing/Total>
## <613/263/876>
We splitted our data into training and testing subsets, with 70% of the data allocated to the training set and the remaining 30% reserved for testing.
Next, we created a 4-fold cross-validation object from the
train_pm
dataset to divide the training data into four
subsets (folds) for cross-validation, enabling model performance
evaluation across multiple splits.
## # 4-fold cross-validation
## # A tibble: 4 × 2
## splits id
## <list> <chr>
## 1 <split [459/154]> Fold1
## 2 <split [460/153]> Fold2
## 3 <split [460/153]> Fold3
## 4 <split [460/153]> Fold4
We then defined a preprocessing recipe, RF_rec
, to
prepare the train_pm
dataset for training a Random Forest
model by cleaning and optimizing the features.
Assigning roles:
everything()
, i.e., all variables are first assigned
the role of “predictor” for model input variables.value
was reassigned as the response/outcome
variableid
was reassigned as the “id variable” to uniquely
identify observationsWe used step_novel("state")
to prepare the model to
handle any unseen levels in the state variable during cross
validation
We converted categorical columns into factors, dropped the unnecessary categorical features, and removed variables that are highly correlated or have near-zero variance
RF_rec <- recipe(train_pm) |>
update_role(everything(), new_role = "predictor")|>
update_role(value, new_role = "outcome")|>
update_role(id, new_role = "id variable") |>
update_role("fips", new_role = "county id") |>
step_novel("state") |>
step_string2factor("state", "county", "city") |>
step_rm("county") |>
step_rm("zcta") |>
step_corr(all_numeric())|>
step_nzv(all_numeric())
We used a workflow to combine the preprocessing recipe and the predictive model steps for streamlined modeling.
RF_PM_model <- parsnip::rand_forest(mtry = 10, min_n = 3) |>
set_engine("randomForest") |>
set_mode("regression")
RF_PM_model
## Random Forest Model Specification (regression)
##
## Main Arguments:
## mtry = 10
## min_n = 3
##
## Computational engine: randomForest
RF_wflow <- workflows::workflow() |>
workflows::add_recipe(RF_rec) |>
workflows::add_model(RF_PM_model)
RF_wflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 6 Recipe Steps
##
## • step_novel()
## • step_string2factor()
## • step_rm()
## • step_rm()
## • step_corr()
## • step_nzv()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (regression)
##
## Main Arguments:
## mtry = 10
## min_n = 3
##
## Computational engine: randomForest
We fit the RF_wflow
workflow to the training dataset,
train_pm
.
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 6 Recipe Steps
##
## • step_novel()
## • step_string2factor()
## • step_rm()
## • step_rm()
## • step_corr()
## • step_nzv()
##
## ── Model ───────────────────────────────────────────────────────────────────────
##
## Call:
## randomForest(x = maybe_data_frame(x), y = y, mtry = min_cols(~10, x), nodesize = min_rows(~3, x))
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 10
##
## Mean of squared residuals: 2.662351
## % Var explained: 58.07
We generated a variable importance plot for the fitted Random Forest model.
We performed cross-validation and collects the performance metrics
for the RF_wflow
workflow using the vfold_pm
cross-validation object
set.seed(456)
resample_RF_fit <- tune::fit_resamples(RF_wflow, vfold_pm)
collect_metrics(resample_RF_fit)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 1.68 4 0.0676 Preprocessor1_Model1
## 2 rsq standard 0.579 4 0.0417 Preprocessor1_Model1
We defined a rand_forest
model with hyperparameters that
will be tuned.
tune_RF_model <- rand_forest(mtry = tune(), min_n = tune()) |>
set_engine("randomForest") |>
set_mode("regression")
tune_RF_model
## Random Forest Model Specification (regression)
##
## Main Arguments:
## mtry = tune()
## min_n = tune()
##
## Computational engine: randomForest
We defined a workflow, RF_tune_wflow
, that combines the
preprocessing recipe and the tunable Random Forest model.
RF_tune_wflow <- workflows::workflow() |>
workflows::add_recipe(RF_rec) |>
workflows::add_model(tune_RF_model)
RF_tune_wflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 6 Recipe Steps
##
## • step_novel()
## • step_string2factor()
## • step_rm()
## • step_rm()
## • step_corr()
## • step_nzv()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (regression)
##
## Main Arguments:
## mtry = tune()
## min_n = tune()
##
## Computational engine: randomForest
## [1] 128
We performed hyperparameter tuning for the Random Forest model within
the RF_tune_wflow
workflow using grid search. It evaluates
the model using the vfold_pm cross-validation object (which contains 4
folds) and tests a grid of 20 different combinations of
hyperparameters.
doParallel::registerDoParallel(cores = n_cores)
set.seed(123)
tune_RF_results <- tune_grid(object = RF_tune_wflow, resamples = vfold_pm, grid = 20)
tune_RF_results
## # Tuning results
## # 4-fold cross-validation
## # A tibble: 4 × 4
## splits id .metrics .notes
## <list> <chr> <list> <list>
## 1 <split [459/154]> Fold1 <tibble [40 × 6]> <tibble [0 × 3]>
## 2 <split [460/153]> Fold2 <tibble [40 × 6]> <tibble [0 × 3]>
## 3 <split [460/153]> Fold3 <tibble [40 × 6]> <tibble [0 × 3]>
## 4 <split [460/153]> Fold4 <tibble [40 × 6]> <tibble [0 × 3]>
We collected the performance metrics from the hyperparameter tuning
process stored in tune_RF_results
.
## # A tibble: 40 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 12 33 rmse standard 1.72 4 0.0802 Preprocessor1_Model01
## 2 12 33 rsq standard 0.551 4 0.0457 Preprocessor1_Model01
## 3 27 35 rmse standard 1.71 4 0.0773 Preprocessor1_Model02
## 4 27 35 rsq standard 0.542 4 0.0551 Preprocessor1_Model02
## 5 22 40 rmse standard 1.73 4 0.0705 Preprocessor1_Model03
## 6 22 40 rsq standard 0.537 4 0.0502 Preprocessor1_Model03
## 7 1 27 rmse standard 2.02 4 0.101 Preprocessor1_Model04
## 8 1 27 rsq standard 0.433 4 0.0645 Preprocessor1_Model04
## 9 6 32 rmse standard 1.78 4 0.0892 Preprocessor1_Model05
## 10 6 32 rsq standard 0.537 4 0.0494 Preprocessor1_Model05
## # ℹ 30 more rows
## # A tibble: 1 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 18 4 rmse standard 1.67 4 0.0687 Preprocessor1_Model17
We selected the best hyperparameter combination based on the root mean square error (RMSE) from the results of the hyperparameter tuning process.
## # A tibble: 1 × 3
## mtry min_n .config
## <int> <int> <chr>
## 1 18 4 Preprocessor1_Model17
We finalized the Random Forest workflow by applying the best hyperparameters identified through grid search. The model was then trained and evaluated on both the training and test datasets using the last_fit() function. Finally, performance metrics such as R^2 and RMSE are collected.
# specify best combination from tune in workflow
RF_tuned_wflow <-RF_tune_wflow |>
tune::finalize_workflow(tuned_RF_values)
# fit model with those parameters on train AND test
overallfit <- RF_wflow |>
tune::last_fit(pm_split)
collect_metrics(overallfit)
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 1.76 Preprocessor1_Model1
## 2 rsq standard 0.619 Preprocessor1_Model1
## # A tibble: 263 × 5
## .pred id .row value .config
## <dbl> <chr> <int> <dbl> <chr>
## 1 12.0 train/test split 3 11.2 Preprocessor1_Model1
## 2 11.7 train/test split 5 12.4 Preprocessor1_Model1
## 3 11.1 train/test split 6 10.5 Preprocessor1_Model1
## 4 13.1 train/test split 7 15.6 Preprocessor1_Model1
## 5 12.0 train/test split 8 12.4 Preprocessor1_Model1
## 6 10.8 train/test split 9 11.1 Preprocessor1_Model1
## 7 11.0 train/test split 16 10.0 Preprocessor1_Model1
## 8 11.5 train/test split 18 12.0 Preprocessor1_Model1
## 9 12.1 train/test split 20 13.2 Preprocessor1_Model1
## 10 11.2 train/test split 24 11.7 Preprocessor1_Model1
## # ℹ 253 more rows
For our extension of the analysis, we copied the pm
dataframe into a new dataframe and added a column called
climate_region
. The column value was determined by one of
nine climate regions based on the state the monitor is located in. These
climate regions are defined as “climatically consistent regions within
the contiguous United States which are useful for putting current
climate anomalies into a historical perspective” by the National Centers
for Environmental Information. 5
climate_regions <- c(
"Connecticut" = "Northeast",
"Delaware" = "Northeast",
"District Of Columbia" = "Northeast",
"Maine" = "Northeast",
"Maryland" = "Northeast",
"Massachusetts" = "Northeast",
"New Hampshire" = "Northeast",
"New Jersey" = "Northeast",
"New York" = "Northeast",
"Pennsylvania" = "Northeast",
"Rhode Island" = "Northeast",
"Vermont" = "Northeast",
"Iowa" = "Upper Midwest",
"Michigan" = "Upper Midwest",
"Minnesota" = "Upper Midwest",
"Wisconsin" = "Upper Midwest",
"Illinois" = "Ohio Valley",
"Indiana" = "Ohio Valley",
"Kentucky" = "Ohio Valley",
"Missouri" = "Ohio Valley",
"Ohio" = "Ohio Valley",
"Tennessee" = "Ohio Valley",
"West Virginia" = "Ohio Valley",
"Alabama" = "Southeast",
"Florida" = "Southeast",
"Georgia" = "Southeast",
"North Carolina" = "Southeast",
"South Carolina" = "Southeast",
"Virginia" = "Southeast",
"Montana" = "Northern Rockies and Plains",
"Nebraska" = "Northern Rockies and Plains",
"North Dakota" = "Northern Rockies and Plains",
"South Dakota" = "Northern Rockies and Plains",
"Wyoming" = "Northern Rockies and Plains",
"Arkansas" = "South",
"Kansas" = "South",
"Louisiana" = "South",
"Mississippi" = "South",
"Oklahoma" = "South",
"Texas" = "South",
"Arizona" = "Southwest",
"Colorado" = "Southwest",
"New Mexico" = "Southwest",
"Utah" = "Southwest",
"Idaho" = "Northwest",
"Oregon" = "Northwest",
"Washington" = "Northwest",
"California" = "West",
"Nevada" = "West"
)
# Add a new column with region labels
pm_clim <- pm |>
mutate(climate_region = recode(state, !!!climate_regions))
The boxplot below illustrates PM2.5 concentrations across the nine climate regions, showcasing the varying levels of air pollution experienced in each region.
pm_clim |>
ggplot(aes(x = climate_region, y = value, fill = climate_region)) +
geom_boxplot() +
theme_classic() +
labs(title = "Distribution of PM2.5 Concentrations by Climate Region",
x = "Climate Region",
y = "Value") +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none")
To create the second model that includes climate regions as a factor,
we used the pm_clim
dataset. We followed the same steps,
except for some changes that are noted.
pm_clim <- pm_clim |>
mutate(city = case_when(city == "Not in a city" ~ "Not in a city",
city != "Not in a city" ~ "In a city"))
pm_clim
## # A tibble: 876 × 51
## id value fips lat lon state county city CMAQ zcta zcta_area
## <fct> <dbl> <fct> <dbl> <dbl> <chr> <chr> <chr> <dbl> <fct> <dbl>
## 1 1003.001 9.60 1003 30.5 -87.9 Alabama Baldwin In a… 8.10 36532 190980522
## 2 1027.0001 10.8 1027 33.3 -85.8 Alabama Clay In a… 9.77 36251 374132430
## 3 1033.1002 11.2 1033 34.8 -87.7 Alabama Colbert In a… 9.40 35660 16716984
## 4 1049.1003 11.7 1049 34.3 -86.0 Alabama DeKalb In a… 8.53 35962 203836235
## 5 1055.001 12.4 1055 34.0 -86.0 Alabama Etowah In a… 9.24 35901 154069359
## 6 1069.0003 10.5 1069 31.2 -85.4 Alabama Houston In a… 9.12 36303 162685124
## 7 1073.0023 15.6 1073 33.6 -86.8 Alabama Jeffer… In a… 10.2 35207 26929603
## 8 1073.1005 12.4 1073 33.3 -87.0 Alabama Jeffer… Not … 10.2 35111 166239542
## 9 1073.1009 11.1 1073 33.5 -87.3 Alabama Jeffer… Not … 8.16 35444 385566685
## 10 1073.101 13.1 1073 33.5 -86.5 Alabama Jeffer… In a… 9.30 35094 148994881
## # ℹ 866 more rows
## # ℹ 40 more variables: zcta_pop <dbl>, imp_a500 <dbl>, imp_a1000 <dbl>,
## # imp_a5000 <dbl>, imp_a10000 <dbl>, imp_a15000 <dbl>, county_area <dbl>,
## # county_pop <dbl>, log_dist_prisec <dbl>, log_pri_5000m <dbl>,
## # log_pri_10000m <dbl>, log_pri_15000m <dbl>, log_pri_25000m <dbl>,
## # log_prisec_500m <dbl>, log_prisec_1000m <dbl>, log_prisec_5000m <dbl>,
## # log_prisec_10000m <dbl>, log_prisec_15000m <dbl>, …
set.seed(1234) # same seed as before
pm_split <- rsample::initial_split(data = pm_clim, prop = 0.7)
pm_split
## <Training/Testing/Total>
## <613/263/876>
A 5-fold cross-validation object is created from the
train_pm
dataset to divide the training data into five
subsets for cross-validation. Increasing v to 5 enabled higher test
accuracy on the final model.
## # 5-fold cross-validation
## # A tibble: 5 × 2
## splits id
## <list> <chr>
## 1 <split [490/123]> Fold1
## 2 <split [490/123]> Fold2
## 3 <split [490/123]> Fold3
## 4 <split [491/122]> Fold4
## 5 <split [491/122]> Fold5
When creating the recipe, we ensured climate_region
was
included as a factor.
RF_rec <- recipe(train_pm) |>
update_role(everything(), new_role = "predictor")|>
update_role(value, new_role = "outcome")|>
update_role(id, new_role = "id variable") |>
update_role("fips", new_role = "county id") |>
step_novel("state") |>
step_string2factor("state", "county", "city", "climate_region") |>
step_rm("county") |>
step_rm("zcta") |>
step_corr(all_numeric())|>
step_nzv(all_numeric())
RF_PM_model <- parsnip::rand_forest(mtry = 10, min_n = 3) |>
set_engine("randomForest") |>
set_mode("regression")
RF_PM_model
## Random Forest Model Specification (regression)
##
## Main Arguments:
## mtry = 10
## min_n = 3
##
## Computational engine: randomForest
RF_wflow <- workflows::workflow() |>
workflows::add_recipe(RF_rec) |>
workflows::add_model(RF_PM_model)
RF_wflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 6 Recipe Steps
##
## • step_novel()
## • step_string2factor()
## • step_rm()
## • step_rm()
## • step_corr()
## • step_nzv()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (regression)
##
## Main Arguments:
## mtry = 10
## min_n = 3
##
## Computational engine: randomForest
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 6 Recipe Steps
##
## • step_novel()
## • step_string2factor()
## • step_rm()
## • step_rm()
## • step_corr()
## • step_nzv()
##
## ── Model ───────────────────────────────────────────────────────────────────────
##
## Call:
## randomForest(x = maybe_data_frame(x), y = y, mtry = min_cols(~10, x), nodesize = min_rows(~3, x))
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 10
##
## Mean of squared residuals: 2.613063
## % Var explained: 58.84
set.seed(456)
resample_RF_fit <- tune::fit_resamples(RF_wflow, vfold_pm)
collect_metrics(resample_RF_fit)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 1.66 5 0.0982 Preprocessor1_Model1
## 2 rsq standard 0.582 5 0.0438 Preprocessor1_Model1
We now see a slightly improved R^2 value.
tune_RF_model <- rand_forest(mtry = tune(), min_n = tune()) |>
set_engine("randomForest") |>
set_mode("regression")
tune_RF_model
## Random Forest Model Specification (regression)
##
## Main Arguments:
## mtry = tune()
## min_n = tune()
##
## Computational engine: randomForest
RF_tune_wflow <- workflows::workflow() |>
workflows::add_recipe(RF_rec) |>
workflows::add_model(tune_RF_model)
RF_tune_wflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 6 Recipe Steps
##
## • step_novel()
## • step_string2factor()
## • step_rm()
## • step_rm()
## • step_corr()
## • step_nzv()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (regression)
##
## Main Arguments:
## mtry = tune()
## min_n = tune()
##
## Computational engine: randomForest
## [1] 128
doParallel::registerDoParallel(cores = n_cores)
set.seed(123)
tune_RF_results <- tune_grid(object = RF_tune_wflow, resamples = vfold_pm, grid = 20)
tune_RF_results
## # Tuning results
## # 5-fold cross-validation
## # A tibble: 5 × 4
## splits id .metrics .notes
## <list> <chr> <list> <list>
## 1 <split [490/123]> Fold1 <tibble [40 × 6]> <tibble [0 × 3]>
## 2 <split [490/123]> Fold2 <tibble [40 × 6]> <tibble [0 × 3]>
## 3 <split [490/123]> Fold3 <tibble [40 × 6]> <tibble [0 × 3]>
## 4 <split [491/122]> Fold4 <tibble [40 × 6]> <tibble [1 × 3]>
## 5 <split [491/122]> Fold5 <tibble [40 × 6]> <tibble [0 × 3]>
##
## There were issues with some computations:
##
## - Warning(s) x1: 37 columns were requested but there were 36 predictors in the dat...
##
## Run `show_notes(.Last.tune.result)` for more information.
## # A tibble: 40 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 13 33 rmse standard 1.71 5 0.108 Preprocessor1_Model01
## 2 13 33 rsq standard 0.548 5 0.0453 Preprocessor1_Model01
## 3 28 35 rmse standard 1.72 5 0.102 Preprocessor1_Model02
## 4 28 35 rsq standard 0.535 5 0.0414 Preprocessor1_Model02
## 5 23 40 rmse standard 1.72 5 0.101 Preprocessor1_Model03
## 6 23 40 rsq standard 0.536 5 0.0414 Preprocessor1_Model03
## 7 1 27 rmse standard 2.00 5 0.100 Preprocessor1_Model04
## 8 1 27 rsq standard 0.451 5 0.0338 Preprocessor1_Model04
## 9 6 32 rmse standard 1.74 5 0.103 Preprocessor1_Model05
## 10 6 32 rsq standard 0.551 5 0.0378 Preprocessor1_Model05
## # ℹ 30 more rows
## # A tibble: 1 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 15 7 rmse standard 1.66 5 0.0997 Preprocessor1_Model09
## # A tibble: 1 × 3
## mtry min_n .config
## <int> <int> <chr>
## 1 15 7 Preprocessor1_Model09
# specify best combination from tune in workflow
RF_tuned_wflow <-RF_tune_wflow |>
tune::finalize_workflow(tuned_RF_values)
# fit model with those parameters on train AND test
overallfit <- RF_wflow |>
tune::last_fit(pm_split)
collect_metrics(overallfit)
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 1.73 Preprocessor1_Model1
## 2 rsq standard 0.626 Preprocessor1_Model1
We now see a slightly improved R^2 value on the test data.
## # A tibble: 263 × 5
## .pred id .row value .config
## <dbl> <chr> <int> <dbl> <chr>
## 1 12.0 train/test split 3 11.2 Preprocessor1_Model1
## 2 11.8 train/test split 5 12.4 Preprocessor1_Model1
## 3 11.1 train/test split 6 10.5 Preprocessor1_Model1
## 4 13.1 train/test split 7 15.6 Preprocessor1_Model1
## 5 12.0 train/test split 8 12.4 Preprocessor1_Model1
## 6 10.8 train/test split 9 11.1 Preprocessor1_Model1
## 7 11.2 train/test split 16 10.0 Preprocessor1_Model1
## 8 11.6 train/test split 18 12.0 Preprocessor1_Model1
## 9 12.1 train/test split 20 13.2 Preprocessor1_Model1
## 10 11.3 train/test split 24 11.7 Preprocessor1_Model1
## # ℹ 253 more rows
Our extension of our analysis demonstrated that we can successfully improve model predictions by including the different climate regions of the U.S. as a model predictor. Our model’s predictors of annual air pollution levels explained 61.9% of the variance in our outcome variable (PM2.5 levels). When we included information about climate regions, the amount of variance explained increased to 62.6%, suggesting that climate regions play a role in understanding air pollution. Our analysis of the data also revealed that the variable “state” was the most important factor, showing how much location affects pollution levels.
We were able to predict PM2.5 levels with 61.9% accuracy with our initial model. After adding a new variable created based on the climate regions in which each monitor was located, our accuracy improved from 61.9% to 62.6%. Overall, our results highlight the importance of considering both geographic and regional factors when studying air quality and finding ways to manage pollution.
https://www.epa.gov/pm-pollution/particulate-matter-pm-basics↩︎
https://www.epa.gov/pm-pollution/particulate-matter-pm-basics↩︎
https://www.stateofglobalair.org/sites/default/files/soga_2019_fact_sheet.pdf↩︎
https://ehjournal.biomedcentral.com/articles/10.1186/1476-069X-13-63↩︎
https://www.ncei.noaa.gov/access/monitoring/reference-maps/us-climate-regions↩︎