Introduction

Air pollution is a pressing environmental issue, with various types of pollutants affecting air quality. Among these, fine particulate matter (PM2.5)—particles smaller than 2.5 µm in diameter—is particularly harmful. 1 Exposure to PM2.5 is linked to severe health problems, and some regions in the United States experience pollution levels that exceed the World Health Organization’s recommended limits. 2 3

Accurately predicting annual average air pollution concentrations in the U.S. has significant benefits, such as informing public health initiatives and guiding policy decisions. While traditional air pollution measurement methods provide valuable data, their uneven distribution nationwide and limited coverage of PM2.5 levels create gaps in understanding. 4 To address this problem, we use machine learning to develop a model aimed at improving the accuracy of air pollution predictions. This model also incorporates climate region as a factor to account for geographic variability, seeking to enhance prediction accuracy, especially in regions with sparse monitor coverage.

Questions

  • With what accuracy can we predict US annual average air pollution concentrations?
  • How does incorporating a climate region category affect the accuracy of PM2.5 concentration predictions?

Load packages

library(tidyverse)
library(tidymodels)
library(olsrr)
library(GGally)

The Data

Our data comes from the US Environmental Protection Agency (EPA), the National Aeronautics and Space Administration (NASA), the US Census, and the National Center for Health Statistics (NCHS).

This dataset contains information from air quality monitors placed all throughout the United States, although they are not uniformly distributed. There are 48 features (variables) with values for each of the 876 monitors (each monitor is a row/observation).

The features are:
  • id | Monitor number
    • The county number is indicated before the decimal and the monitor number is indicated after the decimal
    • Example: 1073.0023 is Jefferson county (1073) and .0023 one of 8 monitors
  • fips | Federal information processing standard number for the county where the monitor is located
    • 5 digit id code for counties (zero is often the first value and sometimes is not shown)
    • First two numbers indicate the state, last three numbers indicate the county
    • Example: Alabama’s state code is 01 because it is first alphabetically
    • Note: Alaska and Hawaii are not included because they are not part of the contiguous US
  • Lat | Latitude of the monitor in degrees
  • Lon | Longitude of the monitor in degrees
  • state | State where the monitor is located
  • county | County where the monitor is located
  • city | City where the monitor is located
  • CMAQ | Estimated values of air pollution from a computational model called Community Multiscale Air Quality (CMAQ)
    • A monitoring system that simulates the physics of the atmosphere using chemistry and weather data to predict the air pollution
    • Data from EPA
  • zcta | Zip Code Tabulation Area where the monitor is located
    • Postal Zip codes are converted into “generalized areal representations” that are non-overlapping
    • Data from the 2010 Census
  • zcta_area | Land area of the zip code area in meters squared
    • Data from the 2010 Census
  • zcta_pop | Population in the zip code area
    • Data from the 2010 Census
  • imp_a500, imp_a1000, imp_a5000, imp_a10000, imp_a15000 | Impervious surface measure
    • Impervious surface are roads, concrete, parking lots, buildings
    • Measured within a circle with a radius of 500, 1000, 5000, 10000, and 15000 meters respectively around the monitor
  • county_area | Land area of the county of the monitor in meters squared
  • county_pop | Population of the county of the monitor
  • Log_dist_to_prisec | Log (Natural log) distance to a primary or secondary road from the monitor
    • Highway or major road
  • log_pri_5000m, log_pri_10000m, log_pri_15000m, log_pri_25000m | Count of primary road length in meters in a circle with a radius of 5000, 10000, 15000, and 25000 meters respectively around the monitor (Natural log)
    • Highways only
  • log_prisec_500m, log_prisec_1000m, log_prisec_5000m, log_prisec_10000m, log_prisec_15000m, log_prisec_25000m | Count of primary and secondary road length in meters in a circle with a radius of 500, 1000, 5000, 10000, 15000, 25000 meters around the monitor (Natural log)
    • Highway and secondary roads
  • log_nei_pm25_10000m, log_nei_pm25_15000m, log_nei_pm25_25000m | Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 10000, 15000, 25000 meters of distance respectively around the monitor (Natural log)
    • Fine Particulate Matter (diameter 2.5 µm)
  • log_nei_pm10_10000m, log_nei_pm10_15000m, log_nei_pm10_25000m | Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 10000, 15000, 25000 meters of distance respectively around the monitor (Natural log)
    • Large Coarse Particulate Matter (diameter 10 µm)
  • popdens_county | Population density (number of people per kilometer squared area of the county)
  • popdens_zcta | Population density (number of people per kilometer squared area of zcta)
  • From the Census:
    • no_hs | Percentage of people in zcta area where the monitor is that do not have a high school degree
    • some_hs | Percentage of people in zcta area where the monitor whose highest formal educational attainment was some high school education
    • hs_grad | Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing a high school degree
    • some_college | Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing some college education
    • associate | Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing an associate degree
    • bachelor | Percentage of people in zcta area where the monitor whose highest formal educational attainment was a bachelor’s degree
    • grad | Percentage of people in zcta area where the monitor whose highest formal educational attainment was a graduate degree
    • poverty | Percentage of people in zcta area where the monitor is that lived in poverty in 2008
  • hs_or_less | Percentage of people in zcta area where the monitor whose highest formal educational attainment was a high school degree or less (sum of no_hs, some_hs, and hs_grad)
  • urc2013, urc2006 | 2013, 2006 Urban-rural classification of the county where the monitor is located
  • aod | Aerosol Optical Depth measurement from a NASA satellite
    • Based on the diffraction of a laser
    • Used as a proxy of particulate pollution
    • Unit-less - higher value indicates more pollution
    • Data from NASA

Data Import

pm <- read_csv("data/pm25_data.csv")

Data Wrangling and EDA

To start, we renamed certain lengthy variables for ease of viewing and consistency:

pm <- pm |>
  rename( 
    
    # Distance to a primary or secondary road (Natural Log)
    log_dist_prisec = log_dist_to_prisec,
    
    # Road lengths (Natural Log) - Primary roads
    log_pri_5000m = log_pri_length_5000,
    log_pri_10000m = log_pri_length_10000,
    log_pri_15000m = log_pri_length_15000,
    log_pri_25000m = log_pri_length_25000,
    
    # Road lengths (Natural Log) - Primary and Secondary roads
    log_prisec_500m = log_prisec_length_500,
    log_prisec_1000m = log_prisec_length_1000,
    log_prisec_5000m = log_prisec_length_5000,
    log_prisec_10000m = log_prisec_length_10000,
    log_prisec_15000m = log_prisec_length_15000,
    log_prisec_25000m = log_prisec_length_25000,
    
    # NEI PM2.5 emission sums (Natural Log)
    log_nei_pm25_10000m = log_nei_2008_pm25_sum_10000,
    log_nei_pm25_15000m = log_nei_2008_pm25_sum_15000,
    log_nei_pm25_25000m = log_nei_2008_pm25_sum_25000,
    
    # NEI PM10 emission sums (Natural Log)
    log_nei_pm10_10000m = log_nei_2008_pm10_sum_10000,
    log_nei_pm10_15000m = log_nei_2008_pm10_sum_15000,
    log_nei_pm10_25000m = log_nei_2008_pm10_sum_25000,
    
    # Education-related variables
    no_hs = nohs,
    some_hs = somehs,
    hs_grad = hs,
    some_college = somecollege,
    
    # Poverty rate
    poverty = pov,
    hs_or_less = hs_orless
)

To explore our data, we split the monitors into four quadrants according to the latitude and longitude lines that bisect the U.S. and compared CMAQ levels across these quadrants. We found that the northwestern quadrant had much lower CMAQ levels than the other three quadrants.

# Creating a new column based on the quadrant of the US the monitor is in
pm_quad <- pm |>
  mutate(quadrant = case_when(
    lon >= -100 & lat >= 40 ~ "northeast",
    lon >= -100 & lat < 40 ~ "southeast",
    lon < -100 & lat >= 40 ~ "northwest",
    lon < -100 & lat < 40 ~ "southwest"
  ))
pm_quad %>% 
  ggplot(aes(y = CMAQ, x = quadrant)) +
  geom_boxplot(fill = "coral") +
  labs(title = "Lowest CMAQ Levels in Northwestern Quadrant",
       subtitle = "CMAQ levels per quadrant of the United States",
       x = "Quadrant of the U.S.") +
  theme_minimal() +
  theme(plot.title.position = "plot") +
  scale_x_discrete(
    labels = c(
      "northeast" = "Northeast",
      "northwest" = "Northwest",
      "southeast" = "Southeast",
      "southwest" = "Southwest"
    ))

We were also interested in examining PM2.5 levels based on population density. In the following box plot you can see that high population density has very consistent PM 2.5 levels—there are no outliers and the box is very thin—while the low population density has a greater spread and several outliers.

pm_cl <- pm |> 
  mutate(pop_density_cluster = cut(popdens_county, breaks = 3, labels = c("Low", "Medium", "High")))

ggplot(pm_cl, aes(x = pop_density_cluster, y = value)) +
  geom_boxplot(fill = "coral") +
  labs(title = "PM2.5 Levels Most Widely Distributed in the Low Population Density Cluster",
       x = "Population Density Cluster",
       y = "PM2.5 Levels") +
  theme_minimal()

We also wanted to examine urbanicity and how the level of urbanization is correlated with different pollution levels. To visualize the potential correlation between urbanicity and emissions, we created a correlation matrix:

#Correlation Matrix between Urbancity and Log NEI Emissions
ggcorr(select(pm, matches("nei|urc")), 
       hjust = .85, 
       size = 3, 
       layout.exp = 2, 
       label = TRUE) + 
  labs(title = "Correlation Between Urbanicity and PM 2.5 & PM 10 Levels",
       subtitle = "Log NEI Emissions from Major Sources within 10,000m, 15,000m, and 25,000m Radii\nfor 2006 and 2013 Urban-Rural Classifications"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 15, face = "bold"),
    plot.subtitle = element_text(size = 12))

The shaded blue areas highlight a negative correlation between urbanicity (urc) and log NEI emissions (log_nei). The negative correlation means that as urc increases (from urban to rural), emission levels tend to decrease. This finding is expected given that urban environments are typically more industrialized, densely populated, and often have higher vehicle traffic—all contributing to elevated pollution levels.

Since there seems to be a correlation between urbanicity and emission levels, we employed sets of boxplots to examine the data more closely. As a reminder, the urbanicity measure spans 6 categorical values with 1 most urban and 6 being rural.

#Reshaped Log NEI Emission and Urbanicity values to long format for easier faceting
pm_long_urc <- pm |>
  pivot_longer(cols = starts_with("log_nei_pm25"), 
               names_to = "log_nei", 
               values_to = "emissions") |>
  pivot_longer(cols = starts_with("urc"), 
               names_to = "urc_year", 
               values_to = "urbanicity")

#Facet Labels
facet_labels <- list(
  log_nei = c(
    "log_nei_pm25_10000m" = "10,000m",
    "log_nei_pm25_15000m" = "15,000m",
    "log_nei_pm25_25000m" = "25,000m"),
  urc_year = c(
    "urc2006" = "2006 Classification",
    "urc2013" = "2013 Classification"))

#Faceted Boxplots for both urbanicity years and emission levels
ggplot(pm_long_urc, aes(x = urbanicity, y = emissions)) +
  geom_boxplot(aes(fill = factor(urbanicity))) +
  labs(x = "Urbanicity (1 = Most Urban, 6 = Most Rural)", y = "Log NEI Emissions",
       title = "PM2.5 Levels by Urban-Rural Classifications",
    subtitle = "Log NEI Emissions from Major Sources within 10,000m, 15,000m, and 25,000m Radii\nfor 2006 and 2013 Urban-Rural Classifications") +
  facet_wrap(~ urc_year + log_nei, 
             labeller = labeller(urc_year = as_labeller(facet_labels$urc_year),
                                            log_nei = as_labeller(facet_labels$log_nei))) +
  scale_x_continuous(breaks = 1:6) +
    #scale_colour_manual(values = blue_gradient) +
  guides(color = "none") +
  theme_minimal() +
  theme(
    strip.text.x = element_text(size = 10, hjust = 0.5),
    axis.text.x = element_text(size = 8),
    plot.title = element_text(size = 15, face = "bold"), 
    plot.subtitle = element_text(size = 12),
    legend.position = "none")

From these boxplots, we observe a clear decrease in average emissions as urbanicity levels decrease (from urban to rural), further reinforcing the idea that urban counties contribute more significantly to pollution.

Analysis

Model 1

To create our first model, we used the pm dataset. First, id, fips, and zcta were coded as factors, to ensure they were treated as categorical variables in the model.

pm <- pm |>
  mutate(across(c(id, fips, zcta), as.factor)) 

In addition, we created modified the city column, classifying locations as either “In a city” or “Not in a city” based on the original city column values.

pm <- pm |>
  mutate(city = case_when(city == "Not in a city" ~ "Not in a city",
                          city != "Not in a city" ~ "In a city"))
pm
## # A tibble: 876 × 50
##    id        value fips    lat   lon state   county  city   CMAQ zcta  zcta_area
##    <fct>     <dbl> <fct> <dbl> <dbl> <chr>   <chr>   <chr> <dbl> <fct>     <dbl>
##  1 1003.001   9.60 1003   30.5 -87.9 Alabama Baldwin In a…  8.10 36532 190980522
##  2 1027.0001 10.8  1027   33.3 -85.8 Alabama Clay    In a…  9.77 36251 374132430
##  3 1033.1002 11.2  1033   34.8 -87.7 Alabama Colbert In a…  9.40 35660  16716984
##  4 1049.1003 11.7  1049   34.3 -86.0 Alabama DeKalb  In a…  8.53 35962 203836235
##  5 1055.001  12.4  1055   34.0 -86.0 Alabama Etowah  In a…  9.24 35901 154069359
##  6 1069.0003 10.5  1069   31.2 -85.4 Alabama Houston In a…  9.12 36303 162685124
##  7 1073.0023 15.6  1073   33.6 -86.8 Alabama Jeffer… In a… 10.2  35207  26929603
##  8 1073.1005 12.4  1073   33.3 -87.0 Alabama Jeffer… Not … 10.2  35111 166239542
##  9 1073.1009 11.1  1073   33.5 -87.3 Alabama Jeffer… Not …  8.16 35444 385566685
## 10 1073.101  13.1  1073   33.5 -86.5 Alabama Jeffer… In a…  9.30 35094 148994881
## # ℹ 866 more rows
## # ℹ 39 more variables: zcta_pop <dbl>, imp_a500 <dbl>, imp_a1000 <dbl>,
## #   imp_a5000 <dbl>, imp_a10000 <dbl>, imp_a15000 <dbl>, county_area <dbl>,
## #   county_pop <dbl>, log_dist_prisec <dbl>, log_pri_5000m <dbl>,
## #   log_pri_10000m <dbl>, log_pri_15000m <dbl>, log_pri_25000m <dbl>,
## #   log_prisec_500m <dbl>, log_prisec_1000m <dbl>, log_prisec_5000m <dbl>,
## #   log_prisec_10000m <dbl>, log_prisec_15000m <dbl>, …
set.seed(1234) # same seed as before
pm_split <- rsample::initial_split(data = pm, prop = 0.7)
pm_split
## <Training/Testing/Total>
## <613/263/876>

We splitted our data into training and testing subsets, with 70% of the data allocated to the training set and the remaining 30% reserved for testing.

 train_pm <- rsample::training(pm_split)
 test_pm <- rsample::testing(pm_split)

Next, we created a 4-fold cross-validation object from the train_pm dataset to divide the training data into four subsets (folds) for cross-validation, enabling model performance evaluation across multiple splits.

set.seed(1234)
vfold_pm <- rsample::vfold_cv(data = train_pm, v = 4)
vfold_pm
## #  4-fold cross-validation 
## # A tibble: 4 × 2
##   splits            id   
##   <list>            <chr>
## 1 <split [459/154]> Fold1
## 2 <split [460/153]> Fold2
## 3 <split [460/153]> Fold3
## 4 <split [460/153]> Fold4

We then defined a preprocessing recipe, RF_rec, to prepare the train_pm dataset for training a Random Forest model by cleaning and optimizing the features.

  • Assigning roles:

    • everything(), i.e., all variables are first assigned the role of “predictor” for model input variables.
    • value was reassigned as the response/outcome variable
    • id was reassigned as the “id variable” to uniquely identify observations
    • “fips” was reassigned with the role of “county id.”
  • We used step_novel("state") to prepare the model to handle any unseen levels in the state variable during cross validation

  • We converted categorical columns into factors, dropped the unnecessary categorical features, and removed variables that are highly correlated or have near-zero variance

RF_rec <- recipe(train_pm) |>
    update_role(everything(), new_role = "predictor")|>
    update_role(value, new_role = "outcome")|>
    update_role(id, new_role = "id variable") |>
    update_role("fips", new_role = "county id") |>
    step_novel("state") |>
    step_string2factor("state", "county", "city") |>
    step_rm("county") |>
    step_rm("zcta") |>
    step_corr(all_numeric())|>
    step_nzv(all_numeric())

We used a workflow to combine the preprocessing recipe and the predictive model steps for streamlined modeling.

RF_PM_model <- parsnip::rand_forest(mtry = 10, min_n = 3) |> 
  set_engine("randomForest") |>
  set_mode("regression")

RF_PM_model
## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = 10
##   min_n = 3
## 
## Computational engine: randomForest
RF_wflow <- workflows::workflow() |>
  workflows::add_recipe(RF_rec) |>
  workflows::add_model(RF_PM_model)

RF_wflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 6 Recipe Steps
## 
## • step_novel()
## • step_string2factor()
## • step_rm()
## • step_rm()
## • step_corr()
## • step_nzv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = 10
##   min_n = 3
## 
## Computational engine: randomForest

We fit the RF_wflow workflow to the training dataset, train_pm.

RF_wflow_fit <- parsnip::fit(RF_wflow, data = train_pm)

RF_wflow_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 6 Recipe Steps
## 
## • step_novel()
## • step_string2factor()
## • step_rm()
## • step_rm()
## • step_corr()
## • step_nzv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:
##  randomForest(x = maybe_data_frame(x), y = y, mtry = min_cols(~10,      x), nodesize = min_rows(~3, x)) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 10
## 
##           Mean of squared residuals: 2.662351
##                     % Var explained: 58.07

We generated a variable importance plot for the fitted Random Forest model.

RF_wflow_fit |> 
  extract_fit_parsnip() |> 
  vip::vip(num_features = 10)

We performed cross-validation and collects the performance metrics for the RF_wflow workflow using the vfold_pm cross-validation object

set.seed(456)
resample_RF_fit <- tune::fit_resamples(RF_wflow, vfold_pm)
collect_metrics(resample_RF_fit)
## # A tibble: 2 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   1.68      4  0.0676 Preprocessor1_Model1
## 2 rsq     standard   0.579     4  0.0417 Preprocessor1_Model1

We defined a rand_forest model with hyperparameters that will be tuned.

tune_RF_model <- rand_forest(mtry = tune(), min_n = tune()) |>
  set_engine("randomForest") |>
  set_mode("regression")
    
tune_RF_model
## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = tune()
##   min_n = tune()
## 
## Computational engine: randomForest

We defined a workflow, RF_tune_wflow, that combines the preprocessing recipe and the tunable Random Forest model.

RF_tune_wflow <- workflows::workflow() |>
  workflows::add_recipe(RF_rec) |>
  workflows::add_model(tune_RF_model)

RF_tune_wflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 6 Recipe Steps
## 
## • step_novel()
## • step_string2factor()
## • step_rm()
## • step_rm()
## • step_corr()
## • step_nzv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = tune()
##   min_n = tune()
## 
## Computational engine: randomForest
n_cores <- parallel::detectCores()
n_cores
## [1] 128

We performed hyperparameter tuning for the Random Forest model within the RF_tune_wflow workflow using grid search. It evaluates the model using the vfold_pm cross-validation object (which contains 4 folds) and tests a grid of 20 different combinations of hyperparameters.

doParallel::registerDoParallel(cores = n_cores)

set.seed(123)
tune_RF_results <- tune_grid(object = RF_tune_wflow, resamples = vfold_pm, grid = 20)
tune_RF_results
## # Tuning results
## # 4-fold cross-validation 
## # A tibble: 4 × 4
##   splits            id    .metrics          .notes          
##   <list>            <chr> <list>            <list>          
## 1 <split [459/154]> Fold1 <tibble [40 × 6]> <tibble [0 × 3]>
## 2 <split [460/153]> Fold2 <tibble [40 × 6]> <tibble [0 × 3]>
## 3 <split [460/153]> Fold3 <tibble [40 × 6]> <tibble [0 × 3]>
## 4 <split [460/153]> Fold4 <tibble [40 × 6]> <tibble [0 × 3]>

We collected the performance metrics from the hyperparameter tuning process stored in tune_RF_results.

tune_RF_results |>
  collect_metrics()
## # A tibble: 40 × 8
##     mtry min_n .metric .estimator  mean     n std_err .config              
##    <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
##  1    12    33 rmse    standard   1.72      4  0.0802 Preprocessor1_Model01
##  2    12    33 rsq     standard   0.551     4  0.0457 Preprocessor1_Model01
##  3    27    35 rmse    standard   1.71      4  0.0773 Preprocessor1_Model02
##  4    27    35 rsq     standard   0.542     4  0.0551 Preprocessor1_Model02
##  5    22    40 rmse    standard   1.73      4  0.0705 Preprocessor1_Model03
##  6    22    40 rsq     standard   0.537     4  0.0502 Preprocessor1_Model03
##  7     1    27 rmse    standard   2.02      4  0.101  Preprocessor1_Model04
##  8     1    27 rsq     standard   0.433     4  0.0645 Preprocessor1_Model04
##  9     6    32 rmse    standard   1.78      4  0.0892 Preprocessor1_Model05
## 10     6    32 rsq     standard   0.537     4  0.0494 Preprocessor1_Model05
## # ℹ 30 more rows
show_best(tune_RF_results, metric = "rmse", n = 1)
## # A tibble: 1 × 8
##    mtry min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1    18     4 rmse    standard    1.67     4  0.0687 Preprocessor1_Model17

We selected the best hyperparameter combination based on the root mean square error (RMSE) from the results of the hyperparameter tuning process.

tuned_RF_values <- select_best(tune_RF_results,  metric = "rmse")
tuned_RF_values
## # A tibble: 1 × 3
##    mtry min_n .config              
##   <int> <int> <chr>                
## 1    18     4 Preprocessor1_Model17

We finalized the Random Forest workflow by applying the best hyperparameters identified through grid search. The model was then trained and evaluated on both the training and test datasets using the last_fit() function. Finally, performance metrics such as R^2 and RMSE are collected.

# specify best combination from tune in workflow
RF_tuned_wflow <-RF_tune_wflow |>
  tune::finalize_workflow(tuned_RF_values)

# fit model with those parameters on train AND test
overallfit <- RF_wflow |>
  tune::last_fit(pm_split)

collect_metrics(overallfit)
## # A tibble: 2 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard       1.76  Preprocessor1_Model1
## 2 rsq     standard       0.619 Preprocessor1_Model1
test_predictions <- collect_predictions(overallfit)
test_predictions
## # A tibble: 263 × 5
##    .pred id                .row value .config             
##    <dbl> <chr>            <int> <dbl> <chr>               
##  1  12.0 train/test split     3  11.2 Preprocessor1_Model1
##  2  11.7 train/test split     5  12.4 Preprocessor1_Model1
##  3  11.1 train/test split     6  10.5 Preprocessor1_Model1
##  4  13.1 train/test split     7  15.6 Preprocessor1_Model1
##  5  12.0 train/test split     8  12.4 Preprocessor1_Model1
##  6  10.8 train/test split     9  11.1 Preprocessor1_Model1
##  7  11.0 train/test split    16  10.0 Preprocessor1_Model1
##  8  11.5 train/test split    18  12.0 Preprocessor1_Model1
##  9  12.1 train/test split    20  13.2 Preprocessor1_Model1
## 10  11.2 train/test split    24  11.7 Preprocessor1_Model1
## # ℹ 253 more rows

Model 2

For our extension of the analysis, we copied the pm dataframe into a new dataframe and added a column called climate_region. The column value was determined by one of nine climate regions based on the state the monitor is located in. These climate regions are defined as “climatically consistent regions within the contiguous United States which are useful for putting current climate anomalies into a historical perspective” by the National Centers for Environmental Information. 5

climate_regions <- c(
  "Connecticut" = "Northeast",
  "Delaware" = "Northeast",
  "District Of Columbia" = "Northeast",
  "Maine" = "Northeast",
  "Maryland" = "Northeast",
  "Massachusetts" = "Northeast",
  "New Hampshire" = "Northeast",
  "New Jersey" = "Northeast",
  "New York" = "Northeast",
  "Pennsylvania" = "Northeast",
  "Rhode Island" = "Northeast",
  "Vermont" = "Northeast",
  "Iowa" = "Upper Midwest",
  "Michigan" = "Upper Midwest",
  "Minnesota" = "Upper Midwest",
  "Wisconsin" = "Upper Midwest",
  "Illinois" = "Ohio Valley",
  "Indiana" = "Ohio Valley",
  "Kentucky" = "Ohio Valley",
  "Missouri" = "Ohio Valley",
  "Ohio" = "Ohio Valley",
  "Tennessee" = "Ohio Valley",
  "West Virginia" = "Ohio Valley",
  "Alabama" = "Southeast",
  "Florida" = "Southeast",
  "Georgia" = "Southeast",
  "North Carolina" = "Southeast",
  "South Carolina" = "Southeast",
  "Virginia" = "Southeast",
  "Montana" = "Northern Rockies and Plains",
  "Nebraska" = "Northern Rockies and Plains",
  "North Dakota" = "Northern Rockies and Plains",
  "South Dakota" = "Northern Rockies and Plains",
  "Wyoming" = "Northern Rockies and Plains",
  "Arkansas" = "South",
  "Kansas" = "South",
  "Louisiana" = "South",
  "Mississippi" = "South",
  "Oklahoma" = "South",
  "Texas" = "South",
  "Arizona" = "Southwest",
  "Colorado" = "Southwest",
  "New Mexico" = "Southwest",
  "Utah" = "Southwest",
  "Idaho" = "Northwest",
  "Oregon" = "Northwest",
  "Washington" = "Northwest",
  "California" = "West",
  "Nevada" = "West"
)

# Add a new column with region labels
pm_clim <- pm |>
  mutate(climate_region = recode(state, !!!climate_regions))

The boxplot below illustrates PM2.5 concentrations across the nine climate regions, showcasing the varying levels of air pollution experienced in each region.

pm_clim |>
  ggplot(aes(x = climate_region, y = value, fill = climate_region)) + 
    geom_boxplot() + 
    theme_classic() + 
    labs(title = "Distribution of PM2.5 Concentrations by Climate Region", 
         x = "Climate Region", 
         y = "Value") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          legend.position = "none")

To create the second model that includes climate regions as a factor, we used the pm_clim dataset. We followed the same steps, except for some changes that are noted.

pm_clim <- pm_clim |>
  mutate(across(c(id, fips, zcta), as.factor)) 
pm_clim <- pm_clim |>
  mutate(city = case_when(city == "Not in a city" ~ "Not in a city",
                          city != "Not in a city" ~ "In a city"))
pm_clim
## # A tibble: 876 × 51
##    id        value fips    lat   lon state   county  city   CMAQ zcta  zcta_area
##    <fct>     <dbl> <fct> <dbl> <dbl> <chr>   <chr>   <chr> <dbl> <fct>     <dbl>
##  1 1003.001   9.60 1003   30.5 -87.9 Alabama Baldwin In a…  8.10 36532 190980522
##  2 1027.0001 10.8  1027   33.3 -85.8 Alabama Clay    In a…  9.77 36251 374132430
##  3 1033.1002 11.2  1033   34.8 -87.7 Alabama Colbert In a…  9.40 35660  16716984
##  4 1049.1003 11.7  1049   34.3 -86.0 Alabama DeKalb  In a…  8.53 35962 203836235
##  5 1055.001  12.4  1055   34.0 -86.0 Alabama Etowah  In a…  9.24 35901 154069359
##  6 1069.0003 10.5  1069   31.2 -85.4 Alabama Houston In a…  9.12 36303 162685124
##  7 1073.0023 15.6  1073   33.6 -86.8 Alabama Jeffer… In a… 10.2  35207  26929603
##  8 1073.1005 12.4  1073   33.3 -87.0 Alabama Jeffer… Not … 10.2  35111 166239542
##  9 1073.1009 11.1  1073   33.5 -87.3 Alabama Jeffer… Not …  8.16 35444 385566685
## 10 1073.101  13.1  1073   33.5 -86.5 Alabama Jeffer… In a…  9.30 35094 148994881
## # ℹ 866 more rows
## # ℹ 40 more variables: zcta_pop <dbl>, imp_a500 <dbl>, imp_a1000 <dbl>,
## #   imp_a5000 <dbl>, imp_a10000 <dbl>, imp_a15000 <dbl>, county_area <dbl>,
## #   county_pop <dbl>, log_dist_prisec <dbl>, log_pri_5000m <dbl>,
## #   log_pri_10000m <dbl>, log_pri_15000m <dbl>, log_pri_25000m <dbl>,
## #   log_prisec_500m <dbl>, log_prisec_1000m <dbl>, log_prisec_5000m <dbl>,
## #   log_prisec_10000m <dbl>, log_prisec_15000m <dbl>, …
set.seed(1234) # same seed as before
pm_split <- rsample::initial_split(data = pm_clim, prop = 0.7)
pm_split
## <Training/Testing/Total>
## <613/263/876>
 train_pm <- rsample::training(pm_split)
 test_pm <- rsample::testing(pm_split)

A 5-fold cross-validation object is created from the train_pm dataset to divide the training data into five subsets for cross-validation. Increasing v to 5 enabled higher test accuracy on the final model.

set.seed(1234)
vfold_pm <- rsample::vfold_cv(data = train_pm, v = 5)
vfold_pm
## #  5-fold cross-validation 
## # A tibble: 5 × 2
##   splits            id   
##   <list>            <chr>
## 1 <split [490/123]> Fold1
## 2 <split [490/123]> Fold2
## 3 <split [490/123]> Fold3
## 4 <split [491/122]> Fold4
## 5 <split [491/122]> Fold5

When creating the recipe, we ensured climate_region was included as a factor.

RF_rec <- recipe(train_pm) |>
    update_role(everything(), new_role = "predictor")|>
    update_role(value, new_role = "outcome")|>
    update_role(id, new_role = "id variable") |>
    update_role("fips", new_role = "county id") |>
    step_novel("state") |>
    step_string2factor("state", "county", "city", "climate_region") |>
    step_rm("county") |>
    step_rm("zcta") |>
    step_corr(all_numeric())|>
    step_nzv(all_numeric())
RF_PM_model <- parsnip::rand_forest(mtry = 10, min_n = 3) |> 
  set_engine("randomForest") |>
  set_mode("regression")

RF_PM_model
## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = 10
##   min_n = 3
## 
## Computational engine: randomForest
RF_wflow <- workflows::workflow() |>
  workflows::add_recipe(RF_rec) |>
  workflows::add_model(RF_PM_model)

RF_wflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 6 Recipe Steps
## 
## • step_novel()
## • step_string2factor()
## • step_rm()
## • step_rm()
## • step_corr()
## • step_nzv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = 10
##   min_n = 3
## 
## Computational engine: randomForest
RF_wflow_fit <- parsnip::fit(RF_wflow, data = train_pm)

RF_wflow_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 6 Recipe Steps
## 
## • step_novel()
## • step_string2factor()
## • step_rm()
## • step_rm()
## • step_corr()
## • step_nzv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:
##  randomForest(x = maybe_data_frame(x), y = y, mtry = min_cols(~10,      x), nodesize = min_rows(~3, x)) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 10
## 
##           Mean of squared residuals: 2.613063
##                     % Var explained: 58.84
RF_wflow_fit |> 
  extract_fit_parsnip() |> 
  vip::vip(num_features = 10)

set.seed(456)
resample_RF_fit <- tune::fit_resamples(RF_wflow, vfold_pm)
collect_metrics(resample_RF_fit)
## # A tibble: 2 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   1.66      5  0.0982 Preprocessor1_Model1
## 2 rsq     standard   0.582     5  0.0438 Preprocessor1_Model1

We now see a slightly improved R^2 value.

tune_RF_model <- rand_forest(mtry = tune(), min_n = tune()) |>
  set_engine("randomForest") |>
  set_mode("regression")
    
tune_RF_model
## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = tune()
##   min_n = tune()
## 
## Computational engine: randomForest
RF_tune_wflow <- workflows::workflow() |>
  workflows::add_recipe(RF_rec) |>
  workflows::add_model(tune_RF_model)

RF_tune_wflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 6 Recipe Steps
## 
## • step_novel()
## • step_string2factor()
## • step_rm()
## • step_rm()
## • step_corr()
## • step_nzv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = tune()
##   min_n = tune()
## 
## Computational engine: randomForest
n_cores <- parallel::detectCores()
n_cores
## [1] 128
doParallel::registerDoParallel(cores = n_cores)

set.seed(123)
tune_RF_results <- tune_grid(object = RF_tune_wflow, resamples = vfold_pm, grid = 20)
tune_RF_results
## # Tuning results
## # 5-fold cross-validation 
## # A tibble: 5 × 4
##   splits            id    .metrics          .notes          
##   <list>            <chr> <list>            <list>          
## 1 <split [490/123]> Fold1 <tibble [40 × 6]> <tibble [0 × 3]>
## 2 <split [490/123]> Fold2 <tibble [40 × 6]> <tibble [0 × 3]>
## 3 <split [490/123]> Fold3 <tibble [40 × 6]> <tibble [0 × 3]>
## 4 <split [491/122]> Fold4 <tibble [40 × 6]> <tibble [1 × 3]>
## 5 <split [491/122]> Fold5 <tibble [40 × 6]> <tibble [0 × 3]>
## 
## There were issues with some computations:
## 
##   - Warning(s) x1: 37 columns were requested but there were 36 predictors in the dat...
## 
## Run `show_notes(.Last.tune.result)` for more information.
tune_RF_results |>
  collect_metrics()
## # A tibble: 40 × 8
##     mtry min_n .metric .estimator  mean     n std_err .config              
##    <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
##  1    13    33 rmse    standard   1.71      5  0.108  Preprocessor1_Model01
##  2    13    33 rsq     standard   0.548     5  0.0453 Preprocessor1_Model01
##  3    28    35 rmse    standard   1.72      5  0.102  Preprocessor1_Model02
##  4    28    35 rsq     standard   0.535     5  0.0414 Preprocessor1_Model02
##  5    23    40 rmse    standard   1.72      5  0.101  Preprocessor1_Model03
##  6    23    40 rsq     standard   0.536     5  0.0414 Preprocessor1_Model03
##  7     1    27 rmse    standard   2.00      5  0.100  Preprocessor1_Model04
##  8     1    27 rsq     standard   0.451     5  0.0338 Preprocessor1_Model04
##  9     6    32 rmse    standard   1.74      5  0.103  Preprocessor1_Model05
## 10     6    32 rsq     standard   0.551     5  0.0378 Preprocessor1_Model05
## # ℹ 30 more rows
show_best(tune_RF_results, metric = "rmse", n = 1)
## # A tibble: 1 × 8
##    mtry min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1    15     7 rmse    standard    1.66     5  0.0997 Preprocessor1_Model09
tuned_RF_values <- select_best(tune_RF_results,  metric = "rmse")
tuned_RF_values
## # A tibble: 1 × 3
##    mtry min_n .config              
##   <int> <int> <chr>                
## 1    15     7 Preprocessor1_Model09
# specify best combination from tune in workflow
RF_tuned_wflow <-RF_tune_wflow |>
  tune::finalize_workflow(tuned_RF_values)

# fit model with those parameters on train AND test
overallfit <- RF_wflow |>
  tune::last_fit(pm_split)

collect_metrics(overallfit)
## # A tibble: 2 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard       1.73  Preprocessor1_Model1
## 2 rsq     standard       0.626 Preprocessor1_Model1

We now see a slightly improved R^2 value on the test data.

test_predictions <- collect_predictions(overallfit)
test_predictions
## # A tibble: 263 × 5
##    .pred id                .row value .config             
##    <dbl> <chr>            <int> <dbl> <chr>               
##  1  12.0 train/test split     3  11.2 Preprocessor1_Model1
##  2  11.8 train/test split     5  12.4 Preprocessor1_Model1
##  3  11.1 train/test split     6  10.5 Preprocessor1_Model1
##  4  13.1 train/test split     7  15.6 Preprocessor1_Model1
##  5  12.0 train/test split     8  12.4 Preprocessor1_Model1
##  6  10.8 train/test split     9  11.1 Preprocessor1_Model1
##  7  11.2 train/test split    16  10.0 Preprocessor1_Model1
##  8  11.6 train/test split    18  12.0 Preprocessor1_Model1
##  9  12.1 train/test split    20  13.2 Preprocessor1_Model1
## 10  11.3 train/test split    24  11.7 Preprocessor1_Model1
## # ℹ 253 more rows

Results & Discussion

Our extension of our analysis demonstrated that we can successfully improve model predictions by including the different climate regions of the U.S. as a model predictor. Our model’s predictors of annual air pollution levels explained 61.9% of the variance in our outcome variable (PM2.5 levels). When we included information about climate regions, the amount of variance explained increased to 62.6%, suggesting that climate regions play a role in understanding air pollution. Our analysis of the data also revealed that the variable “state” was the most important factor, showing how much location affects pollution levels.

Conclusion

We were able to predict PM2.5 levels with 61.9% accuracy with our initial model. After adding a new variable created based on the climate regions in which each monitor was located, our accuracy improved from 61.9% to 62.6%. Overall, our results highlight the importance of considering both geographic and regional factors when studying air quality and finding ways to manage pollution.