Introduction to the {tidymodels} Machine Learning Ecosystem

Introduction

This January I was fortunate enough to attend rstudio::conf(2020), the official conference hosted by RStudio, PBC creators of the RStudio IDE and major contributors to the {tidyverse} ecosystem that has become the defacto standard for data importation, manipulation, and visualization in R.

I could not have asked for a better time for this conference to land in my back yard (San Francisco). As of this writing, I am only 7 months removed from graduating with my Masters. My graduate training left me with a deep understanding of linear models and design-based causal inference, but with little or no training in other types of predictive modeling, unsupervised machine learning, version control, or putting models into production.

Being able to attend this conference in my first year as a data professional is an enormous blessing and I thank all of the workshop leaders, TAs, session presenters, and conference attendees for creating such a welcoming environment. Every interaction I had at the conference was positive and a learning moment.

With that, I hope to pay it forward by sharing some of what I learned at the conference.

The first two days of the conference were divided into 19 workshops, each taught from 9-5 for two days. I chose the Applied Machine Learning workshop in order to fill the gap in my knowledge about machine learning models beyond OLS and logistic regression. Max Kuhn and Davis Vaughn were the two workshop leaders and I knew they were in the process of developing the {tidymodels} ecosystem, which stands to be a successor to their popular {caret} package and promises fill the modeling gap in the {tidyverse} ecosystem. This was an amazing opportunity to both fill in the gaps as well as learn from the package developers themselves.

My notes for this workshop were incredibly sparse, in no small part because the workshop materials (which are free and available online from the workshop’s github repo) are very detailed.

Instead of sharing those, I’ve decided to revisit a dataset we worked with during the workshop and present an example of a tidymodels workflow from start to finish, from sample splitting, to data preprocessing, to modeling, to tuning hyperparameters, to packaging it all up into a single workflow object.

We’ll be using the Ames Housing dataset which contains 81 variables and 2930 observations and our dependent variable is Sale_Price. Obviously, in an actual analysis we would spend much more time exploring this dataset, but for sole purpose of demonstrating the {tidymodels} workflow, we’ll just perform a variety of preprocessing, throw the kitchen sink at the data, then fit a Lasso model and a tuned elastic net model.

Let’s start by inspecting the data.

library(tidymodels)
library(AmesHousing)

ames <- make_ames()

ames %>% 
  head() %>% 
  knitr::kable()
MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape Land_Contour Utilities Lot_Config Land_Slope Neighborhood Condition_1 Condition_2 Bldg_Type House_Style Overall_Qual Overall_Cond Year_Built Year_Remod_Add Roof_Style Roof_Matl Exterior_1st Exterior_2nd Mas_Vnr_Type Mas_Vnr_Area Exter_Qual Exter_Cond Foundation Bsmt_Qual Bsmt_Cond Bsmt_Exposure BsmtFin_Type_1 BsmtFin_SF_1 BsmtFin_Type_2 BsmtFin_SF_2 Bsmt_Unf_SF Total_Bsmt_SF Heating Heating_QC Central_Air Electrical First_Flr_SF Second_Flr_SF Low_Qual_Fin_SF Gr_Liv_Area Bsmt_Full_Bath Bsmt_Half_Bath Full_Bath Half_Bath Bedroom_AbvGr Kitchen_AbvGr Kitchen_Qual TotRms_AbvGrd Functional Fireplaces Fireplace_Qu Garage_Type Garage_Finish Garage_Cars Garage_Area Garage_Qual Garage_Cond Paved_Drive Wood_Deck_SF Open_Porch_SF Enclosed_Porch Three_season_porch Screen_Porch Pool_Area Pool_QC Fence Misc_Feature Misc_Val Mo_Sold Year_Sold Sale_Type Sale_Condition Sale_Price Longitude Latitude
One_Story_1946_and_Newer_All_Styles Residential_Low_Density 141 31770 Pave No_Alley_Access Slightly_Irregular Lvl AllPub Corner Gtl North_Ames Norm Norm OneFam One_Story Above_Average Average 1960 1960 Hip CompShg BrkFace Plywood Stone 112 Typical Typical CBlock Typical Good Gd BLQ 2 Unf 0 441 1080 GasA Fair Y SBrkr 1656 0 0 1656 1 0 1 0 3 1 Typical 7 Typ 2 Good Attchd Fin 2 528 Typical Typical Partial_Pavement 210 62 0 0 0 0 No_Pool No_Fence None 0 5 2010 WD Normal 215000 -93.61975 42.05403
One_Story_1946_and_Newer_All_Styles Residential_High_Density 80 11622 Pave No_Alley_Access Regular Lvl AllPub Inside Gtl North_Ames Feedr Norm OneFam One_Story Average Above_Average 1961 1961 Gable CompShg VinylSd VinylSd None 0 Typical Typical CBlock Typical Typical No Rec 6 LwQ 144 270 882 GasA Typical Y SBrkr 896 0 0 896 0 0 1 0 2 1 Typical 5 Typ 0 No_Fireplace Attchd Unf 1 730 Typical Typical Paved 140 0 0 0 120 0 No_Pool Minimum_Privacy None 0 6 2010 WD Normal 105000 -93.61976 42.05301
One_Story_1946_and_Newer_All_Styles Residential_Low_Density 81 14267 Pave No_Alley_Access Slightly_Irregular Lvl AllPub Corner Gtl North_Ames Norm Norm OneFam One_Story Above_Average Above_Average 1958 1958 Hip CompShg Wd Sdng Wd Sdng BrkFace 108 Typical Typical CBlock Typical Typical No ALQ 1 Unf 0 406 1329 GasA Typical Y SBrkr 1329 0 0 1329 0 0 1 1 3 1 Good 6 Typ 0 No_Fireplace Attchd Unf 1 312 Typical Typical Paved 393 36 0 0 0 0 No_Pool No_Fence Gar2 12500 6 2010 WD Normal 172000 -93.61939 42.05266
One_Story_1946_and_Newer_All_Styles Residential_Low_Density 93 11160 Pave No_Alley_Access Regular Lvl AllPub Corner Gtl North_Ames Norm Norm OneFam One_Story Good Average 1968 1968 Hip CompShg BrkFace BrkFace None 0 Good Typical CBlock Typical Typical No ALQ 1 Unf 0 1045 2110 GasA Excellent Y SBrkr 2110 0 0 2110 1 0 2 1 3 1 Excellent 8 Typ 2 Typical Attchd Fin 2 522 Typical Typical Paved 0 0 0 0 0 0 No_Pool No_Fence None 0 4 2010 WD Normal 244000 -93.61732 42.05125
Two_Story_1946_and_Newer Residential_Low_Density 74 13830 Pave No_Alley_Access Slightly_Irregular Lvl AllPub Inside Gtl Gilbert Norm Norm OneFam Two_Story Average Average 1997 1998 Gable CompShg VinylSd VinylSd None 0 Typical Typical PConc Good Typical No GLQ 3 Unf 0 137 928 GasA Good Y SBrkr 928 701 0 1629 0 0 2 1 3 1 Typical 6 Typ 1 Typical Attchd Fin 2 482 Typical Typical Paved 212 34 0 0 0 0 No_Pool Minimum_Privacy None 0 3 2010 WD Normal 189900 -93.63893 42.06090
Two_Story_1946_and_Newer Residential_Low_Density 78 9978 Pave No_Alley_Access Slightly_Irregular Lvl AllPub Inside Gtl Gilbert Norm Norm OneFam Two_Story Above_Average Above_Average 1998 1998 Gable CompShg VinylSd VinylSd BrkFace 20 Typical Typical PConc Typical Typical No GLQ 3 Unf 0 324 926 GasA Excellent Y SBrkr 926 678 0 1604 0 0 2 1 3 1 Good 7 Typ 1 Good Attchd Fin 2 470 Typical Typical Paved 360 36 0 0 0 0 No_Pool No_Fence None 0 6 2010 WD Normal 195500 -93.63893 42.06078

Tons of (probably) strongly related variables here. In the real world, we’d probably spend some time thinking about which variables are strictly necessary.

Stratified training/test splits

First, let’s split the data. Here, we’ll show one of the neat features of {rsample}, a package within the {tidymodels} ecosystem, which lets us perform stratified sampling on our dependent variable to ensure better balance. We’ll stick with the default split, which is a 75-25 Training-Test split.

## Setting seed
set.seed(1)

## Generate split
ames_split <- initial_split(ames, strata = "Sale_Price")

## Printing the function gives us <Num Rows in Training Set/Num Rows in Testing Set/Total Num Rows>
ames_split
## <2199/731/2930>
## Calling training() on this object will give us our training set, and calling testing() on it will give us our testing set
ames_train <- training(ames_split)
ames_test <- testing(ames_split)

ames_train %>% 
  head() %>% 
  knitr::kable()
MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape Land_Contour Utilities Lot_Config Land_Slope Neighborhood Condition_1 Condition_2 Bldg_Type House_Style Overall_Qual Overall_Cond Year_Built Year_Remod_Add Roof_Style Roof_Matl Exterior_1st Exterior_2nd Mas_Vnr_Type Mas_Vnr_Area Exter_Qual Exter_Cond Foundation Bsmt_Qual Bsmt_Cond Bsmt_Exposure BsmtFin_Type_1 BsmtFin_SF_1 BsmtFin_Type_2 BsmtFin_SF_2 Bsmt_Unf_SF Total_Bsmt_SF Heating Heating_QC Central_Air Electrical First_Flr_SF Second_Flr_SF Low_Qual_Fin_SF Gr_Liv_Area Bsmt_Full_Bath Bsmt_Half_Bath Full_Bath Half_Bath Bedroom_AbvGr Kitchen_AbvGr Kitchen_Qual TotRms_AbvGrd Functional Fireplaces Fireplace_Qu Garage_Type Garage_Finish Garage_Cars Garage_Area Garage_Qual Garage_Cond Paved_Drive Wood_Deck_SF Open_Porch_SF Enclosed_Porch Three_season_porch Screen_Porch Pool_Area Pool_QC Fence Misc_Feature Misc_Val Mo_Sold Year_Sold Sale_Type Sale_Condition Sale_Price Longitude Latitude
One_Story_1946_and_Newer_All_Styles Residential_Low_Density 141 31770 Pave No_Alley_Access Slightly_Irregular Lvl AllPub Corner Gtl North_Ames Norm Norm OneFam One_Story Above_Average Average 1960 1960 Hip CompShg BrkFace Plywood Stone 112 Typical Typical CBlock Typical Good Gd BLQ 2 Unf 0 441 1080 GasA Fair Y SBrkr 1656 0 0 1656 1 0 1 0 3 1 Typical 7 Typ 2 Good Attchd Fin 2 528 Typical Typical Partial_Pavement 210 62 0 0 0 0 No_Pool No_Fence None 0 5 2010 WD Normal 215000 -93.61975 42.05403
One_Story_1946_and_Newer_All_Styles Residential_Low_Density 81 14267 Pave No_Alley_Access Slightly_Irregular Lvl AllPub Corner Gtl North_Ames Norm Norm OneFam One_Story Above_Average Above_Average 1958 1958 Hip CompShg Wd Sdng Wd Sdng BrkFace 108 Typical Typical CBlock Typical Typical No ALQ 1 Unf 0 406 1329 GasA Typical Y SBrkr 1329 0 0 1329 0 0 1 1 3 1 Good 6 Typ 0 No_Fireplace Attchd Unf 1 312 Typical Typical Paved 393 36 0 0 0 0 No_Pool No_Fence Gar2 12500 6 2010 WD Normal 172000 -93.61939 42.05266
One_Story_1946_and_Newer_All_Styles Residential_Low_Density 93 11160 Pave No_Alley_Access Regular Lvl AllPub Corner Gtl North_Ames Norm Norm OneFam One_Story Good Average 1968 1968 Hip CompShg BrkFace BrkFace None 0 Good Typical CBlock Typical Typical No ALQ 1 Unf 0 1045 2110 GasA Excellent Y SBrkr 2110 0 0 2110 1 0 2 1 3 1 Excellent 8 Typ 2 Typical Attchd Fin 2 522 Typical Typical Paved 0 0 0 0 0 0 No_Pool No_Fence None 0 4 2010 WD Normal 244000 -93.61732 42.05125
Two_Story_1946_and_Newer Residential_Low_Density 74 13830 Pave No_Alley_Access Slightly_Irregular Lvl AllPub Inside Gtl Gilbert Norm Norm OneFam Two_Story Average Average 1997 1998 Gable CompShg VinylSd VinylSd None 0 Typical Typical PConc Good Typical No GLQ 3 Unf 0 137 928 GasA Good Y SBrkr 928 701 0 1629 0 0 2 1 3 1 Typical 6 Typ 1 Typical Attchd Fin 2 482 Typical Typical Paved 212 34 0 0 0 0 No_Pool Minimum_Privacy None 0 3 2010 WD Normal 189900 -93.63893 42.06090
Two_Story_1946_and_Newer Residential_Low_Density 78 9978 Pave No_Alley_Access Slightly_Irregular Lvl AllPub Inside Gtl Gilbert Norm Norm OneFam Two_Story Above_Average Above_Average 1998 1998 Gable CompShg VinylSd VinylSd BrkFace 20 Typical Typical PConc Typical Typical No GLQ 3 Unf 0 324 926 GasA Excellent Y SBrkr 926 678 0 1604 0 0 2 1 3 1 Good 7 Typ 1 Good Attchd Fin 2 470 Typical Typical Paved 360 36 0 0 0 0 No_Pool No_Fence None 0 6 2010 WD Normal 195500 -93.63893 42.06078
One_Story_PUD_1946_and_Newer Residential_Low_Density 41 4920 Pave No_Alley_Access Regular Lvl AllPub Inside Gtl Stone_Brook Norm Norm TwnhsE One_Story Very_Good Average 2001 2001 Gable CompShg CemntBd CmentBd None 0 Good Typical PConc Good Typical Mn GLQ 3 Unf 0 722 1338 GasA Excellent Y SBrkr 1338 0 0 1338 1 0 2 0 2 1 Good 6 Typ 0 No_Fireplace Attchd Fin 2 582 Typical Typical Paved 0 0 170 0 0 0 No_Pool No_Fence None 0 4 2010 WD Normal 213500 -93.63379 42.06298

Data pre-processing

Now let’s preprocess our data using the {recipes} package, also part of the {tidymodels} ecosystem. To do this, we’ll first specify our formula and our data, and then iterate the preprocessing steps we want. To demonstrate a wide range of things that can be done with {recipes}, let’s first log transform our dependent variable (Sale_Price), then remove variables containing "_Qual" or “Condition” (which are subjective ratings on the part of the appraiser made on or after the sale, we want to predict Sale_Price before sale!), create dummy variables out of our factor variables, center and scale our predictors, then run PCA on the 13 different variables that contain “SF” or “Area” to enough components to capture 75% of the variation in these variables, then remove any near-zero variance predictors. This all sounds like a lot, but the recipes package makes this pre-processing almost self-documenting!

ames_rec <- recipe(
  Sale_Price ~ .,
  data = ames_train
) %>% 
  step_log(Sale_Price, base = 10) %>%
  step_rm(matches("Qual"), matches("Cond")) %>% 
  step_dummy(all_nominal()) %>% 
  step_center(all_predictors()) %>% 
  step_scale(all_predictors()) %>% 
  step_pca(contains("SF"), contains("Area"), threshold = .75) %>% 
  step_nzv(all_predictors())

ames_rec
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         80
## 
## Operations:
## 
## Log transformation on Sale_Price
## Delete terms matches, Qual, matches, Cond
## Dummy variables from all_nominal
## Centering for all_predictors
## Scaling for all_predictors
## No PCA components were extracted.
## Sparse, unbalanced variable filter on all_predictors

The next step is to prepare or prep() this recipe, which estimates any parameters necessary for the preprocessing steps from the training set to be later applied to other datasets.

ames_rec_trained <- prep(ames_rec, training = ames_train, verbose = TRUE)
## oper 1 step log [training] 
## oper 2 step rm [training] 
## oper 3 step dummy [training] 
## oper 4 step center [training] 
## oper 5 step scale [training] 
## oper 6 step pca [training] 
## oper 7 step nzv [training] 
## The retained training set is ~ 1.47 Mb  in memory.
ames_rec_trained
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         80
## 
## Training data contained 2199 data points and no missing data.
## 
## Operations:
## 
## Log transformation on Sale_Price [trained]
## Variables removed Overall_Qual, Exter_Qual, Bsmt_Qual, ... [trained]
## Dummy variables from MS_SubClass, MS_Zoning, Street, Alley, ... [trained]
## Centering for Lot_Frontage, Lot_Area, ... [trained]
## Scaling for Lot_Frontage, Lot_Area, ... [trained]
## PCA extraction with BsmtFin_SF_1, BsmtFin_SF_2, ... [trained]
## Sparse, unbalanced variable filter removed Kitchen_AbvGr, ... [trained]

Now we can “juice” the prepared recipes, which gives us our preprocessed training set. Lets take a look at our PCA extraction.

ames_rec_trained %>% 
  juice() %>% 
  select(starts_with("PC"))
## # A tibble: 2,199 x 7
##         PC1    PC2    PC3     PC4     PC5     PC6     PC7
##       <dbl>  <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 -1.79     1.20   0.783  0.194   0.120   0.204   2.13  
##  2 -0.625    1.51   0.164 -0.0112  0.0919  1.50    0.256 
##  3 -2.57    -0.201 -0.677  0.723   1.08   -0.455  -0.799 
##  4  0.339    0.932  0.400 -0.230  -0.486   0.819   0.846 
##  5  0.159    0.903  0.385 -0.360  -0.390   1.62    0.510 
##  6 -0.00598 -0.337 -0.603  0.247   0.395  -0.0660 -0.781 
##  7 -0.172   -0.277 -0.689 -0.116  -0.219  -0.772  -0.838 
##  8  0.568   -1.19   0.716 -0.382  -0.347   0.243   0.715 
##  9  0.0123   1.80   0.152 -0.217  -0.0602  2.17   -0.0132
## 10  1.06    -1.55   0.423 -0.211  -0.251  -0.520   0.318 
## # … with 2,189 more rows

Not bad, we’ve reduced 13 variables down to 7. This probably wasn’t the best use case of PCA, but it provides a good example of some advanced preprocessing made simple in {recipes}.

Modeling

Now let’s specify our model. We’re going to go with a Lasso model with a penalty of 0.001 using the {parsnip} package.

To do this, we’re first going to specify our model as a linear regression using linear_reg(), set the mixture proportion to 1 for full L1 regularization (the Lasso), and the penalty to 0.001. Then we’ll set the engine to “glmnet”,as opposed to “lm”, “stan”, “spark”, or “keras” as alternative options. The beauty of {parsnip} is that it unifies the interface for model specifications so that you don’t need to remember dozens of different interfaces for each implementation of a model.

ames_lasso <- linear_reg(penalty = 0.001, mixture = 1) %>% 
  set_engine("glmnet")

Now we have our recipe and our model, we can create our “workflow”, which packages up our the preprocessing steps and model. Using workflows, we don’t need to go through the prep() and juice() steps we went through earlier when we go to fit our model (I demonstrated prep() and juice() as they can be useful for being able to inspect your pre-processed data as we did earlier).

ames_lasso_wfl <- workflow() %>% 
  add_recipe(ames_rec) %>% 
  add_model(ames_lasso)

ames_lasso_wfl
## ══ Workflow ══════════════════════════════════════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────────────
## 7 Recipe Steps
## 
## ● step_log()
## ● step_rm()
## ● step_dummy()
## ● step_center()
## ● step_scale()
## ● step_pca()
## ● step_nzv()
## 
## ── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = 0.001
##   mixture = 1
## 
## Computational engine: glmnet

With our workflow designed, fitting our model is as simple as passing our training data and workflow to the fit() function.

ames_lasso_fit <- fit(ames_lasso_wfl, ames_train)

And getting predictions is as simple as passing out fitted model and the data we want predictions for to the predict() function.

predict(ames_lasso_fit, ames_train) %>% slice(1:5)
## # A tibble: 5 x 1
##   .pred
##   <dbl>
## 1  5.32
## 2  5.16
## 3  5.36
## 4  5.30
## 5  5.32

Model evaluation

How does our model perform on our training set? Let’s find out using metrics from the {yardstick} package. We’ll use three metrics: Root Mean Squared Error (RMSE), R squared, and the concordance correlation coefficient (ccc).

First we’ll set our three metrics, then we’ll generate predictions, and compare those predictions to the true values within the training set.

perf_metrics <- metric_set(rmse, rsq, ccc)

perf_lasso <- ames_lasso_fit %>% 
  predict(ames_train) %>% 
  bind_cols(juice(ames_rec_trained)) %>% 
  perf_metrics(truth = Sale_Price, estimate = .pred)

perf_lasso %>% 
  arrange(.metric)
## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 ccc     standard      0.924 
## 2 rmse    standard      0.0657
## 3 rsq     standard      0.861

Easy peasy! But of course, this is all in-sample. Perhaps we want to know what kind of out-of-sample performance we can expect from our model using cross-validation. {rsample} also makes that easy, so let’s create 10-fold cross-validation sets for evaluating our training set models using vfold_cv(), which defaults to creating 10 folds.

cv_splits <- vfold_cv(ames_train)
cv_splits
## #  10-fold cross-validation 
## # A tibble: 10 x 2
##    splits           id    
##    <named list>     <chr> 
##  1 <split [2K/220]> Fold01
##  2 <split [2K/220]> Fold02
##  3 <split [2K/220]> Fold03
##  4 <split [2K/220]> Fold04
##  5 <split [2K/220]> Fold05
##  6 <split [2K/220]> Fold06
##  7 <split [2K/220]> Fold07
##  8 <split [2K/220]> Fold08
##  9 <split [2K/220]> Fold09
## 10 <split [2K/219]> Fold10

Now we’ll take our workflow and use it to fit 10 models on these 10 splits using the fit_resamples() function from the {tune} package (also a part of the tidymodels ecosystem), as well as tell it to compute the performance metrics we set earlier.

cv_eval <- fit_resamples(ames_lasso_wfl, resamples = cv_splits, metrics = perf_metrics)
cv_eval
## #  10-fold cross-validation 
## # A tibble: 10 x 4
##    splits           id     .metrics         .notes          
##  * <list>           <chr>  <list>           <list>          
##  1 <split [2K/220]> Fold01 <tibble [3 × 3]> <tibble [0 × 1]>
##  2 <split [2K/220]> Fold02 <tibble [3 × 3]> <tibble [0 × 1]>
##  3 <split [2K/220]> Fold03 <tibble [3 × 3]> <tibble [0 × 1]>
##  4 <split [2K/220]> Fold04 <tibble [3 × 3]> <tibble [0 × 1]>
##  5 <split [2K/220]> Fold05 <tibble [3 × 3]> <tibble [0 × 1]>
##  6 <split [2K/220]> Fold06 <tibble [3 × 3]> <tibble [0 × 1]>
##  7 <split [2K/220]> Fold07 <tibble [3 × 3]> <tibble [0 × 1]>
##  8 <split [2K/220]> Fold08 <tibble [3 × 3]> <tibble [0 × 1]>
##  9 <split [2K/220]> Fold09 <tibble [3 × 3]> <tibble [0 × 1]>
## 10 <split [2K/219]> Fold10 <tibble [3 × 3]> <tibble [0 × 1]>

Now let’s compare the in-sample performance we just checked to our cross-validated performance, which is as easy as passing the above fit_samples() object to the collect_metrics() function!

collect_metrics(cv_eval)
## # A tibble: 3 x 5
##   .metric .estimator   mean     n std_err
##   <chr>   <chr>       <dbl> <int>   <dbl>
## 1 ccc     standard   0.911     10 0.00767
## 2 rmse    standard   0.0686    10 0.00289
## 3 rsq     standard   0.847     10 0.0155
perf_lasso %>% 
  arrange(.metric)
## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 ccc     standard      0.924 
## 2 rmse    standard      0.0657
## 3 rsq     standard      0.861

Not bad at all! Our cross-validated performance is fairly close to our in-sample performance, it doesn’t seem like we’re overfitting here.

But I think we can do better. We’ve already used the {tune} package to fit across these resamples, but as the package name suggests, its real power comes in allowing us to easily tune the hyperparameters in our model.

Model tuning

Recall that we set our regularization penalty to 0.001 and we chose to use L1 regularization. Both those decisions were relatively arbitrary. Let’s use the {tune} package to build an elastic net model by exploring other penalty values and regularization mixtures using cross validation performance to decide what values these parameters should be.

We’ll start by defining a new model, ames_mixture, which will not take specific specific values for penalty and mixture and will instead leave these as variables to be tuned and swapping out our ames_lasso_wfl with this new workflow.

ames_mixture <- linear_reg(penalty = tune(), mixture = tune()) %>% 
  set_engine("glmnet")

ames_mixture_wfl <- update_model(ames_lasso_wfl, ames_mixture)

Next, we will define a parameter space to search. {tune} allows you to perform either grid search (where the candidate values are pre-defined) or iterative search (ex: Bayesian optimization) where the results of the previous model are used to select the next parameter values to try.

There are pros/cons to each. A big plus of grid search is that it allows you to take advantage of parallel processing to speed up your search, while iterative search is, by construction, sequential. A big plus of iterative search is that it can quickly rule out areas of parameter space which can be efficient when covering many values of a high dimensional parameter space (where a grid may require many, many models to comfortably cover the entire parameter space, where many of them may turn out to be redundant).

For this post, we’re going to stick with grid search. The simplest form of grid search uses regular grids, where you provide a vector of values for each parameter and the grid is composed of every possible value combination.

{tune} provides useful defaults for searching parameter spaces of many common hyperparameters, for example, creating grids for the “penalty” parameter in log-10 space. We can simply specify the parameters, pass these to grid_regular(), and specify that we want 5 levels of penalization and 5 levels of mixture.

mixture_param <- parameters(penalty(), mixture())

regular_grid <- grid_regular(mixture_param, levels = c(5, 5))

regular_grid %>% 
  ggplot(aes(x = mixture, y = penalty)) +
  geom_point() +
  scale_y_log10()

{tune} also provides ways to create non-regular grids as well.

  • Random grids generated using grid_random() will uniformly sample the parameter space.

  • Space-filling designs (SFD) generated using grid_max_entropy() will try to keep candidate values away from one another in order to more efficiently cover the parameter space.

The below shows how to create a SFD grid and plots 25 candidate values.

sfd_grid <- grid_max_entropy(mixture_param, size = 25)

sfd_grid
## # A tibble: 25 x 2
##     penalty mixture
##       <dbl>   <dbl>
##  1 1.21e- 8  0.614 
##  2 1.32e- 1  0.679 
##  3 1.20e-10  0.269 
##  4 3.90e- 9  0.414 
##  5 8.38e- 6  0.0756
##  6 4.02e- 1  0.838 
##  7 5.94e-10  0.504 
##  8 1.83e- 1  0.177 
##  9 3.61e- 3  0.866 
## 10 2.86e- 7  0.856 
## # … with 15 more rows
sfd_grid %>% 
  ggplot(aes(x = mixture, y = penalty)) +
  geom_point() +
  scale_y_log10()

For simplicity’s sake, we’ll stick with the regular grid we generated. Let’s start tuning.

First, we’ll set up our parallelization.

library(doParallel)

all_cores <- parallel::detectCores(logical = FALSE)

cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)

clusterEvalQ(cl, {library(tidymodels)})

Now we’re going to create our tuning object, which will take our recipe, our model, our resamples, and our metrics, to fit our 25 models over 10 resamples and compute our performance metrics, then we’ll stop our parallelization.

ames_tune <- tune_grid(
  ames_rec,
  model = ames_mixture,
  resamples = cv_splits,
  grid = regular_grid,
  metrics = perf_metrics
)

stopCluster(cl)

# Naive Lasso performance
collect_metrics(cv_eval)
## # A tibble: 3 x 5
##   .metric .estimator   mean     n std_err
##   <chr>   <chr>       <dbl> <int>   <dbl>
## 1 ccc     standard   0.911     10 0.00767
## 2 rmse    standard   0.0686    10 0.00289
## 3 rsq     standard   0.847     10 0.0155
# Best tuned models
show_best(ames_tune, "ccc")
## # A tibble: 5 x 7
##        penalty mixture .metric .estimator  mean     n std_err
##          <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl>
## 1 0.0000000001    0.25 ccc     standard   0.913    10 0.00760
## 2 0.0000000316    0.25 ccc     standard   0.913    10 0.00760
## 3 0.00001         0.25 ccc     standard   0.913    10 0.00760
## 4 0.0000000001    0.75 ccc     standard   0.913    10 0.00762
## 5 0.0000000316    0.75 ccc     standard   0.913    10 0.00762
show_best(ames_tune, "rmse", maximize = FALSE)
## # A tibble: 5 x 7
##        penalty mixture .metric .estimator   mean     n std_err
##          <dbl>   <dbl> <chr>   <chr>       <dbl> <int>   <dbl>
## 1 0.0000000001    0    rmse    standard   0.0682    10 0.00235
## 2 0.0000000316    0    rmse    standard   0.0682    10 0.00235
## 3 0.00001         0    rmse    standard   0.0682    10 0.00235
## 4 0.00316         0    rmse    standard   0.0682    10 0.00235
## 5 0.00316         0.25 rmse    standard   0.0684    10 0.00274
show_best(ames_tune, "rsq")
## # A tibble: 5 x 7
##        penalty mixture .metric .estimator  mean     n std_err
##          <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl>
## 1 0.0000000001    0    rsq     standard   0.849    10  0.0130
## 2 0.0000000316    0    rsq     standard   0.849    10  0.0130
## 3 0.00001         0    rsq     standard   0.849    10  0.0130
## 4 0.00316         0    rsq     standard   0.849    10  0.0130
## 5 0.00316         0.25 rsq     standard   0.848    10  0.0148

Our results suggest that our original model parameters choices definitely had room for improvement. A much smaller penalty and going with pure L2 regularization seems to perform better on this data. The improvements are relatively modest (RMSE: 0.0686 –> 0.0682, R squared: 0.847 –> 0.849), but when tuning is this easy, why leave money on the table?

The plot below nicely visualizes the performance of each grid candidate nicely, along with a dotted line to indicate where our original model would’ve been.

collect_metrics(ames_tune) %>% 
  filter(.metric == "rmse") %>%
  mutate(mixture = format(mixture)) %>% 
  ggplot(aes(x = penalty, y = mean, col = mixture)) +
  geom_line() +
  geom_point() +
  scale_x_log10() +
  geom_vline(xintercept = 0.001, color = "purple", lty = "dotted")

Now let’s select our best grid candidate, finalize our workflow, and fit our model.

best_mixture <- select_best(ames_tune, metric = "rmse", maximize = FALSE)
best_mixture
## # A tibble: 1 x 2
##        penalty mixture
##          <dbl>   <dbl>
## 1 0.0000000001       0
ames_mixture_final <- ames_mixture_wfl %>% 
  finalize_workflow(best_mixture) %>% 
  fit(data = ames_train)

And we’re done! We now have a fitted, tuned, regularized mixtured model (albeit, the mixture is 100% L2 regularization, but we got there via tuning!)

Finally we get to the fun part. What variables turned out to be the most important in predicting sale price?

tidy_coefs <- ames_mixture_final$fit$fit$fit %>% 
  broom::tidy() %>% 
  filter(term != "(Intercept)") %>% 
  select(-step, -dev.ratio)

delta <- abs(tidy_coefs$lambda - best_mixture$penalty)
lambda_opt <- tidy_coefs$lambda[which.min(delta)]

label_coefs <- tidy_coefs %>% 
  mutate(abs_estimate = abs(estimate)) %>% 
  filter(abs_estimate >= 0.01) %>% 
  distinct(term) %>% 
  inner_join(tidy_coefs, by = "term") %>% 
  filter(lambda == lambda_opt)

label_coefs
## # A tibble: 10 x 3
##    term                              estimate lambda
##    <chr>                                <dbl>  <dbl>
##  1 Garage_Cars                         0.0156 0.0139
##  2 Year_Remod_Add                      0.0235 0.0139
##  3 TotRms_AbvGrd                       0.0134 0.0139
##  4 Full_Bath                           0.0124 0.0139
##  5 PC1                                -0.0244 0.0139
##  6 Fireplaces                          0.0121 0.0139
##  7 Central_Air_Y                       0.0121 0.0139
##  8 Bsmt_Full_Bath                      0.0103 0.0139
##  9 Neighborhood_Gilbert               -0.0110 0.0139
## 10 MS_Zoning_Residential_Low_Density   0.0107 0.0139
tidy_coefs %>% 
  ggplot(aes(x = lambda, y = estimate, group = term, col = term, label = term)) +
  geom_vline(xintercept = lambda_opt, lty = 3) +
  geom_line(alpha = .4) +
  theme(legend.position = "none") +
  scale_x_log10() +
  ggrepel::geom_text_repel(data = label_coefs)

The above shows the coefficient estimates plotted against lambda, the dotted line indicating the optimal lambda that we selected during our tuning. Nice to see that one of our principal components ended up being important!

With that all said and done, let’s finally see how our model did against our test set.

ames_mixture_final %>% 
  predict(ames_test) %>% 
  bind_cols(select(ames_test, Sale_Price)) %>% 
  mutate(Sale_Price = log10(Sale_Price)) %>% 
  perf_metrics(truth = Sale_Price, estimate = .pred)
## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      0.0828
## 2 rsq     standard      0.787 
## 3 ccc     standard      0.877

Having practiced on this data before, I can say that this is not great performance, but as mentioned before, all of this is just to demonstrate {tidymodels} functionality and workflow.

There are many, many more features to the {tidymodels} ecosystem and more are continually being developed. I encourage you to explore the vignettes for its composite packages. The parent website is not live as of the writing of this post, but I expect it will be soon. In the mean time, the tidymodels github repository can point you to the vignettes for each of its composite packages.

Avatar
David Nield
Data Analyst

My research interests include design-based inference, network analysis, and data visualization.

Related