Introduction
One of the biggest challenges machine learning beginners face is which algorithms to learn and what to focus on.. In the case of R, the problem is accentuated by the fact that various algorithms would have different syntax, different parameters to adjust and different requirements on the data format. This might be too much for a beginner.
Then, How can you transform from a beginner to a data scientist by building hundreds of models and stacking them together? There is certainly no shortcut, but what I will tell you today will allow you to apply hundreds of machine learning models without having to:
- remember the different package names for each algorithm.
- application syntax of each algorithm.
- parameters to adjust for each algorithm.
All this has been possible thanks to the years of effort that have gone behind CARET (Classification and regression training) which is possibly the biggest project in R. This package alone is everything you need to know to solve almost any supervised machine learning problem.. Provides a uniform interface for various machine learning algorithms and standardizes other tasks such as data slicing, preprocessing, function selection, the estimation of variable importance, etc.
For a detailed description of the various functionalities provided by Caret, you can refer to this article.
Today we will work on the Loan prediction problem III to show you the power of the Caret pack.
PD While caret definitely simplifies the job to some extent, it can't take away the hard work and practice it takes to become a master in machine learning.
Table of Contents
- Starting
- Preprocessing with Caret
- Split the data with Caret
- Feature selection with Caret
- Training models with Caret
- Parameter tuning with Caret
- Variable importance estimation with Caret
- Make predictions with Caret
1. Starting
In a nutshell, Caret is essentially a container for more than 200 machine learning algorithms. What's more, provides several features that make it a comprehensive solution for all modeling needs for supervised machine learning problems.
Caret tries not to load all packages it depends on at first. However, loads them only when packages are needed. But it assumes that you already have all the algorithms installed on your system.
To install Caret on your system, use the following command. Notice: it may take some time:
> install.packages("caret", dependencies = c("Depends", "Suggests"))
Now, let's start using the collation package in Loan prediction problem 3:
#Loading caret package
library("caret")
#Loading training data
train<-read.csv("train_u6lujuX_CVtuZ9i.csv",stringsAsFactors = T)
#Looking at the structure of caret package.
str(train)
#'data.frame': 614 obs. of 13 variables:
#$ Loan_ID : Factor w/ 614 levels "LP001002","LP001003",..: 1 2 3 4 5 6 7 8 9 #10 ..
#$ Gender : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
#$ Married : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 3 3 3 ...
#$ Dependents : Factor w/ 5 levels "","0","1","2",..: 2 3 2 2 2 4 2 5 4 3 ...
#$ Education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 #1 ...
#$ Self_Employed : Factor w/ 3 levels "","No","Yes": 2 2 3 2 2 3 2 2 2 2 ...
#$ ApplicantIncome : int 5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 #...
#$ CoapplicantIncome: num 0 1508 0 2358 0 ...
#$ LoanAmount : int NA 128 66 120 141 267 95 158 168 349 ...
#$ Loan_Amount_Term : int 360 360 360 360 360 360 360 360 360 360 ...
#$ Credit_History : int 1 1 1 1 1 1 1 0 1 1 ...
#$ Property_Area : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 #...
#$ Loan_Status : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...
In this problem, we have to predict the status of a person's loan based on their profile.
2. Pre-processing with Caret
We need to preprocess our data before we can use it for modeling. Let's check if the data is missing values:
sum(is.na(train))
#[1] 86
Then, let's use Caret to impute these missing values using the KNN algorithm. We will predict these missing values based on other attributes for that row. What's more, we will scale and center the numerical data using the convenient preprocess () and Caret.
#Imputing missing values using KNN.Also centering and scaling numerical columns
preProcValues <- preProcess(train, method = c("knnImpute","center","scale"))
library('RANN')
train_processed <- predict(preProcValues, train)
sum(is.na(train_processed))
#[1] 0
It is also very easy to use a hot coding in Caret to create dummy variables for each level of a categorical variable. But first, we will convert the dependent variable to numeric.
#Converting outcome variable to numeric
train_processed$Loan_Status<-ifelse(train_processed$Loan_Status=='N',0,1)
id<-train_processed$Loan_ID
train_processed$Loan_ID<-NULL
#Checking the structure of processed train file
str(train_processed)
#'data.frame': 614 obs. of 12 variables:
#$ Gender : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
#$ Married : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 3 3 3 ...
#$ Dependents : Factor w/ 5 levels "","0","1","2",..: 2 3 2 2 2 4 2 5 4 3 ...
#$ Education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 #1 ...
#$ Self_Employed : Factor w/ 3 levels "","No","Yes": 2 2 3 2 2 3 2 2 2 2 ...
#$ ApplicantIncome : num 0.0729 -0.1343 -0.3934 -0.4617 0.0976 ...
#$ CoapplicantIncome: num -0.554 -0.0387 -0.554 0.2518 -0.554 ...
#$ LoanAmount : num 0.0162 -0.2151 -0.9395 -0.3086 -0.0632 ...
#$ Loan_Amount_Term : num 0.276 0.276 0.276 0.276 0.276 ...
#$ Credit_History : num 0.432 0.432 0.432 0.432 0.432 ...
#$ Property_Area : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 #...
#$ Loan_Status : num 1 0 1 1 1 1 1 0 1 0 ...
Now, creating dummy variables using hot encoding:
#Converting every categorical variable to numerical using dummy variables
dmy <- dummyVars(" ~ .", data = train_processed,fullRank = T)
train_transformed <- data.frame(predict(dmy, newdata = train_processed))
#Checking the structure of transformed train file
str(train_transformed)
#'data.frame': 614 obs. of 19 variables:
#$ Gender.Female : num 0 0 0 0 0 0 0 0 0 0 ...
#$ Gender.Male : num 1 1 1 1 1 1 1 1 1 1 ...
#$ Married.No : num 1 0 0 0 1 0 0 0 0 0 ...
#$ Married.Yes : num 0 1 1 1 0 1 1 1 1 1 ...
#$ Dependents.0 : num 1 0 1 1 1 0 1 0 0 0 ...
#$ Dependents.1 : num 0 1 0 0 0 0 0 0 0 1 ...
#$ Dependents.2 : num 0 0 0 0 0 1 0 0 1 0 ...
#$ Dependents.3. : num 0 0 0 0 0 0 0 1 0 0 ...
#$ Education.Not.Graduate : num 0 0 0 1 0 0 1 0 0 0 ...
#$ Self_Employed.No : num 1 1 0 1 1 0 1 1 1 1 ...
#$ Self_Employed.Yes : num 0 0 1 0 0 1 0 0 0 0 ...
#$ ApplicantIncome : num 0.0729 -0.1343 -0.3934 -0.4617 0.0976 ...
#$ CoapplicantIncome : num -0.554 -0.0387 -0.554 0.2518 -0.554 ...
#$ LoanAmount : num 0.0162 -0.2151 -0.9395 -0.3086 -0.0632 ...
#$ Loan_Amount_Term : num 0.276 0.276 0.276 0.276 0.276 ...
#$ Credit_History : num 0.432 0.432 0.432 0.432 0.432 ...
#$ Property_Area.Semiurban: num 0 0 0 0 0 0 0 1 0 1 ...
#$ Property_Area.Urban : num 1 0 1 1 1 1 1 0 1 0 ...
#$ Loan_Status : num 1 0 1 1 1 1 1 0 1 0 ...
#Converting the dependent variable back to categorical
train_transformed$Loan_Status<-as.factor(train_transformed$Loan_Status)
Here, “fullrank = T” will create only (n-1) columns for a categorical column with n different levels. This works especially for categorical predictors that represent as gender, married, etc., where we only have two levels: Masculine / Feminine, Yes / No, etc. why 0 can be used to represent a class while 1 represents the other class in the same column.
3. Split data using the caret
We will create a cross-validation set from the training set to evaluate our model. It is important to rely more on the cross-validation set for the actual evaluation of your model, on the contrary, you could end up overfitting the public leaderboard.
We will use createDataPartition () to divide our training data into two sets: 75% Y 25%. Since our outcome variable is categorical in nature, this function will ensure that the distribution of the result variable classes is similar in both sets.
#Spliting training set into two parts based on outcome: 75% and 25%
index <- createDataPartition(train_transformed$Loan_Status, p=0.75, list=FALSE)
trainSet <- train_transformed[ index,]
testSet <- train_transformed[-index,]
#Checking the structure of trainSet
str(trainSet)
#'data.frame': 461 obs. of 19 variables:
#$ Gender.Female : num 0 0 0 0 0 0 0 0 0 0 ...
#$ Gender.Male : num 1 1 1 1 1 1 1 1 1 1 ...
#$ Married.No : num 1 0 0 0 1 0 0 0 0 0 ...
#$ Married.Yes : num 0 1 1 1 0 1 1 1 1 1 ...
#$ Dependents.0 : num 1 0 1 1 1 0 1 0 0 0 ...
#$ Dependents.1 : num 0 1 0 0 0 0 0 0 1 0 ...
#$ Dependents.2 : num 0 0 0 0 0 1 0 0 0 1 ...
#$ Dependents.3. : num 0 0 0 0 0 0 0 1 0 0 ...
#$ Education.Not.Graduate : num 0 0 0 1 0 0 1 0 0 0 ...
#$ Self_Employed.No : num 1 1 0 1 1 0 1 1 1 1 ...
#$ Self_Employed.Yes : num 0 0 1 0 0 1 0 0 0 0 ...
#$ ApplicantIncome : num 0.0729 -0.1343 -0.3934 -0.4617 0.0976 ...
#$ CoapplicantIncome : num -0.554 -0.0387 -0.554 0.2518 -0.554 ...
#$ LoanAmount : num 0.0162 -0.2151 -0.9395 -0.3086 -0.0632 ...
#$ Loan_Amount_Term : num 0.276 0.276 0.276 0.276 0.276 ...
#$ Credit_History : num 0.432 0.432 0.432 0.432 0.432 ...
#$ Property_Area.Semiurban: num 0 0 0 0 0 0 0 1 1 0 ...
#$ Property_Area.Urban : num 1 0 1 1 1 1 1 0 0 1 ...
#$ Loan_Status : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 1 1 2 ...
4. Feature selection with Caret
Feature selection is an extremely crucial part of modeling. To understand the importance of role selection and the various techniques used for role selection, I highly recommend that you read my previous article. For now, we will use the elimination of recursive features, which is a wrapper method to find the best subset of features to use in modeling.
#Feature selection using rfe in caretcontrol <- rfeControl(functions = rfFuncs,
method = "repeatedcv",
repeats = 3,
verbose = FALSE)
outcomeName<-'Loan_Status'
predictors<-names(trainSet)[!names(trainSet) %in% outcomeName]
Loan_Pred_Profile <- rfe(trainSet[,predictors], trainSet[,outcomeName],
rfeControl = control)
Loan_Pred_Profile
#Recursive feature selection
#Outer resampling method: Cross-Validated (10 fold, repeated 3 times)
#Resampling performance over subset size:
# Variables Accuracy Kappa AccuracySD KappaSD Selected
#4 0.7737 0.4127 0.03707 0.09962
#8 0.7874 0.4317 0.03833 0.11168
#16 0.7903 0.4527 0.04159 0.11526
#18 0.7882 0.4431 0.03615 0.10812
#The top 5 variables (out of 16):
# Credit_History, LoanAmount, Loan_Amount_Term, ApplicantIncome, CoapplicantIncome
#Taking only the top 5 predictors
predictors<-c("Credit_History", "LoanAmount", "Loan_Amount_Term", "ApplicantIncome", "CoapplicantIncome")
5. Training models with Caret
This is probably the part where Caret stands out from every other package available.. Provides the ability to implement more than 200 machine learning algorithms using consistent syntax. For a list of all algorithms that Caret supports, You can use:
names(getModelInfo())#[1] "ada" "AdaBag" "AdaBoost.M1" "adaboost"
#[5] "amdai" "ANFIS" "avNNet" "awnb"
#[9] "awtan" "bag" "bagEarth" "bagEarthGCV"
#[13] "bagFDA" "bagFDAGCV" "bam" "bartMachine"
#[17] "bayesglm" "bdk" "binda" "blackboost"
#[21] "blasso" "blassoAveraged" "Boruta" "bridge"
#….
#[205] "svmBoundrangeString" "svmExpoString" "svmLinear" "svmLinear2"
#[209] "svmLinear3" "svmLinearWeights" "svmLinearWeights2" "svmPoly"
#[213] "svmRadial" "svmRadialCost" "svmRadialSigma" "svmRadialWeights"
#[217] "svmSpectrumString" "tan" "tanSearch" "treebag"
#[221] "vbmpRadial" "vglmAdjCat" "vglmContRatio" "vglmCumulative"
#[225] "widekernelpls" "WM" "wsrf" "xgbLinear"
#[229] "xgbTree" "xyf"
For more details of any model, you can consult here.
We can simply apply a large number of algorithms with a similar syntax. For instance, to apply GBM, random forest, neural network and logistic regression:
model_gbm<-train(trainSet[,predictors],trainSet[,outcomeName],method='gbm')
model_rf<-train(trainSet[,predictors],trainSet[,outcomeName],method='rf')
model_nnet<-train(trainSet[,predictors],trainSet[,outcomeName],method='nnet')
model_glm<-train(trainSet[,predictors],trainSet[,outcomeName],method='glm')
You can continue tuning the parameters in all of these algorithms using the parameter tuning techniques.
6. Parameter tuning with Caret
It is extremely easy to adjust the parameters using Caret. Normally, parameter setting in Caret is done as follows:
Almost every step of the fitting process can be customized. The resampling technique used to evaluate model performance using a set of parameters in Caret by default is bootstrap, but it provides alternatives to use k-fold, Repeated k-fold and Leave-one-out cross-validation (LOOCV) which can be specified using trainControl (). In this example, we will use cross validation of 5 repeated parts 5 times.
fitControl <- trainControl(
method = "repeatedcv",
number = 5,
repeats = 5)
If the parameter search space is not defined, caret will use 3 random values of each tunable parameter and will use the results of the cross-validation to find the best set of parameters for that algorithm. On the contrary, there are two more ways to adjust the parameters:
6.1.Using tuneGrid
To find the parameters of a model that can be adjusted, you can use
modelLookup(model="gbm")
#model parameter label forReg forClass probModel
#1
gbmn.trees # Boosting Iterations TRUE TRUE TRUE
#2
gbminteraction.depth Max Tree Depth TRUE TRUE TRUE
#3
gbmshrinkage Shrinkage TRUE TRUE TRUE
#4
gbmn.minobsinnode Min. Terminal Node Size TRUE TRUE TRUE
#using grid search
#Creating grid
grid <- expand.grid(n.trees=c(10,20,50,100,500,1000),shrinkage=c(0.01,0.05,0.1,0.5),n.minobsinnode = c(3,5,10),interaction.depth=c(1,5,10))
# training the model
model_gbm<-train(trainSet[,predictors],trainSet[,outcomeName],method='gbm',trControl=fitControl,tuneGrid=grid)
# summarizing the model
print(model_gbm)
#Stochastic Gradient Boosting
#461 samples
#5 predictor
#2 classes: '0', '1'
#No pre-processing
#Resampling: Cross-Validated (5 fold, repeated 5 times)
#Summary of sample sizes: 368, 370, 369, 369, 368, 369, ...
#Resampling results across tuning parameters:
# shrinkage interaction.depth n.minobsinnode n.trees Accuracy Kappa
#0.01 1 3 10 0.6876416 0.000000
0 #0.01 1 3 20 0.6876416 0.0000000
#0.01 1 3 50 0.7982345 0.4423609
#0.01 1 3 100 0.7952190 0.4364383
#0.01 1 3 500 0.7904882 0.4342300
#0.01 1 3 1000 0.7913627 0.4421230
#0.01 1 5 10 0.6876416 0.0000000
#0.01 1 5 20 0.6876416 0.0000000
#0.01 1 5 50 0.7982345 0.4423609
#0.01 1 5 100 0.7943635 0.4351912
#0.01 1 5 500 0.7930783 0.4411348
#0.01 1 5 1000 0.7913720 0.4417463
#0.01 1 10 10 0.6876416 0.0000000
#0.01 1 10 20 0.6876416 0.0000000
#0.01 1 10 50 0.7982345 0.4423609
#0.01 1 10 100 0.7943635 0.4351912
#0.01 1 10 500 0.7939525 0.4426503
#0.01 1 10 1000 0.7948362 0.4476742
#0.01 5 3 10 0.6876416 0.0000000
#0.01 5 3 20 0.6876416 0.0000000
#0.01 5 3 50 0.7960556 0.4349571
#0.01 5 3 100 0.7934987 0.4345481
#0.01 5 3 500 0.7775055 0.4147204
#...
#0.50 5 10 100 0.7045617 0.2834696
#0.50 5 10 500 0.6924480 0.2650477
#0.50 5 10 1000 0.7115234 0.3050953
#0.50 10 3 10 0.7389117 0.3681917
#0.50 10 3 20 0.7228519 0.3317001
#0.50 10 3 50 0.7180833 0.3159445
#0.50 10 3 100 0.7172417 0.3189655
#0.50 10 3 500 0.7058472 0.3098146
#0.50 10 3 1000 0.7001852 0.2967784
#0.50 10 5 10 0.7266895 0.3378430
#0.50 10 5 20 0.7154746 0.3197905
#0.50 10 5 50 0.7063535 0.2984819
#0.50 10 5 100 0.7151012 0.3141440
#0.50 10 5 500 0.7108516 0.3146822
#0.50 10 5 1000 0.7147320 0.3225373
#0.50 10 10 10 0.7314871 0.3327504
#0.50 10 10 20 0.7150814 0.3081869
#0.50 10 10 50 0.6993723 0.2815981
#0.50 10 10 100 0.6977416 0.2719140
#0.50 10 10 500 0.7037864 0.2854748
#0.50 10 10 1000 0.6995610 0.2869718
Precision was used to select the optimal model using the largest value.
The final values used for the model were n. Trees = 10, interaction depth = 1, contraction = 0.05 and n. Aminobsinnodo = 3
plot(model_gbm)
Therefore, for all parameter combinations you listed in expand.grid (), a model will be created and tested using cross-validation. The set of parameters with the best cross-validation performance will be used to create the final model that you will get in the end.
6.2. Using tuneLength
Instead of specifying the exact values for each tuning parameter, we can simply ask it to use any number of possible values for each tuning parameter via tuneLength. Let's try an example using tuneLength = 10.
#using tune length
model_gbm<-train(trainSet[,predictors],trainSet[,outcomeName],method='gbm',trControl=fitControl,tuneLength=10)
print(model_gbm)
#Stochastic Gradient Boosting
#461 samples
#5 predictor
#2 classes: '0', '1'
#No pre-processing
#Resampling: Cross-Validated (5 fold, repeated 5 times)
#Summary of sample sizes: 368, 369, 369, 370, 368, 369, ...
#Resampling results across tuning parameters:
# interaction.depth n.trees Accuracy Kappa
#1 50 0.7978084 0.4541008
#1 100 0.7978177 0.4566764
#1 150 0.7934792 0.4472347
#1 200 0.7904310 0.4424091
#1 250 0.7869714 0.4342797
#1 300 0.7830488 0.4262414
...
#10 100 0.7575230 0.3860319
#10 150 0.7479757 0.3719707
#10 200 0.7397290 0.3566972
#10 250 0.7397285 0.3561990
#10 300 0.7362552 0.3513413
#10 350 0.7340812 0.3453415
#10 400 0.7336416 0.3453117
#10 450 0.7306027 0.3415153
#10 500 0.7253854 0.3295929
The setting parameter ‘shrinkage’ remained constant at a value of 0,1
The setting parameter 'n.minobsinnode’ remained constant at a value of 10
Precision was used to select the optimal model using the largest value.
The final values used for the model were n. Trees = 50, interaction depth = 2, contraction = 0,1 and n. Aminobsinnodo = 10.
plot(model_gbm)
Here, keeps the contraction parameters and n constant. Aminobsinnode while altering n. Trees and the depth of interaction on 10 values and uses the best combination to train the final model.
7. Estimation of the importance of the variable by interleaving
Caret also makes the importance estimates of the variables accessible with the use of varImp () for any model. Let's take a look at the importance of the variable for the four models we created:
#Checking variable importance for GBM
#Variable Importance
varImp(object=model_gbm)
#gbm variable importance
#Overall
#Credit_History 100.000
#LoanAmount 16.633
#ApplicantIncome 7.104
#CoapplicantIncome 6.773
#Loan_Amount_Term 0.000
#Plotting Varianle importance for GBM
plot(varImp(object=model_gbm),main="GBM - Variable Importance")
#Checking variable importance for RF
varImp(object=model_rf)
#rf variable importance
#Overall
#Credit_History 100.00
#ApplicantIncome 73.46
#LoanAmount 60.59
#CoapplicantIncome 40.43
#Loan_Amount_Term 0.00
#Plotting Varianle importance for Random Forest
plot(varImp(object=model_rf),main="RF - Variable Importance")
#Checking variable importance for NNET
varImp(object=model_nnet)
#nnet variable importance
#Overall
#ApplicantIncome 100.00
#LoanAmount 82.87
#CoapplicantIncome 56.92
#Credit_History 41.11
#Loan_Amount_Term 0.00
#Plotting Variable importance for Neural Network
plot(varImp(object=model_nnet),main="NNET - Variable Importance")
#Checking variable importance for GLM
varImp(object=model_glm)
#glm variable importance
#Overall
#Credit_History 100.000
#CoapplicantIncome 17.218
#Loan_Amount_Term 12.988
#LoanAmount 5.632
#ApplicantIncome 0.000
#Plotting Variable importance for GLM
plot(varImp(object=model_glm),main="GLM - Variable Importance")
Clearly, estimates of varying importance from different models differ and, Thus, could be used to gain a more holistic view of the importance of each predictor. Two main uses of varying importance of various models are:
- Predictors that are important to most models represent really important predictors.
- On the whole, we should use predictions from models that have significantly different variable importance, since your predictions are also expected to be different. Even if, one thing you need to make sure is that all of them are accurate enough.
8. Predictions with Caret
To predict the dependent variable for the test set, Caret ofrece predict.train (). You must specify the model name, test data. For classification problems, Caret also offers another feature called type that can be set as “prob” O “raw”. Para type = ”raw”, the predictions will be just the result classes for the test data, while for type = "prob", will give probabilities of occurrence of each observation in several classes of the result variable.
Let's take a look at the predictions of our GBM model:
#Predictions
predictions<-predict.train(object=model_gbm,testSet[,predictors],type="raw")
table(predictions)
#predictions
#0 1
#28 125
Caret also provides a confusionMatrix function that will provide the confusion matrix along with various other metrics for your predictions.. Here is the performance analysis of our GBM model:
confusionMatrix(predictions,testSet[,outcomeName])
#Confusion Matrix and Statistics
#Reference
#Prediction 0 1
#0 25 3
#1 23 102
#Accuracy : 0.8301
#95% CI : (0.761, 0.8859)
#No Information Rate : 0.6863
#P-Value [Acc > NIR] : 4.049e-05
#Kappa : 0.555
#Mcnemar's Test P-Value : 0.0001944
#Sensitivity : 0.5208
#Specificity : 0.9714
#Pos Pred Value : 0.8929
#Neg Pred Value : 0.8160
#Prevalence : 0.3137
#Detection Rate : 0.1634
#Detection Prevalence : 0.1830
#Balanced Accuracy : 0.7461
#'Positive' Class : 0
Additional Resources
Final notes
Caret is one of the most powerful and useful packages ever created in R. It only has the ability to meet all predictive modeling needs from preprocessing to interpretation. What's more, its syntax is also very easy to use. One R, I will encourage you to use Caret.
Caret is a very complete package and, instead of covering all the functionalities it offers, I thought it would be a better idea to show an end-to-end implementation of Caret in an actual hackathon J dataset. I tried to cover as many functions in Caret as I could, but Caret has so much more to offer. To deepen, You may find the resources listed above very helpful. Several of these resources have been written by Max Kuhn himself (the creator of the caret package).