Welcome to the rtemis vignette for MediBoost.
rtemis
Let’s load rtemis
:
library(rtemis)
## .:rtemis v0.7: Welcome, egenn
## [x86_64-apple-darwin15.6.0: 4 threads available]
In rtemis
, you can either provide a feature matrix / data frame, x
, and an outcome vector, y
, separately, or provide a combined dataset x
alone, in which case the last column should be the outcome.
For classification, the outcome should be a factor where the first level is the ‘positive’ case.
Let’s load a dataset from the online UCI ML repository:
checkData
function to examine the datasetparkinsons <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/parkinsons.data")
parkinsons$Status <- factor(parkinsons$status, levels = c(1, 0))
parkinsons$status <- NULL
parkinsons$name <- NULL
checkData(parkinsons)
## -------------------------------
## Dataset: parkinsons
## -------------------------------
##
## Summary
## -------------------------------
## 195 cases with 23 features:
## * 22 continuous features
## * 0 integer features
## * 1 categorical feature, which is not ordered
## * 0 constant features
## * 0 features include 'NA' values
##
## Recommendations
## -------------------------------
## * Everything looks good
Let’s train a MediBoost (MDB) model on the full sample:
parkinsons.mdb <- s.MDB(parkinsons, gamma = .8, learning.rate = .1)
## [2018-03-02 04:14:19 s.MDB] Hello, egenn
## [2018-03-02 04:14:19 dataPrepare] Imbalanced classes: using Inverse Probability Weighting
## ------------------------------------------------------
## Input Summary
## ------------------------------------------------------
## Training features: 195 x 22
## Training outcome: 195 x 1
## Testing features: Not available
## Testing outcome: Not available
##
## [2018-03-02 04:14:20 s.MDB] Training MDB...
## ------------------------------------------------------
## MDB Classification Training Summary
## ------------------------------------------------------
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 145 1
## 0 2 47
##
## Accuracy : 0.9846
## 95% CI : (0.9557, 0.9968)
## No Information Rate : 0.7538
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9588
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9864
## Specificity : 0.9792
## Pos Pred Value : 0.9932
## Neg Pred Value : 0.9592
## Precision : 0.9932
## Recall : 0.9864
## F1 : 0.9898
## Prevalence : 0.7538
## Detection Rate : 0.7436
## Detection Prevalence : 0.7487
## Balanced Accuracy : 0.9828
##
## 'Positive' Class : 1
##
## [2018-03-02 04:14:23 s.MDB] Run completed in 0.06 minutes (Real: 3.81; User: 3.62; System: 0.13)
MDB trees are saved as data.tree
objects. We can plot them using mplot3.mdb
, which creates html output using graphviz
.
The first line shows the rule, followed by the N of samples that match the rule, and lastly by the percent of the above that were outcome positive.
By default, leaf nodes with an estimate of 1 (positive class) are orange, and those with estimate 0 are teal.
You can mouse over nodes, edges, and the plot background for some popup info:
mplot3.mdb(parkinsons.mdb)
We can also explore the tree in the console without plotting:
parkinsons.mdb$mod$mdb.tree.pruned
## levelName
## 1 All cases
## 2 ¦--PPE < 0.1339935
## 3 °--PPE ≥ 0.1339935
## 4 ¦--Shimmer.APQ5 < 0.012745
## 5 ¦ ¦--MDVP.Fo.Hz. < 117.25
## 6 ¦ ¦ ¦--Shimmer.APQ3 < 0.008825
## 7 ¦ ¦ ¦ ¦--MDVP.Fo.Hz. < 110.723
## 8 ¦ ¦ ¦ °--MDVP.Fo.Hz. ≥ 110.723
## 9 ¦ ¦ °--Shimmer.APQ3 ≥ 0.008825
## 10 ¦ °--MDVP.Fo.Hz. ≥ 117.25
## 11 °--Shimmer.APQ5 ≥ 0.012745
Any attribute can be printed along the hierarchical tree structure:
print(parkinsons.mdb$mod$mdb.tree.pruned, "Estimate")
## levelName Estimate
## 1 All cases 1
## 2 ¦--PPE < 0.1339935 0
## 3 °--PPE ≥ 0.1339935 1
## 4 ¦--Shimmer.APQ5 < 0.012745 1
## 5 ¦ ¦--MDVP.Fo.Hz. < 117.25 0
## 6 ¦ ¦ ¦--Shimmer.APQ3 < 0.008825 0
## 7 ¦ ¦ ¦ ¦--MDVP.Fo.Hz. < 110.723 1
## 8 ¦ ¦ ¦ °--MDVP.Fo.Hz. ≥ 110.723 0
## 9 ¦ ¦ °--Shimmer.APQ3 ≥ 0.008825 1
## 10 ¦ °--MDVP.Fo.Hz. ≥ 117.25 1
## 11 °--Shimmer.APQ5 ≥ 0.012745 1
To get predicted values, use the predict
S3 generic with the familiar syntax
predict(mod, newdata)
:
predict(parkinsons.mdb)
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0
## [36] 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1
## [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
## [176] 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0
## Levels: 1 0
train <- resample(parkinsons, n.resamples = 10, resampler = "kfold", verbose = TRUE)
## [2018-03-02 04:14:23 resample] Input contains more than one columns; will stratify on last
## ------------------------------------------------------
## Resampling Parameters
## ------------------------------------------------------
## n.resamples: 10
## resampler: kfold
##
## [2018-03-02 04:14:23 resample] Created 10 independent folds
mplot3.res(train)
parkinsons.train <- parkinsons[train$Fold01, ]
parkinsons.test <- parkinsons[-train$Fold01, ]
parkinsons.mdb <- s.MDB(parkinsons.train, x.test = parkinsons.test,
gamma = .8, learning.rate = .1)
## [2018-03-02 04:14:23 s.MDB] Hello, egenn
## [2018-03-02 04:14:23 dataPrepare] Imbalanced classes: using Inverse Probability Weighting
## ------------------------------------------------------
## Input Summary
## ------------------------------------------------------
## Training features: 176 x 22
## Training outcome: 176 x 1
## Testing features: 19 x 22
## Testing outcome: 19 x 1
##
## [2018-03-02 04:14:23 s.MDB] Training MDB...
## ------------------------------------------------------
## MDB Classification Training Summary
## ------------------------------------------------------
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 124 0
## 0 9 43
##
## Accuracy : 0.9489
## 95% CI : (0.9051, 0.9764)
## No Information Rate : 0.7557
## P-Value [Acc > NIR] : 6.465e-12
##
## Kappa : 0.8707
## Mcnemar's Test P-Value : 0.007661
##
## Sensitivity : 0.9323
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.8269
## Precision : 1.0000
## Recall : 0.9323
## F1 : 0.9650
## Prevalence : 0.7557
## Detection Rate : 0.7045
## Detection Prevalence : 0.7045
## Balanced Accuracy : 0.9662
##
## 'Positive' Class : 1
##
## ------------------------------------------------------
## MDB Classification Testing Summary
## ------------------------------------------------------
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 13 0
## 0 1 5
##
## Accuracy : 0.9474
## 95% CI : (0.7397, 0.9987)
## No Information Rate : 0.7368
## P-Value [Acc > NIR] : 0.02352
##
## Kappa : 0.8725
## Mcnemar's Test P-Value : 1.00000
##
## Sensitivity : 0.9286
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.8333
## Precision : 1.0000
## Recall : 0.9286
## F1 : 0.9630
## Prevalence : 0.7368
## Detection Rate : 0.6842
## Detection Prevalence : 0.6842
## Balanced Accuracy : 0.9643
##
## 'Positive' Class : 1
##
## [2018-03-02 04:14:25 s.MDB] Run completed in 0.03 minutes (Real: 1.75; User: 1.68; System: 0.03)
rtemis
supervised learners, like s.MDB
, support automatic hyperparameter tuning. When more than a single value is passed to a tunable argument, grid search with internal resampling takes place using all available cores (threads).
parkinsons.mdb.tune <- s.MDB(parkinsons.train, x.test = parkinsons.test,
gamma = seq(.6, .9, .1), learning.rate = .1)
## [2018-03-02 04:14:25 s.MDB] Hello, egenn
## [2018-03-02 04:14:25 dataPrepare] Imbalanced classes: using Inverse Probability Weighting
## ------------------------------------------------------
## Input Summary
## ------------------------------------------------------
## Training features: 176 x 22
## Training outcome: 176 x 1
## Testing features: 19 x 22
## Testing outcome: 19 x 1
##
## [2018-03-02 04:14:25 gridSearchLearn] Hello, egenn
## ------------------------------------------------------
## Resampling Parameters
## ------------------------------------------------------
## n.resamples: 5
## resampler: kfold
##
## [2018-03-02 04:14:25 resample] Created 5 independent folds
##
## ------------------------------------------------------
## Input parameters
## ------------------------------------------------------
## grid.params:
## gamma: 0.6, 0.7, 0.8, 0.9
## max.depth: 30
## learning.rate: 0.1
## min.hessian: 0.001
##
## [2018-03-02 04:14:25 gridSearchLearn] Tuning MediBoost Tree-Structured Boosting by exhaustive grid search:
## [2018-03-02 04:14:25 gridSearchLearn] 5 resamples; 20 models total; running on 4 cores (x86_64-apple-darwin15.6.0)
##
##
## ------------------------------------------------------
## Best parameters to maximize BalancedAccuracy
## ------------------------------------------------------
## best.tune:
## gamma: 0.7
## max.depth: 30
## learning.rate: 0.1
## min.hessian: 0.001
##
## [2018-03-02 04:14:34 gridSearchLearn] Run completed in 0.15 minutes (Real: 9.01; User: 0.07; System: 0.04)
## [2018-03-02 04:14:34 s.MDB] Training MDB...
## ------------------------------------------------------
## MDB Classification Training Summary
## ------------------------------------------------------
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 126 0
## 0 7 43
##
## Accuracy : 0.9602
## 95% CI : (0.9198, 0.9839)
## No Information Rate : 0.7557
## P-Value [Acc > NIR] : 1.502e-13
##
## Kappa : 0.8979
## Mcnemar's Test P-Value : 0.02334
##
## Sensitivity : 0.9474
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.8600
## Precision : 1.0000
## Recall : 0.9474
## F1 : 0.9730
## Prevalence : 0.7557
## Detection Rate : 0.7159
## Detection Prevalence : 0.7159
## Balanced Accuracy : 0.9737
##
## 'Positive' Class : 1
##
## ------------------------------------------------------
## MDB Classification Testing Summary
## ------------------------------------------------------
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 13 1
## 0 1 4
##
## Accuracy : 0.8947
## 95% CI : (0.6686, 0.987)
## No Information Rate : 0.7368
## P-Value [Acc > NIR] : 0.0894
##
## Kappa : 0.7286
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.9286
## Specificity : 0.8000
## Pos Pred Value : 0.9286
## Neg Pred Value : 0.8000
## Precision : 0.9286
## Recall : 0.9286
## F1 : 0.9286
## Prevalence : 0.7368
## Detection Rate : 0.6842
## Detection Prevalence : 0.7368
## Balanced Accuracy : 0.8643
##
## 'Positive' Class : 1
##
## [2018-03-02 04:14:35 s.MDB] Run completed in 0.17 minutes (Real: 10.19; User: 1.13; System: 0.08)
We can define tuning resampling parameters with the grid.resampler.rtSet
. The rtSet.resampler
convenience function helps easily build the list needed by grid.resampler.rtSet
, providing auto-completion.
parkinsons.mdb.tune <- s.MDB(parkinsons.train, x.test = parkinsons.test,
gamma = seq(.6, .9, .1), learning.rate = .1,
grid.resampler.rtSet = rtSet.resampler(resampler = 'strat.boot',
n.resamples = 5))
## [2018-03-02 04:14:35 s.MDB] Hello, egenn
## [2018-03-02 04:14:35 dataPrepare] Imbalanced classes: using Inverse Probability Weighting
## ------------------------------------------------------
## Input Summary
## ------------------------------------------------------
## Training features: 176 x 22
## Training outcome: 176 x 1
## Testing features: 19 x 22
## Testing outcome: 19 x 1
##
## [2018-03-02 04:14:35 gridSearchLearn] Hello, egenn
## ------------------------------------------------------
## Resampling Parameters
## ------------------------------------------------------
## n.resamples: 5
## resampler: strat.boot
## stratify.var: y
## cv.p: 0.75
## cv.groups: 4
## target.length: NULL
##
## [2018-03-02 04:14:35 resample] Created 5 stratified bootstraps
##
## ------------------------------------------------------
## Input parameters
## ------------------------------------------------------
## grid.params:
## gamma: 0.6, 0.7, 0.8, 0.9
## max.depth: 30
## learning.rate: 0.1
## min.hessian: 0.001
##
## [2018-03-02 04:14:35 gridSearchLearn] Tuning MediBoost Tree-Structured Boosting by exhaustive grid search:
## [2018-03-02 04:14:35 gridSearchLearn] 5 resamples; 20 models total; running on 4 cores (x86_64-apple-darwin15.6.0)
##
##
## ------------------------------------------------------
## Best parameters to maximize BalancedAccuracy
## ------------------------------------------------------
## best.tune:
## gamma: 0.7
## max.depth: 30
## learning.rate: 0.1
## min.hessian: 0.001
##
## [2018-03-02 04:14:45 gridSearchLearn] Run completed in 0.17 minutes (Real: 10.13; User: 0.12; System: 0.03)
## [2018-03-02 04:14:45 s.MDB] Training MDB...
## ------------------------------------------------------
## MDB Classification Training Summary
## ------------------------------------------------------
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 126 0
## 0 7 43
##
## Accuracy : 0.9602
## 95% CI : (0.9198, 0.9839)
## No Information Rate : 0.7557
## P-Value [Acc > NIR] : 1.502e-13
##
## Kappa : 0.8979
## Mcnemar's Test P-Value : 0.02334
##
## Sensitivity : 0.9474
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.8600
## Precision : 1.0000
## Recall : 0.9474
## F1 : 0.9730
## Prevalence : 0.7557
## Detection Rate : 0.7159
## Detection Prevalence : 0.7159
## Balanced Accuracy : 0.9737
##
## 'Positive' Class : 1
##
## ------------------------------------------------------
## MDB Classification Testing Summary
## ------------------------------------------------------
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 13 1
## 0 1 4
##
## Accuracy : 0.8947
## 95% CI : (0.6686, 0.987)
## No Information Rate : 0.7368
## P-Value [Acc > NIR] : 0.0894
##
## Kappa : 0.7286
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.9286
## Specificity : 0.8000
## Pos Pred Value : 0.9286
## Neg Pred Value : 0.8000
## Precision : 0.9286
## Recall : 0.9286
## F1 : 0.9286
## Prevalence : 0.7368
## Detection Rate : 0.6842
## Detection Prevalence : 0.7368
## Balanced Accuracy : 0.8643
##
## 'Positive' Class : 1
##
## [2018-03-02 04:14:46 s.MDB] Run completed in 0.18 minutes (Real: 11.01; User: 0.96; System: 0.06)
Let’s look at the tuning results (this is a small dataset and tuning may not be very accurate):
parkinsons.mdb.tune$extra$gridSearch$tune.results
We now use the core rtemis
supervised learning function elevate
to use nested resampling for cross-validation and hyperparameter tuning:
parkinsons.mdb.10fold <- elevate(parkinsons, mod = 'mdb',
gamma = c(.8, .9),
learning.rate = c(.01, .05),
seed = 2018)
## [2018-03-02 04:14:46 elevate] Hello, egenn
## ------------------------------------------------------
## Classification Input Summary
## ------------------------------------------------------
## Training features: 195 x 22
## Training outcome: 195 x 1
##
## [2018-03-02 04:14:46 elevate] Training MDB on 10 resamples...
## [2018-03-02 04:14:46 resLearn] Hello, egenn: resLearn running...
## ------------------------------------------------------
## Resampling Parameters
## ------------------------------------------------------
## n.resamples: 10
## resampler: kfold
##
## [2018-03-02 04:14:46 resample] Created 10 independent folds
##
## ------------------------------------------------------
## MDB Parameters
## ------------------------------------------------------
## params:
## gamma: 0.8, 0.9
## learning.rate: 0.01, 0.05
##
## [2018-03-02 04:14:46 resLearn] Training MediBoost Tree-Structured Boosting on 10 resamples...
## [2018-03-02 04:17:21 resLearn] Run completed in 2.58 minutes (Real: 154.55; User: 16.81; System: 0.81)
##
## ======================================================
## elevate MDB
## ======================================================
## N repeats = 1
## N resamples = 10
## Resampler = kfold
## Balanced Accuracy of 10 aggregated test sets in each repeat = 0.87
## ======================================================
## [2018-03-02 04:17:21 elevate] Run completed in 2.58 minutes (Real: 154.65; User: 16.91; System: 0.81)
We can get a summary of the cross-validation by printing the elevate
object:
parkinsons.mdb.10fold
## ======================================================
## .:rtemis Cross-Validated Model
## ======================================================
## Algorithm: MDB (MediBoost Tree-Structured Boosting)
## Resampling: n = 10, type = kfold
## N of repeats: 1
## Average Balanced Accuracy across repeats = 0.8722364
Let’s grab a dataset from the massive OpenML repository.
(We can read the .arff
files as CSVs)
con <- gzcon(url("https://www.openml.org/data/get_csv/53273/sleep.arff"),
text = TRUE)
sleep <- read.csv(con, header = TRUE, na.strings = "?")
checkData(sleep)
## -------------------------------
## Dataset: sleep
## -------------------------------
##
## Summary
## -------------------------------
## 62 cases with 8 features:
## * 4 continuous features
## * 3 integer features
## * 1 categorical feature, which is not ordered
## * 0 constant features
## * 2 features include 'NA' values; 8 'NA' values total
## ** Max percent missing in a feature is 6.45% (max_life_span)
## ** Max percent missing in a case is 25.00% (case #13)
##
## Recommendations
## -------------------------------
## * Consider imputing missing values or use complete cases only
## * Check the 3 integer features and consider if they should be converted to factors
We can impute missing data with preproc
:
sleep <- preproc(sleep, impute = TRUE)
## [2018-03-02 04:17:22 preproc.default] Imputing missing values...
## [2018-03-02 04:17:22 preproc.default] Done
Train and plot MDB:
sleep.mdb <- s.MDB(sleep, gamma = .8, learning.rate = .1)
## [2018-03-02 04:17:22 s.MDB] Hello, egenn
## [2018-03-02 04:17:22 dataPrepare] Imbalanced classes: using Inverse Probability Weighting
## ------------------------------------------------------
## Input Summary
## ------------------------------------------------------
## Training features: 62 x 7
## Training outcome: 62 x 1
## Testing features: Not available
## Testing outcome: Not available
##
## [2018-03-02 04:17:22 s.MDB] Training MDB...
## ------------------------------------------------------
## MDB Classification Training Summary
## ------------------------------------------------------
## Confusion Matrix and Statistics
##
## Reference
## Prediction N P
## N 30 2
## P 3 27
##
## Accuracy : 0.9194
## 95% CI : (0.8217, 0.9733)
## No Information Rate : 0.5323
## P-Value [Acc > NIR] : 3.924e-11
##
## Kappa : 0.8384
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9091
## Specificity : 0.9310
## Pos Pred Value : 0.9375
## Neg Pred Value : 0.9000
## Precision : 0.9375
## Recall : 0.9091
## F1 : 0.9231
## Prevalence : 0.5323
## Detection Rate : 0.4839
## Detection Prevalence : 0.5161
## Balanced Accuracy : 0.9201
##
## 'Positive' Class : N
##
## [2018-03-02 04:17:23 s.MDB] Run completed in 0.01 minutes (Real: 0.64; User: 0.62; System: 0.01)
mplot3.mdb(sleep.mdb)
Let’s load a dataset from the Penn ML Benchmarks github repository.
R allows us to read a gzipped file and unzip on the fly:
read.table
,rzd <- gzcon(url("https://github.com/EpistasisLab/penn-ml-benchmarks/raw/master/datasets/classification/chess/chess.tsv.gz"),
text = TRUE)
chess <- read.table(rzd, header = TRUE)
chess$target <- factor(chess$target, levels = c(1, 0))
checkData(chess)
## -------------------------------
## Dataset: chess
## -------------------------------
##
## Summary
## -------------------------------
## 3196 cases with 37 features:
## * 0 continuous features
## * 36 integer features
## * 1 categorical feature, which is not ordered
## * 0 constant features
## * 0 features include 'NA' values
##
## Recommendations
## -------------------------------
## * Everything looks good
chess.mdb <- s.MDB(chess, gamma = .8, learning.rate = .1)
## [2018-03-02 04:17:25 s.MDB] Hello, egenn
## [2018-03-02 04:17:25 dataPrepare] Imbalanced classes: using Inverse Probability Weighting
## ------------------------------------------------------
## Input Summary
## ------------------------------------------------------
## Training features: 3196 x 36
## Training outcome: 3196 x 1
## Testing features: Not available
## Testing outcome: Not available
##
## [2018-03-02 04:17:25 s.MDB] Training MDB...
## ------------------------------------------------------
## MDB Classification Training Summary
## ------------------------------------------------------
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 1623 28
## 0 46 1499
##
## Accuracy : 0.9768
## 95% CI : (0.971, 0.9818)
## No Information Rate : 0.5222
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9536
## Mcnemar's Test P-Value : 0.04813
##
## Sensitivity : 0.9724
## Specificity : 0.9817
## Pos Pred Value : 0.9830
## Neg Pred Value : 0.9702
## Precision : 0.9830
## Recall : 0.9724
## F1 : 0.9777
## Prevalence : 0.5222
## Detection Rate : 0.5078
## Detection Prevalence : 0.5166
## Balanced Accuracy : 0.9771
##
## 'Positive' Class : 1
##
## [2018-03-02 04:17:33 s.MDB] Run completed in 0.13 minutes (Real: 7.72; User: 7.37; System: 0.24)
mplot3.mdb(chess.mdb)