# Packages Required
library(caret)    # _C_lassification _A_nd _RE_gression _T_raining
library(kernlab)  # Support Vector Machine
library(RSNNS)    # Artificial Neural Network
library(dummies)  # Dummy Variables
library(doSNOW)   # Executing in Parallel

# Introduction

Choosing the correct evaluation metrics are important when evaluating a model. Accuracy is a simple metric that is able to provide quick feedback for how good a model is, but for skewed data it can be misleading. For example, if the data contains 99990 examples with class ‘0’ and 10 examples of class ‘1,’ the model will be 99.99% accurate by always guessing ‘0.’ As another example, take a neural net with 88 outputs indicating keys on a piano, if the ANN always guesses that no note is played (all zeros), it is still around 98% accurate.

There are many different metrics to consider, the best one depends on the goal of the model. In the case of cancer diagnosis, it is much better to lean towards a positive diagnosis even when there is no cancer (false positive) than to have a negative diagnosis when there is a presence of cancer (false negative). For spam detection, it is better to let some spam through if it means never blocking ham messages. The model should be evaluated based on the most important metric.

This project uses the caret package to tune and train a Multi-Layer Perceptron (ANN) and a SVM for the purpose of investigating different evaluation metrics.

# The Albalone Data Set

The data set by Nash, Sellers, Talbot, Cawthorn, & Ford (1994), retrieved from the UCI machine learning repository (Lichman, 2013), contains 4177 examples with 9 features describing an Abalone. The number of rings gives the age of the Abalone. The goal is to predict the age of the abalone using the other features, and not by counting the rings.

Name Data Type Meas. Description
Sex nominal M, F, and I (infant)
Length continuous mm Longest shell measurement
Diameter continuous mm perpendicular to length
Height continuous mm with meat in shell
Whole Weight continuous grams whole abalone
Shucked Weight continuous grams weight of meat
Viscera Weight continuous grams gut weight (after bleeding)
Shell Weight continuous grams after being dried
Rings integer +1.5 gives the age in years

# Exploratory Data Analysis

The purpose of exploring the data first is to get familiar with it and to see if anything is of interest.

abalone <- read.csv("abalone-data.csv") # Read in the data
str(abalone)
## 'data.frame':    4177 obs. of  9 variables:
##  $Sex : Factor w/ 3 levels "F","I","M": 3 3 1 3 2 2 1 1 3 1 ... ##$ Length   : num  0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
##  $Diameter : num 0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ... ##$ Height   : num  0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
##  $Whole_W : num 0.514 0.226 0.677 0.516 0.205 ... ##$ Shucked_W: num  0.2245 0.0995 0.2565 0.2155 0.0895 ...
##  $Viscera_W: num 0.101 0.0485 0.1415 0.114 0.0395 ... ##$ Shell_W  : num  0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
##  $Rings : int 15 7 9 10 7 8 20 16 9 19 ... lapply(abalone, summary) ##$Sex
##    F    I    M
## 1307 1342 1528
##
## $Length ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.075 0.450 0.545 0.524 0.615 0.815 ## ##$Diameter
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##  0.0550  0.3500  0.4250  0.4079  0.4800  0.6500
##
## $Height ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0000 0.1150 0.1400 0.1395 0.1650 1.1300 ## ##$Whole_W
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##  0.0020  0.4415  0.7995  0.8287  1.1530  2.8255
##
## $Shucked_W ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0010 0.1860 0.3360 0.3594 0.5020 1.4880 ## ##$Viscera_W
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##  0.0005  0.0935  0.1710  0.1806  0.2530  0.7600
##
## $Shell_W ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0015 0.1300 0.2340 0.2388 0.3290 1.0050 ## ##$Rings
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   1.000   8.000   9.000   9.934  11.000  29.000

ggplot(data = abalone, aes(x = Rings+1.5)) +
geom_histogram(binwidth = 1, color = 'black', fill = '#099DD9') +
scale_x_continuous(limits = c(0, 31), breaks = seq(0, 31, 1)) +
xlab("Approx Age (years)") +
ylab("Count")

There’s is one interesting variable in the Sex feature, ‘infant.’ The number of rings for infants is in fact lower than that for males and females:

by(abalone$Rings, abalone$Sex, summary)
## abalone$Sex: F ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 5.00 9.00 10.00 11.13 12.00 29.00 ## -------------------------------------------------------- ## abalone$Sex: I
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##    1.00    6.00    8.00    7.89    9.00   21.00
## --------------------------------------------------------
## abalone$Sex: M ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 3.00 9.00 10.00 10.71 12.00 27.00 # Data Preprocessing ### Normalization The SVM and ANN work better with normalized numerical data. # Returns a normalized vector between max / min normalize <- function(x) { return ((x - min(x)) / (max(x) - min(x))) } abalone$Length <- normalize(abalone$Length) abalone$Diameter <- normalize(abalone$Diameter) abalone$Height <- normalize(abalone$Height) abalone$Whole_W <- normalize(abalone$Whole_W) abalone$Shucked_W <- normalize(abalone$Shucked_W) abalone$Viscera_W <- normalize(abalone$Viscera_W) abalone$Shell_W <- normalize(abalone$Shell_W) ### Dummies The Sex feature is categorical so it should be split into dummy variables. # Create dummies out of the Sex feature abalone <- dummy.data.frame(abalone) # Check the new features str(abalone) ## 'data.frame': 4177 obs. of 11 variables: ##$ SexF     : int  0 0 1 0 0 0 1 1 0 1 ...
##  $SexI : int 0 0 0 0 1 1 0 0 0 0 ... ##$ SexM     : int  1 1 0 1 0 0 0 0 1 0 ...
##  $Length : num 0.514 0.372 0.615 0.493 0.345 ... ##$ Diameter : num  0.521 0.353 0.613 0.521 0.336 ...
##  $Height : num 0.0841 0.0796 0.1195 0.1106 0.0708 ... ##$ Whole_W  : num  0.1813 0.0792 0.2391 0.182 0.0719 ...
##  $Shucked_W: num 0.1503 0.0662 0.1718 0.1443 0.0595 ... ##$ Viscera_W: num  0.1323 0.0632 0.1856 0.1494 0.0513 ...
##  $Shell_W : num 0.148 0.0683 0.2078 0.153 0.0533 ... ##$ Rings    : int  15 7 9 10 7 8 20 16 9 19 ...
##  - attr(*, "dummies")=List of 1
##   ..$Sex: int 1 2 3 # Remove the 'SexI' feature, (SexF == 0 & SexM == 0) is SexI abalone <- subset(abalone, select = -SexI) ### Add an Age Label Because we are doing classification instead of regression, we need to create age labels for different ages: • 0 - 6: Young • 7 - 11: Adult • 12+: Old # Create a new variable 'age,' add the recommended 1.5 to get years abalone$Age <- abalone$Rings+1.5 # Aggregate into 3 labels abalone$Age <- cut(abalone$Age, breaks = c(0,7,12,31), labels = c("Young","Adult","Old")) # Check the labels by(abalone$Rings+1.5, abalone$Age, table) ## abalone$Age: Young
##
## 2.5 3.5 4.5 5.5 6.5
##   1   1  15  57 115
## --------------------------------------------------------
## abalone$Age: Adult ## ## 7.5 8.5 9.5 10.5 11.5 ## 259 391 568 689 634 ## -------------------------------------------------------- ## abalone$Age: Old
##
## 12.5 13.5 14.5 15.5 16.5 17.5 18.5 19.5 20.5 21.5 22.5 23.5 24.5 25.5 26.5
##  487  267  203  126  103   67   58   42   32   26   14    6    9    2    1
## 27.5 28.5 30.5
##    1    2    1

# Convert to factor
abalone$Age <- as.factor(abalone$Age)

# Remove the 'Rings' feature
abalone <- subset(abalone, select = -Rings)

# Check the data
lapply(abalone, summary)
## $SexF ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0000 0.0000 0.0000 0.3129 1.0000 1.0000 ## ##$SexM
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##  0.0000  0.0000  0.0000  0.3658  1.0000  1.0000
##
## $Length ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0000 0.5068 0.6351 0.6067 0.7297 1.0000 ## ##$Diameter
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##  0.0000  0.4958  0.6218  0.5931  0.7143  1.0000
##
## $Height ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0000 0.1018 0.1239 0.1235 0.1460 1.0000 ## ##$Whole_W
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##  0.0000  0.1557  0.2825  0.2928  0.4077  1.0000
##
## $Shucked_W ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0000 0.1244 0.2253 0.2410 0.3369 1.0000 ## ##$Viscera_W
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##  0.0000  0.1224  0.2245  0.2371  0.3325  1.0000
##
## $Shell_W ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0000 0.1281 0.2317 0.2365 0.3264 1.0000 ## ##$Age
##   189  2541  1447

### Split the Data Between Testing and Training

The caret package createDataPartition() function will split the training and the test set in a way that preserves label proportions.

set.seed(77)
# Create the index using the caret package
idx <- createDataPartition(abalone$Age, times = 1, p = 0.7, list = FALSE) # Split the data abalone_train <- abalone[idx,] abalone_test <- abalone[-idx,] # Check proportions prop.table(table(abalone_train$Age))
##
## 0.04547009 0.60820513 0.34632479

prop.table(table(abalone_test$Age)) ## ## Young Adult Old ## 0.04472843 0.60862620 0.34664537 # SVM ### Train the Model Using the caret package, an SVM with a radial kernel will be tuned using 10 fold cross validation to find the best Cost and Sigma. # Run 10-fold cross validation 2 times with a grid search tuning_method <- trainControl(method = "repeatedcv", number = 10, repeats = 2, search = "grid") # Tune on 3x3 parameter combinations tuning_grid <- expand.grid(sigma = c(0.1, 1, 10), C = c(0.1, 1, 10)) # Run 6 parallel instances to speed up execution cluster <- makeCluster(6, type = "SOCK") registerDoSNOW(cluster) # Train the model with the best cost and sigma found # during 10 fold cross validation abalone_model <- caret::train(Age ~ ., data = abalone_train, method = "svmRadial", tuneGrid = tuning_grid, trControl = tuning_method) # Stop parallel instances stopCluster(cluster) # View best hyperparameters abalone_model ## Support Vector Machines with Radial Basis Function Kernel ## ## 2925 samples ## 9 predictor ## 3 classes: 'Young', 'Adult', 'Old' ## ## No pre-processing ## Resampling: Cross-Validated (10 fold, repeated 2 times) ## Summary of sample sizes: 2632, 2632, 2633, 2632, 2633, 2632, ... ## Resampling results across tuning parameters: ## ## sigma C Accuracy Kappa ## 0.1 0.1 0.7182847 0.38695481 ## 0.1 1.0 0.7485544 0.47637887 ## 0.1 10.0 0.7593328 0.50516639 ## 1.0 0.1 0.7396672 0.44592210 ## 1.0 1.0 0.7490611 0.48422412 ## 1.0 10.0 0.7362245 0.45697881 ## 10.0 0.1 0.6109429 0.01322378 ## 10.0 1.0 0.7193057 0.42810069 ## 10.0 10.0 0.6794591 0.36960108 ## ## Accuracy was used to select the optimal model using the largest value. ## The final values used for the model were sigma = 0.1 and C = 10. ### Test abalone_pred <- predict(abalone_model, abalone_test) ### Evaluate caret::confusionMatrix(abalone_pred, abalone_test$Age)
## Confusion Matrix and Statistics
##
##           Reference
##      Young    34     3   0
##      Old       0    85 269
##
## Overall Statistics
##
##                Accuracy : 0.7804
##                  95% CI : (0.7564, 0.803)
##     No Information Rate : 0.6086
##     P-Value [Acc > NIR] : < 2.2e-16
##
##                   Kappa : 0.5444
##  Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
##                      Class: Young Class: Adult Class: Old
## Sensitivity               0.60714       0.8845     0.6198
## Specificity               0.99749       0.6184     0.8961
## Pos Pred Value            0.91892       0.7828     0.7599
## Neg Pred Value            0.98189       0.7749     0.8163
## Prevalence                0.04473       0.6086     0.3466
## Detection Rate            0.02716       0.5383     0.2149
## Detection Prevalence      0.02955       0.6877     0.2827
## Balanced Accuracy         0.80232       0.7514     0.7580

# ANN

### Train the Model

Using the caret package, an ANN will be tuned using 10 fold cross validation to find the best combination of hidden neurons in three layers.

# Tune on 3x3x3 parameter combinations
tuning_grid <- expand.grid(layer1 = c(2, 4, 6),
layer2 = c(2, 4, 6),
layer3 = c(2, 4, 6))

# Run 6 parallel instances to speed execution
cluster <- makeCluster(6, type = "SOCK")
registerDoSNOW(cluster)

# Train the model with the best cost and sigma found
# during 10 fold cross validation
abalone_model <- caret::train(Age ~ .,
data = abalone_train,
method = "mlpML",
tuneGrid = tuning_grid,
trControl = tuning_method)

# Stop parallel instances
stopCluster(cluster)

# View best hyperparameters
abalone_model
## Multi-Layer Perceptron, with multiple layers
##
## 2925 samples
##    9 predictor
##    3 classes: 'Young', 'Adult', 'Old'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 2 times)
## Summary of sample sizes: 2632, 2632, 2632, 2633, 2633, 2634, ...
## Resampling results across tuning parameters:
##
##   layer1  layer2  layer3  Accuracy   Kappa
##   2       2       2       0.6453464  0.1480024
##   2       2       4       0.6801665  0.2877089
##   2       2       6       0.7227212  0.4307243
##   2       4       2       0.6762544  0.2524850
##   2       4       4       0.6988264  0.3437947
##   2       4       6       0.7198540  0.4337084
##   2       6       2       0.6743066  0.2595837
##   2       6       4       0.7138628  0.3879127
##   2       6       6       0.7090811  0.3738796
##   4       2       2       0.7012067  0.3462089
##   4       2       4       0.7217148  0.4348955
##   4       2       6       0.7299481  0.4566246
##   4       4       2       0.6729534  0.2331192
##   4       4       4       0.7252938  0.4351600
##   4       4       6       0.7418813  0.4680028
##   4       6       2       0.6966137  0.3446560
##   4       6       4       0.7286968  0.4342664
##   4       6       6       0.7261938  0.4170635
##   6       2       2       0.6986368  0.3689989
##   6       2       4       0.7304390  0.4420467
##   6       2       6       0.7360815  0.4551009
##   6       4       2       0.7053355  0.3688204
##   6       4       4       0.7259455  0.4153914
##   6       4       6       0.7328164  0.4554820
##   6       6       2       0.7114210  0.3727608
##   6       6       4       0.7256638  0.4415206
##   6       6       6       0.7366012  0.4619021
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were layer1 = 4, layer2 = 4 and
##  layer3 = 6.

### Test

abalone_pred <- predict(abalone_model, abalone_test)

### Evaluate

caret::confusionMatrix(abalone_pred, abalone_test\$Age)
## Confusion Matrix and Statistics
##
##           Reference
##      Young    47    19   0
##      Old       0   119 309
##
## Overall Statistics
##
##                Accuracy : 0.7827
##                  95% CI : (0.7589, 0.8053)
##     No Information Rate : 0.6086
##     P-Value [Acc > NIR] : < 2.2e-16
##
##                   Kappa : 0.5746
##  Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
##                      Class: Young Class: Adult Class: Old
## Sensitivity               0.83929       0.8189     0.7120
## Specificity               0.98411       0.7265     0.8545
## Pos Pred Value            0.71212       0.8232     0.7220
## Neg Pred Value            0.99241       0.7206     0.8483
## Prevalence                0.04473       0.6086     0.3466
## Detection Rate            0.03754       0.4984     0.2468
## Detection Prevalence      0.05272       0.6054     0.3419
## Balanced Accuracy         0.91170       0.7727     0.7833

# Conclusion

The evaluation depends on the goal of the analysis. In this case, predicting the age of the abalone without going through the painstaking process of counting the rings. Depending on what needed to be known, the amount of categories to break the abalone ages into could have been different. Perhaps it is only useful to know if the abalone is passed a certain age; the other end of this spectrum would be a regression analysis to predict the age by year. Most examples in our training set (around 60%) were labeled as ‘Adult.’ What was labeled as ‘Adult’ however was arbitrarily chosen. There is a chance that what our model predicted as ‘Adult’ might have been a better indication of what makes an abalone an adult. By looking at the confusion matrix, it is clear that there is a useful threshold. We know this because ‘Young’ and ‘Old’ were never misclassified with each other. The confusion matrix provides an overview of how well a model performs, but it can also give some insight into the data as well.

Without getting into too much depth, the caret::confusionMatrix() method returns many different evaluation metrics. Which metric to focus on will depend on the goal of the model.

# References

Lichman, M. (2013). UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. Retrieved from http://archive.ics.uci.edu/ml

Nash, W. J., Sellers, T. L., Talbot, S. R., Cawthorn, A. J., & Ford, W. B. (1994). The population biology of abalone (haliotis species) in tasmania. i. blacklip abalone (h. rubra) from the north coast and islands of bass strait. Sea Fisheries Division, Technical Report, (48).

Revision History
Revision Date Author Description
1.0 April 16, 2018 Ryan Whitell
1. Genesis