# Packages Required
library(e1071)  # SVM

# Introduction

Support Vector Machines are not really “machines” but more a clever algorithm with a simple concept that has a complicated implementation. The simple concept is just a hyperplane that separates classes with the widest possible margin between the hyperplane and closest data points. New examples will be classified by which side of the hyperplane (decision boundary) they fall. The complicated, and clever, implementation of this algorithm involves finding this hyperplane using vectors and quadratic programming. Basically, only the points closest to the decision boundary will be considered (the support vectors). For data sets that are not linearly separable, the kernel trick is used.

# The Mushroom Data Set

The data set by Schlimmer (1987), retrieved from the UCI machine learning repository (Lichman, 2013) contains 8124 examples with 22 features representing 23 species of gilled mushrooms. The labels are whether the mushroom is edible or poisonous:

1. cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y
4. bruises?: bruises=t,no=f
5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s
6. gill-attachment: attached=a,descending=d,free=f,notched=n
7. gill-spacing: close=c,crowded=w,distant=d
9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y
10. stalk-shape: enlarging=e,tapering=t
11. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=?
12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
16. veil-type: partial=p,universal=u
17. veil-color: brown=n,orange=o,white=w,yellow=y
18. ring-number: none=n,one=o,two=t
19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z
20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y
21. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y

# Exploratory Data Analysis

shroom <- read.csv("agaricus-lepiota.csv") # Read in the data
str(shroom)
## 'data.frame':    8124 obs. of  23 variables:
##  $class : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ... ##$ cap.shape               : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
##  $cap.surface : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ... ##$ cap.color               : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
##  $bruises : Factor w/ 2 levels "f","t": 2 2 2 2 1 2 2 2 2 2 ... ##$ odor                    : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
##  $gill.attachment : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ... ##$ gill.spacing            : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
##  $gill.size : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ... ##$ gill.color              : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
##  $stalk.shape : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ... ##$ stalk.root              : Factor w/ 5 levels "?","b","c","e",..: 4 3 3 4 4 3 3 3 4 3 ...
##  $stalk.surface.above.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ... ##$ stalk.surface.below.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $stalk.color.above.ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ... ##$ stalk.color.below.ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $veil.type : Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ... ##$ veil.color              : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ring.number : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ... ##$ ring.type               : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...
##  $spore.print.color : Factor w/ 9 levels "b","h","k","n",..: 3 4 4 3 4 3 3 4 3 3 ... ##$ population              : Factor w/ 6 levels "a","c","n","s",..: 4 3 3 4 1 3 3 4 5 4 ...
##  $habitat : Factor w/ 7 levels "d","g","l","m",..: 6 2 4 6 2 2 4 4 2 4 ... lapply(shroom, summary) ##$class
##    e    p
## 4208 3916
##
## $cap.shape ## b c f k s x ## 452 4 3152 828 32 3656 ## ##$cap.surface
##    f    g    s    y
## 2320    4 2556 3244
##
## $cap.color ## b c e g n p r u w y ## 168 44 1500 1840 2284 144 16 16 1040 1072 ## ##$bruises
##    f    t
## 4748 3376
##
## $odor ## a c f l m n p s y ## 400 192 2160 400 36 3528 256 576 576 ## ##$gill.attachment
##    a    f
##  210 7914
##
## $gill.spacing ## c w ## 6812 1312 ## ##$gill.size
##    b    n
## 5612 2512
##
## $gill.color ## b e g h k n o p r u w y ## 1728 96 752 732 408 1048 64 1492 24 492 1202 86 ## ##$stalk.shape
##    e    t
## 3516 4608
##
## $stalk.root ## ? b c e r ## 2480 3776 556 1120 192 ## ##$stalk.surface.above.ring
##    f    k    s    y
##  552 2372 5176   24
##
## $stalk.surface.below.ring ## f k s y ## 600 2304 4936 284 ## ##$stalk.color.above.ring
##    b    c    e    g    n    o    p    w    y
##  432   36   96  576  448  192 1872 4464    8
##
## $stalk.color.below.ring ## b c e g n o p w y ## 432 36 96 576 512 192 1872 4384 24 ## ##$veil.type
##    p
## 8124
##
## $veil.color ## n o w y ## 96 96 7924 8 ## ##$ring.number
##    n    o    t
##   36 7488  600
##
## $ring.type ## e f l n p ## 2776 48 1296 36 3968 ## ##$spore.print.color
##    b    h    k    n    o    r    u    w    y
##   48 1632 1872 1968   48   72   48 2388   48
##
## $population ## a c n s v y ## 384 340 400 1248 4040 1712 ## ##$habitat
##    d    g    l    m    p    u    w
## 3148 2148  832  292 1144  368  192

Immediately noticed is that veil type only has one factor and stock root is missing in 2480 examples. Other than that the data looks good.

# Data Preprocessing

### Drop Veil Type

The veil type is a constant, so drop it from the data set.

shroom <- subset(x = shroom, select = -veil.type)

### Handle Missing Values

In this case, stalk root contains many missing values. There are many different ways to handle this type of data, in this example the missing data is turned into a factor called “u” for unknown.

table(shroom$stalk.root) # Inspect ## ## ? b c e r ## 2480 3776 556 1120 192 levels(shroom$stalk.root) <- c("u", "b", "c", "e", "r") # Convert
table(shroom$stalk.root) # Inspect Again ## ## u b c e r ## 2480 3776 556 1120 192 ### Create Numeric Data The SVM algorithm requires all data to be numeric. shroom <- as.data.frame(lapply(shroom, as.numeric)) # Change all rows to numeric shroom$class <- factor(shroom$class) # Convert class back to factor levels(shroom$class) <- c("e", "p")                 # Re-label class factors
str(shroom)
## 'data.frame':    8124 obs. of  22 variables:
##  $class : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ... ##$ cap.shape               : num  6 6 1 6 6 6 1 1 6 1 ...
##  $cap.surface : num 3 3 3 4 3 4 3 4 4 3 ... ##$ cap.color               : num  5 10 9 9 4 10 9 9 9 10 ...
##  $bruises : num 2 2 2 2 1 2 2 2 2 2 ... ##$ odor                    : num  7 1 4 7 6 1 1 4 7 1 ...
##  $gill.attachment : num 2 2 2 2 2 2 2 2 2 2 ... ##$ gill.spacing            : num  1 1 1 1 2 1 1 1 1 1 ...
##  $gill.size : num 2 1 1 2 1 1 1 1 2 1 ... ##$ gill.color              : num  5 5 6 6 5 6 3 6 8 3 ...
##  $stalk.shape : num 1 1 1 1 2 1 1 1 1 1 ... ##$ stalk.root              : num  4 3 3 4 4 3 3 3 4 3 ...
##  $stalk.surface.above.ring: num 3 3 3 3 3 3 3 3 3 3 ... ##$ stalk.surface.below.ring: num  3 3 3 3 3 3 3 3 3 3 ...
##  $stalk.color.above.ring : num 8 8 8 8 8 8 8 8 8 8 ... ##$ stalk.color.below.ring  : num  8 8 8 8 8 8 8 8 8 8 ...
##  $veil.color : num 3 3 3 3 3 3 3 3 3 3 ... ##$ ring.number             : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ring.type : num 5 5 5 5 1 5 5 5 5 5 ... ##$ spore.print.color       : num  3 4 4 3 4 3 3 4 3 3 ...
##  $population : num 4 3 3 4 1 3 3 4 5 4 ... ##$ habitat                 : num  6 2 4 6 2 2 4 4 2 4 ...

# The Support Vector Machine

### Splitting the Data Between Testing and Training

set.seed(77)                                          # Get the same data each time
idx <- sample(nrow(shroom), round(nrow(shroom)*0.7))  # Create 2 samples with ratio 70:30
shroom_train <- shroom[idx, ]                         # Split 70%
shroom_test <- shroom[-idx, ]                         # Split 30%

### Train the Model

shroom_model <- svm(formula = class ~ ., data = shroom_train, kernel = "linear")
summary(shroom_model)
##
## Call:
## svm(formula = class ~ ., data = shroom_train, kernel = "linear")
##
##
## Parameters:
##    SVM-Type:  C-classification
##  SVM-Kernel:  linear
##        cost:  1
##       gamma:  0.04761905
##
## Number of Support Vectors:  616
##
##  ( 305 311 )
##
##
## Number of Classes:  2
##
## Levels:
##  e p

### Test the Model

shroom_pred <- predict(shroom_model, shroom_test)
summary(shroom_pred)
##    e    p
## 1280 1157

### Evaluating Accuracy

# Returns the percentage of correct predictions
get.accuracy <- function(prediction, real) {
accuracy <- prediction == real
return (length(accuracy[accuracy == TRUE])/length(accuracy))
}

get.accuracy(shroom_pred, shroom_test$class) ## [1] 0.983176 ### Different Kernels Different kernels may provide better or worse performance. Picking the right kernel requires a bit of domain knowledge. Knowing what features make elements similar can provide insight into which kernel should be applied. # Returns the accuracy of the SVM with the given kernels all.kernels <- function(kernel) { shroom_model <- svm(formula = class ~ ., data = shroom_train, kernel = kernel) shroom_pred <- predict(shroom_model, shroom_test) return(get.accuracy(shroom_pred, shroom_test$class))
}

kernels <- c("linear", "polynomial", "radial", "sigmoid")
kernel_accuracy <- unlist(lapply(kernels, all.kernels))
kernel_accuracy*100
## [1]  98.31760 100.00000 100.00000  79.23677

Polynomial and radial kernels provide 100% classification accuracy on this data set. The sigmoid kernel performance is poor.

### Tuning

The tune.svn function will run 10-fold cross validation[link to evaluation metrics] for different hyperparameters in a specified range. In this case, performance with polynomial and radial kernels are 100% and tuning is not necessary. It is an expensive operation, so this example only checks 3 different values for the ‘cost’ and ‘gamma’ hyperparameters for reference. Figure 1 plots the results of the tuned hyperparameter grid.

shroom_tune_params <- tune.svm(class ~ ., data = shroom_train, gamma = 2^(-1:1), cost = 2^(-1:1))
summary(shroom_tune_params)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
##  gamma cost
##    0.5  0.5
##
## - best performance: 0
##
## - Detailed performance results:
##   gamma cost        error   dispersion
## 1   0.5  0.5 0.0000000000 0.0000000000
## 2   1.0  0.5 0.0005272408 0.0008489383
## 3   2.0  0.5 0.0286603629 0.0104148429
## 4   0.5  1.0 0.0000000000 0.0000000000
## 5   1.0  1.0 0.0000000000 0.0000000000
## 6   2.0  1.0 0.0172300057 0.0089118050
## 7   0.5  2.0 0.0000000000 0.0000000000
## 8   1.0  2.0 0.0000000000 0.0000000000
## 9   2.0  2.0 0.0168785118 0.0088889684

plot(shroom_tune_params)

# Conclusion

For this data set, the support vector machine black box was able to classify mushrooms as either edible or poisonous with 100% accuracy. This does not mean SVMs are always 100% accurate, it means that the data itself contained highly correlated features.

# References

Lichman, M. (2013). UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. Retrieved from http://archive.ics.uci.edu/ml

Schlimmer, J. (1987). Mushroom records drawn from the audubon society field guide to north american mushrooms: Alfred a. knopf. GH Lincoff (Pres.), New York. Retrieved from http://archive.ics.uci.edu/ml/datasets/Mushroom

Revision History
Revision Date Author Description
1.0 April 16, 2018 Ryan Whitell
1. Genesis