# Packages Required
library(e1071)  # SVM
GitHub Repository

   

Introduction

Support Vector Machines are not really “machines” but more a clever algorithm with a simple concept that has a complicated implementation. The simple concept is just a hyperplane that separates classes with the widest possible margin between the hyperplane and closest data points. New examples will be classified by which side of the hyperplane (decision boundary) they fall. The complicated, and clever, implementation of this algorithm involves finding this hyperplane using vectors and quadratic programming. Basically, only the points closest to the decision boundary will be considered (the support vectors). For data sets that are not linearly separable, the kernel trick is used.

   

The Mushroom Data Set

The data set by Schlimmer (1987), retrieved from the UCI machine learning repository (Lichman, 2013) contains 8124 examples with 22 features representing 23 species of gilled mushrooms. The labels are whether the mushroom is edible or poisonous:

  1. cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
  2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
  3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y
  4. bruises?: bruises=t,no=f
  5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s
  6. gill-attachment: attached=a,descending=d,free=f,notched=n
  7. gill-spacing: close=c,crowded=w,distant=d
  8. gill-size: broad=b,narrow=n
  9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y
  10. stalk-shape: enlarging=e,tapering=t
  11. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=?
  12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
  13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
  14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
  15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
  16. veil-type: partial=p,universal=u
  17. veil-color: brown=n,orange=o,white=w,yellow=y
  18. ring-number: none=n,one=o,two=t
  19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z
  20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y
  21. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y
  22. habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d

   

Exploratory Data Analysis

shroom <- read.csv("agaricus-lepiota.csv") # Read in the data
str(shroom)
## 'data.frame':    8124 obs. of  23 variables:
##  $ class                   : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
##  $ cap.shape               : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
##  $ cap.surface             : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
##  $ cap.color               : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
##  $ bruises                 : Factor w/ 2 levels "f","t": 2 2 2 2 1 2 2 2 2 2 ...
##  $ odor                    : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
##  $ gill.attachment         : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
##  $ gill.spacing            : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
##  $ gill.size               : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
##  $ gill.color              : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
##  $ stalk.shape             : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
##  $ stalk.root              : Factor w/ 5 levels "?","b","c","e",..: 4 3 3 4 4 3 3 3 4 3 ...
##  $ stalk.surface.above.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ stalk.surface.below.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ stalk.color.above.ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ stalk.color.below.ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ veil.type               : Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
##  $ veil.color              : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ ring.number             : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ring.type               : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...
##  $ spore.print.color       : Factor w/ 9 levels "b","h","k","n",..: 3 4 4 3 4 3 3 4 3 3 ...
##  $ population              : Factor w/ 6 levels "a","c","n","s",..: 4 3 3 4 1 3 3 4 5 4 ...
##  $ habitat                 : Factor w/ 7 levels "d","g","l","m",..: 6 2 4 6 2 2 4 4 2 4 ...

 

lapply(shroom, summary)
## $class
##    e    p 
## 4208 3916 
## 
## $cap.shape
##    b    c    f    k    s    x 
##  452    4 3152  828   32 3656 
## 
## $cap.surface
##    f    g    s    y 
## 2320    4 2556 3244 
## 
## $cap.color
##    b    c    e    g    n    p    r    u    w    y 
##  168   44 1500 1840 2284  144   16   16 1040 1072 
## 
## $bruises
##    f    t 
## 4748 3376 
## 
## $odor
##    a    c    f    l    m    n    p    s    y 
##  400  192 2160  400   36 3528  256  576  576 
## 
## $gill.attachment
##    a    f 
##  210 7914 
## 
## $gill.spacing
##    c    w 
## 6812 1312 
## 
## $gill.size
##    b    n 
## 5612 2512 
## 
## $gill.color
##    b    e    g    h    k    n    o    p    r    u    w    y 
## 1728   96  752  732  408 1048   64 1492   24  492 1202   86 
## 
## $stalk.shape
##    e    t 
## 3516 4608 
## 
## $stalk.root
##    ?    b    c    e    r 
## 2480 3776  556 1120  192 
## 
## $stalk.surface.above.ring
##    f    k    s    y 
##  552 2372 5176   24 
## 
## $stalk.surface.below.ring
##    f    k    s    y 
##  600 2304 4936  284 
## 
## $stalk.color.above.ring
##    b    c    e    g    n    o    p    w    y 
##  432   36   96  576  448  192 1872 4464    8 
## 
## $stalk.color.below.ring
##    b    c    e    g    n    o    p    w    y 
##  432   36   96  576  512  192 1872 4384   24 
## 
## $veil.type
##    p 
## 8124 
## 
## $veil.color
##    n    o    w    y 
##   96   96 7924    8 
## 
## $ring.number
##    n    o    t 
##   36 7488  600 
## 
## $ring.type
##    e    f    l    n    p 
## 2776   48 1296   36 3968 
## 
## $spore.print.color
##    b    h    k    n    o    r    u    w    y 
##   48 1632 1872 1968   48   72   48 2388   48 
## 
## $population
##    a    c    n    s    v    y 
##  384  340  400 1248 4040 1712 
## 
## $habitat
##    d    g    l    m    p    u    w 
## 3148 2148  832  292 1144  368  192

 

Immediately noticed is that veil type only has one factor and stock root is missing in 2480 examples. Other than that the data looks good.

   

Data Preprocessing

Drop Veil Type

The veil type is a constant, so drop it from the data set.

shroom <- subset(x = shroom, select = -veil.type)

 

Handle Missing Values

In this case, stalk root contains many missing values. There are many different ways to handle this type of data, in this example the missing data is turned into a factor called “u” for unknown.

table(shroom$stalk.root) # Inspect
## 
##    ?    b    c    e    r 
## 2480 3776  556 1120  192

 

levels(shroom$stalk.root) <- c("u", "b", "c", "e", "r") # Convert
table(shroom$stalk.root) # Inspect Again
## 
##    u    b    c    e    r 
## 2480 3776  556 1120  192

 

Create Numeric Data

The SVM algorithm requires all data to be numeric.

shroom <- as.data.frame(lapply(shroom, as.numeric)) # Change all rows to numeric
shroom$class <- factor(shroom$class)                # Convert class back to factor
levels(shroom$class) <- c("e", "p")                 # Re-label class factors
str(shroom)
## 'data.frame':    8124 obs. of  22 variables:
##  $ class                   : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
##  $ cap.shape               : num  6 6 1 6 6 6 1 1 6 1 ...
##  $ cap.surface             : num  3 3 3 4 3 4 3 4 4 3 ...
##  $ cap.color               : num  5 10 9 9 4 10 9 9 9 10 ...
##  $ bruises                 : num  2 2 2 2 1 2 2 2 2 2 ...
##  $ odor                    : num  7 1 4 7 6 1 1 4 7 1 ...
##  $ gill.attachment         : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ gill.spacing            : num  1 1 1 1 2 1 1 1 1 1 ...
##  $ gill.size               : num  2 1 1 2 1 1 1 1 2 1 ...
##  $ gill.color              : num  5 5 6 6 5 6 3 6 8 3 ...
##  $ stalk.shape             : num  1 1 1 1 2 1 1 1 1 1 ...
##  $ stalk.root              : num  4 3 3 4 4 3 3 3 4 3 ...
##  $ stalk.surface.above.ring: num  3 3 3 3 3 3 3 3 3 3 ...
##  $ stalk.surface.below.ring: num  3 3 3 3 3 3 3 3 3 3 ...
##  $ stalk.color.above.ring  : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ stalk.color.below.ring  : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ veil.color              : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ ring.number             : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ ring.type               : num  5 5 5 5 1 5 5 5 5 5 ...
##  $ spore.print.color       : num  3 4 4 3 4 3 3 4 3 3 ...
##  $ population              : num  4 3 3 4 1 3 3 4 5 4 ...
##  $ habitat                 : num  6 2 4 6 2 2 4 4 2 4 ...

   

The Support Vector Machine

Splitting the Data Between Testing and Training

set.seed(77)                                          # Get the same data each time
idx <- sample(nrow(shroom), round(nrow(shroom)*0.7))  # Create 2 samples with ratio 70:30
shroom_train <- shroom[idx, ]                         # Split 70%
shroom_test <- shroom[-idx, ]                         # Split 30%

 

Train the Model

shroom_model <- svm(formula = class ~ ., data = shroom_train, kernel = "linear")
summary(shroom_model)
## 
## Call:
## svm(formula = class ~ ., data = shroom_train, kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
##       gamma:  0.04761905 
## 
## Number of Support Vectors:  616
## 
##  ( 305 311 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  e p

 

Test the Model

shroom_pred <- predict(shroom_model, shroom_test)
summary(shroom_pred)
##    e    p 
## 1280 1157

 

Evaluating Accuracy

# Returns the percentage of correct predictions
get.accuracy <- function(prediction, real) {
  accuracy <- prediction == real
  return (length(accuracy[accuracy == TRUE])/length(accuracy))
}

 

get.accuracy(shroom_pred, shroom_test$class)
## [1] 0.983176

 

Different Kernels

Different kernels may provide better or worse performance. Picking the right kernel requires a bit of domain knowledge. Knowing what features make elements similar can provide insight into which kernel should be applied.

# Returns the accuracy of the SVM with the given kernels
all.kernels <- function(kernel) {
  shroom_model <- svm(formula = class ~ ., data = shroom_train, kernel = kernel)
  shroom_pred <- predict(shroom_model, shroom_test)
  return(get.accuracy(shroom_pred, shroom_test$class))
}

 

kernels <- c("linear", "polynomial", "radial", "sigmoid")
kernel_accuracy <- unlist(lapply(kernels, all.kernels))
kernel_accuracy*100
## [1]  98.31760 100.00000 100.00000  79.23677

 

Polynomial and radial kernels provide 100% classification accuracy on this data set. The sigmoid kernel performance is poor.

 

Tuning

The tune.svn function will run 10-fold cross validation[link to evaluation metrics] for different hyperparameters in a specified range. In this case, performance with polynomial and radial kernels are 100% and tuning is not necessary. It is an expensive operation, so this example only checks 3 different values for the ‘cost’ and ‘gamma’ hyperparameters for reference. Figure 1 plots the results of the tuned hyperparameter grid.

shroom_tune_params <- tune.svm(class ~ ., data = shroom_train, gamma = 2^(-1:1), cost = 2^(-1:1))
summary(shroom_tune_params)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  gamma cost
##    0.5  0.5
## 
## - best performance: 0 
## 
## - Detailed performance results:
##   gamma cost        error   dispersion
## 1   0.5  0.5 0.0000000000 0.0000000000
## 2   1.0  0.5 0.0005272408 0.0008489383
## 3   2.0  0.5 0.0286603629 0.0104148429
## 4   0.5  1.0 0.0000000000 0.0000000000
## 5   1.0  1.0 0.0000000000 0.0000000000
## 6   2.0  1.0 0.0172300057 0.0089118050
## 7   0.5  2.0 0.0000000000 0.0000000000
## 8   1.0  2.0 0.0000000000 0.0000000000
## 9   2.0  2.0 0.0168785118 0.0088889684

 

plot(shroom_tune_params)
Figure 1: 2x2 Tuned Hyperparameter Grid Plot

Figure 1: 2x2 Tuned Hyperparameter Grid Plot

   

Conclusion

For this data set, the support vector machine black box was able to classify mushrooms as either edible or poisonous with 100% accuracy. This does not mean SVMs are always 100% accurate, it means that the data itself contained highly correlated features.

   

References

Lichman, M. (2013). UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. Retrieved from http://archive.ics.uci.edu/ml

Schlimmer, J. (1987). Mushroom records drawn from the audubon society field guide to north american mushrooms: Alfred a. knopf. GH Lincoff (Pres.), New York. Retrieved from http://archive.ics.uci.edu/ml/datasets/Mushroom








Revision History
Revision Date Author Description
1.0 April 16, 2018 Ryan Whitell
  1. Genesis