# Packages Required
library(ggplot2)       # Plotting
library(rpart)         # Decision Tree
library(randomForest)  # Random Forest
library(gmodels)       # Cross Table
library(rattle)        # Tree Diagram
library(rpart.plot)    # Tree Diagram
library(RColorBrewer)  # Tree Diagram
GitHub Repository

   

Introduction

Decision trees are perhaps the most intuitive classification models for humans to interpret. Decision tree algorithms build trees by splitting data based off of some sort of information measure recursively until a stopping point. The many nuances of improving performance are omitted from this discussion but can easily be found online. The important characteristic is that each node contains a subset of the data, a node is split in such a way as to minimize the amount of diversity of its children.

Random forests are an ensemble method for decisions trees. Many trees are built using different methods, usually by taking subsets of the training data. The classification of new data is the majority vote of all the trees (classification) or the average value (regression). Random forests generalize data better than single decision trees.

   

The Wine Data Set

The data set by Cortez, Cerdeira, Almeida, Matos, & Reis (2009), retrieved from the UCI machine learning repository (Lichman, 2013) contains examples with 11 different features of the Portuguese “Vinho Verde” red wine and the corresponding quality score. The features are:

   

Exploratory Data Analysis

wine <- read.table(file = "winequality-red.csv", header = TRUE, sep = ";") # Read in the data
str(wine)
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

 

lapply(wine, summary)
## $fixed.acidity
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90 
## 
## $volatile.acidity
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800 
## 
## $citric.acid
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000 
## 
## $residual.sugar
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500 
## 
## $chlorides
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100 
## 
## $free.sulfur.dioxide
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00 
## 
## $total.sulfur.dioxide
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00 
## 
## $density
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037 
## 
## $pH
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010 
## 
## $sulphates
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000 
## 
## $alcohol
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90 
## 
## $quality
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

 

The only thing that really sticks out is the quality variable. It is measured from 0-10, but it looks like the wine in this data set is only as good as 8 and as poor as 3.

 

ggplot(data=wine, aes(wine$quality)) + geom_histogram(bins = 6) + labs(x="Quality", y="Count", title="Wine Quality") + scale_x_continuous(breaks=seq(0,8)) # Plot the frequency
Figure 1: Wine Quality Rating Frequency

Figure 1: Wine Quality Rating Frequency

 

prop.table(table(wine$quality))*100 # Get the percentages
## 
##          3          4          5          6          7          8 
##  0.6253909  3.3145716 42.5891182 39.8999375 12.4452783  1.1257036

 

It looks like medium quality wines dominate this data set.

   

Data Preprocessing

The data are already in a workable format. Decision trees are uninfluenced by different data types or scales, so minimal preprocessing needs to be done besides converting the response variable into a factor for the algorithm.

wine$quality <- as.factor(wine$quality)

 

Also, the distribution of the quality variable of this data should be considered carefully. Because the quality is rated 0-10 but the data only contains wines rated between 3-8, a decision needs to be made about grouping. This decision should be made with the input of a domain expert, in this case a sommelier, about which grouping would best represent different wine qualities. For example, because there are no wines with a quality of 1, our model won’t be able to predict a wine quality of 1 with the results of a physiochemical test. However, if a quality of 1 is not much different in the eyes of a sommelier than a 3, then we can group this range and label it ‘poor.’ This assumption is made here and the qualities are grouped as follows:

 

levels(wine$quality) # Inspect
## [1] "3" "4" "5" "6" "7" "8"

 

levels(wine$quality) <- c("low", "low", "med", "med", "high", "high") # Convert
levels(wine$quality) # Inspect
## [1] "low"  "med"  "high"

   

Decision Tree

Splitting the Data Between Testing and Training

set.seed(77) # Get the same data each time
idx <- sample(nrow(wine), round(nrow(wine)*0.7))  # Create 2 samples with ratio 70:30
wine_train <- wine[idx, ] # 1119 (70%)
wine_test <- wine[-idx, ] # 480 (30%)

 

Train the Model

# control=rpart.control(minsplit=2, cp=0), adding these parameters would build a full tree
wine_model <- rpart(formula = quality ~ fixed.acidity + 
                                        volatile.acidity +
                                        citric.acid +
                                        residual.sugar +
                                        chlorides +
                                        free.sulfur.dioxide +
                                        total.sulfur.dioxide +
                                        density +
                                        pH +
                                        sulphates +
                                        alcohol,
                    data = wine_train,
                    method = "class") 

 

Visualize the Tree

fancyRpartPlot(model = wine_model, sub = "Figure 2: Wine Quality Tree")

 

Test the Model

wine_pred <- predict(wine_model, wine_test, type = "class")

 

Evaluating Performance

# Returns the percentage of correct predictions
get.accuracy <- function(prediction, real) {
  accuracy <- prediction == real
  return (length(accuracy[accuracy == TRUE])/length(accuracy))
}

 

get.accuracy(wine_pred, wine_test$quality)
## [1] 0.8291667

 

Cross Table

CrossTable(wine_pred, wine_test$quality, prop.chisq = FALSE, prop.t = FALSE, dnn = c('predicted', 'actual'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  480 
## 
##  
##              | actual 
##    predicted |       low |       med |      high | Row Total | 
## -------------|-----------|-----------|-----------|-----------|
##          med |        17 |       376 |        46 |       439 | 
##              |     0.039 |     0.856 |     0.105 |     0.915 | 
##              |     1.000 |     0.952 |     0.676 |           | 
## -------------|-----------|-----------|-----------|-----------|
##         high |         0 |        19 |        22 |        41 | 
##              |     0.000 |     0.463 |     0.537 |     0.085 | 
##              |     0.000 |     0.048 |     0.324 |           | 
## -------------|-----------|-----------|-----------|-----------|
## Column Total |        17 |       395 |        68 |       480 | 
##              |     0.035 |     0.823 |     0.142 |           | 
## -------------|-----------|-----------|-----------|-----------|
## 
## 

 

Summary

The decision tree was able to predict the quality of wine with about 83% accuracy. However, the tree that was built did not have any leaves leading to a ‘low’ quality wine. This is most likely because the tree stopped building too early, as to avoid overfitting this data. This tree would most likely do poorly in a real world scenario even though it is the most accurate for this data set. It would be recommended to tweak the minimum split and complexity parameters until a tree that can classify ‘low’ quality wines is built. Another technique would be to overgrow the tree and prune it as necessary.

   

Random Forest

Random forests should generalize even better than a single tree. Also, a random forest is likely not to miss the ‘low’ quality rating.

 

Train the Model

# control=rpart.control(minsplit=2, cp=0), adding these parameters would build a full tree
wine_rf_model <- randomForest(formula = quality ~ fixed.acidity  
                                  + volatile.acidity 
                                  + citric.acid 
                                  + residual.sugar 
                                  + chlorides 
                                  + free.sulfur.dioxide 
                                  + total.sulfur.dioxide
                                  + density
                                  + pH
                                  + sulphates
                                  + alcohol,
               data = wine_train,
               method = "class") 

 

Importance of Variables

randomForest::importance(wine_rf_model)
##                      MeanDecreaseGini
## fixed.acidity                25.33092
## volatile.acidity             36.06153
## citric.acid                  27.98890
## residual.sugar               28.20345
## chlorides                    26.98044
## free.sulfur.dioxide          21.10616
## total.sulfur.dioxide         29.98270
## density                      29.55651
## pH                           23.07235
## sulphates                    35.41551
## alcohol                      50.05013

 

Test the Model

wine_rf_pred <- predict(wine_rf_model, wine_test, type = "class")

 

Evaluating Performance

get.accuracy(wine_rf_pred, wine_test$quality)
## [1] 0.8895833

 

Cross Table

CrossTable(wine_rf_pred, wine_test$quality, prop.chisq = FALSE, prop.t = FALSE, dnn = c('predicted', 'actual'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  480 
## 
##  
##              | actual 
##    predicted |       low |       med |      high | Row Total | 
## -------------|-----------|-----------|-----------|-----------|
##          low |         0 |         2 |         0 |         2 | 
##              |     0.000 |     1.000 |     0.000 |     0.004 | 
##              |     0.000 |     0.005 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|
##          med |        17 |       389 |        30 |       436 | 
##              |     0.039 |     0.892 |     0.069 |     0.908 | 
##              |     1.000 |     0.985 |     0.441 |           | 
## -------------|-----------|-----------|-----------|-----------|
##         high |         0 |         4 |        38 |        42 | 
##              |     0.000 |     0.095 |     0.905 |     0.087 | 
##              |     0.000 |     0.010 |     0.559 |           | 
## -------------|-----------|-----------|-----------|-----------|
## Column Total |        17 |       395 |        68 |       480 | 
##              |     0.035 |     0.823 |     0.142 |           | 
## -------------|-----------|-----------|-----------|-----------|
## 
## 

   

Conclusion

The random forest provided a much better model than the decision tree. The random forest not only had higher accuracy, it was also able to generalize better by including the ‘low’ quality wine in its results, something the best performing decision tree was not able to do. However, the random forest was still unable to correctly classify a ‘low’ quality wine.

The decision tree only had to use 6 out of the 11 variables to classify wine at over 80% accuracy. The dominating variables were alcohol and sulphates for the decision tree and random forest. However, volatile.acidity had a greater impact on the random forest than it did on the decision tree.

The accuracy of both methods were expected. The data was dominated by the ‘med’ quality wine and so most leaves led to that classification.

   

References

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4), 547–553. Retrieved from https://archive.ics.uci.edu/ml/datasets/wine+quality

Lichman, M. (2013). UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. Retrieved from https://archive.ics.uci.edu/ml








Revision History
Revision Date Author Description
1.0 April 16, 2018 Ryan Whitell
  1. Genesis