# Packages Required
library(tm)         # Text Mining
library(SnowballC)  # Word Stemming
library(wordcloud)  # Visualize Word Clouds
library(e1071)      # Naïve Bayes
library(gmodels)    # Cross Table
GitHub Repository

   

Introduction

Naïve Bayes classifiers use Bayes rule to classify data using the probabilities gleaned from existing data or new data as it becomes available. Naïve Bayes is very good at text classification even with minimal data preprocessing or wrangling.

   

The SMS Spam Collection Data Set

This project demonstrates how Naïve Bayes can be used to predict and filter SMS spam utilizing the SMS Spam Collection Data Set by Almeida & Hidalgo (2011), which was retrieved from the UCI machine learning repository (Lichman, 2013). The example borrows heavily from Machine Learning with R by Lantz (2015), chapter 4, with a few minor changes.

The data set is a collection of 5574 SMS messages labeled as either spam or ham. There are 747 spam messages and 4827 ham messages. A snapshot of the text file looks like the following:

ham Going to join tomorrow.
spam You are awarded a SiPix Digital Camera! call 09061221061 from landline. Delivery within 28days. T Cs Box177. M221BP. 2yr warranty. 150ppm. 16 . p p£3.99
ham I want to tell you how bad I feel that basically the only times I text you lately are when I need drugs
spam PRIVATE! Your 2003 Account Statement for shows 800 un-redeemed S.I.M. points. Call 08718738001 Identifier Code: 49557 Expires 26/11/04
ham Total disappointment, when I texted you was the craziest shit got :(
ham Its just the effect of irritation. Just ignore it
ham What about this one then.
ham I think that tantrum’s finished so yeah I’ll be by at some point
ham Compliments to you. Was away from the system. How your side.
ham happened here while you were adventuring
ham Hey chief, can you give me a bell when you get this. Need to talk to you about this royal visit on the 1st june.

The Naïve Bayes classifier uses Bayes rule to estimate the posterior probabilities that an SMS message is spam and ham given the probabilities of the appearance of each word in previously seen spam and ham messages. The conditional probabilities of the problem are:

 

\[ P(Spam \mid Words) = \frac{P(Words \mid Spam) \, P(Spam)}{P(Words)} \] \[ P(Ham \mid Words) = \frac{P(Words \mid Ham) \, P(Ham)}{P(Words)} \]

 

Where:

 

\[ P(Words) = P(Spam)P(Words \mid Spam) + P(\neg Spam)P(Words \mid \neg Spam) \]

 

For a given set of words (a new SMS message), the posterior probabilities are determined for both ham and spam, the likeliest classification is then chosen.

   

Text Transformation

Before training a model, the data needs to transformed into a more useful form. The tm package in R provides functions for transforming and manipulating text data.

 

Import the data
sms_raw <- read.table(file = "SMSSpamCollection.tsv", header = FALSE, sep = "\t", quote = "", colClasses = "character")
str(sms_raw)
## 'data.frame':    5574 obs. of  2 variables:
##  $ V1: chr  "ham" "ham" "spam" "ham" ...
##  $ V2: chr  "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..." "Ok lar... Joking wif u oni..." "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question("| __truncated__ "U dun say so early hor... U c already then say..." ...

 

Rename the columns, change type to a factor. Check out the values of ham and spam
colnames(sms_raw) <- c('type', 'text')
sms_raw$type <- factor(sms_raw$type)
table(sms_raw$type)
## 
##  ham spam 
## 4827  747

 

Visualize the data in a wordcloud
spam <- subset(sms_raw, type == "spam")
ham <- subset(sms_raw, type == "ham") 
wordcloud(spam$text, max.words = 40, scale = c(3,0.5), random.order = FALSE)
Figure 1: Common Spam Words

Figure 1: Common Spam Words

 

wordcloud(ham$text, max.words = 40, scale = c(3,0.5), random.order = FALSE)
Figure 2: Common Ham Words

Figure 2: Common Ham Words

 

It is clear that it should be possible to distinguish between ham and spam SMS messages from these word clouds. Words like ‘free,’ ‘prize,’ and ‘call’ are good indicators of spam messages.

 

Transform the data using the tm package

First, create a corpus. A corpus is a collection of text documents.

sms_corpus <- VCorpus(VectorSource(sms_raw$text))
sms_corpus
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 5574

 

Change all words to lower case
as.character(sms_corpus[[10]])
## [1] "Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030"

 

sms_corpus_clean <- tm_map(sms_corpus, content_transformer(tolower))
as.character(sms_corpus_clean[[10]])
## [1] "had your mobile 11 months or more? u r entitled to update to the latest colour mobiles with camera for free! call the mobile update co free on 08002986030"

 

Remove word stems
as.character(sms_corpus_clean[[10]])
## [1] "had your mobile 11 months or more? u r entitled to update to the latest colour mobiles with camera for free! call the mobile update co free on 08002986030"

 

sms_corpus_clean <- tm_map(sms_corpus_clean, stemDocument)
as.character(sms_corpus_clean[[10]])
## [1] "had your mobil 11 month or more? u r entitl to updat to the latest colour mobil with camera for free! call the mobil updat co free on 08002986030"

 

Remove stop words
as.character(sms_corpus_clean[[10]])
## [1] "had your mobil 11 month or more? u r entitl to updat to the latest colour mobil with camera for free! call the mobil updat co free on 08002986030"

 

sms_corpus_clean <- tm_map(sms_corpus_clean, removeWords, stopwords())
as.character(sms_corpus_clean[[10]])
## [1] "  mobil 11 month  ? u r entitl  updat   latest colour mobil  camera  free! call  mobil updat co free  08002986030"

 

Remove numbers
as.character(sms_corpus_clean[[10]])
## [1] "  mobil 11 month  ? u r entitl  updat   latest colour mobil  camera  free! call  mobil updat co free  08002986030"

 

sms_corpus_clean <- tm_map(sms_corpus_clean, removeNumbers)
as.character(sms_corpus_clean[[10]])
## [1] "  mobil  month  ? u r entitl  updat   latest colour mobil  camera  free! call  mobil updat co free  "

 

Remove punctuation
# Replaces punctuation with spaces
replacePunctuation <- function(x) {
  gsub("[[:punct:]]+", " ", x)
}

 

as.character(sms_corpus_clean[[10]])
## [1] "  mobil  month  ? u r entitl  updat   latest colour mobil  camera  free! call  mobil updat co free  "

 

sms_corpus_clean <- tm_map(sms_corpus_clean, replacePunctuation)
as.character(sms_corpus_clean[[10]])
## [1] "  mobil  month    u r entitl  updat   latest colour mobil  camera  free  call  mobil updat co free  "

 

Remove white space
as.character(sms_corpus_clean[[10]])
## [1] "  mobil  month    u r entitl  updat   latest colour mobil  camera  free  call  mobil updat co free  "

 

sms_corpus_clean <- tm_map(sms_corpus_clean, stripWhitespace)
as.character(sms_corpus_clean[[10]])
## [1] " mobil month u r entitl updat latest colour mobil camera free call mobil updat co free "

 

Check a few examples
as.character(sms_corpus[[20]])
## [1] "England v Macedonia - dont miss the goals/team news. Txt ur national team to 87077 eg ENGLAND to 87077 Try:WALES, SCOTLAND 4txt/ú1.20 POBOXox36504W45WQ 16+"
as.character(sms_corpus_clean[[20]])
## [1] "england v macedonia dont miss goals team news txt ur nation team eg england try wales scotland txt ã poboxoxwwq "

 

as.character(sms_corpus[[30]])
## [1] "Ahhh. Work. I vaguely remember that! What does it feel like? Lol"
as.character(sms_corpus_clean[[30]])
## [1] "ahhh work vagu rememb doe feel like lol"

 

as.character(sms_corpus[[40]])
## [1] "Hello! How's you and how did saturday go? I was just texting to see if you'd decided to do anything tomo. Not that i'm trying to invite myself or anything!"
as.character(sms_corpus_clean[[40]])
## [1] "hello saturday go just text see decid anyth tomo tri invit anything "

 

It looks like the text data is in a more usable form.

   

Data Preprocessing

Create a Document-Term-Matrix

Document-term-matrices describe the frequency of all the terms present in a corpus for each individual document. This is the type of form needed for Naïve Bayes.

sms_corpus_clean <- tm_map(sms_corpus_clean, PlainTextDocument) # Convert back to the correct data type
sms_dtm <- DocumentTermMatrix(sms_corpus_clean)

 

Split the Data

The data itself is random so splitting the training and testing sets randomly is as simple as splitting at 75%.

# Train
sms_dtm_train <- sms_dtm[1:4180, ]            # 75%
sms_train_labels <- sms_raw[1:4180, ]$type    # Labels
prop.table(table(sms_train_labels))           # Check
## sms_train_labels
##       ham      spam 
## 0.8648325 0.1351675

 

# Test
sms_dtm_test <- sms_dtm[4181:5574, ]          # 25%
sms_test_labels <- sms_raw[4181:5574, ]$type  # Labels
prop.table(table(sms_test_labels))            # Check
## sms_test_labels
##       ham      spam 
## 0.8694405 0.1305595

 

Check the Frequent Words

# Frequent words in spam
sms_spam_freq <- sort(colSums(as.matrix(sms_dtm_train[sms_train_labels=="spam",])), decreasing=TRUE)
head(sms_spam_freq, 5)
## call free  now  txt text 
##  270  175  164  134  105

 

# Frequent words in ham
sms_ham_freq <- sort(colSums(as.matrix(sms_dtm_train[sms_train_labels=="ham",])), decreasing=TRUE)
head(sms_ham_freq, 5)
##  get  can will  now just 
##  263  261  244  221  216

 

Remove the Infrequent Words

# Infrequent words in spam
sms_spam_freq <- sort(colSums(as.matrix(sms_dtm_train[sms_train_labels=="spam",])), decreasing=FALSE)
head(sms_spam_freq, 5)
##    â‘morrow      â‘rent         –         aah aaooooright 
##           0           0           0           0           0

 

# Infrequent words in ham
sms_ham_freq <- sort(colSums(as.matrix(sms_dtm_train[sms_train_labels=="ham",])), decreasing=FALSE)
head(sms_ham_freq, 5)
##  â“harri      “ “harri  abdomen aberdeen 
##        0        0        0        0        0

 

Remove words that appear in less than 5 SMS messages

sms_freq_words <- findFreqTerms(sms_dtm_train, 5)
sms_dtm_train <- sms_dtm_train[ , sms_freq_words]
sms_dtm_test <- sms_dtm_test[ , sms_freq_words]

 

Convert Numerical to Categorical

The Naïve Bayes classifier needs to know if the document contains the word or not, it does not matter how many times the word is present.

# Converts the frequency to "Yes" if greater than 0, "No" otherwise
convert_counts <- function(x) {
  x <- ifelse(x > 0, "Yes", "No")
}

 

sms_train <- apply(sms_dtm_train, MARGIN = 2, convert_counts)
sms_test <- apply(sms_dtm_test, MARGIN = 2, convert_counts)

   

The Naïve Bayes Classifier

Like most machine learning models, the Naïve Bayes classifier needs to be trained using the training data, predictions can then be made on the testing data to test how accurate the model is.

 

Train the Model

sms_model <- naiveBayes(sms_train, sms_train_labels)

 

Test the Model

sms_test_pred <- predict(sms_model, sms_test)

 

Evaluate

CrossTable(sms_test_pred, sms_test_labels, prop.chisq = FALSE, prop.t = FALSE, dnn = c('predicted', 'actual'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1394 
## 
##  
##              | actual 
##    predicted |       ham |      spam | Row Total | 
## -------------|-----------|-----------|-----------|
##          ham |      1205 |        17 |      1222 | 
##              |     0.986 |     0.014 |     0.877 | 
##              |     0.994 |     0.093 |           | 
## -------------|-----------|-----------|-----------|
##         spam |         7 |       165 |       172 | 
##              |     0.041 |     0.959 |     0.123 | 
##              |     0.006 |     0.907 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      1212 |       182 |      1394 | 
##              |     0.869 |     0.131 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

 

Laplace Smoothing

If a word appears in only ham messages and not in any spam messages in our training data, that word will have a probability of zero for spam and the classifier will classify any SMS with that word as ham every time. Laplace smoothing will prevent this.

sms_model <- naiveBayes(sms_train, sms_train_labels, laplace = 1)
sms_test_pred <- predict(sms_model, sms_test)
CrossTable(sms_test_pred, sms_test_labels, prop.chisq = FALSE, prop.t = FALSE, dnn = c('predicted', 'actual'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1394 
## 
##  
##              | actual 
##    predicted |       ham |      spam | Row Total | 
## -------------|-----------|-----------|-----------|
##          ham |      1206 |        19 |      1225 | 
##              |     0.984 |     0.016 |     0.879 | 
##              |     0.995 |     0.104 |           | 
## -------------|-----------|-----------|-----------|
##         spam |         6 |       163 |       169 | 
##              |     0.036 |     0.964 |     0.121 | 
##              |     0.005 |     0.896 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      1212 |       182 |      1394 | 
##              |     0.869 |     0.131 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

   

Conclusion

The classifier was able to predict spam messages with 95.9% accuracy, only classifying spam messages as ham 17 times. It was even better at classifying ham messages with 98.6% accuracy, classifying ham messages as spam only 7 times.

Using Laplace smoothing reduced the overall accuracy but improved the total classification of ham as spam. This classification is arguably more important to users. It would be better to let more spam messages through than to block any ham messages.

The model can be further improved by tweaking the Laplace value, doing more data preprocessing, and including or excluding different words. The more data that is available, the better the accuracy will be as well. The model can be improved as new data becomes available.

   

References

Almeida, T. A., & Hidalgo, J. M. G. (2011). Retrieved from https://archive.ics.uci.edu/ml/datasets/sms+spam+collection

Lantz, B. (2015). Machine learning with r. Packt Publishing Ltd.

Lichman, M. (2013). UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. Retrieved from https://archive.ics.uci.edu/ml








Revision History
Revision Date Author Description
1.0 April 16, 2018 Ryan Whitell
  1. Genesis