Wednesday, February 8, 2017

Spam classification using Spark MLlib in IBM BigInsights

This blog talks on classifying the SMS messages into Span and Ham using the Spark MLlib.

Environment : IBM BigInsights 4.2

Step 1:  Setup the dataset

We are using the dataset from UCI Machine Learning Repository - SMS Spam Collection Data Set. 

For more details refer -

Download the dataset -

Unzip and upload the file (SMSSpamCollection) to HDFS (/tmp).

 Step 2: Login to Spark Shell

su hdfs
cd /usr/iop/current/spark-client
./bin/spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m

Step 3: In Scala prompt, run below commands

# Read the dataset.
val inputRdd = sc.textFile("/tmp/SMSSpamCollection")

# Get the records that are Spam and Ham
val linesWithSpam = inputRdd.filter(line => line.contains("spam"))
val spam = x => x.split("\t")(1))

val linesWithHam = inputRdd.filter(line => line.contains("ham"))
val ham = x => x.split("\t")(1))

# Import the required mllib classes
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD

# Convert the text to vector of 100 features based on term frequency.
var tf = new HashingTF(numFeatures = 100)
val spamFeatures = => tf.transform(email.split(" ")))
val hamFeatures = => tf.transform(email.split(" ")))

# Label the Spam as 1 and ham as 0.
val positiveExamples = features => LabeledPoint(1, features))
val negativeExamples = features => LabeledPoint(0, features))
val training_data = positiveExamples.union(negativeExamples)
# cache the training data

# We use 60% of dataset for training and remaining for testing the model.
val Array(trainset, testset) = training_data.randomSplit(Array(0.6, 0.4))

# We use Logistic Regression model, and make predictions with the resulting model
val lrLearner = new LogisticRegressionWithSGD()
val model =

val predictionLabel = x => (model.predict(x.features), x.label))

val accuracy = predictionLabel.filter(r => r._1 == r._2).count.toDouble / testset.count

println("Model accuracy : " + accuracy)

Thus, we are able to create and run the model to predict the Spam or Ham.


Deepika Patil said...

Analogica Data is a Best Big Data Analysis Company in india, offers top big data Analytics services, Big Data Solution, Data consulting, data analytics.

mahendar cherry said...

Apache Spark today remains the most active open source project in Big Data with over 1000 contributors. Spark offers over 80 high-level operators that make it easy to build parallel apps
apache spark developer training