Big Data Analytics: 2017

Monday, November 6, 2017

Integrating Watson Alchemy API with IBM Data Science Experience

This blog describes - Integrating the "Watson Alchemy API" service in IBM Bluemix with RStudio in IBM Data Science Experience.

Here, we use different NLP API's provided by Alchemy API from R Studio in IBM Data Science Experience. We use the rJava library to get sentiment, entities, relation etc on unstructured data.

1) Setup the "Watson Alchemy API" service in IBM Bluemix

Login to https://console.ng.bluemix.net/ and create an account then create a Alchemy Service.

Get the apikey from service credentials as below.

2) Build a Java Application

Download Watson Developer Cloud Java SDK

https://github.com/watson-developer-cloud/java-sdk/releases/download/java-sdk-3.3.1/java-sdk-3.3.1-jar-with-dependencies.jar

Create a Java Project with the below code

package com.bluemix;

import java.util.HashMap;

import java.util.Map;

import com.ibm.watson.developer_cloud.alchemy.v1.AlchemyLanguage;

import com.ibm.watson.developer_cloud.alchemy.v1.model.DocumentSentiment;

import com.ibm.watson.developer_cloud.alchemy.v1.model.Entities;

import com.ibm.watson.developer_cloud.alchemy.v1.model.TypedRelations;

public class BlueMix_Alchemy_API {

String Document;

public static void main(String[] args) {}

public String getSentiment(String inputStr, String ApiKey) {

AlchemyLanguage service = new AlchemyLanguage();

service.setApiKey(ApiKey);

Map<String, Object> params = new HashMap<String, Object>();

params.put(AlchemyLanguage.TEXT,inputStr);

DocumentSentiment sentiment = service.getSentiment(params).execute();

return sentiment.toString();

}

public String getTypedRelations(String inputStr, String ApiKey) {

AlchemyLanguage service = new AlchemyLanguage();

service.setApiKey(ApiKey);

Map<String, Object> params = new HashMap<String, Object>();

params.put(AlchemyLanguage.TEXT,inputStr);

TypedRelations relations = service.getTypedRelations(params).execute();

return relations.toString();

}

public String getEntities(String inputStr, String ApiKey) {

AlchemyLanguage service = new AlchemyLanguage();

service.setApiKey(ApiKey);

Map<String, Object> params = new HashMap<String, Object>();

params.put(AlchemyLanguage.TEXT,inputStr);

Entities entities = service.getEntities(params).execute();

return entities.toString();

}

Generate the java jar for the java code.

3) Setup the "IBM Data Science Experience"

Open R Studio and install the R Library - rJava

Create a folder and set it as working directory. Upload the java jar & java-sdk-3.3.1-jar-with-dependencies.jar to the working directory.

R Commands:

> library(rJava)
>
> cp = c("/home/rstudio/WatsonAlchemySentimentAnalysis/WatsonAlchemySentimentAnalysis.jar",
"/home/rstudio/WatsonAlchemySentimentAnalysis/java-sdk-3.3.1-jar-with-dependencies.jar")
>
> .jinit(classpath=cp)
>
> instance = .jnew("com.bluemix.BlueMix_Alchemy_API")
>
> sentiment <- .jcall(instance, "S", "getSentiment", "IBM Watson won the Jeopardy television show hosted by Alex Trebek","<provideTheAPIKeyFromBlumix>")
>
> cat(sentiment)
>
> entities <- .jcall(instance, "S", "getEntities", "IBM Watson won the Jeopardy television show hosted by Alex Trebek","<provideTheAPIKeyFromBlumix>")
>
> cat(entities)
>
> relation <- .jcall(instance, "S", "getTypedRelations", "IBM Watson won the Jeopardy television show hosted by Alex Trebek","<provideTheAPIKeyFromBlumix>")
>
> cat(relation)

Link to GitHub code

Further Reading:

Integrating the "IBM BigInsights for Apache Hadoop" service in IBM Bluemix with RStudio in IBM Data Science Experience - Part 1

Integrating the "IBM BigInsights for Apache Hadoop" service in IBM Bluemix with RStudio in IBM Data Science Experience - Part 2

Sunday, March 5, 2017

Configuring and Running Apache Kafka in IBM BigInsights

This blog describes on Configuring and running the Kafka from IBM BigInsights.

Apache Kafka is an open source that provides a publish-subscribe model for messaging system. Refer : https://kafka.apache.org/

I assume that you were aware of terminologies like Producer, Subscriber, Kafka Brokers, Topic and Partitions. Here, I will be focusing on creating multiple Brokers in BigInsights then create a topic and publish the messages from command line and consumer getting it from the Broker.

Environment: BigInsights 4.2

Step 1: Creating Kafka Brokers from Ambari

By default, Ambari will have one Kafka Broker configured. Based on your usecase, you may need to create multiple brokers.

Login to Ambari UI --> Click on Host and add the Kafka Broker to the node where you need to install Broker.

You can see multiple brokers running in Kafka UI.

Step 2: Create a Topic

Login to one of the node where broker is running. Then create a topic.

cd /usr/iop/4.2.0.0/kafka/bin

su kafka -c "./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 -partitions 1 --topic CustomerOrder"

You can get the details of the topic using the below describe command.

su kafka -c "./kafka-topics.sh --describe --zookeeper localhost:2181 --topic CustomerOrder"

Step 3: Start the Producer

In the argument --broker-list, pass all the brokers that are running.

su kafka -c "./kafka-console-producer.sh --broker-list bi1.test.com:6667,bi2.test.com:6667 --topic CustomerOrder"

When you run the above command, it will be waiting for user input. You can pass a sample message

{"ID":99, "CUSTOMID":234,"ADDRESS":"12,5-7,westmead", "ORDERID":99, "ITEM":"iphone6", "COST":980}

Step 4: Start the Consumer

Open an other Linux terminal and start the consumer. It will display all the messages send to producer.

su kafka -c "./kafka-console-consumer.sh --zookeeper localhost:2181 --from-beginning --topic CustomerOrder"

Thus, We are able to configure and perfom a sample pub-sub system using Kafka.

Thursday, March 2, 2017

Configuring and Running Apache Phoenix in IBM BigInsights

This blog describes on Configuring and running the Phoenix from IBM BigInsights.

Apache Phoenix is an open source that provides SQL on HBase. Refer : https://phoenix.apache.org/

Environment: BigInsights 4.2

Step 1: Configure Phoenix from Ambari

Login to Ambari UI, then go to HBase Configuration and enable the phoenix.

Save the changes and restart the HBase.

2) Validating the Phoenix

Login to Linux terminal as hbase user and run the below command. It will create the tables and do some select queries. You can see the output in the console.

cd /usr/iop/current/phoenix-client/bin

./psql.py localhost:2181:/hbase-unsecure ../doc/examples/WEB_STAT.sql ../doc/examples/WEB_STAT.csv ../doc/examples/WEB_STAT_QUERIES.sql

3) Running Queries using Phoenix

This section focus on running some queries on Phoenix. Here I am focusing on some basic operations.

Open the Terminal and run the below commands

cd /usr/iop/current/phoenix-client/bin

./sqlline.py testiop.in.com:2181:/hbase-unsecure

Create the table then insert some rows and do a select on the table.

CREATE TABLE IF NOT EXISTS CUSTOMER_ORDER (
   ID BIGINT NOT NULL,
   CUSTOMID INTEGER,
   ADDRESS VARCHAR,
   ORDERID INTEGER,
   ITEM VARCHAR,
   COST INTEGER
   CONSTRAINT PK PRIMARY KEY (ID)
   );

upsert into CUSTOMER_ORDER values (1,234,'11,5-7,westmead',99,'iphone7',1200);
upsert into CUSTOMER_ORDER values (2,288,'12,5-7,westmead',99,'iphone6',1000);
upsert into CUSTOMER_ORDER values (3,299,'13,5-7,westmead',99,'iphone5',600);

select * from CUSTOMER_ORDER;

If you like to know about other SQL Query syntax, refer https://phoenix.apache.org/language/

4) Bulk Loading the data to the table

Here, we are doing a bulk load to the above table.

Upload the data to HDFS

[root@test bin]#
[root@test bin]# hadoop fs -cat /tmp/importData.csv
11,234,'11,5-7,westmead',99,'iphone7',1200
12,288,'11,5-7,westmead',99,'iphone7',1200
13,299,'11,5-7,westmead',99,'iphone7',1200
14,234,'11,5-7,westmead',99,'iphone7',1200
[root@test bin]#

Run the import command from the terminal

sudo -u hbase hadoop jar ../phoenix-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table CUSTOMER_ORDER --input /tmp/importData.csv

Thus, we are able to configure and perform some basic Queries on Phoenix.

Monday, February 20, 2017

Create and Configure Separate Queue in YARN Capacity Scheduler for running Spark Jobs in IBM BigInsights

This blog talks on - How to create and configure separate queue in YARN Capacity Scheduler Queues for running the Spark jobs.

Environment : BigInsights 4.2

1) Create a queue for Spark from Yarm Queue Manager

Here I am allocating 50% of resources to default queue and rest 50% to Spark Jobs. You can configure the queues based on your use case. You can also create hierarchial queues also.

Login to Ambari UI and go to Yarn Queue Manager

The default queue is configured to use 100% resources. You need to modify the Capacity and Max Capacity to 50%.

Save the changes by clicking the tick button as shown bellow.

Now, click on +Add Queue button and create a new queue for Spark Jobs.

Save and refresh the queues.

Open the Resource Manager UI and confirm the Queues configured.

2)Submit a Spark job to the queue

Login to the cluster and run the below commands to submit the job.

[root@cluster ~]# su hdfs

[hdfs@cluster root]$ cd /usr/iop/current/spark-client

[hdfs@cluster spark-client]$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 --queue sparkQueue lib/spark-examples.jar 10

In the Yarn Resource Manager UI, you can see the job is running in the new queue

In the logs, you can see the output from the spark job.

Thus, you are able to run the Spark Jobs in different Queue.

Sunday, February 12, 2017

Running HDFS Word Count using Spark Streaming in IBM BigInsights

This blog talks on running a simple word count example to demonstrate Spark Streaming in IBM BigInsights.

Environment : IBM BigInsights 4.2

Step 1: Run the Spark Streaming word Count example for HDFS.

su hdfs

cd /usr/iop/current/spark-client

./bin/spark-submit --class org.apache.spark.examples.streaming.HdfsWordCount --master yarn-client lib/spark-examples.jar /tmp/wordcount

The above statement will be listening to the hdfs folder ( /tmp/wordcount ). Whenever a file is loaded to hdfs folder, it will do a word count and output it.

Step2: Open another Linux terminal and run the below command as hdfs user.

echo "Hello - Date is `date`" | hadoop fs -put - /tmp/wordcount/test1.txt

In the Linux terminal in step 1, you can see the output of the word count.

The above example will help us to validate the Spark Streaming.

Wednesday, February 8, 2017

Spam classification using Spark MLlib in IBM BigInsights

This blog talks on classifying the SMS messages into Span and Ham using the Spark MLlib.

Environment : IBM BigInsights 4.2

Step 1: Setup the dataset

We are using the dataset from UCI Machine Learning Repository - SMS Spam Collection Data Set.

For more details refer -
https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

Download the dataset - https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip

Unzip and upload the file (SMSSpamCollection) to HDFS (/tmp).

Step 2: Login to Spark Shell

su hdfs
cd /usr/iop/current/spark-client
./bin/spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m

Step 3: In Scala prompt, run below commands

# Read the dataset.
val inputRdd = sc.textFile("/tmp/SMSSpamCollection")

# Get the records that are Spam and Ham
val linesWithSpam = inputRdd.filter(line => line.contains("spam"))
val spam = linesWithSpam.map( x => x.split("\t")(1))

val linesWithHam = inputRdd.filter(line => line.contains("ham"))
val ham = linesWithHam.map( x => x.split("\t")(1))

# Import the required mllib classes
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD

# Convert the text to vector of 100 features based on term frequency.
var tf = new HashingTF(numFeatures = 100)
val spamFeatures = spam.map(email => tf.transform(email.split(" ")))
val hamFeatures = ham.map(email => tf.transform(email.split(" ")))

# Label the Spam as 1 and ham as 0.
val positiveExamples = spamFeatures.map( features => LabeledPoint(1, features))
val negativeExamples = hamFeatures.map( features => LabeledPoint(0, features))
val training_data = positiveExamples.union(negativeExamples)
# cache the training data
training_data.cache()

# We use 60% of dataset for training and remaining for testing the model.
val Array(trainset, testset) = training_data.randomSplit(Array(0.6, 0.4))

# We use Logistic Regression model, and make predictions with the resulting model
val lrLearner = new LogisticRegressionWithSGD()
val model = lrLearner.run(trainset)

val predictionLabel = testset.map( x => (model.predict(x.features), x.label))

val accuracy = predictionLabel.filter(r => r._1 == r._2).count.toDouble / testset.count

println("Model accuracy : " + accuracy)

Thus, we are able to create and run the model to predict the Spam or Ham.