Tuesday, March 29, 2016

Automated Java code to read the BigInsights Web Crawler output

I have heard this requirement from multiple customers, where they were looking for a java code to read the output from the BigInsights WebCrawler Application and convert to csv file for further processing.

This blog talks on how to read the web crawler output using Java. The WebCrawler application writes the output as a sequence file hence, we are using the Sequence File Reader to read the output.

1)  Create a Java project with the below code then generate a runnable jar.

package com.test;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.nutch.protocol.Content;
import org.apache.nutch.util.NutchConfiguration;

public class NutchConverter {

public void readData(String coreSitePath, String inputDirPath)
throws Exception {

Configuration conf = NutchConfiguration.create();
FileSystem fs = FileSystem.get(conf);
FileStatus[] statuses = fs.globStatus(new Path(inputDirPath
+ "/segments/*/content/part-*/data"));

for (FileStatus status : statuses) {
System.out.println("Processing file: " + status.getPath());
SequenceFile.Reader reader = new SequenceFile.Reader(fs,
status.getPath(), conf);
Text key = new Text();
Content content = new Content();
while (reader.next(key, content)) {
System.out.println("Key :" + key);
System.out.println("content :" + content);


public static void main(String[] args) throws Exception {

System.out.println("Testing... .... ");
NutchConverter sam = new NutchConverter();

String coreSitePath = args[0];
String inputDirPath = args[1];
sam.readData(coreSitePath, inputDirPath);


Following jar files are required to compile the java class


 Generate the runnable jar.


2) Running the jar from BigInsights Cluster

Copy the jar to Cluster, then run the below command


hadoop jar /opt/Webcrawler/NutchConverter.jar $BIGINSIGHTS_HOME/hadoop-conf/core-site.xml /user/biadmin/IBMCrawlTest1

hadoop jar  <passTheCoreSiteXML> <OutputFolderUsedInWebCrawlerApplication>


How to run the BigInsights WebCrawler Application as an Oozie Job in BigInsights 3.2

This blog is the continuation of my previous blog. Here we discuss on, How to port the BigInsights WebCrawler Application has an Oozie Job.

1) Setup the WebCrawler Application in Eclipse as mentioned in my previous blog

2) Modify the NutchApp.java file and generate the jar file.

a) Open the NutchApp.java in /webcrawlerapp/src/main/java/com/ibm/biginsights/apps/nutch/NutchApp.java


Comment the Line RunJar.main (runJarArgs); and add below line

             //RunJar.main (runJarArgs);
              RunJar.unJar(new File ("./nutch-1.4.job.new"), new File("/tmp/unpack"));

             RunJar.main (runJarArgs);

import the FileUtils class

               import org.apache.commons.io.FileUtils;

In finally block, add

              FileUtils.deleteDirectory(new File("/tmp/unpack"));

 b) Generate the java jar into webcrawlerapp\BIApp\workflow\lib\NutchCrawler.jar


3) Creating an Oozie workflow

a) Copy the files under /webcrawlerapp/BIApp/workflow to BigInsights Cluster. Ensure you give proper permission for files in Linux.

b) Create job.properties file with below detail. You need to change the hostname and port (highlighted in blue)  based your cluster.

# JobTracker and NodeName Details

#HDFS path where you need to copy workflow.xml and lib/*.jar to

#one of the values from Hadoop mapred.queue.names

## urls are delimited by \n

4) Running the Oozie Workflow - Run the below command from Linux terminal

# Remove the output and input folders
hadoop fs -rmr /user/biadmin/NutchCrawlOozie1
hadoop fs -rmr /user/biadmin/IBMCrawlTest1

# Move the workflow to hdfs
hadoop fs -put /opt/Webcrawler /user/biadmin/NutchCrawlOozie1

# Create outpit folder in HDFS
hadoop fs -mkdir /user/biadmin/IBMCrawlTest1

# Invoke your Oozie workflow from command line.

cd $BIGINSIGHTS_HOME/oozie/bin

# Change the hostname & port based on your cluster 
export OOZIE_URL=http://bivm:8280/oozie

./oozie job -run -config /opt/Webcrawler/job.properties 

5) Testing the output - Login to Web Console then open the output folder with sheet & Basic Crawl Data reader.

I will be covering - How to setup the Web Crawler in BigInsights 4, in my next blog.