Tuesday, March 29, 2016

How to run the BigInsights WebCrawler Application as an Oozie Job in BigInsights 3.2

This blog is the continuation of my previous blog. Here we discuss on, How to port the BigInsights WebCrawler Application has an Oozie Job.

1) Setup the WebCrawler Application in Eclipse as mentioned in my previous blog

2) Modify the NutchApp.java file and generate the jar file.

a) Open the NutchApp.java in /webcrawlerapp/src/main/java/com/ibm/biginsights/apps/nutch/NutchApp.java


 

Comment the Line RunJar.main (runJarArgs); and add below line

             //RunJar.main (runJarArgs);
              RunJar.unJar(new File ("./nutch-1.4.job.new"), new File("/tmp/unpack"));

             RunJar.main (runJarArgs);


import the FileUtils class

               import org.apache.commons.io.FileUtils;


In finally block, add

              FileUtils.deleteDirectory(new File("/tmp/unpack"));
 



 b) Generate the java jar into webcrawlerapp\BIApp\workflow\lib\NutchCrawler.jar





 

3) Creating an Oozie workflow

a) Copy the files under /webcrawlerapp/BIApp/workflow to BigInsights Cluster. Ensure you give proper permission for files in Linux.




















b) Create job.properties file with below detail. You need to change the hostname and port (highlighted in blue)  based your cluster.

# JobTracker and NodeName Details
jobTracker=bivm:9001
nameNode=hdfs://bivm:9000

#HDFS path where you need to copy workflow.xml and lib/*.jar to
oozie.wf.application.path=hdfs://bivm:9000/user/biadmin/NutchCrawlOozie1/


#one of the values from Hadoop mapred.queue.names
queueName=default

## urls are delimited by \n
urls=www.ibm.com/developerworks\nhttps://play.google.com/store?hl=en\n
topN=10
outputDir=/user/biadmin/IBMCrawlTest1
depth=2
filters=+.



























  
4) Running the Oozie Workflow - Run the below command from Linux terminal


# Remove the output and input folders
hadoop fs -rmr /user/biadmin/NutchCrawlOozie1
hadoop fs -rmr /user/biadmin/IBMCrawlTest1

# Move the workflow to hdfs
hadoop fs -put /opt/Webcrawler /user/biadmin/NutchCrawlOozie1

# Create outpit folder in HDFS
hadoop fs -mkdir /user/biadmin/IBMCrawlTest1

# Invoke your Oozie workflow from command line.

cd $BIGINSIGHTS_HOME/oozie/bin

# Change the hostname & port based on your cluster 
export OOZIE_URL=http://bivm:8280/oozie

./oozie job -run -config /opt/Webcrawler/job.properties 




















 
5) Testing the output - Login to Web Console then open the output folder with sheet & Basic Crawl Data reader.
 




I will be covering - How to setup the Web Crawler in BigInsights 4, in my next blog.

No comments: