Big Data Analytics: Nutch

Showing posts with label Nutch. Show all posts

Wednesday, May 4, 2016

Building Apache Nutch Job & Running the WebCrawler in Hadoop

This blog talks on - How to compile / build the Nutch Job from Apache Nutch Source code and executing it in Hadoop.

1) Setup the Apache Nutch

Download the Apache Nutch source code from http://nutch.apache.org/downloads.html to Linux Machine.

[root@rvm ~]# cd /opt

[root@rvm opt]# mkdir nutch_build

[root@rvm opt]# cd nutch_build/

[root@rvm nutch_build]# wget apache.mirror.digitalpacific.com.au/nutch/1.11/apache-nutch-1.11-src.tar.gz
--2016-05-03 17:51:02-- http://apache.mirror.digitalpacific.com.au/nutch/1.11/apache-nutch-1.11-src.tar.gz
Resolving apache.mirror.digitalpacific.com.au... 101.0.120.90, 2401:fc00:0:20e::a0
Connecting to apache.mirror.digitalpacific.com.au|101.0.120.90|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3807144 (3.6M) [application/x-gzip]
Saving to: âapache-nutch-1.11-src.tar.gzâ

100%[========================================>] 3,807,144   1.34M/s   in 2.7s

2016-05-03 17:51:07 (1.34 MB/s) - âapache-nutch-1.11-src.tar.gzâ

[root@rvm nutch_build]# ls
apache-nutch-1.11-src.tar.gz

[root@rvm nutch_build]# tar -xvzf apache-nutch-1.11-src.tar.gz -C /opt/nutch_build/

[root@rvm nutch_build]# ls
apache-nutch-1.11 apache-nutch-1.11-src.tar.gz

2) Install the Apache Ant

Ensure Java is installed.

[root@rvm nutch_build]# java -version
openjdk version "1.8.0_45"
OpenJDK Runtime Environment (build 1.8.0_45-b13)
OpenJDK 64-Bit Server VM (build 25.45-b02, mixed mode)
[root@rvm nutch_build]#

Download the Apache ANT Binaries.

[root@rvm nutch_build]# pwd
/opt/nutch_build

[root@rvm nutch_build]# ls
apache-nutch-1.11 apache-nutch-1.11-src.tar.gz

[root@rvm nutch_build]# wget mirror.ventraip.net.au/apache//ant/binaries/apache-ant-1.9.7-bin.tar.gz
--2016-05-03 18:13:57-- http://mirror.ventraip.net.au/apache//ant/binaries/apache-ant-1.9.7-bin.tar.gz
Resolving mirror.ventraip.net.au... 103.252.152.2, 2400:8f80:0:11::1
Connecting to mirror.ventraip.net.au|103.252.152.2|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5601575 (5.3M) [application/x-gzip]
Saving to: âapache-ant-1.9.7-bin.tar.gzâ

100%[=========================================>] 5,601,575   1.90M/s   in 2.8s

2016-05-03 18:14:02 (1.90 MB/s) - âapache-ant-1.9.7-bin.tar.gzâ

[root@rvm nutch_build]# ls
apache-ant-1.9.7-bin.tar.gz apache-nutch-1.11 apache-nutch-1.11-src.tar.gz
[root@rvm nutch_build]#

Move the downloaded ANT binary to Java Home and untar it.

[root@rvm nutch_build]# mv apache-ant-1.9.7-bin.tar.gz /usr/jdk64/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64/

[root@rvm nutch_build]# cd /usr/jdk64/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64/

[root@rvm java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64]# tar xzf apache-ant-1.9.7-bin.tar.gz

[root@rvm java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64]# rm -rf apache-ant-1.9.7-bin.tar.gz

3) Building the Apache Nutch

Set the JAVA_HOME and NUTCH_JAVA_HOME.

[root@rvm apache-nutch-1.11]# pwd
/opt/nutch_build/apache-nutch-1.11

[root@rvm apache-nutch-1.11]# export JAVA_HOME=/usr/jdk64/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64/

[root@rvm apache-nutch-1.11]# export NUTCH_JAVA_HOME=/usr/jdk64/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64/
[root@rvm apache-nutch-1.11]#

Add the property http.agent.name in /opt/nutch_build/apache-nutch-1.11/conf/nutch-site.xml

[root@rvm nutch_build]# pwd
/opt/nutch_build

[root@rvm nutch_build]# cat apache-nutch-1.11/conf/nutch-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

         <configuration>

        </configuration>

[root@rvm nutch_build]# vi apache-nutch-1.11/conf/nutch-site.xml

[root@rvm nutch_build]# cat apache-nutch-1.11/conf/nutch-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
      <property>
   <name>http.agent.name</name>
                    <value>WebCrawler</value>
                    <description></description>
       </property>
</configuration>
[root@rvm nutch_build]#

Run the makescript using ant. The execution takes more than 45 mins and it will download required jars from external repositories. Ensure that the system has the internet access.

[root@rvm apache-nutch-1.11]# /usr/jdk64/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64/apache-ant-1.9.7/bin/ant runtime

The Nutch job will be generated under /opt/nutch_build/apache-nutch-1.11/runtime/deploy

[root@rvm deploy]# pwd
/opt/nutch_build/apache-nutch-1.11/runtime/deploy

[root@rvm deploy]# ls
apache-nutch-1.11.job bin
[root@rvm deploy]#

4) Modifying the Nutch code to use the class org.apache.nutch.crawl.Crawl

In older version of Nutch, we had a class org.apache.nutch.crawl.Crawl that perform all the crawling operations using one single API call, that is removed in latest Nutch versions. If your application uses that class org.apache.nutch.crawl.Crawl then you can build the job with that code also.

For that, download the Crawl.java from Apache. In below, I am getting the Crawl.java from Nutch 1.7 branch.

[root@rvm crawl]# pwd
/opt/nutch_build/apache-nutch-1.11/src/java/org/apache/nutch/crawl

[root@rvm crawl]# wget http://svn.apache.org/viewvc/nutch/branches/branch-1.7/src/java/org/apache/nutch/crawl/Crawl.java?view=co -O Crawl.java
--2016-05-03 23:14:50-- http://svn.apache.org/viewvc/nutch/branches/branch-1.7/src/java/org/apache/nutch/crawl/Crawl.java?view=co
Resolving svn.apache.org... 209.188.14.144
Connecting to svn.apache.org|209.188.14.144|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: âCrawl.javaâ

    [ <=>                                                                                                                                                ] 5,895       --.-K/s   in 0.04s

2016-05-03 23:14:53 (137 KB/s) - âCrawl.javaâ

In /opt/nutch_build/apache-nutch-1.11/src/java/org/apache/nutch/crawl/Crawl.java, remove the references of solr. Remove the below lines, from the code

a)
import org.apache.nutch.indexer.solr.SolrDeleteDuplicates;

b)
else if ("-solr".equals(args[i])) {
        solrUrl = args[i + 1];
        i++;
      }

c)
if (solrUrl != null) {
        // index, dedup & merge
        FileStatus[] fstats = fs.listStatus(segments, HadoopFSUtil.getPassDirectoriesFilter(fs));

        IndexingJob indexer = new IndexingJob(getConf());
        indexer.index(crawlDb, linkDb,
                Arrays.asList(HadoopFSUtil.getPaths(fstats)));

        SolrDeleteDuplicates dedup = new SolrDeleteDuplicates();
        dedup.setConf(getConf());
        dedup.dedup(solrUrl);
      }

d)
LOG.info("solrUrl=" + solrUrl);

e)

if (solrUrl == null) {
      LOG.warn("solrUrl is not set, indexing will be skipped...");
    }
    else {
        // for simplicity assume that SOLR is used
        // and pass its URL via conf
        getConf().set("solr.server.url", solrUrl);
    }

f)
String solrUrl = null;

g) Modify the line
System.out.println
      ("Usage: Crawl <urlDir> -solr <solrURL> [-dir d] [-threads n] [-depth i] [-topN N]");
to
System.out.println
      ("Usage: Crawl <urlDir> [-dir d] [-threads n] [-depth i] [-topN N]");

Rebuild the code.

[root@rvm apache-nutch-1.11]# pwd
/opt/nutch_build/apache-nutch-1.11

[root@rvm apache-nutch-1.11]# /usr/jdk64/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64/apache-ant-1.9.7/bin/ant runtime

5) Running the generted nutch job from Hadoop

Create the output and input directory in HDFS

[root@rvm apache-nutch-1.11]# hadoop fs -mkdir /tmp/testNutchJob
[root@rvm apache-nutch-1.11]# hadoop fs -mkdir /tmp/testNutchJob/input
[root@rvm apache-nutch-1.11]# hadoop fs -mkdir /tmp/testNutchJob/output
[root@rvm apache-nutch-1.11]#

Create a file that have set of seed URLs and load to HDFS

[root@rvm apache-nutch-1.11]# vi /opt/nutch_build/urllist.txt
[root@rvm apache-nutch-1.11]#
[root@rvm apache-nutch-1.11]# cat /opt/nutch_build/urllist.txt
http://www.ibm.com/
http://www.ibm.com/developerworks/

[root@rvm apache-nutch-1.11]#
[root@rvm apache-nutch-1.11]# hadoop fs -put /opt/nutch_build/urllist.txt /tmp/testNutchJob/input
[root@rvm apache-nutch-1.11]#
[root@rvm apache-nutch-1.11]# hadoop fs -tail /tmp/testNutchJob/input/urllist.txt
http://www.ibm.com/
http://www.ibm.com/developerworks/

Run the nutch.job from hadoop as hdfs user.

The org.apache.nutch.crawl.Crawl takes the arguments <urlDirContainingSeedURL> [-dir d] [-threads n] [-depth i] [-topN N] Refer: https://wiki.apache.org/nutch/bin/nutch%20crawl

[root@rvm apache-nutch-1.11]# su hdfs

[hdfs@rvm apache-nutch-1.11]$
[hdfs@rvm apache-nutch-1.11]$ hadoop jar /opt/nutch_build/apache-nutch-1.11/runtime/deploy/apache-nutch-1.11.job org.apache.nutch.crawl.Crawl /tmp/testNutchJob/input -dir /tmp/testNutchJob/output -depth 2 -topN 10

6) Viewing the crawled data

Copy the generated output from Hadoop file system to Linux file system.

[hdfs@rvm apache-nutch-1.11]$ hadoop fs -copyToLocal /tmp/testNutchJob/output /tmp

[hdfs@rvm apache-nutch-1.11]$ cd /tmp/output/

[hdfs@rvm output]$ ls
crawldb linkdb segments
[hdfs@rvm output]$

The below command convert the crawled output in sequence format to an html output for testing.

[hdfs@rvm bin]$ su root
Password:
[root@rvm bin]#
[root@rvm bin]# ./nutch commoncrawldump -outputDir /tmp/commoncrawlOutput -segment /tmp/output/segments
[root@rvm bin]#

If you want to change the Nutch configurations, you can manually open the apache-nutch-1.11.job using 7zip and update the nutch-site.xml.

In my next blog, I will be covering how to set these properties & URL filter dynamically.

Tuesday, March 29, 2016

Automated Java code to read the BigInsights Web Crawler output

I have heard this requirement from multiple customers, where they were looking for a java code to read the output from the BigInsights WebCrawler Application and convert to csv file for further processing.

This blog talks on how to read the web crawler output using Java. The WebCrawler application writes the output as a sequence file hence, we are using the Sequence File Reader to read the output.

1) Create a Java project with the below code then generate a runnable jar.

package com.test;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileStatus;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.SequenceFile;

import org.apache.hadoop.io.Text;

import org.apache.nutch.protocol.Content;

import org.apache.nutch.util.NutchConfiguration;

public class NutchConverter {

public void readData(String coreSitePath, String inputDirPath)

throws Exception {

Configuration conf = NutchConfiguration.create();

conf.addResource(coreSitePath);

FileSystem fs = FileSystem.get(conf);

FileStatus[] statuses = fs.globStatus(new Path(inputDirPath

+ "/segments/*/content/part-*/data"));

for (FileStatus status : statuses) {

System.out.println("Processing file: " + status.getPath());

SequenceFile.Reader reader = new SequenceFile.~~Reader~~(fs,

status.getPath(), conf);

Text key = new Text();

Content content = new Content();

while (reader.next(key, content)) {

System.out.println("Key :" + key);

System.out.println("content :" + content);

}

reader.close();

}

fs.close();

}

public static void main(String[] args) throws Exception {

System.out.println("Testing... .... ");

NutchConverter sam = new NutchConverter();

String coreSitePath = args[0];

String inputDirPath = args[1];

sam.readData(coreSitePath, inputDirPath);

System.out.println("\nDone");

}

Following jar files are required to compile the java class

$BIGINSIGHTS_HOME/sheets/libext/nutch-1.4.jar
$BIGINSIGHTS_HOME/IHC/share/hadoop/common/hadoop-common-2.2.0.jar

Generate the runnable jar.

2) Running the jar from BigInsights Cluster

Copy the jar to Cluster, then run the below command

export HADOOP_CLASSPATH=$BIGINSIGHTS_HOME/sheets/libext/nutch-1.4.jar:$HADOOP_CLASSPATH

hadoop jar /opt/Webcrawler/NutchConverter.jar $BIGINSIGHTS_HOME/hadoop-conf/core-site.xml /user/biadmin/IBMCrawlTest1

hadoop jar <passTheCoreSiteXML> <OutputFolderUsedInWebCrawlerApplication>

How to run the BigInsights WebCrawler Application as an Oozie Job in BigInsights 3.2

This blog is the continuation of my previous blog. Here we discuss on, How to port the BigInsights WebCrawler Application has an Oozie Job.

1) Setup the WebCrawler Application in Eclipse as mentioned in my previous blog

2) Modify the NutchApp.java file and generate the jar file.

a) Open the NutchApp.java in /webcrawlerapp/src/main/java/com/ibm/biginsights/apps/nutch/NutchApp.java

Comment the Line RunJar.main (runJarArgs); and add below line

             //RunJar.main (runJarArgs);
              RunJar.unJar(new File ("./nutch-1.4.job.new"), new File("/tmp/unpack"));
   RunJar.main (runJarArgs);

import the FileUtils class

               import org.apache.commons.io.FileUtils;

In finally block, add

              FileUtils.deleteDirectory(new File("/tmp/unpack"));

b) Generate the java jar into webcrawlerapp\BIApp\workflow\lib\NutchCrawler.jar

3) Creating an Oozie workflow

a) Copy the files under /webcrawlerapp/BIApp/workflow to BigInsights Cluster. Ensure you give proper permission for files in Linux.

b) Create job.properties file with below detail. You need to change the hostname and port (highlighted in blue) based your cluster.

# JobTracker and NodeName Details
jobTracker=bivm:9001
nameNode=hdfs://bivm:9000

#HDFS path where you need to copy workflow.xml and lib/*.jar to
oozie.wf.application.path=hdfs://bivm:9000/user/biadmin/NutchCrawlOozie1/

#one of the values from Hadoop mapred.queue.names
queueName=default

## urls are delimited by \n
urls=www.ibm.com/developerworks\nhttps://play.google.com/store?hl=en\n
topN=10
outputDir=/user/biadmin/IBMCrawlTest1
depth=2
filters=+.

4) Running the Oozie Workflow - Run the below command from Linux terminal

# Remove the output and input folders
hadoop fs -rmr /user/biadmin/NutchCrawlOozie1
hadoop fs -rmr /user/biadmin/IBMCrawlTest1

# Move the workflow to hdfs
hadoop fs -put /opt/Webcrawler /user/biadmin/NutchCrawlOozie1

# Create outpit folder in HDFS
hadoop fs -mkdir /user/biadmin/IBMCrawlTest1

# Invoke your Oozie workflow from command line.

cd $BIGINSIGHTS_HOME/oozie/bin

# Change the hostname & port based on your cluster
export OOZIE_URL=http://bivm:8280/oozie

./oozie job -run -config /opt/Webcrawler/job.properties

5) Testing the output - Login to Web Console then open the output folder with sheet & Basic Crawl Data reader.

I will be covering - How to setup the Web Crawler in BigInsights 4, in my next blog.

Tuesday, July 7, 2015

How to build a Web Crawler in IBM BigInsights to crawl https URL

Consider, you want to crawl the https url, then you need to build your own custom crawlers.
This blog talks on how to build Web Crawler Application to crawl https URL's from BigInsights.

Step 1: Install Eclipse 4.2
Download Eclipse V4.2 from https://www.eclipse.org/downloads/

Step 2: Install BigInsights Plugin

a) There are two options for installing BigInsights Eclipse tooling and we will use the first option below to install it directly from the web server.

b) Launch Eclipse.

c) From Help-->Install New Software, click Add to add a repository. Then click OK to return to the Install page.

d) Select the URL that you just added, and select the IBM InfoSphere BigInsights category to install.

e) Restart Eclipse after the installation completes successfully.

Step 3: Create a BigInsights Server Connection

a) Go to the BigInsights perspective by clicking Window--> Open Perspective and select BigInsights

b) In the BigInsights Servers view, right-click on BigInsights Servers, and click New.

c) Fill in the server information and click Finish to create a connection.

d) The newly created connection displays under the BigInsights Servers folder.

Step 4: Setup the Web Crawler Project in Eclipse

a) Download the Applications(Eclipse projects) from Download client library and development software

b) Download the WebCrawlerProject_eclipse.zip and export it to Eclipse workspace.

Step 6: Publishing the Application to BigInsights Server

a) Right-click on the project and select BigInsights Application Publish

b) Select the server details and click Next

c) Provide a Application Name and click Next and go with the default option as shown below.

d) In Zip and Publish Application page, click Create Jar

e) Select the files and mention the export destination path as webcrawlerapp\BIApp\workflow\lib\NutchCrawler.jar and click Finish

f) Click Finish to publish the Application to Server

Step 7: Deploy the Application in BigInsights Server

a) Login to Web Console Application Tab and open the publised App and click Deploy

Step 8: Testing the Application

a) Open the Run tab and click the deployed application ans pass the below parameters and click Run button

b) After completion of job, Open the output directory /user/Biadmin/TestApp view it in sheets and modify the collection reader as Basic Crawl Data

If you want to customise the Web Crawling based on your Business Use case, You need to build a custom Nutch Application and need to integrate with BigInsights.

There are three options to integrate your Nutch Application to the BigInsights.

    1) You can link the Nutch Application thru Oozie workflow and run it from BigInsights

    2) You can create a BigInsight Application and integrate the Nutch Application to it. Then you can deploy the BigInsights application.

    3) You can download the WebCrawlerProject_eclipse.zip from Console and modify it based on your requirement and re-deploy it in BigInsight.

I have already covered the option 3. I will be covering the other options in my next blog.