Tuesday, July 7, 2015

How to build a Web Crawler in IBM BigInsights to crawl https URL

 Consider, you want to crawl the https url, then you need to build your own custom crawlers.
This blog talks on how to build Web Crawler Application to crawl https URL's from BigInsights.

Step 1: Install Eclipse 4.2
Download Eclipse V4.2 from https://www.eclipse.org/downloads/

Step 2: Install BigInsights Plugin

a) There are two options for installing BigInsights Eclipse tooling and we will use the first option below to install it directly from the web server.



b) Launch Eclipse.

c) From Help-->Install New Software, click Add to add a repository. Then click OK to return to the Install page.



d) Select the URL that you just added, and select the IBM InfoSphere BigInsights category to install.



e) Restart Eclipse after the installation completes successfully.

Step 3: Create a BigInsights Server Connection

a) Go to the BigInsights perspective by clicking Window--> Open Perspective and select BigInsights



b) In the BigInsights Servers view, right-click on BigInsights Servers, and click New.



c) Fill in the server information and click Finish to create a connection.



d) The newly created connection displays under the BigInsights Servers folder.




Step 4: Setup the Web Crawler Project in Eclipse

a) Download the Applications(Eclipse projects) from Download client library and development software



b) Download the WebCrawlerProject_eclipse.zip and export it to Eclipse workspace.





Step 5: Set the Property plugin.includes to protocol-httpclient for getting the HTTPS URLs

a) Open the NutchApp.java in /webcrawlerapp/src/main/java/com/ibm/biginsights/apps/nutch/NutchApp.java

b) Add the below code in function updateNutchConf (String confName) and save the file.

// Set the Property plugin.includes to protocol-httpclient for getting the HTTPS URLs
    conf.set("plugin.includes",
            "protocol-httpclient|urlfilter-(regex|suffix)|parse-(text|html|js)|index-(basic|anchor)|" +
            "query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)");



Step 6: Publishing the Application to BigInsights Server

a) Right-click on the project and select BigInsights Application Publish



b) Select the server details and click Next



c) Provide a Application Name and click Next and go with the default option as shown below.









d) In Zip and Publish Application page, click Create Jar



e) Select the files and mention the export destination path as webcrawlerapp\BIApp\workflow\lib\NutchCrawler.jar and click Finish



f) Click Finish to publish the Application to Server



Step 7: Deploy the Application in BigInsights Server

a) Login to Web Console Application Tab and open the publised App and click Deploy





Step 8: Testing the Application

a) Open the Run tab and click the deployed application ans pass the below parameters and click Run button




b) After completion of job, Open the output directory /user/Biadmin/TestApp view it in sheets and modify the collection reader as Basic Crawl Data



If you want to customise the Web Crawling based on  your Business Use case, You need to build a custom  Nutch Application and need to integrate with BigInsights.

There are three options to integrate your Nutch Application to the BigInsights.

    1) You can link the Nutch Application thru Oozie workflow and run it from BigInsights

    2) You can create a BigInsight Application and integrate the Nutch Application to it. Then you can deploy the BigInsights application.

    3) You can download the WebCrawlerProject_eclipse.zip from Console and modify it based on your requirement and re-deploy it in BigInsight.

I have already covered the option 3. I will be covering the other options in my next blog.

No comments: