Tuesday, March 29, 2016

Automated Java code to read the BigInsights Web Crawler output

I have heard this requirement from multiple customers, where they were looking for a java code to read the output from the BigInsights WebCrawler Application and convert to csv file for further processing.

This blog talks on how to read the web crawler output using Java. The WebCrawler application writes the output as a sequence file hence, we are using the Sequence File Reader to read the output.

1)  Create a Java project with the below code then generate a runnable jar.

package com.test;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.nutch.protocol.Content;
import org.apache.nutch.util.NutchConfiguration;

public class NutchConverter {

public void readData(String coreSitePath, String inputDirPath)
throws Exception {

Configuration conf = NutchConfiguration.create();
FileSystem fs = FileSystem.get(conf);
FileStatus[] statuses = fs.globStatus(new Path(inputDirPath
+ "/segments/*/content/part-*/data"));

for (FileStatus status : statuses) {
System.out.println("Processing file: " + status.getPath());
SequenceFile.Reader reader = new SequenceFile.Reader(fs,
status.getPath(), conf);
Text key = new Text();
Content content = new Content();
while (reader.next(key, content)) {
System.out.println("Key :" + key);
System.out.println("content :" + content);


public static void main(String[] args) throws Exception {

System.out.println("Testing... .... ");
NutchConverter sam = new NutchConverter();

String coreSitePath = args[0];
String inputDirPath = args[1];
sam.readData(coreSitePath, inputDirPath);


Following jar files are required to compile the java class


 Generate the runnable jar.


2) Running the jar from BigInsights Cluster

Copy the jar to Cluster, then run the below command


hadoop jar /opt/Webcrawler/NutchConverter.jar $BIGINSIGHTS_HOME/hadoop-conf/core-site.xml /user/biadmin/IBMCrawlTest1

hadoop jar  <passTheCoreSiteXML> <OutputFolderUsedInWebCrawlerApplication>


No comments: