org.matalon.pagerankhits.crawler
Class Crawler

java.lang.Object
  extended byorg.matalon.pagerankhits.crawler.Crawler
All Implemented Interfaces:
java.lang.Runnable

public class Crawler
extends java.lang.Object
implements java.lang.Runnable

This class is responsible for crawling the web according to the given parameters.

Author:
Yonatan Matalon

Field Summary
private  int maxRetrievedUrlsNo
           
private  java.lang.String startingAddress
           
private  boolean stopCrawling
           
static Crawler theOnlyOne
           
private static Graph webGraph
           
 
Constructor Summary
private Crawler()
          Private constructor.
 
Method Summary
 void crawl(java.lang.String startingAddress, int maxRetrievedUrlsNo)
          This method crawls over web pages on the Internet and extracts all thier links.
static Crawler getInstance()
           
 Graph getWebGraph()
           
 boolean isStopCrawling()
           
static java.lang.String makeShorter(java.lang.String string, int maxLength)
          Shortens the given string to a maximum of characters long.
private  void prepareSummary(UrlsQueue visitedUrlsQueue)
          Prepares a summary of crawler's progress; this summary is written to GeneralSettings.CRAWLING_PROGRESS_REPORT_FILE file.
private static void printUrl(java.io.PrintWriter printWriter, java.lang.String url, int urlNo, int maxUrlsNo)
          Prints the given url to the given PrintWriter.
private static void printUrlDetails(java.io.PrintWriter printWriter, int urlNo, WebPageProperties webPageProperties)
          Prints the given URL details to the given PrintWriter.
 void run()
           
 void stopCrawling()
          Stops the Crawler.
private static void updateRealTimeProgress(UrlsQueue urls2Print, int currCrawledUrlNo, int maxUrlsNo, boolean lastUrl, boolean stopCrawling)
          Updates crawler's progress in real time to GeneralSettings.CRAWLING_PROGRESS_REPORT_FILE_TAIL file.
private static boolean wasUrlVisited(java.lang.String url, java.util.Map visitedUrlsMap)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

theOnlyOne

public static Crawler theOnlyOne

webGraph

private static Graph webGraph

stopCrawling

private boolean stopCrawling

startingAddress

private java.lang.String startingAddress

maxRetrievedUrlsNo

private int maxRetrievedUrlsNo
Constructor Detail

Crawler

private Crawler()
Private constructor.

Method Detail

getInstance

public static Crawler getInstance()
Returns:
Returns the only instance of Crawler.

crawl

public void crawl(java.lang.String startingAddress,
                  int maxRetrievedUrlsNo)
This method crawls over web pages on the Internet and extracts all thier links. The carwling process starts from the web page having the given starting address and ends after retrieving the maximum number of urls (which is also given as parameter) or when reaching a dead end (all queued pages have no inner/outer links or all inner/outer links have already been visited). The links are being visited in a similar way to the Breadth-Search-First (BFS) Algorithm. I.e., the crawler crawls over all the pages in level i and extract all their links (according to their order in the page source); these links are considered as links to pages in level i+1. Only after crawling all of the pages in level i, the crawler starts to crawl over all the pages in level i+1, one after the other, and so on. In order to do so, the links inside each page are queued for later use (and are not crawled over as they are found). Thus, the links in level i are visited only after ALL the links in level i-1 have been crawled over. All retrieved data is saved to a graph, which its vertices represent the visited web pages and its edges represent the links between them. Therefore the algorithm is as follows: 1. Validate given starting address and maximum number of retrieved URLs parameters. 2. Enqueue starting address in URLs Queue. 3. While URLs Queue is not empty and number of retrieved URLs is smaller than the maximum number of retrieved URLs: 3.1. Dequeue an address from the queue. 3.2. If this URL was not already visited (check it against Visited URLs map): 3.2.1. Add this URL to Visited URLs hash map. 3.2.2. Save Crawler's progress to a file. 3.2.3. Read the web page source and extract all its inner/outer links. 3.2.4. Enqueue these links in URLs Queue. 3.3. Add/update this URL in Web Graph. 4. Prepare a summary file.

Parameters:
startingAddress -
maxRetrievedUrlsNo -

run

public void run()
Specified by:
run in interface java.lang.Runnable
See Also:
Runnable.run()

wasUrlVisited

private static final boolean wasUrlVisited(java.lang.String url,
                                           java.util.Map visitedUrlsMap)
Parameters:
url -
visitedUrlsMap -
Returns:
Returns true it the given url was visited, false otherwise.

updateRealTimeProgress

private static final void updateRealTimeProgress(UrlsQueue urls2Print,
                                                 int currCrawledUrlNo,
                                                 int maxUrlsNo,
                                                 boolean lastUrl,
                                                 boolean stopCrawling)
Updates crawler's progress in real time to GeneralSettings.CRAWLING_PROGRESS_REPORT_FILE_TAIL file.

Parameters:
urls2Print -
currCrawledUrlNo -
maxUrlsNo -
lastUrl -
stopCrawling -

printUrl

private static final void printUrl(java.io.PrintWriter printWriter,
                                   java.lang.String url,
                                   int urlNo,
                                   int maxUrlsNo)
Prints the given url to the given PrintWriter.

Parameters:
printWriter -
url -
urlNo -
maxUrlsNo -

makeShorter

public static final java.lang.String makeShorter(java.lang.String string,
                                                 int maxLength)
Shortens the given string to a maximum of characters long.

Parameters:
string -
maxLength -
Returns:
Returns the shortened string.

prepareSummary

private final void prepareSummary(UrlsQueue visitedUrlsQueue)
Prepares a summary of crawler's progress; this summary is written to GeneralSettings.CRAWLING_PROGRESS_REPORT_FILE file.

Parameters:
visitedUrlsQueue -

printUrlDetails

private static final void printUrlDetails(java.io.PrintWriter printWriter,
                                          int urlNo,
                                          WebPageProperties webPageProperties)
Prints the given URL details to the given PrintWriter.

Parameters:
printWriter -
urlNo -
webPageProperties -

getWebGraph

public Graph getWebGraph()
Returns:
Returns the webGraph.

isStopCrawling

public boolean isStopCrawling()
Returns:
Returns the stopCrawling.

stopCrawling

public void stopCrawling()
Stops the Crawler.