Crawler

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.matalon.pagerankhits.crawler
Class Crawler

java.lang.Object
  org.matalon.pagerankhits.crawler.Crawler

All Implemented Interfaces:: java.lang.Runnable

public class Crawler
extends java.lang.Object
implements java.lang.Runnable

This class is responsible for crawling the web according to the given parameters.

Author:: Yonatan Matalon

Field Summary
`private int`	`maxRetrievedUrlsNo`
`private java.lang.String`	`startingAddress`
`private boolean`	`stopCrawling`
`static Crawler`	`theOnlyOne`
`private static Graph`	`webGraph`

Constructor Summary
`private`	`Crawler()` Private constructor.

Method Summary
`void`	`crawl(java.lang.String startingAddress, int maxRetrievedUrlsNo)` This method crawls over web pages on the Internet and extracts all thier links.
`static Crawler`	`getInstance()`
`Graph`	`getWebGraph()`
`boolean`	`isStopCrawling()`
`static java.lang.String`	`makeShorter(java.lang.String string, int maxLength)` Shortens the given string to a maximum of characters long.
`private void`	`prepareSummary(UrlsQueue visitedUrlsQueue)` Prepares a summary of crawler's progress; this summary is written to GeneralSettings.CRAWLING_PROGRESS_REPORT_FILE file.
`private static void`	`printUrl(java.io.PrintWriter printWriter, java.lang.String url, int urlNo, int maxUrlsNo)` Prints the given `url` to the given `PrintWriter`.
`private static void`	`printUrlDetails(java.io.PrintWriter printWriter, int urlNo, WebPageProperties webPageProperties)` Prints the given `URL` details to the given `PrintWriter`.
`void`	`run()`
`void`	`stopCrawling()` Stops the Crawler.
`private static void`	`updateRealTimeProgress(UrlsQueue urls2Print, int currCrawledUrlNo, int maxUrlsNo, boolean lastUrl, boolean stopCrawling)` Updates crawler's progress in real time to GeneralSettings.CRAWLING_PROGRESS_REPORT_FILE_TAIL file.
`private static boolean`	`wasUrlVisited(java.lang.String url, java.util.Map visitedUrlsMap)`

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

theOnlyOne

public static Crawler theOnlyOne

webGraph

private static Graph webGraph

stopCrawling

private boolean stopCrawling

startingAddress

private java.lang.String startingAddress

maxRetrievedUrlsNo

private int maxRetrievedUrlsNo

Constructor Detail

Crawler

private Crawler()

Private constructor.

Method Detail

getInstance

public static Crawler getInstance()

Returns:: Returns the only instance of Crawler.

crawl

public void crawl(java.lang.String startingAddress,
                  int maxRetrievedUrlsNo)

This method crawls over web pages on the Internet and extracts all thier links. The carwling process starts from the web page having the given starting address and ends after retrieving the maximum number of urls (which is also given as parameter) or when reaching a dead end (all queued pages have no inner/outer links or all inner/outer links have already been visited). The links are being visited in a similar way to the Breadth-Search-First (BFS) Algorithm. I.e., the crawler crawls over all the pages in level i and extract all their links (according to their order in the page source); these links are considered as links to pages in level i+1. Only after crawling all of the pages in level i, the crawler starts to crawl over all the pages in level i+1, one after the other, and so on. In order to do so, the links inside each page are queued for later use (and are not crawled over as they are found). Thus, the links in level i are visited only after ALL the links in level i-1 have been crawled over. All retrieved data is saved to a graph, which its vertices represent the visited web pages and its edges represent the links between them. Therefore the algorithm is as follows: 1. Validate given starting address and maximum number of retrieved URLs parameters. 2. Enqueue starting address in URLs Queue. 3. While URLs Queue is not empty and number of retrieved URLs is smaller than the maximum number of retrieved URLs: 3.1. Dequeue an address from the queue. 3.2. If this URL was not already visited (check it against Visited URLs map): 3.2.1. Add this URL to Visited URLs hash map. 3.2.2. Save Crawler's progress to a file. 3.2.3. Read the web page source and extract all its inner/outer links. 3.2.4. Enqueue these links in URLs Queue. 3.3. Add/update this URL in Web Graph. 4. Prepare a summary file.

Parameters:: startingAddress -; maxRetrievedUrlsNo -

run

public void run()

Specified by:: run in interface java.lang.Runnable

See Also:: Runnable.run()

wasUrlVisited

private static final boolean wasUrlVisited(java.lang.String url,
                                           java.util.Map visitedUrlsMap)

Parameters:: url -; visitedUrlsMap -
Returns:: Returns true it the given url was visited, false otherwise.

updateRealTimeProgress

private static final void updateRealTimeProgress(UrlsQueue urls2Print,
                                                 int currCrawledUrlNo,
                                                 int maxUrlsNo,
                                                 boolean lastUrl,
                                                 boolean stopCrawling)

Updates crawler's progress in real time to GeneralSettings.CRAWLING_PROGRESS_REPORT_FILE_TAIL file.

Parameters:: urls2Print -; currCrawledUrlNo -; maxUrlsNo -; lastUrl -; stopCrawling -

printUrl

private static final void printUrl(java.io.PrintWriter printWriter,
                                   java.lang.String url,
                                   int urlNo,
                                   int maxUrlsNo)

Prints the given url to the given PrintWriter.

Parameters:: printWriter -; url -; urlNo -; maxUrlsNo -

makeShorter

public static final java.lang.String makeShorter(java.lang.String string,
                                                 int maxLength)

Shortens the given string to a maximum of characters long.

Parameters:: string -; maxLength -
Returns:: Returns the shortened string.

prepareSummary

private final void prepareSummary(UrlsQueue visitedUrlsQueue)

Prepares a summary of crawler's progress; this summary is written to GeneralSettings.CRAWLING_PROGRESS_REPORT_FILE file.

Parameters:: visitedUrlsQueue -

printUrlDetails

private static final void printUrlDetails(java.io.PrintWriter printWriter,
                                          int urlNo,
                                          WebPageProperties webPageProperties)

Prints the given URL details to the given PrintWriter.

Parameters:: printWriter -; urlNo -; webPageProperties -

getWebGraph

public Graph getWebGraph()

Returns:: Returns the webGraph.

isStopCrawling

public boolean isStopCrawling()

Returns:: Returns the stopCrawling.

stopCrawling

public void stopCrawling()

Stops the Crawler.

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.matalon.pagerankhits.crawler Class Crawler

theOnlyOne

webGraph

stopCrawling

startingAddress

maxRetrievedUrlsNo

Crawler

getInstance

crawl

run

wasUrlVisited

updateRealTimeProgress

printUrl

makeShorter

prepareSummary

printUrlDetails

getWebGraph

isStopCrawling

stopCrawling

org.matalon.pagerankhits.crawler
Class Crawler