Editing Final Report 2011 (section)

==Web Crawler==
===Concept===
The web crawler is the second module of the internet search application defined in the Objectives. The purpose of the web crawler module is to autonomously browse websites as explained in the [[#Background|Background Theory]] section. This web crawler module must interface with the pattern matcher module, passing web page content and accepting results to deliver to the centralised user interface.
The requirements of the web crawler included:
# Java-coded to maintain compatibility with other modules
# Accept a seed URL from which to explore
# Download pages linked in the page being explored
# Not revisit pages already visited
# Abide by ethical web crawling practices.

Ethical web crawling practices refer to not overloading servers with repeated requests and not exploring sections of internet pages the domain determines should be off-limits to web crawlers<ref name=Rodham>Rodham, Ken, Web Crawler, Brigham Young University Computer Science, http://faculty.cs.byu.edu/~rodham/cs240/crawler/index.html</ref>.

===Previous Work===
In 2010, the project group aimed to create a similar product to what this project set out to design; an application that could download internet pages and pattern analyse them. The 2010 solution was a two-stage process: a web archive tool that would mirror copies of web pages in local directories and a pattern matching module to analyse the stored pages at a later time. This was done using open source software program, [http://en.wikipedia.org/wiki/Httrack HTTRACK].

In 2011, a slightly different approach was taken, aiming to create a real-time integrated solution that analysed web content as it browsed, with a more user-friendly layout and function.

===Technical Challenges===
Technical challenges encountered in the Web Crawler module included the following:
# Robots Exclusion Protocol
# Webpage content extraction
# Multithreading
# Crawling https secure sites
# Accessing the Internet through secured proxies that require a username and password for use. 

Robots Exclusion Protocol refers to the policy governing ethical behavioural of web crawlers. The policy for each site is stored in a text file named robots.txt (available at <domain>/robots.txt. The challenge this provides web crawlers is reading the robots.txt file and obeying any instructions limiting crawling across that domain.

In interfacing to the pattern matcher module, the web crawler is required to supply the contents of each webpage it crawls. In addition to this, any html code should be ignored and not passed. The complex nature of web crawlers meant parallel processing was required through multithreading. This introduced the technical challenge of controlling threads externally. Finally, providing the ability to access encrypted sessions or https (secure) sites is a challenge that is yet to be overcome.

Currently the web crawler is unable to function in an operating environment where access to the interent is secured by a proxy that requires a username and password. An example of this type of domain is the University of Adelaide student computer accounts. An update to the web crawler has been designed to remedy this and will be implemented prior to project closeout.

===Design===
The design for the web crawler process is explained by the diagram below.

<center>[[Image:Web crawler process.png| Web Crawler process.]]</center>
<center>'''Figure 12 - Web crawler software operational process'''</center>

The design starts with a seed from which it attempts to load the protocol definition file robots.txt. If robots.txt does not exist or does exist but doesn’t limit crawling web page content is loaded from which embedded links are extracted. These links are added to the link queue via a sub-process as the thread returns to attempting to load the next web page in the queue.

===Implementation===
An early realisation by project members, Steven and Patrick, was that there was not enough time to implement a complete web crawler given the wide scope of the entire project. Thus an Off the Shelf (OTS) solution that met the requirements was sought.

After extensive research primarily by Steven and investigative testing by Patrick, the Java-based crawler [http://code.google.com/p/crawler4j/ crawler4j] was chosen. This choice was made because crawler4j met the specified requirements, was freeware and open source with no legal implications.

===Testing===
Given the OTS crawler4j provides fully functioning support, the main testing required was ensuring it met requirements listed above. Due to the OTS nature of the crawler, a Black-box testing approach was used with an exploratory slant. 
The requirement of a Java-coded crawler was trivially satisfied and the ability to accept seed websites was also confirmed. Investigation revealed crawler4j supported polite web crawling, obeying the robots.txt file and providing parameters to limit request rates. An in-built function in the OTS crawler that returned webpage content with all html code removed was also validated. crawler4j functioned as research indicated it should, meeting all requirements. A screenshot of the command-line web crawler running over some linked pages during the testing phase is shown in the figure below.

<center>[[File:Web crawler working.png|Web crawler working]]</center>
<center>'''Figure 13 - Web crawler software deployment test'''</center>

Testing also revealed some additional useful functionality provided by the crawler4j interface. This included parameters to limit: link depth explored, number of pages searched and domain. These features proved especially useful in the testing phase of the following [[Final Report 2011#System Integration|System Integration]] section and have also been incorporated into the product for use by the customer.