Editing Progress Report 2010 (section)

===Web Crawling===
An open source Java based Web crawling device called Arachnid<ref>http://arachnid.sourceforge.net/</ref> has currently been the main focus in terms of producing the web crawler.  This crawler is so far proving to be quite useful for the project due to the following traits:


*Java based 
*Basic example provided - allowing for reasonably steep learning curve instead of having to become familiar with a crawler from scratch
*Specific Case handling methods provided - methods for some URL cases are provided (with no handling code however it has provided a starting point to the solution)
*Highly Modifiable


So far after modifying the supplied crawler and exploiting methods used in the supplied example the crawler is capable of the following:


*Compiling successfully (This is a huge bonus! it allows for actual tests to take place)
*Intaking a base URL from the console
*Testing for correct protocol (We only want to look at HTTP websites for now)
*Test for a null URL
*Test for non HTML URLs
*Test for URLs that link to external websites
*Run a handleLink()
*Iterate through the website and produce a list of URLs and their associated name found along with an incrementing counter

The screen print below shows the command lines printed out for the beginning of a search on www.adelaide.edu.au, the university website.  As can be seen the results are displaying each found URL with a number and a title.  The title = null lines are explained later in this report.

[[Image:arachnidscreenprint.jpg|600px|centre]]
 


While using Arachnid is proving to be quite useful, there have also been some issues encountered (see “Approach for Remainder of the Project” section).  Because of this, other alternative crawling methods have been researched in parallel to Arachnid.

The main alternative for retrieving data from websites that is being investigated is Wget <ref>http://www.gnu.org/software/wget/</ref>.  This is a program that would be highly useful for use in the project for the following reasons:

*Freely available
*Supports downloading via HTTP
*Able to act as a web crawler to download recursively until reaching a limit or end of web site 


The third point is extremely valuable to the goal of this project.  Wget is able to mirror websites for offline viewing which will also allow for easier extraction and searching of data.