Editing Progress Report 2010 (section)

==Approach for Remainder of the Project==
===Web Crawler Plans===
With the web crawler design with the Arachnid crawler there is currently two major issues to be resolved.

*Character encoding differences with HTML URLs
*Traversal method - breadth first vs depth first


The image above shows a cut out from the command prompt displaying the current issue regarding character encoding.  The text under the numeral “1” is from a successful URL discovery where the blue underline shows the URL ‘type’ and the red underline shows the found ‘title’ of the URL.  In the second section under the numeral “2” the type is shown as UTF-8 and as can be seen the title being displayed is null, resulting in no valid information being retrieved from the URL that has been found at that point.

[[Image:charencoding.jpg|400px|centre]]


It is important to overcome this problem beacause UTF-8 is the international “standard” for encoding of URLs and appears to be the most common type found.  This will be achieved by trying to firstly convert URLs to UTF-16 before handling them with Arachnid.  The reason for doing this is that Arachnid is currently handling UTF-16 quite well and it may simpler to just convert each URL that is found to the encoding that is working, rather than altering the crawler to deal with a whole new type of character encoding.  To do this the java method URLEncoder() method will be experimented with.

The other issue involving traversal order is less critical however will be important in terms of data collection and storage. The types of websites that are of interest will be information sites containing large amounts of text that will most likely be on the website in a structured hierarchical manner. This means that it is quite important to be able to move down one avenue of information of a site and storing similar items of interest so that they can be easily collected and analysed later. Breadth first traversal is the simplest approach to filtering through a website however it will result in data found being stored in an unorganised manner with respect to the aim of the project.  Depth first is slightly harder to implement but, if it can be achieved, will result in all results of a similar nature being stored in an ordered fashion.  The following figures show the difference in breadth and depth first traversal and the resulting order of stored data.  Each node can be thought of as a different web page on the same website where the numbers relate to the order that each page would be searched.


[[Image:breadthfirst.jpg|800px|centre]]


It is clear from the diagrams that depth first will allow for data from the same category in a hierarchy will be searched in order, relating to patterns and details found being stored in a usable order as well.  Depth first traversal is the method that is most useful to the project and will be implemented as best as possible.


In order to try and minimise any large delays in the project due to issues with Arachnid, more time will be spent with Wget and using it as an effective web crawler.  As Wget is written in a C based language, there may be some issues with translating the pattern searching code in order to use data extracted by Wget.  This means that researching Wget methods and experimenting with it as soon as possible is important in order to provide as much time as possible for combining the web crawler and the pattern matching algorithm.

===Pattern Algorithm Plans===

The major remaining areas of the pattern matching algorithm yet to be completed is the implementation of ignoring HTML; and the algorithm being able to search for exact and ‘similar’ initialisms as described in the Stage 1 Design. Ignoring HTML will likely be done by ignoring the characters in the text file when a “<” is encountered, and then re-start comparing characters again after a “>” is found.

The method to find ‘similar’ initialisms has currently not been identified as of yet, although some ideas have been formed. The approach for doing this will likely be similar to the FindInitialism method, utilising smart and logical coding rather than a simple isWhitespace class. Moreover, we would also like to implement wild card searches when ' * ' is used. This will likely be done by incrementing the swCount variable for when the ' * ' character is compared to any character in the text file string.

Additionally, further implementation could include saving the found initialisms into another text file for inspection later. Beyond that, more testing and debugging should be done and any unforeseen errors that occur should be fixed as soon as possible. Once that is completed, the code will be ready to be integrated into the web crawler.  This area of the project is almost finished and the future lies with the combined web crawler-pattern matching algorithm.

===Obtaining and Using Results===

The main sites of interest will include web sites with large amounts of raw data that can easily be parsed using the pattern matching algorithm. If possible, a variety of different sites should be searched to widen the results as much as possible. Using both the web crawler and the pattern matching algorithm, we can then produce a list of potential initialisms for the Somerton man’s code. With this list, we can then create frequency graphs to narrow down the list and hopefully solve the mystery code.