Editing Progress Report 2010 (section)

==Progress So Far==
===Verification of Past Results===
[[Image:randomletter.jpg|thumb|right|Completed Random Letter Sample form.]]
In order to decide on a direction for the use of the web crawler, it was first important to look at the previous students work and make a decision on whether their conclusions are accepted. 

The verification of their results was undertaken in two sections:

*Collection of random letter samples for comparison to the Somerton man code
*Analysis and Review of their code

A template for collecting random letter samples was created and used to collect 45 random letter samples from a variety of subjects.  45 was chosen as it is roughly the number of letters in the Somerton man code.  Details taken from each subject include the following, and can be seen in the completed form to the right.

*Intox. Yes/no - the Physical state of each subject was recorded to keep samples taken from intoxicated subjects separate from those that were, to the best of our knowledge, not under the influence of any substance.
* Age – each subject was required to supply their age for future use 
*M/F – the sex of each subject was recorded for future use

The graphs below show the recently collected samples next to the samples collected by the previous students and the letter frequency plot of the Somerton man code. 

[[Image:Comparefreq.JPG|center|800px]]


Although the results are not identical to the previous student’s results, they are very different to the graph depicting the code from the Somerton man. It is clear that in both results there is not 1 letter that has a frequency less than 1. This leads us to agree with the previous assumption that the code has not been produced randomly.
So far there have not been enough samples from intoxicated subjects in order to produce a useful result.  This is an ongoing collection and will be finalised when there are enough samples to compile a graph that can be considered useful. 
It was initially planned to use the age and sex data that has been collected to subdivide the results into different categories in order to conclude whether the letters in the Somerton man code are better correlated to a specific age group or sex.  It was then realised that this will involve the collection of many more samples than is needed to verify the past students results and has therefore not been included in the results.  Collecting more samples and splitting the data into different categories may provide useful results for future studies into the case.

The previous year's code was used to analyse and come to a conclusion of the Somerton man's code using several well known cipher techniques. Upon reviewing the created software modules, it was decided that the algorithms used would perform the proper functions in attempting to prove or disprove their hypotheses. Their code was then compiled and run to generate a number of output files. These output files were then examined and compared to the output files that the previous project group generated. From all of these outputs, none of the files generated any legible words and could then be assumed that the Somerton man's code was neither a Vigenere, Playfair or Caesar Cipher. In fact, the results claim that the mystery code resembles a set of initialisms which may or may not be using substituted letters.

===Text Pattern Algorithms===

The most recent revision of the pattern finding algorithm and testing text file can be found [[Media:FindMatch.rar|here]].

The text pattern algorithm was written from scratch using java.  The main purpose of the algorithm is to search through a text file and return an output saying if a word or initialism is found. This was done because we weren’t aware of what the web crawler’s were capable of using the initial functions (that is, using the web crawler without modifying any of the code).

To start, a flow chart of the basic function of the algorithm was made to determine how to begin coding in the most efficient way possible. This is shown below:


[[Image:FindMatchFlowChart.jpg|500px|centre]]


To explain the function as simply as possible, the basic algorithm was designed to:
*Accept a user entered input from the command prompt and place it into a string
*Open a text file and place a line from that file into a string
*Compare the file string with the user’s string character by character, printing an output if the entire user entered string was found
*Repeat comparison process for every line until the end of the file is reached.

The algorithm utilises the scanner class from the util package as well as the File and FileReader classes from the io package. They are important because these classes allow the algorithm to open the needed text file and scan through it. They were easily found on the java API<ref>http://java.sun.com/j2se/1.5.0/docs/api/</ref> and are fairly simple to use. For this algorithm, the output would report the line and character number of the first letter of the matched word found in the text file.

To simplify the searching process, both the text file string and the user string were converted to lower case letters. It was felt that this helped improve the functionality of the algorithm because it involved fewer cases where characters would be the same letter but not the same. Since it doesn’t really matter if the characters are upper case or lower case for the results, there will really be no negative effect on the final outcome of the project.

To implement searching for initialisms, the area circled in red was modified to only change the swCount variable if the previous character in the file string was a white space. The results of the output would report the line and character number like the previous method, as well as the sentence that the initialism represents. This was done because the aim of the project is to find what the initialisms of the Somerton code could possibly represent. This seemed to work, although several errors in the output were observed when some special cases were done.

For example in the file string, if there were two white spaces in a row, the output would be incorrect (the result would print out the wrong character position and the wrong set of words that matched that initialism). Additionally, problems occurred when the input string had spaces in it as well (the method incorrectly assumed that the space represented an initialism as well).
Several attempts were made to debug and fix these problems. The final solution was to create another method (to be used by the initialism finding algorithm) to remove the white spaces in the input string. Also, one more method was created to print the initialism sentence properly for every case that was thought of.

For testing purposes, a main method was created which utilises the scanner class again. For this, the method accepts a user entered input to determine which algorithm will be used for the current search (FindExact or FindInitialism) and then accepts another user inputted string that represents the word or initialism that will be searched for. It is currently unknown whether this will be used as the interface for the final web crawler, although it is likely that the final interface will have a similar function.

Some of the simulations and results from testing can be found below.


[[Image:javaresults2.jpg|560px|centre]]


The image above demonstrates the FindExact method. From observing the word being searched for, and the testing text file used, it is possible to see that the algorithm works in the assumed manner. As stated earlier, the code considers upper and lower case characters the same – this is illustrated above with the searches for “TEST” and “test” returning the same thing. The method correctly determines the line and character number of the beginning of the word. Furthermore, the method determines the word “  hello” as “hello” (without the spaces at the start) and can correctly find its location in the text file.

Although the code ignores spaces in the search string, it can correctly determine the words “  t e s t” as “t e s t” with spaces in between, but not before the first non-whitespace character. Since the FindExact method can search for multiple words, it was felt that this specification was a requirement in the code. To do this, an algorithm was used to delete all of the white spaces in the user input string until a non-whitespace character was met.

Additionally, the main method returns an input error when a letter other than ‘i’ or ‘e’ is used.


[[Image:javaresults1.jpg|560px|centre]]


This image shows that the FindInitialism method returns the same result for every different search of the initialism test (with spaces in different places in the string). Incredible amounts of debugging and modification was done in order for this to work correctly. Extensive use of the isWhitespace() method helped to achieve a working algorithm. This method was very important because it was used to determine how many words of the initialism were currently printed as well as determining if a character was the beginning of a word or not.

The FindInitialism method is the method that will be developed for the final web crawler. It has a function specific to the final outcome for the project: to generate a list of potential sentences for an inputted initialism by reading through the source code of a web site.

===Web Crawling===
An open source Java based Web crawling device called Arachnid<ref>http://arachnid.sourceforge.net/</ref> has currently been the main focus in terms of producing the web crawler.  This crawler is so far proving to be quite useful for the project due to the following traits:


*Java based 
*Basic example provided - allowing for reasonably steep learning curve instead of having to become familiar with a crawler from scratch
*Specific Case handling methods provided - methods for some URL cases are provided (with no handling code however it has provided a starting point to the solution)
*Highly Modifiable


So far after modifying the supplied crawler and exploiting methods used in the supplied example the crawler is capable of the following:


*Compiling successfully (This is a huge bonus! it allows for actual tests to take place)
*Intaking a base URL from the console
*Testing for correct protocol (We only want to look at HTTP websites for now)
*Test for a null URL
*Test for non HTML URLs
*Test for URLs that link to external websites
*Run a handleLink()
*Iterate through the website and produce a list of URLs and their associated name found along with an incrementing counter

The screen print below shows the command lines printed out for the beginning of a search on www.adelaide.edu.au, the university website.  As can be seen the results are displaying each found URL with a number and a title.  The title = null lines are explained later in this report.

[[Image:arachnidscreenprint.jpg|600px|centre]]
 


While using Arachnid is proving to be quite useful, there have also been some issues encountered (see “Approach for Remainder of the Project” section).  Because of this, other alternative crawling methods have been researched in parallel to Arachnid.

The main alternative for retrieving data from websites that is being investigated is Wget <ref>http://www.gnu.org/software/wget/</ref>.  This is a program that would be highly useful for use in the project for the following reasons:

*Freely available
*Supports downloading via HTTP
*Able to act as a web crawler to download recursively until reaching a limit or end of web site 


The third point is extremely valuable to the goal of this project.  Wget is able to mirror websites for offline viewing which will also allow for easier extraction and searching of data.