Editing Progress Report 2010 (section)

===Text Pattern Algorithms===

The most recent revision of the pattern finding algorithm and testing text file can be found [[Media:FindMatch.rar|here]].

The text pattern algorithm was written from scratch using java.  The main purpose of the algorithm is to search through a text file and return an output saying if a word or initialism is found. This was done because we weren’t aware of what the web crawler’s were capable of using the initial functions (that is, using the web crawler without modifying any of the code).

To start, a flow chart of the basic function of the algorithm was made to determine how to begin coding in the most efficient way possible. This is shown below:


[[Image:FindMatchFlowChart.jpg|500px|centre]]


To explain the function as simply as possible, the basic algorithm was designed to:
*Accept a user entered input from the command prompt and place it into a string
*Open a text file and place a line from that file into a string
*Compare the file string with the user’s string character by character, printing an output if the entire user entered string was found
*Repeat comparison process for every line until the end of the file is reached.

The algorithm utilises the scanner class from the util package as well as the File and FileReader classes from the io package. They are important because these classes allow the algorithm to open the needed text file and scan through it. They were easily found on the java API<ref>http://java.sun.com/j2se/1.5.0/docs/api/</ref> and are fairly simple to use. For this algorithm, the output would report the line and character number of the first letter of the matched word found in the text file.

To simplify the searching process, both the text file string and the user string were converted to lower case letters. It was felt that this helped improve the functionality of the algorithm because it involved fewer cases where characters would be the same letter but not the same. Since it doesn’t really matter if the characters are upper case or lower case for the results, there will really be no negative effect on the final outcome of the project.

To implement searching for initialisms, the area circled in red was modified to only change the swCount variable if the previous character in the file string was a white space. The results of the output would report the line and character number like the previous method, as well as the sentence that the initialism represents. This was done because the aim of the project is to find what the initialisms of the Somerton code could possibly represent. This seemed to work, although several errors in the output were observed when some special cases were done.

For example in the file string, if there were two white spaces in a row, the output would be incorrect (the result would print out the wrong character position and the wrong set of words that matched that initialism). Additionally, problems occurred when the input string had spaces in it as well (the method incorrectly assumed that the space represented an initialism as well).
Several attempts were made to debug and fix these problems. The final solution was to create another method (to be used by the initialism finding algorithm) to remove the white spaces in the input string. Also, one more method was created to print the initialism sentence properly for every case that was thought of.

For testing purposes, a main method was created which utilises the scanner class again. For this, the method accepts a user entered input to determine which algorithm will be used for the current search (FindExact or FindInitialism) and then accepts another user inputted string that represents the word or initialism that will be searched for. It is currently unknown whether this will be used as the interface for the final web crawler, although it is likely that the final interface will have a similar function.

Some of the simulations and results from testing can be found below.


[[Image:javaresults2.jpg|560px|centre]]


The image above demonstrates the FindExact method. From observing the word being searched for, and the testing text file used, it is possible to see that the algorithm works in the assumed manner. As stated earlier, the code considers upper and lower case characters the same – this is illustrated above with the searches for “TEST” and “test” returning the same thing. The method correctly determines the line and character number of the beginning of the word. Furthermore, the method determines the word “  hello” as “hello” (without the spaces at the start) and can correctly find its location in the text file.

Although the code ignores spaces in the search string, it can correctly determine the words “  t e s t” as “t e s t” with spaces in between, but not before the first non-whitespace character. Since the FindExact method can search for multiple words, it was felt that this specification was a requirement in the code. To do this, an algorithm was used to delete all of the white spaces in the user input string until a non-whitespace character was met.

Additionally, the main method returns an input error when a letter other than ‘i’ or ‘e’ is used.


[[Image:javaresults1.jpg|560px|centre]]


This image shows that the FindInitialism method returns the same result for every different search of the initialism test (with spaces in different places in the string). Incredible amounts of debugging and modification was done in order for this to work correctly. Extensive use of the isWhitespace() method helped to achieve a working algorithm. This method was very important because it was used to determine how many words of the initialism were currently printed as well as determining if a character was the beginning of a word or not.

The FindInitialism method is the method that will be developed for the final web crawler. It has a function specific to the final outcome for the project: to generate a list of potential sentences for an inputted initialism by reading through the source code of a web site.