Editing Final Report 2010 (section)

==Methodology==
===The Text Matching Algorithm===
====Function====
The text parser is a piece of code written in java that parses through a text or HTML file and attempts to find specific word and pattern segments within that file. It is divided into several different parts and provides a means in which to parse through a large file directory in a relatively fast manner. The parser makes use of readily available and easy to use packages from the Java API<ref>http://java.sun.com/j2se/1.5.0/docs/api/ The Java API</ref>. In particular, it utilises the scanner class from the util package, as well as the File and FileReader class from the IO package.

====Implementation====

First, there’s a main method that takes the user’s input of which mode to use, the directory to look through, and the pattern, word or initialism to search for. Once these inputs have been taken, the parser then calls a method to find all of the files in the inputted directory as well as recursively calling itself when it finds a folder in the given directory. 

Once the file list has been determined the code then calls a different parsing method depending on which mode was chosen in the main method for each file in the list. These are FindExact, which searches for an exact word in a file; FindInitialism which searches for the initial letters of a word; and FindPattern which searches for all initialisms that match a given pattern.

It is important to understand the difference between FindInitialism and FindPattern. FindInitialism takes the initialism as an input (ie “abab”) and then parses through a text and searches for it. On the other hand, FindPattern takes a pattern as an input (ie “#@#@” or “##@#”) and parsers through a text to get every initialism that matches that pattern (so for the case of “#@#@”, FindPattern will find “abab”, “acac”, “xzxz”, etc).

The parsing code works by taking in segments of the text file as a string one line at a time. The string is then observed character by character to determine if a match or a pattern occurs.  However for the initialism and pattern search, we are only interested in the initial letters of a word, so specific conditional statements are needed to determine if a character meets a certain criteria before observing it for comparison with the initialism or pattern we are searching for. Generally, any whitespace or punctuation such as periods or commas (but not apostrophes) is considered the beginnings and ends of words, so the subsequent characters are the ones we are interested in.

If the parser does find a match, it produces an output into the command prompt as well as a results text file stating the line and character number of the line where it was initially found. As well, depending on the mode, it prints the pattern or the actual initialism.

After it has gone through all of the files in the directory, a summary of the results is printed. For FindExact, it prints the number of times it was found and in how many files. For FindInitialism, as well as the amount of times it was found, it also prints the expected proportion to find the initialism in the entire set of files as well as the proportion that actually occurred. From this we can determine whether a set of files contained an initialism more or less than expected. Finally, FindPattern prints out an array of which initialisms occurred as well as how often.

Because it is designed to parse through text from HTML files, it is also programmed to ignore HTML code. This has the added advantage of improving accuracy as well as the speed because less text has to be parsed. Additionally, when inputting the word or pattern to look for while using FindExact or FindInitialism, the user can type in an asterisk(*) and the code will consider that letter to be a wild card(any and every letter). While this does not have any real use for this project, it has non-specific general use such as when you do not know how to spell the word you are searching for.

At the moment, there are some limitations to the code. Firstly, the code can only parse through text, HTML and other similar files. It cannot handle complex files such as Microsoft Office documents or PDF files and although this limits the amount of results we can obtain, there are many places that offer eBooks and such in text file format so it is not really a problem. Also, due to how the code was written, FindPattern is currently limited to finding initialisms that span only a single line of text. FindInitialism does not suffer from this limitation; however there are some issues in printing the initialism if it spans 3 or more lines. That is, the initialism is properly detected but not printed properly, so it is only really a problem for sentence accuracy and doesn’t really affect the frequency calculations for the results.

The latest version of the software can be found in the appendix.

===The Web Crawler===
====Function====
The basic function of the web crawling portion of the project is to access text on the internet and pass it directly to the pattern matching algorithm. 
This allows for a reasonably fast access method to large quantities of raw text that can be processed thoroughly and used for statistical analysis.

====Implementation====
Several different approaches were used to implement the web crawler in order to find  method that was both effective and simple to use.
After experimenting with open source crawlers available such as Arachnid<ref>http://arachnid.sourceforge.net/ The Arachnid Web Crawler</ref> and JSpider<ref>http://j-spider.sourceforge.net/ The JSpider Web Crawler</ref> we turned our attention to searching for a simpler solution that could be operated directly from the command prompt.  Such a program would allow us to hopefully input a website or list of websites of interest, collect relevant data and then have some control over the pattern matching methods that would be used to produce useful results.
After much searching and experimenting I came across an open source crawler called HTTrack<ref>http://www.httrack.com/</ref>.  HTTrack was used for the following reasons:
*It is free
*It is simple to use.  A GUI version and command line version come with the standard package which allowed for an easy visual experience to become familiar with the program that was easily translated to coded commands.
*It allows full website mirroring.  This means that the text from the websites is stored on the computer and can be used both offline and for multiple searches without needing to access and search the internet every time.
*It has a huge amount of customisation options.  This allowed for control over such things as search depth (how deep into a website), accessing external websites or just one (avoids jumping to websites that contain irrelevant data), search criteria (only text is downloaded, no images movies or unwanted files that are of no use and waste downloads)
*It abides the Robots Exclusion Protocol  (individual access rights that are customised by the owner of each website)
*It has a command prompt option.  This allows for a user friendly approach and integration with the pattern matching algorithm.



To keep the whole project user friendly, a batch file was created that follows the following process:
#Takes in a URL or list of URLs that are pre-saved in a text file at a known location on the computer.
#Prompts the user to enter a destination on the computer to store the data retrieved from the website.
#Accesses HTTrack and perform a predetermined search on the provided URL(s).
#Once the website mirroring is complete the program moves to the predetermined location containing the pattern matching code
#Compiles and runs the pattern matching code


For a simple video demonstration of the web crawler working in conjunction with the pattern matching software please see [http://www.youtube.com/watch?v=Pk6AaKM8wd4: Web Crawler Demo]

The demonstration video above shows the use of the file input batch file.  There is also a version that will take in URL's directly from the user.  All software is available from the appendix.