Editing Final Report 2010 (section)

====Implementation====

First, there’s a main method that takes the user’s input of which mode to use, the directory to look through, and the pattern, word or initialism to search for. Once these inputs have been taken, the parser then calls a method to find all of the files in the inputted directory as well as recursively calling itself when it finds a folder in the given directory. 

Once the file list has been determined the code then calls a different parsing method depending on which mode was chosen in the main method for each file in the list. These are FindExact, which searches for an exact word in a file; FindInitialism which searches for the initial letters of a word; and FindPattern which searches for all initialisms that match a given pattern.

It is important to understand the difference between FindInitialism and FindPattern. FindInitialism takes the initialism as an input (ie “abab”) and then parses through a text and searches for it. On the other hand, FindPattern takes a pattern as an input (ie “#@#@” or “##@#”) and parsers through a text to get every initialism that matches that pattern (so for the case of “#@#@”, FindPattern will find “abab”, “acac”, “xzxz”, etc).

The parsing code works by taking in segments of the text file as a string one line at a time. The string is then observed character by character to determine if a match or a pattern occurs.  However for the initialism and pattern search, we are only interested in the initial letters of a word, so specific conditional statements are needed to determine if a character meets a certain criteria before observing it for comparison with the initialism or pattern we are searching for. Generally, any whitespace or punctuation such as periods or commas (but not apostrophes) is considered the beginnings and ends of words, so the subsequent characters are the ones we are interested in.

If the parser does find a match, it produces an output into the command prompt as well as a results text file stating the line and character number of the line where it was initially found. As well, depending on the mode, it prints the pattern or the actual initialism.

After it has gone through all of the files in the directory, a summary of the results is printed. For FindExact, it prints the number of times it was found and in how many files. For FindInitialism, as well as the amount of times it was found, it also prints the expected proportion to find the initialism in the entire set of files as well as the proportion that actually occurred. From this we can determine whether a set of files contained an initialism more or less than expected. Finally, FindPattern prints out an array of which initialisms occurred as well as how often.

Because it is designed to parse through text from HTML files, it is also programmed to ignore HTML code. This has the added advantage of improving accuracy as well as the speed because less text has to be parsed. Additionally, when inputting the word or pattern to look for while using FindExact or FindInitialism, the user can type in an asterisk(*) and the code will consider that letter to be a wild card(any and every letter). While this does not have any real use for this project, it has non-specific general use such as when you do not know how to spell the word you are searching for.

At the moment, there are some limitations to the code. Firstly, the code can only parse through text, HTML and other similar files. It cannot handle complex files such as Microsoft Office documents or PDF files and although this limits the amount of results we can obtain, there are many places that offer eBooks and such in text file format so it is not really a problem. Also, due to how the code was written, FindPattern is currently limited to finding initialisms that span only a single line of text. FindInitialism does not suffer from this limitation; however there are some issues in printing the initialism if it spans 3 or more lines. That is, the initialism is properly detected but not printed properly, so it is only really a problem for sentence accuracy and doesn’t really affect the frequency calculations for the results.

The latest version of the software can be found in the appendix.