Editing Final Report/Thesis 2015 (section)

=====Search Engine=====
The initial n-gram search code was written in Python and submitted to our GitHub repository for review.  Based on advice from project supervisor Dr Berryman, it was discovered that the code would work on a small data set, but since our data set was so large (1.79 Tebibytes when compressed), the code was modified to fit the suggested workflow and run in parallel on Amazon instances.
 
The maximum number of n provided by the gram database was five.  Due to this, a maximum of five letter gram groups from the code could be processed at a time.
This was achieved by writing a code in Python to generate all possible 5-gram ''initialisms'' from all code variants, including the crossed out line, and output them into a corresponding text file.  The same was also done for 4, 3, 2 and 1-grams and stored in their respective text files.  These were to be used as input files from which the search engine was able to perform searches to query the database.

The search engine code was also written in Python. It functioned by taking in the ''initialism'' combinations from the Somerton Man code of length n, from text files created by the intiialism generator code, and stored them in a dictionary labelled 'initialisms of interest'.  The grams from the database were read in by a reader and ''initialisms'' were generated from the grams in each line of the reader.  If the ''initialism'' generated from the line of the reader matched an ''initialism'' in the dictionary containing the ''initialisms'' of interest, the full gram was output into a corresponding text file containing results of length n.  This code was copied and modified to be used for each gram length from n=1-5.  A simplified diagram of the way the code works can be seen in the flowchart in Figure 23, and the full code can be seen in Appendix A.

[[File:Search_Engine_Flow_Chart.png|thumb|600px|centre|'''Fig. 23:''' Search Engine Flowchart]]

Running our code on the Google N-Gram database stored in the i2.xlarge instances in parallel for each group of n-gram inputs from n=1-5 took approximately two weeks.  These raw results were then small enough to be stored and processed locally and so the Amazon EC2 service was no longer required.

The frequency for each n-gram was then taken using Python code to count the number unique entries for each gram.  This was implemented in order to speed up the time in which to obtain a frequency to be used to rank the popularity of each gram.  This was a bug that unfortunately caused the frequency of occurrence of grams in each year to be lost, and so the count of the number of years in which each gram occurred was used as measure of frequency.