Editing Final Report/Thesis 2015 (section)

====Method====
Research upon the available databases was conducted and it was found that the two largest databases that most suited our needs were 'Microsoft Web N-Gram Services' <ref>Microsoft Research. (2015). Microsoft Web N-Gram Services [Online]. Available: http://research.microsoft.com/en-us/collaboration/focus/cs/web-ngram.aspx</ref>, and Google N-Gram <ref>Google Research Blog. (2006, August 3). All Our N-Gram are Belong to You [Online]. Available: http://googleresearch.blogspot.com.au/2006/08/all-our-n-gram-are-belong-to-you.html</ref>.  The use of Microsoft Web N-Gram Services was initially considered more favourable due to its larger database <ref>C. X. Zhai et al. (2010, July 19-23). Web N-gram Workshop [online]. Available: http://research.microsoft.com/en-us/events/webngram/sigir2010web_ngram_workshop_proceedings.pdf</ref> , better provided documentation and tutorials <ref>No Author. Microsoft Web N-Gram Service Quick Start [online]. Available: http://weblm.research.microsoft.com/info/QuickStart.htm</ref>, and lower initial cost (see Budget Section).  Despite the Microsoft alternative's advantages, upon further research and after consulting the Microsoft Web N-Gram, it was discovered that it could be used for our purposes using a best-first DP search, with a very large number of calls to the N-gram service's generate method.  Also, through consulting our project supervisor Dr. Berryman, there was a concern that a combinational explosion would occur, meaning that for a 5-gram search, if the Microsoft database was say 50,000 words, the search engine would have to make 50,000^5 + 50,000^4 + 50,000^3 + 50,000^2 + 50,000 calls to the database to complete the search.  Due to all the network calls required to use this method, the Microsoft Web N-Gram Service was deemed unfit for purpose in our application, since the method would not be fast enough to complete all of the searches that we required.  Due to this, the Google alternative was considered.

=====Programming Language Selection=====
The search engine was initially to be implemented via web application using developer tools such as Microsoft Visual Studio or Eclipse [14], for use with Microsoft's Web N-Gram Service.  The application could be programmed in native languages including Visual Basic, Visual C#, Visual C++, Visual F# and Jscript <ref>No Author. Visual Studio Languages [online]. Available: https://msdn.microsoft.com/en-us/library/vstudio/ee822860%28v=vs.100%29.aspx</ref>.  Instead, the programming language of Python was chosen to be used in conjunction with the Google N-Gram database through advice from Dr Berryman and due to its ease of use and efficiency with text processing.

Upon deciding to use the Google N-Gram database, a decision was to be made whether to purchase the University of Pennsylvania's Linguistic Data Consortium version or to obtain it for free directly from Google.  The Linguistic Data Consortium's version of the database was initially chosen as an alternative since it had the advantage of a concise and clean nature, with all total n-gram frequencies summed and collated.  Upon discussion with supervisor Dr. Berryman, it was decided that it was not worth purchasing the cleaned version of the database since we could extract the data we needed from the raw database and the outputs could be cleaned up easily enough through writing some simple codes in Python.  Due to this, it was decided that the database provided by Google was to be used for our purposes (See budget section for further details).

=====Storage and Processing Implementation=====
Due to the size of the database, although being able to be physically stored locally, the local computing power available would have been insufficient to run the search engine code through the database within the time frame of the project.  Instead, a cloud based computing service with increased processing power was sought out to be able to complete the database search within the time restrictions of the project.  Upon considering a number of options, it was decided that 'Amazon Elastic Compute Cloud' was to be used due to its robust storage and processing options <ref>Amazon Web Services. (2015). Amazon EC2 Instances [Online]. Available: https://aws.amazon.com/ec2/instance-types/</ref> and Dr Berryman's prior experience in using this service.  The Amazon EC2 free tier was assessed for use but had a 30GB storage limit<ref>Amazon Web Services. (2015). Amazon EC2 Instances [Online]. Available: https://aws.amazon.com/ec2/instance-types/</ref>, which was insufficient to store the Google N-gram database on.  In addition to this, the instance sizes provided by the Amazon EC2 free tier were t2.micro instances, meaning that they provided 1 vCPU, 1 Gib of RAM and only 20% of each vCPU could be used <ref>Amazon Web Services. (2015). Amazon EC2 Instances [Online]. Available: https://aws.amazon.com/ec2/instance-types/</ref>.  Based on this, it was estimated that using this version of the Amazon Elastic Compute Cloud would have taken approximately 20 months to complete, which was far too long to complete within the project timeframe.  Instead, it was proposed to use the high input/output Amazon i2 tier to provide the performance needed to store and process the database.  After some experimentation with different tiers, two i2.xlarge instances run on Amazon EC2 were proposed to be used, providing two sets of instances, each containing 4 vCPUs, 30.5 GiB of RAM, and 2 x 800 GB SSD Storage<ref>Amazon Web Services. (2015). Amazon EC2 Instances [Online]. Available: https://aws.amazon.com/ec2/instance-types/</ref>.  Using this tier allowed for parallelisation by running separate processes for each group of n-gram inputs from n=1-5 using 5 separate instances of the search engine code.

=====Search Engine=====
The initial n-gram search code was written in Python and submitted to our GitHub repository for review.  Based on advice from project supervisor Dr Berryman, it was discovered that the code would work on a small data set, but since our data set was so large (1.79 Tebibytes when compressed), the code was modified to fit the suggested workflow and run in parallel on Amazon instances.
 
The maximum number of n provided by the gram database was five.  Due to this, a maximum of five letter gram groups from the code could be processed at a time.
This was achieved by writing a code in Python to generate all possible 5-gram ''initialisms'' from all code variants, including the crossed out line, and output them into a corresponding text file.  The same was also done for 4, 3, 2 and 1-grams and stored in their respective text files.  These were to be used as input files from which the search engine was able to perform searches to query the database.

The search engine code was also written in Python. It functioned by taking in the ''initialism'' combinations from the Somerton Man code of length n, from text files created by the intiialism generator code, and stored them in a dictionary labelled 'initialisms of interest'.  The grams from the database were read in by a reader and ''initialisms'' were generated from the grams in each line of the reader.  If the ''initialism'' generated from the line of the reader matched an ''initialism'' in the dictionary containing the ''initialisms'' of interest, the full gram was output into a corresponding text file containing results of length n.  This code was copied and modified to be used for each gram length from n=1-5.  A simplified diagram of the way the code works can be seen in the flowchart in Figure 23, and the full code can be seen in Appendix A.

[[File:Search_Engine_Flow_Chart.png|thumb|600px|centre|'''Fig. 23:''' Search Engine Flowchart]]

Running our code on the Google N-Gram database stored in the i2.xlarge instances in parallel for each group of n-gram inputs from n=1-5 took approximately two weeks.  These raw results were then small enough to be stored and processed locally and so the Amazon EC2 service was no longer required.

The frequency for each n-gram was then taken using Python code to count the number unique entries for each gram.  This was implemented in order to speed up the time in which to obtain a frequency to be used to rank the popularity of each gram.  This was a bug that unfortunately caused the frequency of occurrence of grams in each year to be lost, and so the count of the number of years in which each gram occurred was used as measure of frequency.

=====Processing Search Results=====
Once the raw results were obtained, some grams contained words followed by an underscore and the corresponding lexical category of the word (ie. noun, verb, adverb, adjective, pronoun etc.).  This was desired to be removed and so another python code was written to remove everything but the words themselves from each line in the results.

Upon processing the raw results, the output of the lexical category removal results showed multiple identical results with individual frequencies for the numbers of years in which they occurred.  This was brought about since previously the database considered these entries to be unique results, but now with the lexical categories removed, some results became identical.  This was rectified by writing another code in Python to process these cleaned results to combine identical entries and sum their frequency of years in which they occurred.  This code was then duplicated and modified into two codes, the first output the results sorted alphabetical order, and the second in order of frequency of years in which each result occurred from highest to lowest.  The alphabetically sorted outputs were used as a means of comparison to the cleaned inputs since these were also sorted alphabetically, in order to check that the code was functioning correctly. The frequency sorted outputs were more useful since they able to be used to generate a condensed list of the top 30 most popular ''initialisms'' that could be generated from the letters from all variations in the Somerton Man code, seen in the results section in Figure 24.

=====Combinations of Search Results=====
Finally, a code was written in Python to generate all possible combinations of the top 2 5-gram group results for each variant of the code, where the top 2 results were based on frequency of years in which they occurred.  This was achieved using a non-overlapping sliding window of 5 letters in length.  The way this code worked can be more easily explained using the following example:

For simplicity, using 2-grams and the code ABAC: If the top 2 2-grams for AB are Absolute Bargain and American Beagle, and the top 2 2-grams for AC are Air Conditioning and Alternating Current, then all possible combinations for the code are: Absolute Bargain Air Conditioning, Absolute Bargain Alternating Current, American Beagle Air Conditioning and American Beagle Alternating Current.

This code was implemented as an exercise to see if any interesting or useful results could come about using this simple method.  Unfortunately, this produced nonsensical results due to the disjoint between each 5-gram group's search results, a sample of these can be seen in the results section in Figure 25.  Due to the time constraints of the project, the code was not able to be developed any further, but the code and the results it provides can be used as a first step towards obtaining meaningful or useful combinations of n-grams from the results obtained using the search engine developed throughout this project.  This code could be improved by using a sliding window that progresses by less than 5 letters for each search, for example, using a step size of 1 letter would create the maximum possible overlap of 4 letters between each input gram group.  More information on this and other suggested improvements can be found in the future work section.