Editing
Final Report/Thesis 2015
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=====Search Engine===== The initial n-gram search code was written in Python and submitted to our GitHub repository for review. Based on advice from project supervisor Dr Berryman, it was discovered that the code would work on a small data set, but since our data set was so large (1.79 Tebibytes when compressed), the code was modified to fit the suggested workflow and run in parallel on Amazon instances. The maximum number of n provided by the gram database was five. Due to this, a maximum of five letter gram groups from the code could be processed at a time. This was achieved by writing a code in Python to generate all possible 5-gram ''initialisms'' from all code variants, including the crossed out line, and output them into a corresponding text file. The same was also done for 4, 3, 2 and 1-grams and stored in their respective text files. These were to be used as input files from which the search engine was able to perform searches to query the database. The search engine code was also written in Python. It functioned by taking in the ''initialism'' combinations from the Somerton Man code of length n, from text files created by the intiialism generator code, and stored them in a dictionary labelled 'initialisms of interest'. The grams from the database were read in by a reader and ''initialisms'' were generated from the grams in each line of the reader. If the ''initialism'' generated from the line of the reader matched an ''initialism'' in the dictionary containing the ''initialisms'' of interest, the full gram was output into a corresponding text file containing results of length n. This code was copied and modified to be used for each gram length from n=1-5. A simplified diagram of the way the code works can be seen in the flowchart in Figure 23, and the full code can be seen in Appendix A. [[File:Search_Engine_Flow_Chart.png|thumb|600px|centre|'''Fig. 23:''' Search Engine Flowchart]] Running our code on the Google N-Gram database stored in the i2.xlarge instances in parallel for each group of n-gram inputs from n=1-5 took approximately two weeks. These raw results were then small enough to be stored and processed locally and so the Amazon EC2 service was no longer required. The frequency for each n-gram was then taken using Python code to count the number unique entries for each gram. This was implemented in order to speed up the time in which to obtain a frequency to be used to rank the popularity of each gram. This was a bug that unfortunately caused the frequency of occurrence of grams in each year to be lost, and so the count of the number of years in which each gram occurred was used as measure of frequency.
Summary:
Please note that all contributions to Derek may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Derek:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information