Final Report/Thesis 2015: Difference between revisions
| Line 477: | Line 477: | ||
| The skills developed through undertaking this project include text processing and programming in a variety of languages including Java, MATLAB and Python. | The skills developed through undertaking this project include text processing and programming in a variety of languages including Java, MATLAB and Python. | ||
| The group has also thoroughly researched and learnt how to implement statistical tehcniques  | The group has also thoroughly researched and learnt how to implement and evaluate statistical tehcniques including chi-squared testing, p-value calculation and hypothesis testing and developed skills in using Microsoft Excel software to perform statistical analyses. | ||
| The main conclusions drawn from the project work include that the Somerton Man code was not created using the 'Rubaiyat of Omar Khayam' as one-time pad and the proposed method of using letter position within words as the key.  Further analysis is required to obtain meaningful or useful combinations of grams from the results of the n-gram search.  The Universal Declaration of Human rights has too small a sample size of words in each language to accurately represent the initial letter frequency in each language for use in chi-squared testing.  Finally, although the results from the hypothesis testing were somewhat inconclusive, the results of all of the chi-squared testing have lead to the conclusion that we can now say more confidently than ever that English was the most likely language from which the Somerton Man code was written, assuming it is an initialism. | The main conclusions drawn from the project work include that the Somerton Man code was not created using the 'Rubaiyat of Omar Khayam' as one-time pad and the proposed method of using letter position within words as the key.  Further analysis is required to obtain meaningful or useful combinations of grams from the results of the n-gram search.  The Universal Declaration of Human rights has too small a sample size of words in each language to accurately represent the initial letter frequency in each language for use in chi-squared testing.  Finally, although the results from the hypothesis testing were somewhat inconclusive, the results of all of the chi-squared testing have lead to the conclusion that we can now say more confidently than ever that English was the most likely language from which the Somerton Man code was written, assuming it is an initialism. | ||
Revision as of 20:34, 19 October 2015
Abstract
The project involves the mysterious case of a dead man found at Somerton Beach, South Australia. There was no evidence to show the man’s identification or cause of death, however, there were 5 lines of letters that were found on a scrap of paper in the dead man’s trouser pocket. It was later discovered that the scrap of paper was torn from a book known as ‘The Rubaiyat of Omar Khayyam’. These letters are considered vital to the case as it is speculated that they may be a code or cipher of some sort. The case still remains unsolved today, and so this project has been undertaken in order to uncover further case evidence. The aims and objectives of the project include using various computational techniques to statistically analyse the likely language of origin of the code, designing and implementing software in order to decipher the code, and ultimately attempting to solve the cold case.
Executive Summary
This project involves the design and implementation of software in order to decipher the code associated with the Somerton Man murder mystery. This document is a Final Report and Thesis outlining the key aims, methods, results and evaluations and justifications of the specific tasks involved in the Code Cracking: Who Murdered The Somerton Man? Honours project. The information in the report also includes the motivation, previous studies, aims, objectives, and significance of the project, as well as technical background, knowledge gaps, technical challenges, specific project tasks and project management resources.
Write key findings etc. from conclusion
Acknowledgements
- Project supervisor Prof. Derek Abbott for providing the inspiration and motivation in undertaking this project, as well as advice on specific tasks to be completed to further the research into uncovering the mystery of the Somerton Man code.
- Project supervisor Dr. Matthew Berryman for all advice and assistance with statistical methods and software programming.
Introduction
Motivation

On the 1st of December, 1948, there was a dead body found at Somerton Beach, South Australia [1]. There was no evidence to show the man’s identification and the cause of death [2], however, there were 5 lines of capital letters, with the second line struck out, that were found on a scrap of paper in the dead man’s trouser pocket [3]. A photo of the paper containing the letters can be seen in Figure 1. It was later discovered that the scrap of paper was torn from a book known as the 'Rubaiyat of Omar Khayyam' [4]. These letters are considered vital to the case as it is speculated that they may be a code or cipher of some sort. As engineers, we have the ability to help investigators in solving the case. With that in mind, this project is being undertaken to attempt to decrypt the code in order to help solve the cold case.
The South Australian Police stand to benefit from this project not only from the decoding technology developed for this case, but it also may be able to be applied to solve similar cases. Historians may be interested in gaining further historical information from this project since the case occurred during the heightened tension of the Cold War, and it is speculated that this case may be related in some way [5]. Pathologists may also be interested as the cause of death may have been an unknown or undetectable poison [6]. This project stands to benefit the wider community as well as extended family of the unknown man to provide closure to the mysterious case. Professor Derek Abbot also stands to benefit as he has been working closely with honours project students for the past seven years in an attempt to decipher the Somerton Man code.
Previous Studies/Related Work

Previous professional attempts to decipher the code were limited since they did not use modern techniques or have access to modern databases. Another limitation is that some of the characters in the code appear to be ambiguous and previous attempts made fixed assumptions on particular characters [7]. The Australian Navy’s response was that the letters were “neither a code nor a cipher” [8]. The Australian Department of Defence analysed the characters in 1978 using the computer technology available of that era and concluded:
- a) “There are insufficient symbols to provide a pattern”
- b) “The symbols could be a complex substitute code or the meaningless response to a disturbed mind”
- c) “It is not possible to provide a satisfactory answer” [9]
Other previous studies into deciphering the code include Honours Projects at the University of Adelaide from 2009-2013. The previous work undertaken by these groups includes: multiple evolutions of letter frequency analysis of the code on a variety of base texts in a number of languages, initial letter and sentence letter probabilities, the probabilities of known cypher techniques, the likelihood of the code being an initialism of a poem, the use of various one-time pad techniques, the design and implementation of a web crawler, the analysis of text type and genre of the code’s likely plaintext, the implementation of pattern matching software into the web crawler, a 3D generated reconstruction of the bust of the Somerton Man (see Figure 2) and the analysis of mass spectrometer data taken from the Somerton Man’s hair [10] [11] [12] [13] [14]. The main conclusions that past groups have come to in their projects are: that the letters are unlikely to be random, the code is unlikely to be an initialism, it is likely that the Rubaiyat of Omar Khayyam was used as a one-time pad, the language of the code is likely to be English, the code is unlikely to be an initialism of a poem and that the Rubaiyat of Omar Khayyam was not used as a straight substitution one-time pad [15] [16] [17] [18] [19]. The analysis and extension upon specific elements of previous work that are directly related to the 2015 group’s project are discussed in the 'Method – Specific Tasks' section.
Aims and Objectives
The key aims and objectives in this project included the aim to statistically analyse the likely language of the plaintext of the code. Another aim was to design and implement software in order to try and decipher the code. This was to be implemented by using the 'Rubaiyat of Omar Khayyam' as a one-time pad in conjunction with a new key technique, and by developing a search engine to try to discover possible n-grams contained within the code. The third aim was to analyse mass spectrometer isotope concentration data of the Somerton Man’s hair. Finally, the ultimate aim was to decrypt the code in order to solve the mystery, however this was somewhat unrealistic as the code has remained uncracked for many years. Despite this, computational techniques were to be utilised to attempt the decryption, and at the very least, the past research into the case was to be furthered for future Honours students.
Significance
Considered “one of Australia’s most profound mysteries” at the time [20], this case still remains unsolved today. As the development of decoder technology and the related knowledge progresses, this project poses the opportunity to uncover further case evidence. The skills developed in undertaking this project were also of great significance in a broader sense, as these can be transferrable to possible future career paths. The techniques developed include: software and programming skills, information theory, probability, statistics, encryption and decryption, datamining and database trawling. The job areas and industries that these skills can be applied to are: computer security, communications, digital forensics, computational linguistics, defence, software, e-finance, e-security, telecommunications, search engines and information technology. Some possible job examples include working at: Google, ASIO, ASIS and ASD [21].
Technical Background
P-Value Theorem Explanation
Chi-Squared Test Explanation
Universal Declaration of Human Rights Explanation
Project Gutenberg Explanation
N-Gram Model Explanation
One-Time Pad Explanation
Knowledge Gaps and Technical Challenges
In order to complete the specific tasks proposed within this project, the group members were required to develop new skills in text processing and programming in a variety of languages. These languages included Java, MATLAB, and Python. The group was also required to research and learn how to implement and evaluate statistical techniques including chi-squared testing, p-value calculation and hypothesis testing. Another necessary skill to be developed was the use of Microsoft Excel software to perform statistical analyses. The technical challenges that were to be faced within the project were directly associated with these knowledge gaps.
Method - Specific Tasks
Task 1: Statistical Frequency Analysis of Letters
Aim
- Critical review of previous results
- Use statistical methods to verify whether the most possible language of Somerton Code is English
- Universal Declaration of Human Rights as base text
- Find out how common letters are in each language
- Calculate the P-value between Somerton Man code and other languages
2013 Statistical Frequency Analysis Review
A critical review of the statistical frequency analysis of the letters from the 2013 group has been conducted to determine possible extensions to be undertaken by the 2015 group. The 2013 group based their statistical frequency analysis of letters on the translations and transliterations of the Universal Declaration of Human Rights. This document was chosen since it is translated in over 400 languages. The 2015 group is to analyse this choice of base document using statistical techniques. 266 languages were analysed, since many of the translations were not text files, but paper scans. This seems like a reasonable omission due to the time constraints of the project and the common European languages are all in text form. The analysis included accented letters normalised to their ‘parent’ characters, for example, considering 'ǎ' as 'a'. The languages were then analysed using a variety of combinations of the ambiguous letters within the code. The 2013 group’s statistical analysis was then refined to the top 20 closest European languages. These included more uncommon European languages such as Scots and Vepsian. The graphs of the results from the 2013 group’s analysis can be seen in Figure 1.1. The most likely language of the Somerton Man code and control text was found to be Scots. This is unlikely since it is not a common European language as it is a Scottish dialect [22]. Due to this, Scots, and other less common European languages will be omitted in the 2015 group’s statistical analysis. Instead, only the most common European languages will be included in order to further refine the statistical process. Despite this, a conclusion was made that the most likely language was English since Scots and English shared very similar initial letters. In order to test this conclusion, the 2013 group used the English translation of the Rubaiyat of Omar Khayyam as an English control text. The first 44 words from the text were used since this is the same number as the number of letters in the Somerton Man code. The results showed, once again, that Scots was the most likely language, followed by English (see Figure 1.1). Analysing this choice of control text, the choice was made by the 2013 group as a matter of convenience. This may not have been the best decision to make as the Rubaiyat of Omar Khayyam was originally written in Persian, and has since been translated into English. The use of a translated text as the English control may have skewed the statistical results of the analysis of the control text as the translation may use uncommon words or expressions. Another reason for the suspected use of uncommon words or expressions is the fact that the text is a book made up of four line poems, meaning that it may not accurately represent commonly used words or letters in such a small sample size of 44 words. The 2015 group is to attempt to counteract this possible skew by using 44 words out of a popular novel, originally written in the most likely language.

Base Text
The 2013 group’s frequency analysis code was obtained. Initially, some background research in Java code was undertaken for use in compiling and editing the code. The text files used by the 2013 group from the Universal Declaration of Human Right for the statistical frequency analysis were also obtained. A selection of these were then processed in Java, using a modified version of the 2013 group’s code, and the results were tabulated in an excel spread sheet including language, numerical frequency of each letter in each language and the proportional frequency based on the total number of initial letters in each language. These results were then used to test our statistical method of analysis in Microsoft Excel, MATLAB and by hand.
A spread sheet containing the letter frequency data from the Universal Declaration of Human Rights was obtained from the 2013 group. These results were verified by processing the original text files through the initial letter frequency java code. This data spread sheet was used for statistical calculations for comparison of the letter frequency in the code versus each language in the Universal Declaration of Human Rights.
Statistical Methods
We prepare to use both Parametric Testing and Non-Parametric Testing for Task 1.
Method
Assumptions
Before commencing statistical calculation, a number of initial assumptions were made for this task. One assumption was that the language used as the basis for the code is a European Language. This assumption was made based on the European appearance of the Somerton Man, results from previous groups concluding that the most likely language is English (see Previous Studies/Related Work section), and advice from Professor Abbott. Another assumption that was made is that the code is an initialism, meaning that it is made up of letters that represent the first letters of an ordered series of words. This assumption has also been made based on conclusions made by previous groups (see Previous Studies/Related Work section) and advice from Professor Abbott. Due to this assumption, only the first letters of words in each language in the base text will be considered for analysis, and not every letter. This assumption has been made since in Task 2: The Web Crawler Re-design, the crawler will search for English grams using each letter in the Somerton Man code as the first letter of each word in the gram phrase. In addition, in Task 3: The Rubaiyat of Omar Khayyam as a One-time Pad, the deciphering technique used relies on the initial letters of words in the pad text. Further assumptions were the inclusion of all accented letters in base texts and all combinations of ambiguous letters in the code. These assumptions were made to increase the robustness of the analysis. A final assumption that was made was the choice not to include the crossed out line of letters in the code. This was made since from observation, one can see that this line is very similar to the third line and is most likely to be a mistake.
Because the letter M and W in the Somerton code are very similar. Our group make 3 assumptions for this case. First assumption is All 6 M or W letters in the code are M. Second assumption is All 6 M OR W letters in the code are 4 Ms and 2 Ws. Third assumption is All 6 M OR W letters in the code are 5 Ms and 1 W. The last assumption we make as it appears in the original police report from 1949.
Parametric Testing
Firstly, a group of test texts were prepared, using 3 groups of 44 letters from the English, French and German languages from the Universal Declaration of Human Rights, as well as the Somerton Man Code. The test texts were analysed using ANOVA in Excel. Once these results were obtained, the p-value calculation method used was tested using various pieces of software. A test case with known p-value result from the Engineering Maths IIA notes was run through ANOVA in Microsoft Excel, as well as using built-in MATLAB functions found in Engineering Maths IIA notes, and by hand. All of these methods produced the same p-value using the test case with known p-value and so the ANOVA method was verified to be functioning correctly.
Non-Parametric Testing
Due to the lack of usability of the normally distributed statistical methods, a number of non-parametric tests were researched including the Kolmogrov-Smirnov Test, Mann-Whitney U Test and Chi-Squared test. These 3 tests were researched and considered but ultimately the chi-squared test was chosen based on a cryptography reference found that uses the chi-squared test to compare a code to a particular language [23], as well as advice from supervisor Dr Matthew Berryman. An extension for future Honours groups could be to attempt the analysis with other non-parametric tests such as the Kolmogrov-Smirnov Test or Mann-Whitney U Test. The base text chosen for testing was the full English translation of the Universal Declaration of Human Rights. The sample texts chosen were the first 44 letters from a the novel Tess of the d’Urbervilles by Thomas Hardy as an English benchmark [24], as well as the first 44 letters from the German, French and Zapoteco translations of the Universal Declaration of Human Rights.
Results And Evaluation
Parametric Testing
ANOVA One Way analysis was used in Microsoft Excel to compare letter frequency between texts, as well as ANOVA Two Way Analysis of Variance, where Factor 1 was the letter, Factor 2 was the text and the response was the letter frequency. Both of these methods did not produce valid p-values as they used the comparison of total means and variances. Paired data analysis using a paired sample t-test was researched and attempted based on the Engineering Maths IIA Notes. This was computed using MATLAB, but also produced unusable p-values since the method uses mean and standard deviation of the differences between samples to calculate the p-value, thus letter and text type data is lost in the process and so is not applicable. The p-values calculated using these methods fluctuated depending on the type of data used. If the raw number of letter frequencies for the sample text or code (44 letters) versus the Universal Declaration of Human Rights reference text (1000-2000 letters depending on language) was used, then the p-value became very small since the total means of each text were vastly different. Whereas, if the frequency of each letter as a percentage of the total number of letters in each text were used, this gave a p-value of 1 since the means became the same. Thus this method of mean comparison did not work for letter frequency analysis.
Non-Parametric Testing

During our Chi-squared value calculation, we meet a serious problem. There are some letters appear 0 times in one particular language in our base text from Declaration of Human Rights. According to figure 1.2, it will cause the denominator of the Chi-Squared value equation to be 0 and as we all know the denominator cannot be 0 in the equation. We have to make some reasonable assumptions so that we are able to continue using Chi-squared testing for our task 1. First assumption is we use number 1 which is the closest integer to 0 to instead of 0 when the letter counts 0 in this language. Second assumption is we use 0.001 and 0.0001 to instead of 0 frequencies in same case rather than the number 1. For the first assumption, it would not have significant enough effect on the results to cause a skew in the data. However, for the second assumption, this caused the result to very large Chi-Squared values (in the order of 10^10) and very small p-values (approaching 0).
Top 20 European Language Comparison
The top 20 closest European languages by squared difference and standard deviation from the 2013 Honours group were then run through the Chi-Squared test against the Somerton Man code as in the original police report (5 Ms and 1 W). This test was run as a method of comparison to test how similar our results were to the 2013 group’s results. We will display the results for our two assumptions mentioned before.
Results

Figure 1.3 shows the Chi-squared value of closest 20 languages base on 2013 group’s Squared Difference with 5 Ms and 1 W in code (the left graph) versus 2013 group’s result (the right graph). In this case, we use number 1 to instead of counts 0 of letters.

Figure 1.4 shows the Chi-squared value of closest 20 languages base on 2013 group’s Standard Deviation with 5 Ms and 1 W in code (the left graph) versus 2013 group’s result (the right graph). In this case, we use number 1 to instead of counts 0 of letters.

Figure 1.5 shows the Chi-squared value of closest 20 languages base on 2013 group’s Squared Difference with 5 Ms and 1 W in code (the left graph) versus 2013 group’s result (the right graph). In this case, we use 0.0001 to instead of frequency 0 of letters.

Figure 1.6 shows the Chi-squared value of closest 20 languages base on 2013 group’s Standard Deviation with 5 Ms and 1 W in code (the left graph) versus 2013 group’s result (the right graph). In this case, we use 0.0001 to instead of frequency 0 of letters.
Evaluation
The results show that the two closest languages to the Somerton Man code are Scots, followed by English in all cases. The main conclusion to draw from the results of this comparison was that the Chi-Squared method appeared to be functioning correctly, and so the 2015 group could now further extend the analysis.
Top 20 European Languages based on Estimated Number of Native Speakers
In extension to the 2013 group’s work, the 2015 group conducted a Chi-Squared test of the Somerton Man code against the top 20 European languages based on the estimated number of native speakers [25]. This test included all versions of the Somerton Man code including the versions with 6 Ms, 4 Ms 2 Ws, Average and the police report code (5 Ms and 1 W).
Results

Figure 1.7 shows the Chi-squared value of Top 20 European Languages based on Estimated Number of Speakers with 6 Ms in code (the left graph) and 4 Ms & 2 Ws (the right graph). In this case, we use number 1 to instead of counts 0 of letters.

Figure 1.8 shows the Chi-squared value of Top 20 European Languages based on Estimated Number of Speakers with 5 Ms & 1 W in code (the left graph) and Average of 3 cases (the right graph). In this case, we use number 1 to instead of counts 0 of letters.

Figure 1.9 shows the Chi-squared value of Top 21 European Languages based on Estimated Number of Speakers with 6 Ms in code (the left graph) and 4 Ms & 2 Ws (the right graph). In this case, we use 0.0001 to instead of frequency 0 of letters.

Figure 1.10 shows the Chi-squared value of Top 21 European Languages based on Estimated Number of Speakers with 5 Ms & 1 W in code (the left graph) and Average of 3 cases (the right graph). In this case, we use 0.0001 to instead of frequency 0 of letters.
Evaluation
The results show that for 2 of the 3 code versions and their average value, English was the closest language to the Somerton Man code. Kurdish, a language spoken in some parts of Turkey, was found to be the closest language to the code version with 6 Ms. However, consider Kurdish is not a common and popular language in the world. English was also found to be the closest language to the code as in the original police report.
Top 20 European Languages based on Estimated Number of Native Speakers against Thomas Hardy Sample
A comparison of the English control text (a 44 letter sample of Thomas Hardy’s Tess of the d’Urbervilles) against the top 20 European Languages based on Estimated Number of Native Speakers was conducted in order to test the ability of the method being able to identify an English sample text among the top 20 European languages.
Results

Figure 1.11 shows the Chi-squared value of Top 20 European Languages based on Estimated Number of Speakers against Thomas Hardy Sample. In this case, we use number 1 to instead of counts 0 of letters.

Figure 1.12 shows the Chi-squared value of Top 20 European Languages based on Estimated Number of Speakers against Thomas Hardy Sample. In this case, we use 0.0001 to instead of frequency 0 of letters.
Evaluation
The results of this test show that English was the closest text to the English sample text. This result is desirable as it successfully verified the ability of the Chi-Squared test to distinguish a 44 letter English sample out of the top 20 European languages, and can be used to back up the results obtained from comparing the same 20 languages against the Somerton Man code.
Evaluation and Justification
Evaluation of Parametric Testing
All of these methods did not compare individual data entries from one group to their corresponding entry in another group. Instead they used the sample size, sum, mean, and variance of each group to compare to the other group. Because of this, this method could not be used when comparing letter frequency between languages. For example the frequency of the letter A in one text, must be compared with the frequency of the letter A in another text, rather than the frequency of all letters in one text being compared to the frequency of all letters in another text.
Evaluation Of Task 1
The original proposal suggested that the group repeat the statistical analysis from the 2013 group and use benchmark texts to statistically assess the validity of the method as well as the Universal Declaration of Human Rights as a base text. The group was then to extend the analysis by calculating the p-values for the Somerton Man code when compared to the most common European languages and perform hypothesis testing based on the results. The group was also to use benchmark texts to test the statistical accuracy of the method as well as the validity of the Universal Declaration of Human Rights as a base text. The 2015 group’s statistical analysis has achieved almost all of its proposed goals. A slight diverge from the initially proposed method was decided upon once it was found that p-values useful for comparison or hypothesis testing were unable to be obtained using any attempted statistical method. Instead, the texts were ranked using their calculated Chi-Value. All assumptions outlined in the proposal were followed, with the addition of the modification of the frequency data to account for the small sample size of the base text.
Conclusion
The results of all of the Chi-Squared testing lead to the conclusion that we can now say more confidently than ever that English was the most likely language from which the Somerton Man code was written assuming it is an initialism.
Task 2: N-Gram Search
Aim
The aim of Task 2 was to create a search engine to look for regular expressions that could be linked to the Somerton Man code. Numerous studies from previous groups showed that it is statistically likely for the Somerton Man code to be an English initialism (see Previous Studies/Related Work section). Based on this, for this task an assumption was made that the code is an initialism. A search engine was to be developed that used the letters of the code as initial letters of words in commonly used English phrases. This concept was to be explored further using a technique that accesses a larger database in much shorter time than the web crawler developed by groups in previous years (see Previous Studies/Related Work section). Using an n-gram database, rather than crawling the whole web for grams, had the advantage that the crawling had already been done and all grams had been recorded. This was to drastically increase the speed at which the gram combinations on the web could be found. The search engine was required to output a list of possible gams from the input letter combinations. An assumption that was to be made in order to complete this task was that the letters in the code, and thus the words in the grams, were order relevant. Another assumption was that all variants of ambiguous letters in the code were to be included.
Method
Research upon the available databases was conducted and it was found that the two largest databases that most suited our needs were 'Microsoft Web N-Gram Services' [26], and Google N-Gram [27]. The use of Microsoft Web N-Gram Services was initially considered more favourable due to its larger database [28] , better provided documentation and tutorials [29], and lower initial cost (see Budget Section). Despite the Microsoft alternative's advantages, upon further research and after consulting the Microsoft Web N-Gram, it was discovered that it could be used for our purposes using a best-first DP search, with a very large number of calls to the N-gram service's generate method. Also, through consulting our project supervisor Dr. Berryman, there was a concern that a combinational explosion would occur, meaning that for a 5-gram search, if the Microsoft database was say 50,000 words, the search engine would have to make 50,000^5 + 50,000^4 + 50,000^3 + 50,000^2 + 50,000 calls to the database to complete the search. Due to all the network calls required to use this method, the Microsoft Web N-Gram Service was deemed unfit for purpose in our application, since the method would not be fast enough to complete all of the searches that we required. Due to this, the Google alternative was considered.
Programming Language Selection
The search engine was initially to be implemented via web application using developer tools such as Microsoft Visual Studio or Eclipse [14], for use with Microsoft's Web N-Gram Service. The application could be programmed in native languages including Visual Basic, Visual C#, Visual C++, Visual F# and Jscript [30]. Instead, the programming language of Python was chosen to be used in conjunction with the Google N-Gram database through advice from Dr Berryman and due to its ease of use and efficiency with text processing.
Upon deciding to use the Google N-Gram database, a decision was to be made whether to purchase the University of Pennsylvania's Linguistic Data Consortium version or to obtain it for free directly from Google. The Linguistic Data Consortium's version of the database was initially chosen as an alternative since it had the advantage of a concise and clean nature, with all total n-gram frequencies summed and collated. Upon discussion with supervisor Dr. Berryman, it was decided that it was not worth purchasing the cleaned version of the database since we could extract the data we needed from the raw database and the outputs could be cleaned up easily enough through writing some simple codes in Python. Due to this, it was decided that the database provided by Google was to be used for our purposes (See budget section for further details).
Storage and Processing Implementation
Due to the size of the database, although being able to be physically stored locally, the local computing power available would have been insufficient to run the search engine code through the database within the time frame of the project. Instead, a cloud based computing service with increased processing power was sought out to be able to complete the database search within the time restrictions of the project. Upon considering a number of options, it was decided that 'Amazon Elastic Compute Cloud' was to be used due to its robust storage and processing options [31] and Dr Berryman's prior experience in using this service. The Amazon EC2 free tier was assessed for use but had a 30GB storage limit[32], which was insufficient to store the Google N-gram database on. In addition to this, the instance sizes provided by the Amazon EC2 free tier were t2.micro instances, meaning that they provided 1 vCPU, 1 Gib of RAM and only 20% of each vCPU could be used [33]. Based on this, it was estimated that using this version of the Amazon Elastic Compute Cloud would have taken approximately 20 months to complete, which was far too long to complete within the project timeframe. Instead, it was proposed to use the high input/output Amazon i2 tier to provide the performance needed to store and process the database. After some experimentation with different tiers, two i2.xlarge instances run on Amazon EC2 were proposed to be used, providing two sets of instances, each containing 4 vCPUs, 30.5 GiB of RAM, and 2 x 800 GB SSD Storage[34]. Using this tier allowed for parallelisation by running separate processes for each group of n-gram inputs from n=1-5 using 5 separate instances of the search engine code.
Search Engine
The initial n-gram search code was written in Python and submitted to our GitHub repository for review. Based on advice from project supervisor Dr Berryman, it was discovered that the code would work on a small data set, but since our data set was so large (1.79 Tebibytes when compressed), the code was modified to fit the suggested workflow and run in parallel on Amazon instances.
The maximum number of n provided by the gram database was five. Due to this, a maximum of five letter gram groups from the code could be processed at a time. This was achieved by writing a code in Python to generate all possible 5-gram initialisms from all code variants, including the crossed out line, and output them into a corresponding text file. The same was also done for 4, 3, 2 and 1-grams and stored in their respective text files. These were to be used as input files from which the search engine was able to perform searches to query the database.
The search engine code was also written in Python. It functioned by taking in the initialism combinations from the Somerton Man code of length n, from text files created by the intiialism generator code, and stored them in a dictionary labelled 'initialisms of interest'. The grams from the database were read in by a reader and initialisms were generated from the grams in each line of the reader. If the initialism generated from the line of the reader matched an initialism in the dictionary containing the initialisms of interest, the full gram was output into a corresponding text file containing results of length n. This code was copied and modified to be used for each gram length from n=1-5. A simplified diagram of the way the code works can be seen in the flowchart in Figure X, and the full code can be seen in Appendix X.

Running our code on the Google N-Gram database stored in the i2.xlarge instances in parallel for each group of n-gram inputs from n=1-5 took approximately two weeks. These raw results were then small enough to be stored and processed locally and so the Amazon EC2 service was no longer required.
The frequency for each n-gram was then taken using Python code to count the number unique entries for each gram. This was implemented in order to speed up the time in which to obtain a frequency to be used to rank the popularity of each gram. This was a bug that unfortunately caused the frequency of occurrence of grams in each year to be lost, and so the count of the number of years in which each gram occurred was used as measure of frequency.
Processing Search Results
Once the raw results were obtained, some grams contained words followed by an underscore and the corresponding lexical category of the word (ie. noun, verb, adverb, adjective, pronoun etc.). This was desired to be removed and so another python code was written to remove everything but the words themselves from each line in the results.
Upon processing the raw results, the output of the lexical category removal results showed multiple identical results with individual frequencies for the numbers of years in which they occurred. This was brought about since previously the database considered these entries to be unique results, but now with the lexical categories removed, some results became identical. This was rectified by writing another code in Python to process these cleaned results to combine identical entries and sum their frequency of years in which they occurred. This code was then duplicated and modified into two codes, the first output the results sorted alphabetical order, and the second in order of frequency of years in which each result occurred from highest to lowest. The alphabetically sorted outputs were used as a means of comparison to the cleaned inputs since these were also sorted alphabetically, in order to check that the code was functioning correctly. The frequency sorted outputs were more useful since they able to be used to generate a condensed list of the top 30 most popular initialisms that could be generated from the letters from all variations in the Somerton Man code, seen in the results section in Figure X.
Combinations of Search Results
Finally, a code was written in Python to generate all possible combinations of the top 2 5-gram group results for each variant of the code, where the top 2 results were based on frequency of years in which they occurred. This was achieved using a non-overlapping sliding window of 5 letters in length. The way this code worked can be more easily explained using the following example:
For simplicity, using 2-grams and the code ABAC: If the top 2 2-grams for AB are Absolute Bargain and American Beagle, and the top 2 2-grams for AC are Air Conditioning and Alternating Current, then all possible combinations for the code are: Absolute Bargain Air Conditioning, Absolute Bargain Alternating Current, American Beagle Air Conditioning and American Beagle Alternating Current.
This code was implemented as an exercise to see if any interesting or useful results could come about using this simple method. Unfortunately, this produced nonsensical results due to the disjoint between each 5-gram group's search results, a sample of these can be seen in the results section in Figure X. Due to the time constraints of the project, the code was not able to be developed any further, but the code and the results it provides can be used as a first step towards obtaining meaningful or useful combinations of n-grams from the results obtained using the search engine developed throughout this project. This code could be improved by using a sliding window that progresses by less than 5 letters for each search, for example, using a step size of 1 letter would create the maximum possible overlap of 4 letters between each input gram group. More information on this and other suggested improvements can be found in the future work section.
Results
File:Combined Top 30.pdf File:Top 2 5-Gram Results Combined.pdf
Evaluation and Justification
Unfortunately, the database used in the search engine was smaller in size than the preferred Microsoft option[35]. This could have provided a larger number of search results to be used for future analysis, but since the workflow utilised to access this database did not suit the application proposed by this project, the Microsoft option was deemed unusable. Another limitation of using the Google N-Gram database was that it only includes grams that have appeared more than 40 times across the corpus[36]. There is room for improvement in this task by potentially fixing the bug in the search engine code to record frequency of occurrence of grams in each year to calculate the total frequency of each gram. As mentioned, another improvement could be to optimise the gram combination code in order to obtain more useful or interesting results. Overall, I believe Task 2 was successful in achieving its aims to crate a search engine that found regular expressions that could be linked to the Somerton Man code by using the initial letters of the code to find commonly used English phrases. In addition to this, the search time relative to the number of results produced has clearly been largely improved when compared to the web crawler developed in previous years.
Task 3: Rubaiyat of Omar Khayyam as a One-Time Pad
Aim
The aim of Task 3 was to use Rubaiyat of Omar Khayyam as a one-time pad to attempt to decode the Somerton Man code to find any meaningful messages after decrypting. This task involved the investigation that the letters had been substituted for others using a one-time pad technique. This group was to use the Somerton Man code to act as the cipher text, the numerical value of letter positions within words, with respect to the first letter of each word to act as the key, and the decoded messages to act as the plaintext of the code. The first letter of each word in the one-time pad had numerical value 0, the second letter had numerical value 1, and so on.
The difference in aim between our group and the 2013 honours group is that the key used to decode the message is based on letter position within each word rather than using numbers assigned to each letter in the alphabet to perform alphabetic shits, as implemented in the 'Decoding Toolkit - One Time Pad' software [37].
Method
Before commencing Task 3, we had to choose a computer programming language to implement the code. My first attempt was to use Java, because this language was used by all previous groups to implement for their work. I tried to reuse the base code from previous groups and extend it to satisfy our Task 3 requirements, however, the aim of 2013 was significantly different to ours, so there were not too many similar points in the code. On the other hand, I am not good at Java language because I never learnt it before. My second attempt was to use the C and C++ language to implement the code. I completed some functions for the code, but there was a problem in loading text files. Because we used the Rubaiyat as a one-time pad, we were trying to translate Rubaiyat into a text file so that we were able to run the code and search for words in the Rubaiyat. C++ is an old programming language and the process of loading data was too complex. I decided to give up this method. My final attempt was to use Matlab, it is a relatively new programming language, and loading text files is relatively simple. In addition, I have studied Matlab in previous university courses and find it easier to use. Finally, I choose Matlab as implement language for our Task 3.
The direct substitution of the letters in the pad was to be used. For instance, if we chose a line: MRGOABABD, from the Somerton Man code as the encoded message, the first letter of the code is to be used, which is an M. The program was then to search the Rubaiyat from beginning to end, until it finds the first word that begins with M. This was then to be decoded to the second letter in the same word. This process was then to be repeated for the second letter in the code, being R. It was to find the first word in the Rubaiyat that starts with R, and decodes it to the second letter in that word. This method was then to be repeated until there were too few words long enough to decode all of the letters in the code. After this point, one final decode was to be attempted, where rather than using letter position, the last letter of each word was to be used. The output of the software was to be possible words or phrases made up of letter substitutions in place of the letters in the Somerton Man code.
Design

The figure 3.1 showed above include my designing and thinking progress about task 3. We need 2 inputs for the code. One is the Somerton Man code which is the encode message, other is the letter position as the key (the range of the letter position is above 0 and start with integer 1 based on Matlab system). The function will output a letter from the Somerton Man code and a number n from letter position. Then, a function called matching will receive the letter output by Somerton Man code. It will search the Rubaiyat from beginning to end to match the first word with the same initial letter. On the same time, a function called finding will received the number output by letter position. Afterwards, the function matching will send the matched word from Rubaiyat to function finding. The function finding will choose correct letter in the matched word base on the letter position n. For example, if n equals 2, the finding will choose the second letter of the matched word. Finally, the function finding will output chose letter to function called recordMessage, and this function will record all the letters from finding in order. Until this step, the program finished decoding for one letter of Somerton Man code and it will repeat for the rest letters. After decoding all letters, the program will output the original message.
Example of One-Time Pad


The figure 3.2 is a part form the Rubaiyat and the word AWAKE is the first word in the Rubaiyat.
- Assuming we have encoded message AFM and we will use second letter position to decode it.
- The program will search the Rubaiyat from beginning to end, until it finds the first word that begins with A. From figure 3.2, the word should be AWAKE. This will then be decoded to the second letter in the same word which is W.
- The program will repeat step 2 for the rest letters in the code. The second word and third word should be FOR and MORNING based on the Rubaiyat.
- The output should be WOO.
The figure 3.3 is the output from Matlab, the function called multi, it will show all possible decoded message according to letter position n. For example, if n equals 2, the output will display the first original encoded message AFM and the second decoded message WOO as well.
Results


The figure 3.4 shows the result for using second letter position as key to decode each line of Somerton Man code. The Rubaiyat is acting as one-time pad and without any formatting (We put whole Rubaiyat into text file without removing any symbols and punctuations)
The figure 3.5 shows the result for using second letter position as key to decode each line of Somerton Man code as well. But the Rubaiyat is acting as one-time pad and with formatting (We put whole Rubaiyat into text file and remove all symbols, punctuations and non-letter parts)


The figure 3.6 shows the result for using third letter position as key to decode each line of Somerton Man code. The Rubaiyat is acting as one-time pad and without any formatting (We put whole Rubaiyat into text file without removing any symbols and punctuations)
The figure 3.7 shows the result for using third letter position as key to decode each line of Somerton Man code as well. But the Rubaiyat is acting as one-time pad and with formatting (We put whole Rubaiyat into text file and remove all symbols, punctuations and non-letter parts)


The figure 3.8 shows the result for using last letter position for each word as key to decode each line of Somerton Man code. The Rubaiyat is acting as one-time pad and without any formatting (We put whole Rubaiyat into text file without removing any symbols and punctuations)
The figure 3.9 shows the result for using last letter position for each word as key to decode each line of Somerton Man code as well. But the Rubaiyat is acting as one-time pad and with formatting (We put whole Rubaiyat into text file and remove all symbols, punctuations and non-letter parts)
Evaluation and Justification
Verification


A verification method is to be used to test the Matlab code to be working effectively and correctly. Assuming we have origin message GUN, and use third letter position to encode the message based on the Rubaiyat. Based on figure 3.10 which is a part from formatted Rubaiyat. The encoded words should be NIGHT, FLUNG and HUNTER which are emphasizing in the figure. Then we take the first letter of each word. The code should be NFH.
If we use NFH as code and choose third letter position as key, by using One-Time pad method. We should get GUN as the output. The figure 3.11 displays the output of Matlab code by using input NFH and third letter position.
Task 4: Statistical Frequency of Letters Reanalysis
Aim
Towards the end of the project, a decision was made that for Task 4, rather than analysing the mass spectrometer data from the Somerton Man's hair, we would focus our efforts on reanalysing the letter frequencies of various European languages. This was decided upon since our initial analysis performed in Task 1 produced inconsistent and varied results. This was brought about due to the limited sample size of the Universal Declaration of Human Rights as a base text, causing the frequency of particular letters to appear 0 times in particular languages. Due to this, these letter frequencies had to be altered by choosing arbitrary numbers for their frequency in order to perform our chi-Squared testing and thus reduced the accuracy and validity of the test's results.
the limited sample size caused the chi-squared values for all languages, including English, to be reasonably large. This caused the resulting calculated p-values to be extremely small numbers, or in most cases 0. Because of this, these chi-squared values were not usable to use p-values to perform our initially proposed hypothesis testing from Task 1.
This caused us to question the validity of the Universal Declaration of Human Rights as a base text and so we sought to increase our sample size using alternate base texts and extend our original statistical analysis.
Method
It was decided that for the reanalysis, we would use Project Gutenberg to increase the sample size for as many of the 21 most popular European languages used in Task 1 as possible by collecting novels from the time before the Somerton Man's death. This was chosen to be used as our base corpus in an attempt to obtain a more accurate representation of the initial letter frequencies of words in these languages. Novels in each language were concatenated and their letter frequencies were determined, until each letter appeared at least once in each language.
The 2013 group’s decoding toolkit and initial letter frequency count code were able to be utilised for this task. The decoding toolkit's 'format texts' function was used to remove all non letter characters and symbols as well as punctuation and accented letters, and the initial letter frequency counter was run on all of our base and benchmark sample texts in order to obtain the data we needed to perform our statistical analysis. All statistical calculations and graphs were generated using Mircosoft Excel.
Initial Validation
First of all, the same test that was initially run in Task 1 on the statistics obtained from the English translation of the Universal Declaration of Human Rights (with letters with frequency 0 modified to 0.0001) as a baseline check were also run on the new statistics gathered from the novel: The Life of the Spider by J. Henri Fabre[38], used as our English base text found on Project Gutenberg, as a means of comparison between the base texts. The Somerton man code, 44 letter samples from a Thomas Hardy novel acting as an English control [39] as well as a French sample, German sample, and Zapoteco sample from the Universal Declaration of Human Rights were all compared to both sets of data and the results can be seen in Figure X.
European Language Comparison
Next, once it was found that the English text from Project Gutenberg provided lower chi-Squared values than the Universal Declaration for all samples in the initial test, the chi-Squared testing on European languages could be commenced. This involved the same procedure as was used in Task 1, but of the top 21 most popular European languages from Task 1, only 12 of the languages were able to be used in the reanalysis due to insufficient usability or availability of texts on Project Gutenberg. The languages used in the analysis can be seen in the graph in Figure X. The omitted languages included Greek, Russian, Serbian, Kurdish, Uzbek, Turkish, Ukranian, Belarusian and Kazakh. The texts used for this analysis can be seen in Appendix X.
Significance Level Calculation
The chi-squared and p-values calculated showed that English was the closest language to the Somerton Man code. From this, hypothesis testing could be performed based on the English results. Upon consultation with Prof. Abbott and Dr. Berryman, rather than choosing an arbitrary value of significance level such as the typically used P=0.05, it was decided a significance level could be calculated using the p-value found using real English texts to be used as what we deemed to be an acceptable significance level for which we would confidently be able to say that the most likely language of origin of the Somerton Man code is English. This was achieved by collecting 20 44 letter excerpts from English novels from Project Gutenberg (see Appendix X), performing the chi-squared testing for these samples against the English project Gutenberg novel used as our English base text, taking an average of the chi-squared values, and from this calculating a p-value. This result was then compared to the results obtained from the English portion of the chi-squared testing performed on the variants of the code, and was plotted as seen in Figure X.
This same testing was then also run on the English samples and code variants against the original English translation of the Universal Declaration of Human rights as a means of comparison between the two base texts. Significance levels were unable to be calculated using the Universal Declaration of Human Rights since the chi-squared values were too large, causing the calculated p-values to be too small (approaching 0). The results can be seen in Figure X.
It was unnecessary to extend the analysis to collect benchmarks and perform the hypothesis testing on the other European languages against the code since chi-squared values produced were too large, and so the p-values calculated were unusable.
Increased Sample Size Testing
It was then decided that in order to increase our confidence in the calculated significance level, we would increase the sample size for our English base text from Project Gutenberg to not only large enough such that each letter appeared at least once, but to concatenate 20 English novels from the time before the Somerton Man's death to be used as our base English Corpus (See Appendix X). It was first confirmed whether this would have an affect on the chi-squared values against the code variants when compared to other languages. We could then also increase our English benchmark sample size by taking 100 44 letter samples from this corpus using code written in Python, and performing the same testing as performed on our smaller English base text. The results from this testing can be seen in Figure X.
Increasing the sample size of the English base text had very little effect on the graphs produced in the Initial Validation, European Language Comparison and so these graphs have been excluded. A closer look at the changes to the chi-squared and p-values for the Somerton Man code variants caused by this increased sample size can be seen through comparing Figures X and X. Increasing the number of 44 letter English samples from 20 to 100 however, did have an effect on the chi-squared value and p-value calculated to be used as our significance level, the results of which can be seen in Figure X. This increase in number of samples had very little effect on the graph of the Universal Declaration of Human Rights significance level calculation, and so this has also been omitted.
Results





Evaluation and Justification
The results of the initial validation seen in Figure X, show that using the Project Gutenberg novel as an English reference text provided lower chi-squared values for all test cases and thus it was deemed to be a more suitable base text than the modified version of the Universal Declaration of Human Rights.
The results from the European Language Comparison in Figure X, show that English had the lowest chi-squared value when compared to all languages in our Project Gutenberg corpus, and thus was the closest language to the Somerton Man code.
The chi-squared values calculated using the English translation of the Declaration of Human Rights (Figure X) were found to be much higher than those calculated using the English Project Gutenberg novel (Figure X). The large difference in results, and the fact that real english samples obtained such high chi-squared values, show that the Declaration may not provide an accurate representation of letter frequencies in the English language, and thus the validity of this as a base text has been proven to be questionable when used as part of a chi-squared analysis. In addition to this, despite the chi-squared values calculated using the Somerton Man code variants being much lower in all cases, hypothesis testing could not be completed due to the large chi-squared values producing very small p-values for the code variants and significance level (approaching 0).
The results from the initial significance level calculation in Figure X shows that 2 of the 3 Somerton Man code variants, and thus the average result, achieved higher p-values than the calculated significance level. From this we could deduce that our preliminary results showed that our null hypothesis was accepted and that English is the most likely language of origin of the code, assuming that it is an initialism.
Upon increasing the sample size, the significance level calculation in Figure X shows that now only 1 of the 3 code variants achieved a p-value higher than the calculated significance level. This caused the average result to fall below the significance level. Due to this, our statement had to be modified to say that overall the null hypothesis was rejected and alternative hypothesis accepted, meaning that we can not confidently say that the language of origin of the Somerton Man code is english for all variants. Despite this, the null hypothesis could be accepted and English is the most likely language of origin of the Somerton Man code, assuming that it contains 4 M's, 2 W's and is an initialism.
Regardless of the choice to accept or reject the null hypothesis, the similarities in chi-squared and p-values calculated between real 44 letter English Samples and all variants of the Somerton Man code using the Project Gutenberg base text reinforces the notion that the language of origin of the code is indeed English.
Although we were able to find a base text with frequency greater than 0 for each letter, suitable for reanalysis and performing hypothesis testing, the chi-squared method used was still not entirely mathematically accurate since the expected value of the number of sample observations for each letter in the code should have been a minimum of 5[40]. This was unavoidable since we had limited letter frequencies provided by the Somerton Man code and thus this sample size could not be increased. Since this was constant when comparing across all languages, the method was still able to be used as a means of comparing the 'goodness of fit' of letters in each language.
Project Management - Planning and Feasibility
Work Breakdown/Deliverables
The workload for this project was broken down into its main tasks. These can be seen in list form in the Final Project Gantt Chart (see Timeline section). The key deliverables are represented as milestones on the Gantt Chart. The dependencies of the tasks and deliverables can be seen in the Gantt Chart as black arrows, these are as follows: The Research Proposal and Progress Report have dependence on the Draft Research Proposal, which has dependence on the Proposal Seminar. Of the specific project tasks, Task 1 was completed first, and Tasks 2, 3 and 4 were completed in parallel. The Final Seminar Presentation, Project Exhibition Poster, Final Performance, Youtube video and Dump of final work are all dependent on the completion of the specific project tasks. The Final Report/Honours Thesis was completed in parallel with the rest of the work from the Research Proposal and Progress Report hand-up, onwards.
Timeline
The timeline for this project was created in the form of a Gantt Chart. The proposed Gantt Chart can be seen in Figure X.

The final Gantt Chart after all revisions and updates can be seen in Figure X.

Changes made from the originally proposed Gantt Chart to the final revised Gantt Chart include the renaming of Tasks 2 and 4 to N-Gram Search and Statistical Frequency of Letters Reanalysis. Task 2 was completed earlier than expected, but cleaning up results for presentation and finding meaningful combinations of the results proved to take longer than expected, and so the second part of Task 2 was extended. Task 3 was also extended so that Jikai was able to complete this task. Task 4 was commenced earlier than proposed since the bulk of Task 2 was completed early. Due to this, Task 4 was completed in parallel with Tasks 2 and 3 towards the end of the project timeline. The dump of final work and project youtube video were moved to be completed after the due date of the Final Report/Thesis upon discussion with our supervisors. Overall, our initially proposed Gantt Chart estimated our project timeline quite accurately and only minor changes needed to be made.
Task Allocation
The workload for the tasks within this project were allocated based on the strengths and skillset of each member, as well as the estimated time taken and complexity of each task. A table of the project task allocation can be seen in Figure X. The key allocations were that Nicholas Gencarelli undertook the tasks of Project Management, N-Gram Search and the Project Exhibition Poster. Jikai Yang undertook the tasks of the use of the Rubaiyat of Omar Khayyam as a One-time Pad, and the project Youtube video. The allocations did not require changing throughout the project life cycle apart from the decision for both members to perform a statistical reanalysis for Task 4 rather than both analysing the mass spectrometer data from the Somerton Man's hair.

Management Strategy
A number of management strategies were adopted for use throughout the project. One of which was frequent face-to-face contact through regular meetings every 2-3 weeks. Another was regular communication between group members via text message and email. Collaboration is another strategy that was useful, if one member required assistance on a particular task, the other was able to step in and help. This was achieved through the use of flexible task allocation. The group was able to make use of collaborative software including Google Drive for working together on project documents, and Git Hub repository for working together on code for software development. The project Wiki page was updated in real time including the weekly progress section to monitor and review work completed by each member every week, as well as plan tasks for the upcoming week. Finally, the use of a Gantt chart was used as a management strategy to incorporate clearly defined task and goals and established a critical path through use of task dependencies.
Budget
The project budget for this honours group was set at 500 dollars at the commencement of the project. It was initially proposed for the budget to depend on the n-gram database chosen to be used for the search engine in Task 2. As discussed in the Method section of Task 2: N-Gram Search, a variety of options were considered and the main two largest databases were found to be Microsoft Web N-Gram Services[41], and Google N-Gram [42].
The Microsoft alternative was found to be free to use for academic purposes after applying for a user token, and is stored for free on Microsoft’s web server, hence there was no need to purchase storage upon which to store the database[43].
The Google alternative was available for free when obtaining the raw dataset, or at a cost of 150 dollars for a student license when purchased from the University of Pennsylvania Linguistic Data Consortium [44]. Unlike the Microsoft alternative, if the Google N-Gram option was chosen, a portion of the budget would have had to be dedicated to storing the database. It was initially proposed to store the database on a hard drive at a cost of approximately 100 dollars.
The proposed budget can be seen in the tables highlighting the key costs of each option in Figure X.

For reasons discussed in the Method section of Task 2: N-Gram Search, upon deciding to use the Google N-Gram database, a decision was to be made whether to purchase the University of Pennsylvania's Linguistic Data Consortium version or to obtain it for free directly from Google. A decision was made to utilise the free database provided by Google as it was not deemed justifiable to spend $150 on the processed data from the Linguistic Data Consortium since it was proposed that the raw dataset could be cleaned up through writing software.
The initial budget was based on the assumption that the Google N-Gram database could be stored locally, although this was feasibly possible in its compressed form, the local computing power available would have been insufficient to run the search engine code through the database within a the time frame of the project. As discussed in the Method section of Task 2: N-Gram Search, a cloud based computing service called ‘Amazon Elastic Compute Cloud’ was utilised to store and process the database. The free tier was considered but did not provide the specifications required to meet the needs of our task, and so instances on Amazon EC2 were hired at a rate of 0.853 dollars per hour [45]. Upon storing the initial full database, running our search code, and downloading our results generated from the outputs of the code, the total cost of utilising the service came to 576 dollars. This caused our project to exceed the initially proposed budget. The reason for the additional project expenditure was that despite our efforts, it was difficult to predict the precise time that it would take to upload, store and process the database on the cloud service. The initially proposed budget did not include the need or costing for the Amazon server since this was not something that could be reasonably foreseen at the start of the project since it was initially thought that the Microsoft N-Gram Service would be suitable for the needs of the project, and if this was not suitable, that the Google N-gram alternative would be able to be stored locally.
The final revised budget including total project expenditure can be seen in Figure X.

In conclusion, despite going over budget, the additional funds were kindly provided by the school of Electrical and Electronic engineering upon sending an application for funding including justification of our purchases. The project work has benefited through the purchase of the Amazon service since we were able to complete a search of specific n-gram combinations of the code on the full Google N-Gram database. It has provided us with results to present as part of our thesis and allowed us to meet the requirements set out in the aim of Task 2.
Risk Analysis
A risk assessment was undertaken for this project to include risk identification, analysis, evaluation and treatment strategies using the Adelaide University risk matrix procedure [34][46]. This can be seen in Figure X. One of the risks that occurred during the project was the inaccurate estimation of time and resources. This occurred since the group and supervisors were unhappy with the results obtained from the initial analysis of letter frequency performed in Task 1. This was rectified by implementing the flexibility of our schedule and by replacing the initially proposed Task 4: Mass Spectrometer Data Analysis, with a new Task 4: Statistical Frequency of Letters Reanalysis. Another risk that occurred throughout the project was Illness. This was able to be dealt with relatively easily through working from home for a short period of time. The minor misunderstanding of project tasks occurred on a few occasions, but these were clarified through scheduling meetings with group members and supervisors. Bugs in code were reduced to the best of our ability through thorough testing and debugging of code. Finally, the inability to decipher the Somerton Man Code was a risk estimated with an almost certain likelihood. Despite being unable to avoid this risk throughout the project, its effects were considered negligible, and the group was still able to complete all work to the best of its ability, and further the research into the decryption of the code for not only future honours groups, but also the wider community through publishing our results on our Wiki.

Conclusions
The work undertaken throughout the project has fulfilled the key aims and objectives of the project including statistical analysis of likely language of origin of the Somerton Man code, the design and implementation of software to test the 'Rubaiyat of Omar Khayyam as a one-time pad in conjunction with a new key technique, and developed a search engine to discover possible n-grams contained within the code. The group was successful in completing all tasks outlined in the proposal, with the exception of the proposed extension task to analyse the mass spectrometer data of the Somerton Man's hair.
Through this, the group was able to critically review and further the statistical analysis of the likely language of origin of the Somerton Man code conducted by previous groups. The 2015 group has improved upon the search for n-grams conducted by previous groups by increasing number of results and search speed. In addition to this, the results collected through implementing the search engine are valuable for future groups to analyse for useful grams that could be linked to the Somerton Man code. The group has also furthered the exploration from previous groups into the possibility that the 'Rubaiyat of Omar Khayyam' was used as a one-time pad to encrypt the Somerton Man code by testing a new key technique.
The skills developed through undertaking this project include text processing and programming in a variety of languages including Java, MATLAB and Python. The group has also thoroughly researched and learnt how to implement and evaluate statistical tehcniques including chi-squared testing, p-value calculation and hypothesis testing and developed skills in using Microsoft Excel software to perform statistical analyses.
The main conclusions drawn from the project work include that the Somerton Man code was not created using the 'Rubaiyat of Omar Khayam' as one-time pad and the proposed method of using letter position within words as the key. Further analysis is required to obtain meaningful or useful combinations of grams from the results of the n-gram search. The Universal Declaration of Human rights has too small a sample size of words in each language to accurately represent the initial letter frequency in each language for use in chi-squared testing. Finally, although the results from the hypothesis testing were somewhat inconclusive, the results of all of the chi-squared testing have lead to the conclusion that we can now say more confidently than ever that English was the most likely language from which the Somerton Man code was written, assuming it is an initialism.
Despite being unable to decipher the Somerton Man code, the 2015 group has designed and implemented software that has furthered past work into the investigation and provided useful tools and resources to be utilised by future Honours students.
Future Work
The search engine in Task 2 contained a bug in the code that collected the number of years in which each gram occurred, rather than actual frequency of occurrence in each year. This brings about a potential for reanalysis of the database, in particular the 5-gram data, using raw count through minor modification to the search engine code.
There is also the potential to run a limited search on the search engine results using a more sophisticated code than the one used by the 2015 group. This could be used to generate more useful gram combinations to find commonly used English expressions or phrases that could be linked to the Somerton Man code.
Future students could extend the statistical analysis to perform hypothesis testing on all European languages in Project Gutenberg, but an alternate method to the chi-squared testing performed would have to be utilised since the chi-squared values for the code against all other languages were too high to produce usable p-values.
Another option would be to focus on English as the most likely language and statistically analyse the code against genres as conducted by the 2013 group [47], but using the chi-squared method as opposed to the squared difference and standard deviation methods adopted to consolidate or refute the conclusions drawn from their results .
Future groups could also extend the 2013 groups analysis of the mass spectrometer data collected using laser ablation of the Somerton Man's hair [48]. The hair was collected from the plaster bust of the Somerton Man made after his autopsy. The concentration of isotopes found in the hair can be used to find out the environment the Somerton Man lived in in the lead up to his death. Additional data from a separate hair has been taken since the 2013 group's analysis, as well as the concentration of isotopes in the plaster. This could be used by a future group to crosscheck the data between separate hairs, as well as crosscheck the isotopes in the plaster versus the hair to see if the isotopes from the plaster may have diffused into the hair.
Appendices
- Appendix X: Full Search Engine Code
- Appendix X: Project Gutenberg European Language Comparison Text References
File:Gutenberg European Language Comparison Text References.pdf
- Appendix X: Project Gutenberg 20 English 44 Letter Text File References
File:Gutenberg 20 English 44 Letter Text File References.pdf
- Appendix X: Project Gutenberg English Corpus 20 Novels References.pdf
File:Gutenberg English Corpus 20 Novels References.pdf
References
- ↑ The News. (1948, December 1). Dead Man Found Lying on Somerton Beach [online]. Available: http://trove.nla.gov.au/ndp/del/article/129897161
- ↑ The News. (1948, December 1). Dead Man Found Lying on Somerton Beach [online]. Available: http://trove.nla.gov.au/ndp/del/article/129897161
- ↑ The Advertiser. (2005, March 9). Death riddle of a man with no name [online]. Available: http://www.eleceng.adelaide.edu.au/personal/dabbott/tamanshud/advertiser_mar2005.pdf
- ↑ The Advertiser. (1949, June 9). Cryptic Note on Body [online]. Available: http://trove.nla.gov.au/ndp/del/article/36371152
- ↑ Hub Pages Author. (2014, August 30). The Body on the Beach: The Somerton Man - Taman Shud Case [online]. Available: http://brokenmeadows.hubpages.com/hub/The-Mystery-of-the-Somerton-Man-Taman-Shud-Case
- ↑ Cleland. (1949). Coroner's Inquest [online]. Available: http://trove.nla.gov.au/ndp/del/article/130195091
- ↑ A. Turnbull and D. Bihari. (2009). Final Report 2009: Who killed the Somerton man? [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_report_2009:_Who_killed_the_Somerton_man%3F
- ↑ Hub Pages Author. (2014, August 30). The Body on the Beach: The Somerton Man - Taman Shud Case [online]. Available: http://brokenmeadows.hubpages.com/hub/The-Mystery-of-the-Somerton-Man-Taman-Shud-Case
- ↑ YouTube ABC. Somerton Beach Mystery 1978 [online]. Available: https://www.youtube.com/watch?v=ieczsZRQnu8
- ↑ A. Turnbull and D. Bihari. (2009). Final Report 2009: Who killed the Somerton man? [online]. Available:
- ↑ K. Ramirez and L-V. Michael. (2010). Final Report 2010 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2010
- ↑ S. Maxwell and P. Johnson. (2011). Final Report 2011 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2011
- ↑ A. Duffy and T. Stratfold. (2012). Final Report 2012 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2012
- ↑ L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking
- ↑ A. Turnbull and D. Bihari. (2009). Final Report 2009: Who killed the Somerton man? [online]. Available:
- ↑ K. Ramirez and L-V. Michael. (2010). Final Report 2010 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2010
- ↑ S. Maxwell and P. Johnson. (2011). Final Report 2011 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2011
- ↑ A. Duffy and T. Stratfold. (2012). Final Report 2012 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2012
- ↑ L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking
- ↑ The Advertiser. (1949, June 10). Tamam Shud [online]. Available: http://trove.nla.gov.au/ndp/del/article/36371416
- ↑ N. Gencarelli and J. K. Yang. (2015, March 15). Cipher Cracking 2015 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Cipher_Cracking_2015
- ↑ S. L. Center. (2015). What is Scots? [online]. Available: http://www.scotslanguage.com/What_is_Scots%3F_uid2/What_is_Scots_%3F
- ↑ No Author. 2014. Using Chi Squared to Crack Codes [online]. Available: http://ibmathsresources.com/2014/06/15/using-chi-squared-to-crack-codes/.
- ↑ T Hardy. 2005. Tess of the d’Urbervilles (11th Edition) [online]. Available: https://ia801409.us.archive.org/24/items/tessofthedurberv00110gut/110-8.txt.
- ↑ No Author. 2015. List of languages by number of native speakers [online]. Available: http://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers#cite_note-Nationalencyklopedin-1.
- ↑ Microsoft Research. (2015). Microsoft Web N-Gram Services [Online]. Available: http://research.microsoft.com/en-us/collaboration/focus/cs/web-ngram.aspx
- ↑ Google Research Blog. (2006, August 3). All Our N-Gram are Belong to You [Online]. Available: http://googleresearch.blogspot.com.au/2006/08/all-our-n-gram-are-belong-to-you.html
- ↑ C. X. Zhai et al. (2010, July 19-23). Web N-gram Workshop [online]. Available: http://research.microsoft.com/en-us/events/webngram/sigir2010web_ngram_workshop_proceedings.pdf
- ↑ No Author. Microsoft Web N-Gram Service Quick Start [online]. Available: http://weblm.research.microsoft.com/info/QuickStart.htm
- ↑ No Author. Visual Studio Languages [online]. Available: https://msdn.microsoft.com/en-us/library/vstudio/ee822860%28v=vs.100%29.aspx
- ↑ Amazon Web Services. (2015). Amazon EC2 Instances [Online]. Available: https://aws.amazon.com/ec2/instance-types/
- ↑ Amazon Web Services. (2015). Amazon EC2 Instances [Online]. Available: https://aws.amazon.com/ec2/instance-types/
- ↑ Amazon Web Services. (2015). Amazon EC2 Instances [Online]. Available: https://aws.amazon.com/ec2/instance-types/
- ↑ Amazon Web Services. (2015). Amazon EC2 Instances [Online]. Available: https://aws.amazon.com/ec2/instance-types/
- ↑ C. X. Zhai et al. (2010, July 19-23). Web N-gram Workshop [online]. Available: http://research.microsoft.com/en-us/events/webngram/sigir2010web_ngram_workshop_proceedings.pdf
- ↑ Google Books. (2012 July). Ngram Viewer [Online]. Available: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
- ↑ L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking
- ↑ J. H. Fabre. (2005, March 22). The Life of the Spider [Online]. Available: https://www.gutenberg.org/ebooks/1887
- ↑ T Hardy. 2005. Tess of the d’Urbervilles (11th Edition) [online]. Available: https://ia801409.us.archive.org/24/items/tessofthedurberv00110gut/110-8.txt.
- ↑ Stat Trek. (2015). Chi-Square Goodness of Fit Test [Online]. Available: http://stattrek.com/chi-square-test/goodness-of-fit.aspx?Tutorial=AP.
- ↑ C. X. Zhai et al. (2010, July 19-23). Web N-gram Workshop [online]. Available: http://research.microsoft.com/en-us/events/webngram/sigir2010web_ngram_workshop_proceedings.pdf
- ↑ Google Books. (2012 July). Ngram Viewer [Online]. Available: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
- ↑ C. X. Zhai et al. (2010, July 19-23). Web N-gram Workshop [online]. Available: http://research.microsoft.com/en-us/events/webngram/sigir2010web_ngram_workshop_proceedings.pdf
- ↑ T.Brants and A.Franz. (2006). Web 1T 5-gram Version 1 [online]. Available: https://catalog.ldc.upenn.edu/LDC2006T13
- ↑ Amazon Web Services. (2015). Amazon EC2 Pricing [Online]. Available: https://aws.amazon.com/ec2/pricing/
- ↑ No Author. RISK MANAGEMENT HANDBOOK [online]. Available: http://www.adelaide.edu.au/legalandrisk/docs/resources/Risk_Management_Handbook.pdf
- ↑ L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking
- ↑ L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking
Glossary and Symbols
- ASIO: Australian Security Intelligence Organisation
- ASIS: Australian Secret Intelligence Service
- ASD: Australian Signals Directorate
- P-value theorem: The p-value is the calculated probability that gives researchers a measure of the strength of evidence against the null hypothesis [1].
- Chi-Squared Test:
- Universal Declaration of Human Rights:
- Project Gutenberg:
- N-gram model: The N-gram model is a sequence of n items from a given sequence of phonemes, syllables, letters, words or base pairs [2].
- One-time pad: The one-time pad is a decoder technology which cannot be cracked if the correct key is used [3].
- Initialism: A group of letters formed using the initial letters of a group of words or a phrase [4].
- Plaintext: The information of an original message, which is desired to be deciphered from the ciphertext [5].
- Ciphertext: The encoded format of a message [6].
- Key: What is needed to convert the ciphertext into the plaintext using the one-time pad [7].
- ↑ B. David et al., “P Value and the Theory of Hypothesis Testing: An Explanation for New Researchers,” Clinical Orthopaedics and Related Research®, Vol.468 (3), pp.885-892 2010. [25] G G. L et al., “What is the Value of a p Value?,” The Annals of Thoracic Surgery, Vol.87(5), pp.1337-1343 2009. [26] No Author.p-value [online]. Available: http://en.wikipedia.org/wiki/P-value#cite_note-nature506-1
- ↑ A. Z Broder et al., “Syntactic clustering of the web”. Computer Networks and ISDN Systems 29 (8), pp.1157–1166. [28] No Author. Video Lectures [online]. Available: https://class.coursera.org/nlp/lecture/17
- ↑ S.M. Bellovin. (2011, July 12). Frank Miller: Inventor of the One-Time Pad [online]. Available: http://www.tandfonline.com.proxy.library.adelaide.edu.au/doi/full/10.1080/01611194.2011.583711#abstract
- ↑ No Author. Initialism [online]. Available: http://dictionary.reference.com/browse/initialism
- ↑ No Author (2011). Topic 1: Cryptography [online]. Available: http://www.maths.uq.edu.au/~pa/SCIE1000/gma.pdf
- ↑ No Author (2011). Topic 1: Cryptography [online]. Available: http://www.maths.uq.edu.au/~pa/SCIE1000/gma.pdf
- ↑ No Author (2011). Topic 1: Cryptography [online]. Available: http://www.maths.uq.edu.au/~pa/SCIE1000/gma.pdf