Final Report/Thesis 2015: Difference between revisions

From Derek
Jump to navigation Jump to search
A1614395 (talk | contribs)
A1628026 (talk | contribs)
Line 103: Line 103:
====Aim====
====Aim====
A Critical review of the statistical frequency analysis of the letters from the 2013 group was to be conducted.  It was then proposed that the 2015 group was to repeat the statistical tests done by the 2013 group. Like the 2013 group, the Universal Declaration of Human Rights was to be used as the base text based on advice from Professor Abbott, but the validity of this text being used as the base text was to be statistically tested. The group was to find out how common each letter of the Somerton Man code is in each popular European language. The statistical results from this analysis were then to be used on 44 letters out of pieces of text from the most likely languages. The 2013 group’s analysis was then to be extended by the 2015 group by calculating p-values and implementing hypothesis testing. Like the 2013 group, the 2015 group was also to use Microsoft Excel to compute the statistical analysis and produce output graphs. The group was to run p-value tests on benchmark pieces of text from the most likely languages and see if the p-values suggested that the letters are indeed from those languages. The use of benchmark pieces of text were also to be used to test the statistical accuracy of the method of analysis as well as the validity of the Universal Declaration of Human Rights as a base text. Once the most likely language was determined, the group was to process an additional number of benchmarks of that language and obtain a mean p-value. Next, a hypothesis test could be performed based on the mean p-value obtained from the benchmarks, when compared to, the p-value of the letters in the Somerton Man code against the base text. The null hypothesis was to be that ‘The group of letters are from the English language’, and the alternative hypothesis was to be that ‘The group of letters are from another language’. The alternative hypothesis could be altered if it was found that the most likely language from the statistical analysis was one other than English. Using this hypothesis testing method, the 2015 group was hoping to be able to more confidently determine which language the letters in the code are from.
A Critical review of the statistical frequency analysis of the letters from the 2013 group was to be conducted.  It was then proposed that the 2015 group was to repeat the statistical tests done by the 2013 group. Like the 2013 group, the Universal Declaration of Human Rights was to be used as the base text based on advice from Professor Abbott, but the validity of this text being used as the base text was to be statistically tested. The group was to find out how common each letter of the Somerton Man code is in each popular European language. The statistical results from this analysis were then to be used on 44 letters out of pieces of text from the most likely languages. The 2013 group’s analysis was then to be extended by the 2015 group by calculating p-values and implementing hypothesis testing. Like the 2013 group, the 2015 group was also to use Microsoft Excel to compute the statistical analysis and produce output graphs. The group was to run p-value tests on benchmark pieces of text from the most likely languages and see if the p-values suggested that the letters are indeed from those languages. The use of benchmark pieces of text were also to be used to test the statistical accuracy of the method of analysis as well as the validity of the Universal Declaration of Human Rights as a base text. Once the most likely language was determined, the group was to process an additional number of benchmarks of that language and obtain a mean p-value. Next, a hypothesis test could be performed based on the mean p-value obtained from the benchmarks, when compared to, the p-value of the letters in the Somerton Man code against the base text. The null hypothesis was to be that ‘The group of letters are from the English language’, and the alternative hypothesis was to be that ‘The group of letters are from another language’. The alternative hypothesis could be altered if it was found that the most likely language from the statistical analysis was one other than English. Using this hypothesis testing method, the 2015 group was hoping to be able to more confidently determine which language the letters in the code are from.


====Method====
====Method====
Line 113: Line 112:


=====Assumptions=====
=====Assumptions=====
Before commencing statistical calculations, a number of initial assumptions were made for this task. One assumption was that the language used as the basis for the code is a European Language. This assumption was made based on the European appearance of the Somerton Man, results from previous groups concluding that the most likely language is English (see Previous Studies/Related Work section), and advice from Professor Abbott. Another assumption that was made is that the code is an initialism, meaning that it is made up of letters that represent the first letters of an ordered series of words. This assumption has also been made based on conclusions made by previous groups (see Previous Studies/Related Work section) and advice from Professor Abbott. Due to this assumption, only the first letters of words in each language in the base text were to be considered for analysis, and not every letter.  Further assumptions were the inclusion of all accented letters in base texts and all combinations of ambiguous letters in the code. These assumptions were made to increase the robustness of the analysis. A final assumption that was made was the choice not to include the crossed out line of letters in the code. This was made since from observation, one can see that this line is very similar to the third line and is most likely to be a mistake.  
Before commencing statistical calculations, a number of initial assumptions were made for this task. One assumption was that the language used as the basis for the code is a European Language. This assumption was made based on the European appearance of the Somerton Man, results from previous groups concluding that the most likely language is English (see Previous Studies/Related Work section), and advice from Professor Abbott. Another assumption that was made is that the code is an initialism, meaning that it is made up of letters that represent the first letters of an ordered series of words. This assumption has also been made based on conclusions made by previous groups (see Previous Studies/Related Work section) and advice from Professor Abbott. Due to this assumption, only the first letters of words in each language in the base text were to be considered for analysis, and not every letter.  Further assumptions were the inclusion of all accented letters in base texts and all combinations of ambiguous letters in the code. This assumption entailed the use of versions of the code including 6Ms, 4Ms and 2Ws, and the code as it appeared in the original police report from 1949<ref>L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking</ref>.  These assumptions were made to increase the robustness of the analysis. A final assumption that was made was the choice not to include the crossed out line of letters in the code. This was made since from observation, one can see that this line is very similar to the third line and is most likely to be a mistake.  


Because the letter M and W in the Somerton code are very similar. Our group make 3 assumptions for this case. First assumption is All 6 M or W letters in the code are M. Second assumption is All 6 M OR W letters in the code are 4 Ms and 2 Ws. Third assumption is All 6 M OR W letters in the code are 5 Ms and 1 W. The last assumption we make as it appears in the original police report from 1949.
=====Base Text=====
=====Base Text=====
The 2013 group’s frequency analysis code was obtained. Initially, some background research in Java code was undertaken for use in compiling and editing the code. The text files used by the 2013 group from the Universal Declaration of Human Right for the statistical frequency analysis were also obtained. A selection of these were then processed in Java, using a modified version of the 2013 group’s code, and the results were tabulated in an excel spread sheet including language, numerical frequency of each letter in each language and the proportional frequency based on the total number of initial letters in each language. These results were then used to test our statistical method of analysis in Microsoft Excel, MATLAB and by hand.
The 2013 group’s frequency analysis code was obtained. Initially, some background research in Java code was undertaken for use in compiling and editing the code. The text files used by the 2013 group from the Universal Declaration of Human Right for the statistical frequency analysis were also obtained. A selection of these were then processed in Java, using a modified version of the 2013 group’s code, and the results were tabulated in an excel spread sheet including language, numerical frequency of each letter in each language and the proportional frequency based on the total number of initial letters in each language. These results were then used to test our statistical method of analysis in Microsoft Excel, MATLAB and by hand.
Line 122: Line 120:


=====Statistical Methods=====
=====Statistical Methods=====
We prepare to use both Parametric Testing and Non-Parametric Testing for Task 1.
======Parametric Testing======
Firstly, a group of test texts were prepared, using 3 groups of 44 letters from the English, French and German languages from the Universal Declaration of Human Rights, as well as the Somerton Man Code. The test texts were analysed using ANOVA in Excel. Once these results were obtained, the p-value calculation method used was tested using various pieces of software. A test case with known p-value result from the Engineering Maths IIA notes was run through ANOVA in Microsoft Excel, as well as using built-in MATLAB functions found in Engineering Maths IIA notes, and by hand. All of these methods produced the same p-value using the test case with known p-value and so the ANOVA method was verified to be functioning correctly.  


======Parametric Testing Results======
ANOVA One Way analysis was used in Microsoft Excel to compare letter frequency between texts, as well as ANOVA Two Way Analysis of Variance, where Factor 1 was the letter, Factor 2 was the text and the response was the letter frequency.  Both of these methods did not produce valid p-values as they used the comparison of total means and variances.
Paired data analysis using a paired sample t-test was researched and attempted based on the Engineering Maths IIA Notes.  This was computed using MATLAB, but also produced unusable p-values since the method uses mean and standard deviation of the differences between samples to calculate the p-value, thus letter and text type data is lost in the process and so is not applicable.
The p-values calculated using these methods fluctuated depending on the type of data used.  If the raw number of letter frequencies for the sample text or code (44 letters) versus the Universal Declaration of Human Rights reference text (1000-2000 letters depending on language) was used, then the p-value became very small since the total means of each text were vastly different.  Whereas, if the frequency of each letter as a percentage of the total number of letters in each text were used, this gave a p-value of 1 since the means became the same.  Thus this method of mean comparison did not work for letter frequency analysis.


=====Parametric Testing=====
======Parametric Testing Evaluation======
 
All of these methods did not compare individual data entries from one group to their corresponding entry in another group.  Instead they used the sample size, sum, mean, and variance of each group to compare to the other group. Because of this, this method could not be used when comparing letter frequency between languages.  For example the frequency of the letter A in one text, must be compared with the frequency of the letter A in another text, rather than the frequency of all letters in one text being compared to the frequency of all letters in another text.
Firstly, a group of test texts were prepared, using 3 groups of 44 letters from the English, French and German languages from the Universal Declaration of Human Rights, as well as the Somerton Man Code. The test texts were analysed using ANOVA in Excel. Once these results were obtained, the p-value calculation method used was tested using various pieces of software. A test case with known p-value result from the Engineering Maths IIA notes was run through ANOVA in Microsoft Excel, as well as using built-in MATLAB functions found in Engineering Maths IIA notes, and by hand. All of these methods produced the same p-value using the test case with known p-value and so the ANOVA method was verified to be functioning correctly.  


=====Non-Parametric Testing=====
=====Non-Parametric Testing=====
Due to the lack of usability of the normally distributed statistical methods, a number of non-parametric tests were researched including the Kolmogrov-Smirnov Test, Mann-Whitney U Test and Chi-Squared test.
These 3 tests were researched and considered but ultimately the chi-squared test was chosen based on a cryptography reference found that uses the chi-squared test to compare a code to a particular language <ref>No Author. 2014. Using Chi Squared to Crack Codes [online]. Available: http://ibmathsresources.com/2014/06/15/using-chi-squared-to-crack-codes/.</ref>, as well as advice from supervisor Dr. Berryman.  The base text chosen for testing was the full English translation of the Universal Declaration of Human Rights.  The sample texts chosen were the first 44 letters from a the novel Tess of the d’Urbervilles by Thomas Hardy as an English benchmark <ref>T Hardy. 2005. Tess of the d’Urbervilles (11th Edition) [online]. Available: https://ia801409.us.archive.org/24/items/tessofthedurberv00110gut/110-8.txt.</ref>, as well as the first 44 letters from the German, French and Zapoteco translations of the Universal Declaration of Human Rights.


Due to the lack of usability of the normally distributed statistical methods, a number of non-parametric tests were researched including the Kolmogrov-Smirnov Test, Mann-Whitney U Test and Chi-Squared test.
====Non-Parametric Testing Initial Results And Evaluation====
These 3 tests were researched and considered but ultimately the chi-squared test was chosen based on a cryptography reference found that uses the chi-squared test to compare a code to a particular language <ref>No Author. 2014. Using Chi Squared to Crack Codes [online]. Available: http://ibmathsresources.com/2014/06/15/using-chi-squared-to-crack-codes/.</ref>, as well as advice from supervisor Dr Matthew Berryman.  An extension for future Honours groups could be to attempt the analysis with other non-parametric tests such as the Kolmogrov-Smirnov Test or Mann-Whitney U Test.
Initially, a result was unable to be computed for the Chi-Squared value or p-value using the Chi-Squared test method. It was soon discovered that in the calculation of the Chi-Squared value, if the frequency of a letter in the reference text (a particular language from the Declaration of Human Rights) was 0, this caused the denominator of the Chi-Squared value equation to be 0 and thus the equation became invalid (See Figure 1.2).  An assumption in the method of the Chi-Squared testing had to be made in an attempt to rectify the issue.  The count for letters in the reference text that appeared 0 times, were altered to 1, a small number chosen to be the closest integer value to 0 in the hope that it would not have significant enough effect on the results to cause a skew in the data. This assumption was made since the sample size for the reference text was too small, and so not every letter appeared at least once.
The base text chosen for testing was the full English translation of the Universal Declaration of Human Rights.  The sample texts chosen were the first 44 letters from a the novel Tess of the d’Urbervilles by Thomas Hardy as an English benchmark <ref>T Hardy. 2005. Tess of the d’Urbervilles (11th Edition) [online]. Available: https://ia801409.us.archive.org/24/items/tessofthedurberv00110gut/110-8.txt.</ref>, as well as the first 44 letters from the German, French and Zapoteco translations of the Universal Declaration of Human Rights.


====Results And Evaluation====
[[File: Task_1_Chi-squared_Formula.png |thumb|1000px|centre|'''Fig. 1.2:''' Task 1 Chi-squared Formula]]
=====Parametric Testing=====


ANOVA One Way analysis was used in Microsoft Excel to compare letter frequency between texts, as well as ANOVA Two Way Analysis of Variance, where Factor 1 was the letter, Factor 2 was the text and the response was the letter frequency. Both of these methods did not produce valid p-values as they used the comparison of total means and variances. Paired data analysis using a paired sample t-test was researched and attempted based on the Engineering Maths IIA Notes. This was computed using MATLAB, but also produced unusable p-values since the method uses mean and standard deviation of the differences between samples to calculate the p-value, thus letter and text type data is lost in the process and so is not applicable. The p-values calculated using these methods fluctuated depending on the type of data used. If the raw number of letter frequencies for the sample text or code (44 letters) versus the Universal Declaration of Human Rights reference text (1000-2000 letters depending on language) was used, then the p-value became very small since the total means of each text were vastly different. Whereas, if the frequency of each letter as a percentage of the total number of letters in each text were used, this gave a p-value of 1 since the means became the same. Thus this method of mean comparison did not work for letter frequency analysis.  
The chi-squared test resulted in large Chi-Squared values that were brought about by a small text size sample of 44 letters. This effectively caused the p-value results to be very small numbers and therefore were unable to be used as a means of comparison and caused the null hypothesis to be rejected in every case. However, this test could still be used as a measure of similarity since the Chi-Squared Values for each text comparison can be compared based on the fact that the smaller the Chi-Squared value, the more similar the two texts are.  Thus, despite being unable to obtain valid p-values and perform hypothesis testing, the texts could still be ranked based on the Chi-Squared value.


=====Non-Parametric Testing=====
Show initial test results for eng, germ, fre, zap etc. somewhere here


[[File: Task_1_Chi-squared_Formula.png |thumb|1000px|centre|'''Fig. 1.2:''' Task 1 Chi-squared Formula]]


During our Chi-squared value calculation, we meet a serious problem. There are some letters appear 0 times in one particular language in our base text from Declaration of Human Rights. According to figure 1.2, it will cause the denominator of the Chi-Squared value equation to be 0 and as we all know the denominator cannot be 0 in the equation.  We have to make some reasonable assumptions so that we are able to continue using Chi-squared testing for our task 1. First assumption is we use number 1 which is the closest integer to 0 to instead of 0 when the letter counts 0 in this language.  Second assumption is we use 0.001 and 0.0001 to instead of 0 frequencies in same case rather than the number 1. For the first assumption, it would not have significant enough effect on the results to cause a skew in the data. However, for the second assumption, this caused the result to very large Chi-Squared values (in the order of 10^10) and very small p-values (approaching 0).
During our Chi-squared value calculation, we meet a serious problem. There are some letters appear 0 times in one particular language in our base text from Declaration of Human Rights. According to figure 1.2, it will cause the denominator of the Chi-Squared value equation to be 0 and as we all know the denominator cannot be 0 in the equation.  We have to make some reasonable assumptions so that we are able to continue using Chi-squared testing for our task 1. First assumption is we use number 1 which is the closest integer to 0 to instead of 0 when the letter counts 0 in this language.  Second assumption is we use 0.001 and 0.0001 to instead of 0 frequencies in same case rather than the number 1. For the first assumption, it would not have significant enough effect on the results to cause a skew in the data. However, for the second assumption, this caused the result to very large Chi-Squared values (in the order of 10^10) and very small p-values (approaching 0).


=====Top 20 European Language Comparison=====
=====Top 20 European Language Comparison=====

Revision as of 15:00, 20 October 2015

Abstract

The project involves the mysterious case of a dead man found at Somerton Beach, South Australia. There was no evidence to show the man’s identification or cause of death, however, there were 5 lines of letters that were found on a scrap of paper in the dead man’s trouser pocket. It was later discovered that the scrap of paper was torn from a book known as ‘The Rubaiyat of Omar Khayyam’. These letters are considered vital to the case as it is speculated that they may be a code or cipher of some sort. The case still remains unsolved today, and so this project has been undertaken in order to uncover further case evidence. The aims and objectives of the project include using various computational techniques to statistically analyse the likely language of origin of the code, designing and implementing software in order to decipher the code, and ultimately attempting to solve the cold case.

Executive Summary

This project involves the design and implementation of software in order to decipher the code associated with the Somerton Man murder mystery. This document is a Final Report and Thesis outlining the key aims, methods, results and evaluations and justifications of the specific tasks involved in the Code Cracking: Who Murdered The Somerton Man? Honours project. The information in the report also includes the motivation, previous studies, aims, objectives, and significance of the project, as well as technical background, knowledge gaps, technical challenges, specific project tasks and project management resources.

Write key findings etc. from conclusion

Acknowledgements

  • Project supervisor Prof. Derek Abbott for providing the inspiration and motivation in undertaking this project, as well as advice on specific tasks to be completed to further the research into uncovering the mystery of the Somerton Man code.
  • Project supervisor Dr. Matthew Berryman for all advice and assistance with statistical methods and software programming.

Introduction

Motivation

Fig. 1: Photograph of the letters making up the Somerton Man Code

On the 1st of December, 1948, there was a dead body found at Somerton Beach, South Australia [1]. There was no evidence to show the man’s identification and the cause of death [2], however, there were 5 lines of capital letters, with the second line struck out, that were found on a scrap of paper in the dead man’s trouser pocket [3]. A photo of the paper containing the letters can be seen in Figure 1. It was later discovered that the scrap of paper was torn from a book known as the 'Rubaiyat of Omar Khayyam' [4]. These letters are considered vital to the case as it is speculated that they may be a code or cipher of some sort. As engineers, we have the ability to help investigators in solving the case. With that in mind, this project is being undertaken to attempt to decrypt the code in order to help solve the cold case.

The South Australian Police stand to benefit from this project not only from the decoding technology developed for this case, but it also may be able to be applied to solve similar cases. Historians may be interested in gaining further historical information from this project since the case occurred during the heightened tension of the Cold War, and it is speculated that this case may be related in some way [5]. Pathologists may also be interested as the cause of death may have been an unknown or undetectable poison [6]. This project stands to benefit the wider community as well as extended family of the unknown man to provide closure to the mysterious case. Professor Derek Abbot also stands to benefit as he has been working closely with honours project students for the past seven years in an attempt to decipher the Somerton Man code.

Previous Studies/Related Work

Fig. 2: 3D generated reconstruction of bust of Somerton Man from 2012 Final Report

Previous professional attempts to decipher the code were limited since they did not use modern techniques or have access to modern databases. Another limitation is that some of the characters in the code appear to be ambiguous and previous attempts made fixed assumptions on particular characters [7]. The Australian Navy’s response was that the letters were “neither a code nor a cipher” [8]. The Australian Department of Defence analysed the characters in 1978 using the computer technology available of that era and concluded:

  • a) “There are insufficient symbols to provide a pattern”
  • b) “The symbols could be a complex substitute code or the meaningless response to a disturbed mind”
  • c) “It is not possible to provide a satisfactory answer” [9]

Other previous studies into deciphering the code include Honours Projects at the University of Adelaide from 2009-2013. The previous work undertaken by these groups includes: multiple evolutions of letter frequency analysis of the code on a variety of base texts in a number of languages, initial letter and sentence letter probabilities, the probabilities of known cypher techniques, the likelihood of the code being an initialism of a poem, the use of various one-time pad techniques, the design and implementation of a web crawler, the analysis of text type and genre of the code’s likely plaintext, the implementation of pattern matching software into the web crawler, a 3D generated reconstruction of the bust of the Somerton Man (see Figure 2) and the analysis of mass spectrometer data taken from the Somerton Man’s hair [10] [11] [12] [13] [14]. The main conclusions that past groups have come to in their projects are: that the letters are unlikely to be random, the code is unlikely to be an initialism, it is likely that the Rubaiyat of Omar Khayyam was used as a one-time pad, the language of the code is likely to be English, the code is unlikely to be an initialism of a poem and that the Rubaiyat of Omar Khayyam was not used as a straight substitution one-time pad [15] [16] [17] [18] [19]. The analysis and extension upon specific elements of previous work that are directly related to the 2015 group’s project are discussed in the 'Method – Specific Tasks' section.

Aims and Objectives

The key aims and objectives in this project included the aim to statistically analyse the likely language of the plaintext of the code. Another aim was to design and implement software in order to try and decipher the code. This was to be implemented by using the 'Rubaiyat of Omar Khayyam' as a one-time pad in conjunction with a new key technique, and by developing a search engine to try to discover possible n-grams contained within the code. The third aim was to analyse mass spectrometer isotope concentration data of the Somerton Man’s hair. Finally, the ultimate aim was to decrypt the code in order to solve the mystery, however this was somewhat unrealistic as the code has remained uncracked for many years. Despite this, computational techniques were to be utilised to attempt the decryption, and at the very least, the past research into the case was to be furthered for future Honours students.

Significance

Considered “one of Australia’s most profound mysteries” at the time [20], this case still remains unsolved today. As the development of decoder technology and the related knowledge progresses, this project poses the opportunity to uncover further case evidence. The skills developed in undertaking this project were also of great significance in a broader sense, as these can be transferrable to possible future career paths. The techniques developed include: software and programming skills, information theory, probability, statistics, encryption and decryption, datamining and database trawling. The job areas and industries that these skills can be applied to are: computer security, communications, digital forensics, computational linguistics, defence, software, e-finance, e-security, telecommunications, search engines and information technology. Some possible job examples include working at: Google, ASIO, ASIS and ASD [21].

Technical Background

P-Value Theorem Explanation

The p-value is the probability to obtain an effect equal to or more extreme than the one observed, presuming the null hypothesis of no effect is true. It gives researchers a measure of the strength of evidence against the null hypothesis.

In statistics, the p-value is used for testing a statistical hypothesis by observing sample results. P-value is an effective method to presume whether the null hypothesis of no effect is true. In this project, we will use p-value to do some simple tests to determine whether English was the language in the original message [49]B. David et al., “P Value and the Theory of Hypothesis Testing: An Explanation for New Researchers,” Clinical Orthopaedics and Related Research®, Vol.468 (3), pp.885-892 2010.

In this case, we make an assumption that English was not the language in the original message. Thus we should get a larger p-value which is higher than 0.05 in the simple test. This means the observed data point is in the range of “more likely observation”, mentioned in Figure 1 below. Oppositely, if we get a smaller p-value which is lower than 0.05 in the simple test, this means the observed data point is in the range of “very un-likely observations”. This indicates our assumption is wrong and that English was the language in the original message [50]G G. L et al., “What is the Value of a p Value?,” The Annals of Thoracic Surgery, Vol.87(5), pp.1337-1343 2009.

Fig. TB.1: Technical Backgroud P-value

As the graph shows, the horizontal coordinate is the set of possible results, and the vertical coordinate is the probability density. The p-value is in the right side of observed data point, which is between the curve and horizontal coordinate [51]No Author.p-value [online]. Available: http://en.wikipedia.org/wiki/Pvalue# cite_note-nature506-1.

Chi-Squared Test Explanation

Chi-square is a statistical test commonly used to compare observed data with expected data. It requires than there is no relationship between the observe data and expected data. It means the observed data should not affect the expected data and they are independent respectively. [52] N Turner, “Chi-squared test” Journal of clinical nursing, Vol.9 (1), pp.93 2000.


Since it is a non-parametric test, it doesn’t care about distribution of samples. It means there is no requirement that expected and observed data should follow normal distribution.

There is a test called ‘Goodness of Fit’ in Chi-squared test, meaning it measures how closely one group of data is related to another. [53] N. Balakrishnan et al., Chi-squared Goodness of Fit Tests with Applications [online]. Available: http://www.sciencedirect.com.proxy.library.adelaide.edu.au/science/book/9780123971944

Fig. TB.2: Technical Backgroud Chi-Squared

Figure TB.2 is a quick example that how we use Chi-squared test for our task 1.

Firstly, we should calculate the Chi-squared value for letter A by using the equation. The observed value is the count of letter A in the Somerton Man code, the expected value is the total number of letters in Somerton Man code which is 44 multiply the frequency of letter A in the sample language we used.

Secondly, we repeat this step to calculate the Chi-squared value for rest letters from B to Z.

Finally, we sum the Chi-squared value for all letters. This is the Chi-squared value between sample language and Somerton Man code. Lower Chi-squared value means the Somerton Man code is likely to be this language.

Universal Declaration of Human Rights Explanation

Project Gutenberg Explanation

N-Gram Model Explanation

One-Time Pad Explanation

Knowledge Gaps and Technical Challenges

In order to complete the specific tasks proposed within this project, the group members were required to develop new skills in text processing and programming in a variety of languages. These languages included Java, MATLAB, and Python. The group was also required to research and learn how to implement and evaluate statistical techniques including chi-squared testing, p-value calculation and hypothesis testing. Another necessary skill to be developed was the use of Microsoft Excel software to perform statistical analyses. The technical challenges that were to be faced within the project were directly associated with these knowledge gaps.

Method - Specific Tasks

Task 1: Statistical Frequency Analysis of Letters

Aim

A Critical review of the statistical frequency analysis of the letters from the 2013 group was to be conducted. It was then proposed that the 2015 group was to repeat the statistical tests done by the 2013 group. Like the 2013 group, the Universal Declaration of Human Rights was to be used as the base text based on advice from Professor Abbott, but the validity of this text being used as the base text was to be statistically tested. The group was to find out how common each letter of the Somerton Man code is in each popular European language. The statistical results from this analysis were then to be used on 44 letters out of pieces of text from the most likely languages. The 2013 group’s analysis was then to be extended by the 2015 group by calculating p-values and implementing hypothesis testing. Like the 2013 group, the 2015 group was also to use Microsoft Excel to compute the statistical analysis and produce output graphs. The group was to run p-value tests on benchmark pieces of text from the most likely languages and see if the p-values suggested that the letters are indeed from those languages. The use of benchmark pieces of text were also to be used to test the statistical accuracy of the method of analysis as well as the validity of the Universal Declaration of Human Rights as a base text. Once the most likely language was determined, the group was to process an additional number of benchmarks of that language and obtain a mean p-value. Next, a hypothesis test could be performed based on the mean p-value obtained from the benchmarks, when compared to, the p-value of the letters in the Somerton Man code against the base text. The null hypothesis was to be that ‘The group of letters are from the English language’, and the alternative hypothesis was to be that ‘The group of letters are from another language’. The alternative hypothesis could be altered if it was found that the most likely language from the statistical analysis was one other than English. Using this hypothesis testing method, the 2015 group was hoping to be able to more confidently determine which language the letters in the code are from.

Method

2013 Statistical Frequency Analysis Review

A critical review of the statistical frequency analysis of the letters from the 2013 group has been conducted to determine possible extensions to be undertaken by the 2015 group. The 2013 group based their statistical frequency analysis of letters on the translations and transliterations of the Universal Declaration of Human Rights. This document was chosen since it is translated in over 400 languages. The 2015 group is to analyse this choice of base document using statistical techniques. 266 languages were analysed, since many of the translations were not text files, but paper scans. This seems like a reasonable omission due to the time constraints of the project and the common European languages are all in text form. The analysis included accented letters normalised to their ‘parent’ characters, for example, considering 'ǎ' as 'a'. The languages were then analysed using a variety of combinations of the ambiguous letters within the code. The 2013 group’s statistical analysis was then refined to the top 20 closest European languages. These included more uncommon European languages such as Scots and Vepsian. The graphs of the results from the 2013 group’s analysis can be seen in Figure 1.1. The most likely language of the Somerton Man code and control text was found to be Scots. This is unlikely since it is not a common European language as it is a Scottish dialect [22]. Due to this, Scots, and other less common European languages will be omitted in the 2015 group’s statistical analysis. Instead, only the most common European languages will be included in order to further refine the statistical process. Despite this, a conclusion was made that the most likely language was English since Scots and English shared very similar initial letters. In order to test this conclusion, the 2013 group used the English translation of the Rubaiyat of Omar Khayyam as an English control text. The first 44 words from the text were used since this is the same number as the number of letters in the Somerton Man code. The results showed, once again, that Scots was the most likely language, followed by English (see Figure 1.1). Analysing this choice of control text, the choice was made by the 2013 group as a matter of convenience. This may not have been the best decision to make as the Rubaiyat of Omar Khayyam was originally written in Persian, and has since been translated into English. The use of a translated text as the English control may have skewed the statistical results of the analysis of the control text as the translation may use uncommon words or expressions. Another reason for the suspected use of uncommon words or expressions is the fact that the text is a book made up of four line poems, meaning that it may not accurately represent commonly used words or letters in such a small sample size of 44 words. The 2015 group is to attempt to counteract this possible skew by using 44 words out of a popular novel, originally written in the most likely language.

Fig. 1.1: 2013 Statistical Language Analysis Graphed Results
Assumptions

Before commencing statistical calculations, a number of initial assumptions were made for this task. One assumption was that the language used as the basis for the code is a European Language. This assumption was made based on the European appearance of the Somerton Man, results from previous groups concluding that the most likely language is English (see Previous Studies/Related Work section), and advice from Professor Abbott. Another assumption that was made is that the code is an initialism, meaning that it is made up of letters that represent the first letters of an ordered series of words. This assumption has also been made based on conclusions made by previous groups (see Previous Studies/Related Work section) and advice from Professor Abbott. Due to this assumption, only the first letters of words in each language in the base text were to be considered for analysis, and not every letter. Further assumptions were the inclusion of all accented letters in base texts and all combinations of ambiguous letters in the code. This assumption entailed the use of versions of the code including 6Ms, 4Ms and 2Ws, and the code as it appeared in the original police report from 1949[23]. These assumptions were made to increase the robustness of the analysis. A final assumption that was made was the choice not to include the crossed out line of letters in the code. This was made since from observation, one can see that this line is very similar to the third line and is most likely to be a mistake.

Base Text

The 2013 group’s frequency analysis code was obtained. Initially, some background research in Java code was undertaken for use in compiling and editing the code. The text files used by the 2013 group from the Universal Declaration of Human Right for the statistical frequency analysis were also obtained. A selection of these were then processed in Java, using a modified version of the 2013 group’s code, and the results were tabulated in an excel spread sheet including language, numerical frequency of each letter in each language and the proportional frequency based on the total number of initial letters in each language. These results were then used to test our statistical method of analysis in Microsoft Excel, MATLAB and by hand.

A spread sheet containing the letter frequency data from the Universal Declaration of Human Rights was obtained from the 2013 group. These results were verified by processing the original text files through the initial letter frequency java code. This data spread sheet was used for statistical calculations for comparison of the letter frequency in the code versus each language in the Universal Declaration of Human Rights.

Statistical Methods
Parametric Testing

Firstly, a group of test texts were prepared, using 3 groups of 44 letters from the English, French and German languages from the Universal Declaration of Human Rights, as well as the Somerton Man Code. The test texts were analysed using ANOVA in Excel. Once these results were obtained, the p-value calculation method used was tested using various pieces of software. A test case with known p-value result from the Engineering Maths IIA notes was run through ANOVA in Microsoft Excel, as well as using built-in MATLAB functions found in Engineering Maths IIA notes, and by hand. All of these methods produced the same p-value using the test case with known p-value and so the ANOVA method was verified to be functioning correctly.

Parametric Testing Results

ANOVA One Way analysis was used in Microsoft Excel to compare letter frequency between texts, as well as ANOVA Two Way Analysis of Variance, where Factor 1 was the letter, Factor 2 was the text and the response was the letter frequency. Both of these methods did not produce valid p-values as they used the comparison of total means and variances. Paired data analysis using a paired sample t-test was researched and attempted based on the Engineering Maths IIA Notes. This was computed using MATLAB, but also produced unusable p-values since the method uses mean and standard deviation of the differences between samples to calculate the p-value, thus letter and text type data is lost in the process and so is not applicable. The p-values calculated using these methods fluctuated depending on the type of data used. If the raw number of letter frequencies for the sample text or code (44 letters) versus the Universal Declaration of Human Rights reference text (1000-2000 letters depending on language) was used, then the p-value became very small since the total means of each text were vastly different. Whereas, if the frequency of each letter as a percentage of the total number of letters in each text were used, this gave a p-value of 1 since the means became the same. Thus this method of mean comparison did not work for letter frequency analysis.

Parametric Testing Evaluation

All of these methods did not compare individual data entries from one group to their corresponding entry in another group. Instead they used the sample size, sum, mean, and variance of each group to compare to the other group. Because of this, this method could not be used when comparing letter frequency between languages. For example the frequency of the letter A in one text, must be compared with the frequency of the letter A in another text, rather than the frequency of all letters in one text being compared to the frequency of all letters in another text.

Non-Parametric Testing

Due to the lack of usability of the normally distributed statistical methods, a number of non-parametric tests were researched including the Kolmogrov-Smirnov Test, Mann-Whitney U Test and Chi-Squared test. These 3 tests were researched and considered but ultimately the chi-squared test was chosen based on a cryptography reference found that uses the chi-squared test to compare a code to a particular language [24], as well as advice from supervisor Dr. Berryman. The base text chosen for testing was the full English translation of the Universal Declaration of Human Rights. The sample texts chosen were the first 44 letters from a the novel Tess of the d’Urbervilles by Thomas Hardy as an English benchmark [25], as well as the first 44 letters from the German, French and Zapoteco translations of the Universal Declaration of Human Rights.

Non-Parametric Testing Initial Results And Evaluation

Initially, a result was unable to be computed for the Chi-Squared value or p-value using the Chi-Squared test method. It was soon discovered that in the calculation of the Chi-Squared value, if the frequency of a letter in the reference text (a particular language from the Declaration of Human Rights) was 0, this caused the denominator of the Chi-Squared value equation to be 0 and thus the equation became invalid (See Figure 1.2). An assumption in the method of the Chi-Squared testing had to be made in an attempt to rectify the issue. The count for letters in the reference text that appeared 0 times, were altered to 1, a small number chosen to be the closest integer value to 0 in the hope that it would not have significant enough effect on the results to cause a skew in the data. This assumption was made since the sample size for the reference text was too small, and so not every letter appeared at least once.

Fig. 1.2: Task 1 Chi-squared Formula

The chi-squared test resulted in large Chi-Squared values that were brought about by a small text size sample of 44 letters. This effectively caused the p-value results to be very small numbers and therefore were unable to be used as a means of comparison and caused the null hypothesis to be rejected in every case. However, this test could still be used as a measure of similarity since the Chi-Squared Values for each text comparison can be compared based on the fact that the smaller the Chi-Squared value, the more similar the two texts are. Thus, despite being unable to obtain valid p-values and perform hypothesis testing, the texts could still be ranked based on the Chi-Squared value.

Show initial test results for eng, germ, fre, zap etc. somewhere here


During our Chi-squared value calculation, we meet a serious problem. There are some letters appear 0 times in one particular language in our base text from Declaration of Human Rights. According to figure 1.2, it will cause the denominator of the Chi-Squared value equation to be 0 and as we all know the denominator cannot be 0 in the equation. We have to make some reasonable assumptions so that we are able to continue using Chi-squared testing for our task 1. First assumption is we use number 1 which is the closest integer to 0 to instead of 0 when the letter counts 0 in this language. Second assumption is we use 0.001 and 0.0001 to instead of 0 frequencies in same case rather than the number 1. For the first assumption, it would not have significant enough effect on the results to cause a skew in the data. However, for the second assumption, this caused the result to very large Chi-Squared values (in the order of 10^10) and very small p-values (approaching 0).


Top 20 European Language Comparison

The top 20 closest European languages by squared difference and standard deviation from the 2013 Honours group were then run through the Chi-Squared test against the Somerton Man code as in the original police report (5 Ms and 1 W). This test was run as a method of comparison to test how similar our results were to the 2013 group’s results. We will display the results for our two assumptions mentioned before.

Results
Fig. 1.3: Squared Difference versus 2013 Number

Figure 1.3 shows the Chi-squared value of closest 20 languages base on 2013 group’s Squared Difference with 5 Ms and 1 W in code (the left graph) versus 2013 group’s result (the right graph). In this case, we use number 1 to instead of counts 0 of letters.

Fig. 1.4: Standard Deviation versus 2013 Number

Figure 1.4 shows the Chi-squared value of closest 20 languages base on 2013 group’s Standard Deviation with 5 Ms and 1 W in code (the left graph) versus 2013 group’s result (the right graph). In this case, we use number 1 to instead of counts 0 of letters.

Fig. 1.5: Squared Difference versus 2013 Frequency

Figure 1.5 shows the Chi-squared value of closest 20 languages base on 2013 group’s Squared Difference with 5 Ms and 1 W in code (the left graph) versus 2013 group’s result (the right graph). In this case, we use 0.0001 to instead of frequency 0 of letters.

Fig. 1.6: Standard Deviation versus 2013 Frequency

Figure 1.6 shows the Chi-squared value of closest 20 languages base on 2013 group’s Standard Deviation with 5 Ms and 1 W in code (the left graph) versus 2013 group’s result (the right graph). In this case, we use 0.0001 to instead of frequency 0 of letters.


Evaluation

The results show that the two closest languages to the Somerton Man code are Scots, followed by English in all cases. The main conclusion to draw from the results of this comparison was that the Chi-Squared method appeared to be functioning correctly, and so the 2015 group could now further extend the analysis.


Top 20 European Languages based on Estimated Number of Native Speakers

In extension to the 2013 group’s work, the 2015 group conducted a Chi-Squared test of the Somerton Man code against the top 20 European languages based on the estimated number of native speakers [26]. This test included all versions of the Somerton Man code including the versions with 6 Ms, 4 Ms 2 Ws, Average and the police report code (5 Ms and 1 W).

Results
Fig. 1.7: European Languages versus Code 1 Number

Figure 1.7 shows the Chi-squared value of Top 20 European Languages based on Estimated Number of Speakers with 6 Ms in code (the left graph) and 4 Ms & 2 Ws (the right graph). In this case, we use number 1 to instead of counts 0 of letters.

Fig. 1.8: European Languages versus Code 2 Number

Figure 1.8 shows the Chi-squared value of Top 20 European Languages based on Estimated Number of Speakers with 5 Ms & 1 W in code (the left graph) and Average of 3 cases (the right graph). In this case, we use number 1 to instead of counts 0 of letters.

Fig. 1.9: European Languages versus Code 1 Frequency

Figure 1.9 shows the Chi-squared value of Top 21 European Languages based on Estimated Number of Speakers with 6 Ms in code (the left graph) and 4 Ms & 2 Ws (the right graph). In this case, we use 0.0001 to instead of frequency 0 of letters.

Fig. 1.10: European Languages versus Code 2 Frequency

Figure 1.10 shows the Chi-squared value of Top 21 European Languages based on Estimated Number of Speakers with 5 Ms & 1 W in code (the left graph) and Average of 3 cases (the right graph). In this case, we use 0.0001 to instead of frequency 0 of letters.

Evaluation

The results show that for 2 of the 3 code versions and their average value, English was the closest language to the Somerton Man code. Kurdish, a language spoken in some parts of Turkey, was found to be the closest language to the code version with 6 Ms. However, consider Kurdish is not a common and popular language in the world. English was also found to be the closest language to the code as in the original police report.

Top 20 European Languages based on Estimated Number of Native Speakers against Thomas Hardy Sample

A comparison of the English control text (a 44 letter sample of Thomas Hardy’s Tess of the d’Urbervilles) against the top 20 European Languages based on Estimated Number of Native Speakers was conducted in order to test the ability of the method being able to identify an English sample text among the top 20 European languages.


Results
Fig. 1.11: European Languages versus Thomas 1 Number

Figure 1.11 shows the Chi-squared value of Top 20 European Languages based on Estimated Number of Speakers against Thomas Hardy Sample. In this case, we use number 1 to instead of counts 0 of letters.


Fig. 1.12: European Languages versus Thomas 1 Frequency


Figure 1.12 shows the Chi-squared value of Top 20 European Languages based on Estimated Number of Speakers against Thomas Hardy Sample. In this case, we use 0.0001 to instead of frequency 0 of letters.

Evaluation

The results of this test show that English was the closest text to the English sample text. This result is desirable as it successfully verified the ability of the Chi-Squared test to distinguish a 44 letter English sample out of the top 20 European languages, and can be used to back up the results obtained from comparing the same 20 languages against the Somerton Man code.


Evaluation and Justification

Evaluation of Parametric Testing

All of these methods did not compare individual data entries from one group to their corresponding entry in another group. Instead they used the sample size, sum, mean, and variance of each group to compare to the other group. Because of this, this method could not be used when comparing letter frequency between languages. For example the frequency of the letter A in one text, must be compared with the frequency of the letter A in another text, rather than the frequency of all letters in one text being compared to the frequency of all letters in another text.

Evaluation Of Task 1

The original proposal suggested that the group repeat the statistical analysis from the 2013 group and use benchmark texts to statistically assess the validity of the method as well as the Universal Declaration of Human Rights as a base text. The group was then to extend the analysis by calculating the p-values for the Somerton Man code when compared to the most common European languages and perform hypothesis testing based on the results. The group was also to use benchmark texts to test the statistical accuracy of the method as well as the validity of the Universal Declaration of Human Rights as a base text. The 2015 group’s statistical analysis has achieved almost all of its proposed goals. A slight diverge from the initially proposed method was decided upon once it was found that p-values useful for comparison or hypothesis testing were unable to be obtained using any attempted statistical method. Instead, the texts were ranked using their calculated Chi-Value. All assumptions outlined in the proposal were followed, with the addition of the modification of the frequency data to account for the small sample size of the base text.

Conclusion

The results of all of the Chi-Squared testing lead to the conclusion that we can now say more confidently than ever that English was the most likely language from which the Somerton Man code was written assuming it is an initialism.

Task 2: N-Gram Search

Aim

The aim of Task 2 was to create a search engine to look for regular expressions that could be linked to the Somerton Man code. Numerous studies from previous groups showed that it is statistically likely for the Somerton Man code to be an English initialism (see Previous Studies/Related Work section). Based on this, for this task an assumption was made that the code is an initialism. A search engine was to be developed that used the letters of the code as initial letters of words in commonly used English phrases. This concept was to be explored further using a technique that accesses a larger database in much shorter time than the web crawler developed by groups in previous years (see Previous Studies/Related Work section). Using an n-gram database, rather than crawling the whole web for grams, had the advantage that the crawling had already been done and all grams had been recorded. This was to drastically increase the speed at which the gram combinations on the web could be found. The search engine was required to output a list of possible gams from the input letter combinations.

An assumption that was made in order to complete this task was that the code is an initialism, meaning that it is made up of letters that represent the first letters of an ordered series of words. This assumption was made based on conclusions made by previous groups (see Previous Studies/Related Work section) and advice from Professor Abbott. Due to this, a second assumption was made that the letters in the code, and thus the words in the grams, were order relevant. A final assumption was that all variants of ambiguous letters in the code were to be included.

Method

Research upon the available databases was conducted and it was found that the two largest databases that most suited our needs were 'Microsoft Web N-Gram Services' [27], and Google N-Gram [28]. The use of Microsoft Web N-Gram Services was initially considered more favourable due to its larger database [29] , better provided documentation and tutorials [30], and lower initial cost (see Budget Section). Despite the Microsoft alternative's advantages, upon further research and after consulting the Microsoft Web N-Gram, it was discovered that it could be used for our purposes using a best-first DP search, with a very large number of calls to the N-gram service's generate method. Also, through consulting our project supervisor Dr. Berryman, there was a concern that a combinational explosion would occur, meaning that for a 5-gram search, if the Microsoft database was say 50,000 words, the search engine would have to make 50,000^5 + 50,000^4 + 50,000^3 + 50,000^2 + 50,000 calls to the database to complete the search. Due to all the network calls required to use this method, the Microsoft Web N-Gram Service was deemed unfit for purpose in our application, since the method would not be fast enough to complete all of the searches that we required. Due to this, the Google alternative was considered.

Programming Language Selection

The search engine was initially to be implemented via web application using developer tools such as Microsoft Visual Studio or Eclipse [14], for use with Microsoft's Web N-Gram Service. The application could be programmed in native languages including Visual Basic, Visual C#, Visual C++, Visual F# and Jscript [31]. Instead, the programming language of Python was chosen to be used in conjunction with the Google N-Gram database through advice from Dr Berryman and due to its ease of use and efficiency with text processing.

Upon deciding to use the Google N-Gram database, a decision was to be made whether to purchase the University of Pennsylvania's Linguistic Data Consortium version or to obtain it for free directly from Google. The Linguistic Data Consortium's version of the database was initially chosen as an alternative since it had the advantage of a concise and clean nature, with all total n-gram frequencies summed and collated. Upon discussion with supervisor Dr. Berryman, it was decided that it was not worth purchasing the cleaned version of the database since we could extract the data we needed from the raw database and the outputs could be cleaned up easily enough through writing some simple codes in Python. Due to this, it was decided that the database provided by Google was to be used for our purposes (See budget section for further details).

Storage and Processing Implementation

Due to the size of the database, although being able to be physically stored locally, the local computing power available would have been insufficient to run the search engine code through the database within the time frame of the project. Instead, a cloud based computing service with increased processing power was sought out to be able to complete the database search within the time restrictions of the project. Upon considering a number of options, it was decided that 'Amazon Elastic Compute Cloud' was to be used due to its robust storage and processing options [32] and Dr Berryman's prior experience in using this service. The Amazon EC2 free tier was assessed for use but had a 30GB storage limit[33], which was insufficient to store the Google N-gram database on. In addition to this, the instance sizes provided by the Amazon EC2 free tier were t2.micro instances, meaning that they provided 1 vCPU, 1 Gib of RAM and only 20% of each vCPU could be used [34]. Based on this, it was estimated that using this version of the Amazon Elastic Compute Cloud would have taken approximately 20 months to complete, which was far too long to complete within the project timeframe. Instead, it was proposed to use the high input/output Amazon i2 tier to provide the performance needed to store and process the database. After some experimentation with different tiers, two i2.xlarge instances run on Amazon EC2 were proposed to be used, providing two sets of instances, each containing 4 vCPUs, 30.5 GiB of RAM, and 2 x 800 GB SSD Storage[35]. Using this tier allowed for parallelisation by running separate processes for each group of n-gram inputs from n=1-5 using 5 separate instances of the search engine code.

Search Engine

The initial n-gram search code was written in Python and submitted to our GitHub repository for review. Based on advice from project supervisor Dr Berryman, it was discovered that the code would work on a small data set, but since our data set was so large (1.79 Tebibytes when compressed), the code was modified to fit the suggested workflow and run in parallel on Amazon instances.

The maximum number of n provided by the gram database was five. Due to this, a maximum of five letter gram groups from the code could be processed at a time. This was achieved by writing a code in Python to generate all possible 5-gram initialisms from all code variants, including the crossed out line, and output them into a corresponding text file. The same was also done for 4, 3, 2 and 1-grams and stored in their respective text files. These were to be used as input files from which the search engine was able to perform searches to query the database.

The search engine code was also written in Python. It functioned by taking in the initialism combinations from the Somerton Man code of length n, from text files created by the intiialism generator code, and stored them in a dictionary labelled 'initialisms of interest'. The grams from the database were read in by a reader and initialisms were generated from the grams in each line of the reader. If the initialism generated from the line of the reader matched an initialism in the dictionary containing the initialisms of interest, the full gram was output into a corresponding text file containing results of length n. This code was copied and modified to be used for each gram length from n=1-5. A simplified diagram of the way the code works can be seen in the flowchart in Figure X, and the full code can be seen in Appendix X.

Fig. X: Search Engine Flowchart

Running our code on the Google N-Gram database stored in the i2.xlarge instances in parallel for each group of n-gram inputs from n=1-5 took approximately two weeks. These raw results were then small enough to be stored and processed locally and so the Amazon EC2 service was no longer required.

The frequency for each n-gram was then taken using Python code to count the number unique entries for each gram. This was implemented in order to speed up the time in which to obtain a frequency to be used to rank the popularity of each gram. This was a bug that unfortunately caused the frequency of occurrence of grams in each year to be lost, and so the count of the number of years in which each gram occurred was used as measure of frequency.

Processing Search Results

Once the raw results were obtained, some grams contained words followed by an underscore and the corresponding lexical category of the word (ie. noun, verb, adverb, adjective, pronoun etc.). This was desired to be removed and so another python code was written to remove everything but the words themselves from each line in the results.

Upon processing the raw results, the output of the lexical category removal results showed multiple identical results with individual frequencies for the numbers of years in which they occurred. This was brought about since previously the database considered these entries to be unique results, but now with the lexical categories removed, some results became identical. This was rectified by writing another code in Python to process these cleaned results to combine identical entries and sum their frequency of years in which they occurred. This code was then duplicated and modified into two codes, the first output the results sorted alphabetical order, and the second in order of frequency of years in which each result occurred from highest to lowest. The alphabetically sorted outputs were used as a means of comparison to the cleaned inputs since these were also sorted alphabetically, in order to check that the code was functioning correctly. The frequency sorted outputs were more useful since they able to be used to generate a condensed list of the top 30 most popular initialisms that could be generated from the letters from all variations in the Somerton Man code, seen in the results section in Figure X.

Combinations of Search Results

Finally, a code was written in Python to generate all possible combinations of the top 2 5-gram group results for each variant of the code, where the top 2 results were based on frequency of years in which they occurred. This was achieved using a non-overlapping sliding window of 5 letters in length. The way this code worked can be more easily explained using the following example:

For simplicity, using 2-grams and the code ABAC: If the top 2 2-grams for AB are Absolute Bargain and American Beagle, and the top 2 2-grams for AC are Air Conditioning and Alternating Current, then all possible combinations for the code are: Absolute Bargain Air Conditioning, Absolute Bargain Alternating Current, American Beagle Air Conditioning and American Beagle Alternating Current.

This code was implemented as an exercise to see if any interesting or useful results could come about using this simple method. Unfortunately, this produced nonsensical results due to the disjoint between each 5-gram group's search results, a sample of these can be seen in the results section in Figure X. Due to the time constraints of the project, the code was not able to be developed any further, but the code and the results it provides can be used as a first step towards obtaining meaningful or useful combinations of n-grams from the results obtained using the search engine developed throughout this project. This code could be improved by using a sliding window that progresses by less than 5 letters for each search, for example, using a step size of 1 letter would create the maximum possible overlap of 4 letters between each input gram group. More information on this and other suggested improvements can be found in the future work section.

Results

File:Combined Top 30.pdf
File:Top 2 5-Gram Results Combined.pdf

Evaluation and Justification

Unfortunately, the database used in the search engine was smaller in size than the preferred Microsoft option[36]. This could have provided a larger number of search results to be used for future analysis, but since the workflow utilised to access this database did not suit the application proposed by this project, the Microsoft option was deemed unusable. Another limitation of using the Google N-Gram database was that it only includes grams that have appeared more than 40 times across the corpus[37]. There is room for improvement in this task by potentially fixing the bug in the search engine code to record frequency of occurrence of grams in each year to calculate the total frequency of each gram. As mentioned, another improvement could be to optimise the gram combination code in order to obtain more useful or interesting results. Overall, I believe Task 2 was successful in achieving its aims to crate a search engine that found regular expressions that could be linked to the Somerton Man code by using the initial letters of the code to find commonly used English phrases. In addition to this, the search time relative to the number of results produced has clearly been largely improved when compared to the web crawler developed in previous years.

Task 3: Rubaiyat of Omar Khayyam as a One-Time Pad

Aim

The aim of Task 3 was to use Rubaiyat of Omar Khayyam as a one-time pad to attempt to decode the Somerton Man code to find any meaningful messages after decrypting. This task involved the investigation that the letters had been substituted for others using a one-time pad technique. This group was to use the Somerton Man code to act as the cipher text, the numerical value of letter positions within words, with respect to the first letter of each word to act as the key, and the decoded messages to act as the plaintext of the code. The first letter of each word in the one-time pad had numerical value 0, the second letter had numerical value 1, and so on.

The difference in aim between our group and the 2013 honours group is that the key used to decode the message is based on letter position within each word rather than using numbers assigned to each letter in the alphabet to perform alphabetic shits, as implemented in the 'Decoding Toolkit - One Time Pad' software [38].

Method

Before commencing Task 3, we had to choose a computer programming language to implement the code. My first attempt was to use Java, because this language was used by all previous groups to implement for their work. I tried to reuse the base code from previous groups and extend it to satisfy our Task 3 requirements, however, the aim of 2013 was significantly different to ours, so there were not too many similar points in the code. On the other hand, I am not good at Java language because I never learnt it before. My second attempt was to use the C and C++ language to implement the code. I completed some functions for the code, but there was a problem in loading text files. Because we used the Rubaiyat as a one-time pad, we were trying to translate Rubaiyat into a text file so that we were able to run the code and search for words in the Rubaiyat. C++ is an old programming language and the process of loading data was too complex. I decided to give up this method. My final attempt was to use Matlab, it is a relatively new programming language, and loading text files is relatively simple. In addition, I have studied Matlab in previous university courses and find it easier to use. Finally, I chose Matlab as implement language for our Task 3.

The direct substitution of the letters in the pad was to be used. For instance, if we chose a line: MRGOABABD, from the Somerton Man code as the encoded message, the first letter of the code is to be used, which is an M. The program was then to search the Rubaiyat from beginning to end, until it finds the first word that begins with M. This was then to be decoded to the second letter in the same word. This process was then to be repeated for the second letter in the code, being R. It was to find the first word in the Rubaiyat that starts with R, and decodes it to the second letter in that word. This method was then to be repeated until there were too few words long enough to decode all of the letters in the code. After this point, one final decode was to be attempted, where rather than using letter position, the last letter of each word was to be used. The output of the software was to be possible words or phrases made up of letter substitutions in place of the letters in the Somerton Man code.

Design
Fig. 3.1: Design for task3

The Figure 3.1 shown above include my design and thinking process for Task 3. We needed 2 inputs for the code: One was the Somerton Man code which is the encoded message, other was the letter position as the key (the range of the letter position is above 0 and starts with integer 1 based on the MATLAB system). The function was to output a letter from the Somerton Man code and a number n from the letter position. Then, a function called matching was then to receive the letter output from the Somerton Man code. It was to search the Rubaiyat from beginning to end to match the first word with the same initial letter. At the same time, a function called finding was to receive the number output by letter position. Afterwards, the function matching was to send the matched word from Rubaiyat to the function 'finding'. This function was then to choose the correct letter in the matched word based on the letter position n. For example, if n equaled to 2, the 'finding' function would choose the second letter of the matched word. Finally, the function wouldl output the chosen letter to a function called 'recordMessage', and this function would record all the letters from 'finding' in order. The program would have now completed decoding for one letter of Somerton Man code, and would then repeat for the rest of the letters in the code. After decoding all letters, the program would output the original message.

One-Time Pad Example

The Figure 3.2 is an excerpt form the Rubaiyat of Omar Khayyam and the word AWAKE is the first word in the Rubaiyat. Assuming we have the encoded message AFM and we want to use the second letter position to decode it. The program will search the Rubaiyat from beginning to end, until it finds the first word that begins with A. From Figure 3.2, the word should be AWAKE. This will then be decoded to the second letter in the same word which is W. The program will repeat step 2 for the rest of the letters in the code. The second word and third word should be FOR and MORNING based on the Rubaiyat of Omar Khayyam. The output should be WOO. Figure 3.3 shows the output from MATLAB, the function called multi, which shows all possible decoded messages according to letter position n. For example, if n equals 2, the output will display the first original encoded message AFM and the second decoded message WOO as well.

Fig. 3.2: Example of task3 1
Fig. 3.3: Example of task3 2
Verification

A simple verification method was used to test the MATLAB code to be functioning correctly. One group member encoded the original message GUN using the proposed key with the third letter position, and Rubaiyat of Omar Khayyam as the one-time pad. Based on Figure 3.4, which is an excerpt from the formatted version of the Rubaiyat of Omar Khayyam, the encoded words used were NIGHT, FLUNG and HUNTER. Taking the first letter of each word produces the letters NFH as the encoded message.

The other group member then took the letters NFH as the code and ran the MATLAB script to test all letter positions until discovering that by using the third letter position as the key, the original message 'GUN' was displayed as the output as expected. This can be see in Figure 3.5.

Fig. 3.4: Task 3 Verification
Fig. 3.5: Task 3 Verification Result

Results

Fig. 3.6: Task 3 result 21
Fig. 3.7: Task 3 result 22

Figures 3.6 and 3.7 show the results for using second letter position as the key to decode each line of Somerton Man code with the Rubaiyat of Omar Khayyam acting as the one-time pad both with and without formatting respectively. Formatting refers to whether or not sumbols, punctuation and non-letter parts were removed from the text file.

Fig. 3.8: Task 3 result 31
Fig. 3.9: Task 3 result 32

Figures 3.8 and 3.9 show the results for using third letter position as the key to decode each line of Somerton Man code with the Rubaiyat of Omar Khayyam acting as the one-time pad both with and without formatting respectively.

Fig. 3.10: Task 3 result l1
Fig. 3.11: Task 3 result l2

Figures 3.10 and 3.11 show the results for using last letter position as the key to decode each line of Somerton Man code with the Rubaiyat of Omar Khayyam acting as the one-time pad both with and without formatting respectively.

Evaluation and Justification

Despite correctly implementing the newly proposed key method to use the Rubaiyat of Omar Khayyam as a one-time pad through the use of software, no useful results were able to be generated by using all variants of the code and all letter positions as keys. The main conclusion to be drawn from this was that the Rubaiyat of Omar Khayyam was not used as a one-time pad to generate the Somerton Man code with the proposed method of using letter position within words as the key.

Task 4: Statistical Frequency of Letters Reanalysis

Aim

Towards the end of the project, a decision was made that for Task 4, rather than analysing the mass spectrometer data from the Somerton Man's hair, we would focus our efforts on reanalysing the letter frequencies of various European languages. This was decided upon since our initial analysis performed in Task 1 produced inconsistent and varied results. This was brought about due to the limited sample size of the Universal Declaration of Human Rights as a base text, causing the frequency of particular letters to appear 0 times in particular languages. Due to this, these letter frequencies had to be altered by choosing arbitrary numbers for their frequency in order to perform our chi-Squared testing and thus reduced the accuracy and validity of the test's results.

the limited sample size caused the chi-squared values for all languages, including English, to be reasonably large. This caused the resulting calculated p-values to be extremely small numbers, or in most cases 0. Because of this, these chi-squared values were not usable to use p-values to perform our initially proposed hypothesis testing from Task 1.

This caused us to question the validity of the Universal Declaration of Human Rights as a base text and so we sought to increase our sample size using alternate base texts and extend our original statistical analysis.

Method

It was decided that for the reanalysis, we would use Project Gutenberg to increase the sample size for as many of the 21 most popular European languages used in Task 1 as possible by collecting novels from the time before the Somerton Man's death. This was chosen to be used as our base corpus in an attempt to obtain a more accurate representation of the initial letter frequencies of words in these languages. Novels in each language were concatenated and their letter frequencies were determined, until each letter appeared at least once in each language.

The 2013 group’s decoding toolkit and initial letter frequency count code were able to be utilised for this task. The decoding toolkit's 'format texts' function was used to remove all non letter characters and symbols as well as punctuation and accented letters, and the initial letter frequency counter was run on all of our base and benchmark sample texts in order to obtain the data we needed to perform our statistical analysis. All statistical calculations and graphs were generated using Mircosoft Excel.

Initial Validation

First of all, the same test that was initially run in Task 1 on the statistics obtained from the English translation of the Universal Declaration of Human Rights (with letters with frequency 0 modified to 0.0001) as a baseline check were also run on the new statistics gathered from the novel: The Life of the Spider by J. Henri Fabre[39], used as our English base text found on Project Gutenberg, as a means of comparison between the base texts. The Somerton man code, 44 letter samples from a Thomas Hardy novel acting as an English control [40] as well as a French sample, German sample, and Zapoteco sample from the Universal Declaration of Human Rights were all compared to both sets of data and the results can be seen in Figure X.

European Language Comparison

Next, once it was found that the English text from Project Gutenberg provided lower chi-Squared values than the Universal Declaration for all samples in the initial test, the chi-Squared testing on European languages could be commenced. This involved the same procedure as was used in Task 1, but of the top 21 most popular European languages from Task 1, only 12 of the languages were able to be used in the reanalysis due to insufficient usability or availability of texts on Project Gutenberg. The languages used in the analysis can be seen in the graph in Figure X. The omitted languages included Greek, Russian, Serbian, Kurdish, Uzbek, Turkish, Ukranian, Belarusian and Kazakh. The texts used for this analysis can be seen in Appendix X.

Significance Level Calculation

The chi-squared and p-values calculated showed that English was the closest language to the Somerton Man code. From this, hypothesis testing could be performed based on the English results. Upon consultation with Prof. Abbott and Dr. Berryman, rather than choosing an arbitrary value of significance level such as the typically used P=0.05, it was decided a significance level could be calculated using the p-value found using real English texts to be used as what we deemed to be an acceptable significance level for which we would confidently be able to say that the most likely language of origin of the Somerton Man code is English. This was achieved by collecting 20 44 letter excerpts from English novels from Project Gutenberg (see Appendix X), performing the chi-squared testing for these samples against the English project Gutenberg novel used as our English base text, taking an average of the chi-squared values, and from this calculating a p-value. This result was then compared to the results obtained from the English portion of the chi-squared testing performed on the variants of the code, and was plotted as seen in Figure X.

This same testing was then also run on the English samples and code variants against the original English translation of the Universal Declaration of Human rights as a means of comparison between the two base texts. Significance levels were unable to be calculated using the Universal Declaration of Human Rights since the chi-squared values were too large, causing the calculated p-values to be too small (approaching 0). The results can be seen in Figure X.

It was unnecessary to extend the analysis to collect benchmarks and perform the hypothesis testing on the other European languages against the code since chi-squared values produced were too large, and so the p-values calculated were unusable.

Increased Sample Size Testing

It was then decided that in order to increase our confidence in the calculated significance level, we would increase the sample size for our English base text from Project Gutenberg to not only large enough such that each letter appeared at least once, but to concatenate 20 English novels from the time before the Somerton Man's death to be used as our base English Corpus (See Appendix X). It was first confirmed whether this would have an affect on the chi-squared values against the code variants when compared to other languages. We could then also increase our English benchmark sample size by taking 100 44 letter samples from this corpus using code written in Python, and performing the same testing as performed on our smaller English base text. The results from this testing can be seen in Figure X.

Increasing the sample size of the English base text had very little effect on the graphs produced in the Initial Validation, European Language Comparison and so these graphs have been excluded. A closer look at the changes to the chi-squared and p-values for the Somerton Man code variants caused by this increased sample size can be seen through comparing Figures X and X. Increasing the number of 44 letter English samples from 20 to 100 however, did have an effect on the chi-squared value and p-value calculated to be used as our significance level, the results of which can be seen in Figure X. This increase in number of samples had very little effect on the graph of the Universal Declaration of Human Rights significance level calculation, and so this has also been omitted.

Results

Fig. X: Graph of
Fig. X: Graph of Average Chi-Squared values from Project Gutenberg Base Texts versus Somerton Man Code Variants
Fig. X: Graph of
Fig. X: Graph of
Fig. X: Graph of

Evaluation and Justification

The results of the initial validation seen in Figure X, show that using the Project Gutenberg novel as an English reference text provided lower chi-squared values for all test cases and thus it was deemed to be a more suitable base text than the modified version of the Universal Declaration of Human Rights.

The results from the European Language Comparison in Figure X, show that English had the lowest chi-squared value when compared to all languages in our Project Gutenberg corpus, and thus was the closest language to the Somerton Man code.

The chi-squared values calculated using the English translation of the Declaration of Human Rights (Figure X) were found to be much higher than those calculated using the English Project Gutenberg novel (Figure X). The large difference in results, and the fact that real english samples obtained such high chi-squared values, show that the Declaration may not provide an accurate representation of letter frequencies in the English language, and thus the validity of this as a base text has been proven to be questionable when used as part of a chi-squared analysis. In addition to this, despite the chi-squared values calculated using the Somerton Man code variants being much lower in all cases, hypothesis testing could not be completed due to the large chi-squared values producing very small p-values for the code variants and significance level (approaching 0).

The results from the initial significance level calculation in Figure X shows that 2 of the 3 Somerton Man code variants, and thus the average result, achieved higher p-values than the calculated significance level. From this we could deduce that our preliminary results showed that our null hypothesis was accepted and that English is the most likely language of origin of the code, assuming that it is an initialism.

Upon increasing the sample size, the significance level calculation in Figure X shows that now only 1 of the 3 code variants achieved a p-value higher than the calculated significance level. This caused the average result to fall below the significance level. Due to this, our statement had to be modified to say that overall the null hypothesis was rejected and alternative hypothesis accepted, meaning that we can not confidently say that the language of origin of the Somerton Man code is english for all variants. Despite this, the null hypothesis could be accepted and English is the most likely language of origin of the Somerton Man code, assuming that it contains 4 M's, 2 W's and is an initialism.

Regardless of the choice to accept or reject the null hypothesis, the similarities in chi-squared and p-values calculated between real 44 letter English Samples and all variants of the Somerton Man code using the Project Gutenberg base text reinforces the notion that the language of origin of the code is indeed English.

Although we were able to find a base text with frequency greater than 0 for each letter, suitable for reanalysis and performing hypothesis testing, the chi-squared method used was still not entirely mathematically accurate since the expected value of the number of sample observations for each letter in the code should have been a minimum of 5[41]. This was unavoidable since we had limited letter frequencies provided by the Somerton Man code and thus this sample size could not be increased. Since this was constant when comparing across all languages, the method was still able to be used as a means of comparing the 'goodness of fit' of letters in each language.

Project Management - Planning and Feasibility

Work Breakdown/Deliverables

The workload for this project was broken down into its main tasks. These can be seen in list form in the Final Project Gantt Chart (see Timeline section). The key deliverables are represented as milestones on the Gantt Chart. The dependencies of the tasks and deliverables can be seen in the Gantt Chart as black arrows, these are as follows: The Research Proposal and Progress Report have dependence on the Draft Research Proposal, which has dependence on the Proposal Seminar. Of the specific project tasks, Task 1 was completed first, and Tasks 2, 3 and 4 were completed in parallel. The Final Seminar Presentation, Project Exhibition Poster, Final Performance, Youtube video and Dump of final work are all dependent on the completion of the specific project tasks. The Final Report/Honours Thesis was completed in parallel with the rest of the work from the Research Proposal and Progress Report hand-up, onwards.

Timeline

The timeline for this project was created in the form of a Gantt Chart. The proposed Gantt Chart can be seen in Figure X.

Fig. X: Proposed Project Gantt Chart

The final Gantt Chart after all revisions and updates can be seen in Figure X.

Fig. X: Final Project Gantt Chart

Changes made from the originally proposed Gantt Chart to the final revised Gantt Chart include the renaming of Tasks 2 and 4 to N-Gram Search and Statistical Frequency of Letters Reanalysis. Task 2 was completed earlier than expected, but cleaning up results for presentation and finding meaningful combinations of the results proved to take longer than expected, and so the second part of Task 2 was extended. Task 3 was also extended so that Jikai was able to complete this task. Task 4 was commenced earlier than proposed since the bulk of Task 2 was completed early. Due to this, Task 4 was completed in parallel with Tasks 2 and 3 towards the end of the project timeline. The dump of final work and project youtube video were moved to be completed after the due date of the Final Report/Thesis upon discussion with our supervisors. Overall, our initially proposed Gantt Chart estimated our project timeline quite accurately and only minor changes needed to be made.

Task Allocation

The workload for the tasks within this project were allocated based on the strengths and skillset of each member, as well as the estimated time taken and complexity of each task. A table of the project task allocation can be seen in Figure X. The key allocations were that Nicholas Gencarelli undertook the tasks of Project Management, N-Gram Search and the Project Exhibition Poster. Jikai Yang undertook the tasks of the use of the Rubaiyat of Omar Khayyam as a One-time Pad, and the project Youtube video. The allocations did not require changing throughout the project life cycle apart from the decision for both members to perform a statistical reanalysis for Task 4 rather than both analysing the mass spectrometer data from the Somerton Man's hair.

Fig. X: Table of Task Allocation

Management Strategy

A number of management strategies were adopted for use throughout the project. One of which was frequent face-to-face contact through regular meetings every 2-3 weeks. Another was regular communication between group members via text message and email. Collaboration is another strategy that was useful, if one member required assistance on a particular task, the other was able to step in and help. This was achieved through the use of flexible task allocation. The group was able to make use of collaborative software including Google Drive for working together on project documents, and Git Hub repository for working together on code for software development. The project Wiki page was updated in real time including the weekly progress section to monitor and review work completed by each member every week, as well as plan tasks for the upcoming week. Finally, the use of a Gantt chart was used as a management strategy to incorporate clearly defined task and goals and established a critical path through use of task dependencies.

Budget

The project budget for this honours group was set at 500 dollars at the commencement of the project. It was initially proposed for the budget to depend on the n-gram database chosen to be used for the search engine in Task 2. As discussed in the Method section of Task 2: N-Gram Search, a variety of options were considered and the main two largest databases were found to be Microsoft Web N-Gram Services[42], and Google N-Gram [43].

The Microsoft alternative was found to be free to use for academic purposes after applying for a user token, and is stored for free on Microsoft’s web server, hence there was no need to purchase storage upon which to store the database[44].

The Google alternative was available for free when obtaining the raw dataset, or at a cost of 150 dollars for a student license when purchased from the University of Pennsylvania Linguistic Data Consortium [45]. Unlike the Microsoft alternative, if the Google N-Gram option was chosen, a portion of the budget would have had to be dedicated to storing the database. It was initially proposed to store the database on a hard drive at a cost of approximately 100 dollars.

The proposed budget can be seen in the tables highlighting the key costs of each option in Figure X.

Fig. X: Proposed Budgeting Table

For reasons discussed in the Method section of Task 2: N-Gram Search, upon deciding to use the Google N-Gram database, a decision was to be made whether to purchase the University of Pennsylvania's Linguistic Data Consortium version or to obtain it for free directly from Google. A decision was made to utilise the free database provided by Google as it was not deemed justifiable to spend $150 on the processed data from the Linguistic Data Consortium since it was proposed that the raw dataset could be cleaned up through writing software.

The initial budget was based on the assumption that the Google N-Gram database could be stored locally, although this was feasibly possible in its compressed form, the local computing power available would have been insufficient to run the search engine code through the database within a the time frame of the project. As discussed in the Method section of Task 2: N-Gram Search, a cloud based computing service called ‘Amazon Elastic Compute Cloud’ was utilised to store and process the database. The free tier was considered but did not provide the specifications required to meet the needs of our task, and so instances on Amazon EC2 were hired at a rate of 0.853 dollars per hour [46]. Upon storing the initial full database, running our search code, and downloading our results generated from the outputs of the code, the total cost of utilising the service came to 576 dollars. This caused our project to exceed the initially proposed budget. The reason for the additional project expenditure was that despite our efforts, it was difficult to predict the precise time that it would take to upload, store and process the database on the cloud service. The initially proposed budget did not include the need or costing for the Amazon server since this was not something that could be reasonably foreseen at the start of the project since it was initially thought that the Microsoft N-Gram Service would be suitable for the needs of the project, and if this was not suitable, that the Google N-gram alternative would be able to be stored locally.

The final revised budget including total project expenditure can be seen in Figure X.

Fig. X: Final Budgeting Table

In conclusion, despite going over budget, the additional funds were kindly provided by the school of Electrical and Electronic engineering upon sending an application for funding including justification of our purchases. The project work has benefited through the purchase of the Amazon service since we were able to complete a search of specific n-gram combinations of the code on the full Google N-Gram database. It has provided us with results to present as part of our thesis and allowed us to meet the requirements set out in the aim of Task 2.

Risk Analysis

A risk assessment was undertaken for this project to include risk identification, analysis, evaluation and treatment strategies using the Adelaide University risk matrix procedure [34][47]. This can be seen in Figure X. One of the risks that occurred during the project was the inaccurate estimation of time and resources. This occurred since the group and supervisors were unhappy with the results obtained from the initial analysis of letter frequency performed in Task 1. This was rectified by implementing the flexibility of our schedule and by replacing the initially proposed Task 4: Mass Spectrometer Data Analysis, with a new Task 4: Statistical Frequency of Letters Reanalysis. Another risk that occurred throughout the project was Illness. This was able to be dealt with relatively easily through working from home for a short period of time. The minor misunderstanding of project tasks occurred on a few occasions, but these were clarified through scheduling meetings with group members and supervisors. Bugs in code were reduced to the best of our ability through thorough testing and debugging of code. Finally, the inability to decipher the Somerton Man Code was a risk estimated with an almost certain likelihood. Despite being unable to avoid this risk throughout the project, its effects were considered negligible, and the group was still able to complete all work to the best of its ability, and further the research into the decryption of the code for not only future honours groups, but also the wider community through publishing our results on our Wiki.

Fig. 6: Table of Risk Assessment

Conclusions

The work undertaken throughout the project has fulfilled the key aims and objectives of the project including statistical analysis of likely language of origin of the Somerton Man code, the design and implementation of software to test the 'Rubaiyat of Omar Khayyam as a one-time pad in conjunction with a new key technique, and developed a search engine to discover possible n-grams contained within the code. The group was successful in completing all tasks outlined in the proposal, with the exception of the proposed extension task to analyse the mass spectrometer data of the Somerton Man's hair.

Through this, the group was able to critically review and further the statistical analysis of the likely language of origin of the Somerton Man code conducted by previous groups. The 2015 group has improved upon the search for n-grams conducted by previous groups by increasing number of results and search speed. In addition to this, the results collected through implementing the search engine are valuable for future groups to analyse for useful grams that could be linked to the Somerton Man code. The group has also furthered the exploration from previous groups into the possibility that the 'Rubaiyat of Omar Khayyam' was used as a one-time pad to encrypt the Somerton Man code by testing a new key technique.

The skills developed through undertaking this project include text processing and programming in a variety of languages including Java, MATLAB and Python. The group has also thoroughly researched and learnt how to implement and evaluate statistical tehcniques including chi-squared testing, p-value calculation and hypothesis testing and developed skills in using Microsoft Excel software to perform statistical analyses.

The main conclusions drawn from the project work include that the Somerton Man code was not created using the 'Rubaiyat of Omar Khayam' as one-time pad and the proposed method of using letter position within words as the key. Further analysis is required to obtain meaningful or useful combinations of grams from the results of the n-gram search. The Universal Declaration of Human rights has too small a sample size of words in each language to accurately represent the initial letter frequency in each language for use in chi-squared testing. Finally, although the results from the hypothesis testing were somewhat inconclusive, the results of all of the chi-squared testing have lead to the conclusion that we can now say more confidently than ever that English was the most likely language from which the Somerton Man code was written, assuming it is an initialism.

Despite being unable to decipher the Somerton Man code, the 2015 group has designed and implemented software that has furthered past work into the investigation and provided useful tools and resources to be utilised by future Honours students.

Future Work

The search engine in Task 2 contained a bug in the code that collected the number of years in which each gram occurred, rather than actual frequency of occurrence in each year. This brings about a potential for reanalysis of the database, in particular the 5-gram data, using raw count through minor modification to the search engine code.

There is also the potential to run a limited search on the search engine results using a more sophisticated code than the one used by the 2015 group. This could be used to generate more useful gram combinations to find commonly used English expressions or phrases that could be linked to the Somerton Man code.

Future students could extend the statistical analysis to perform hypothesis testing on all European languages in Project Gutenberg, but an alternate method to the chi-squared testing performed would have to be utilised since the chi-squared values for the code against all other languages were too high to produce usable p-values.

Another option would be to focus on English as the most likely language and statistically analyse the code against genres as conducted by the 2013 group [48], but using the chi-squared method as opposed to the squared difference and standard deviation methods adopted to consolidate or refute the conclusions drawn from their results .

Future groups could also extend the 2013 groups analysis of the mass spectrometer data collected using laser ablation of the Somerton Man's hair [49]. The hair was collected from the plaster bust of the Somerton Man made after his autopsy. The concentration of isotopes found in the hair can be used to find out the environment the Somerton Man lived in in the lead up to his death. Additional data from a separate hair has been taken since the 2013 group's analysis, as well as the concentration of isotopes in the plaster. This could be used by a future group to crosscheck the data between separate hairs, as well as crosscheck the isotopes in the plaster versus the hair to see if the isotopes from the plaster may have diffused into the hair.

Appendices

  • Appendix X: Full Search Engine Code

File:Search Engine Code.pdf

  • Appendix X: Project Gutenberg European Language Comparison Text References

File:Gutenberg European Language Comparison Text References.pdf

  • Appendix X: Project Gutenberg 20 English 44 Letter Text File References

File:Gutenberg 20 English 44 Letter Text File References.pdf

  • Appendix X: Project Gutenberg English Corpus 20 Novels References.pdf

File:Gutenberg English Corpus 20 Novels References.pdf

References

  1. The News. (1948, December 1). Dead Man Found Lying on Somerton Beach [online]. Available: http://trove.nla.gov.au/ndp/del/article/129897161
  2. The News. (1948, December 1). Dead Man Found Lying on Somerton Beach [online]. Available: http://trove.nla.gov.au/ndp/del/article/129897161
  3. The Advertiser. (2005, March 9). Death riddle of a man with no name [online]. Available: http://www.eleceng.adelaide.edu.au/personal/dabbott/tamanshud/advertiser_mar2005.pdf
  4. The Advertiser. (1949, June 9). Cryptic Note on Body [online]. Available: http://trove.nla.gov.au/ndp/del/article/36371152
  5. Hub Pages Author. (2014, August 30). The Body on the Beach: The Somerton Man - Taman Shud Case [online]. Available: http://brokenmeadows.hubpages.com/hub/The-Mystery-of-the-Somerton-Man-Taman-Shud-Case
  6. Cleland. (1949). Coroner's Inquest [online]. Available: http://trove.nla.gov.au/ndp/del/article/130195091
  7. A. Turnbull and D. Bihari. (2009). Final Report 2009: Who killed the Somerton man? [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_report_2009:_Who_killed_the_Somerton_man%3F
  8. Hub Pages Author. (2014, August 30). The Body on the Beach: The Somerton Man - Taman Shud Case [online]. Available: http://brokenmeadows.hubpages.com/hub/The-Mystery-of-the-Somerton-Man-Taman-Shud-Case
  9. YouTube ABC. Somerton Beach Mystery 1978 [online]. Available: https://www.youtube.com/watch?v=ieczsZRQnu8
  10. A. Turnbull and D. Bihari. (2009). Final Report 2009: Who killed the Somerton man? [online]. Available:
  11. K. Ramirez and L-V. Michael. (2010). Final Report 2010 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2010
  12. S. Maxwell and P. Johnson. (2011). Final Report 2011 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2011
  13. A. Duffy and T. Stratfold. (2012). Final Report 2012 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2012
  14. L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking
  15. A. Turnbull and D. Bihari. (2009). Final Report 2009: Who killed the Somerton man? [online]. Available:
  16. K. Ramirez and L-V. Michael. (2010). Final Report 2010 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2010
  17. S. Maxwell and P. Johnson. (2011). Final Report 2011 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2011
  18. A. Duffy and T. Stratfold. (2012). Final Report 2012 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2012
  19. L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking
  20. The Advertiser. (1949, June 10). Tamam Shud [online]. Available: http://trove.nla.gov.au/ndp/del/article/36371416
  21. N. Gencarelli and J. K. Yang. (2015, March 15). Cipher Cracking 2015 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Cipher_Cracking_2015
  22. S. L. Center. (2015). What is Scots? [online]. Available: http://www.scotslanguage.com/What_is_Scots%3F_uid2/What_is_Scots_%3F
  23. L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking
  24. No Author. 2014. Using Chi Squared to Crack Codes [online]. Available: http://ibmathsresources.com/2014/06/15/using-chi-squared-to-crack-codes/.
  25. T Hardy. 2005. Tess of the d’Urbervilles (11th Edition) [online]. Available: https://ia801409.us.archive.org/24/items/tessofthedurberv00110gut/110-8.txt.
  26. No Author. 2015. List of languages by number of native speakers [online]. Available: http://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers#cite_note-Nationalencyklopedin-1.
  27. Microsoft Research. (2015). Microsoft Web N-Gram Services [Online]. Available: http://research.microsoft.com/en-us/collaboration/focus/cs/web-ngram.aspx
  28. Google Research Blog. (2006, August 3). All Our N-Gram are Belong to You [Online]. Available: http://googleresearch.blogspot.com.au/2006/08/all-our-n-gram-are-belong-to-you.html
  29. C. X. Zhai et al. (2010, July 19-23). Web N-gram Workshop [online]. Available: http://research.microsoft.com/en-us/events/webngram/sigir2010web_ngram_workshop_proceedings.pdf
  30. No Author. Microsoft Web N-Gram Service Quick Start [online]. Available: http://weblm.research.microsoft.com/info/QuickStart.htm
  31. No Author. Visual Studio Languages [online]. Available: https://msdn.microsoft.com/en-us/library/vstudio/ee822860%28v=vs.100%29.aspx
  32. Amazon Web Services. (2015). Amazon EC2 Instances [Online]. Available: https://aws.amazon.com/ec2/instance-types/
  33. Amazon Web Services. (2015). Amazon EC2 Instances [Online]. Available: https://aws.amazon.com/ec2/instance-types/
  34. Amazon Web Services. (2015). Amazon EC2 Instances [Online]. Available: https://aws.amazon.com/ec2/instance-types/
  35. Amazon Web Services. (2015). Amazon EC2 Instances [Online]. Available: https://aws.amazon.com/ec2/instance-types/
  36. C. X. Zhai et al. (2010, July 19-23). Web N-gram Workshop [online]. Available: http://research.microsoft.com/en-us/events/webngram/sigir2010web_ngram_workshop_proceedings.pdf
  37. Google Books. (2012 July). Ngram Viewer [Online]. Available: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
  38. L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking
  39. J. H. Fabre. (2005, March 22). The Life of the Spider [Online]. Available: https://www.gutenberg.org/ebooks/1887
  40. T Hardy. 2005. Tess of the d’Urbervilles (11th Edition) [online]. Available: https://ia801409.us.archive.org/24/items/tessofthedurberv00110gut/110-8.txt.
  41. Stat Trek. (2015). Chi-Square Goodness of Fit Test [Online]. Available: http://stattrek.com/chi-square-test/goodness-of-fit.aspx?Tutorial=AP.
  42. C. X. Zhai et al. (2010, July 19-23). Web N-gram Workshop [online]. Available: http://research.microsoft.com/en-us/events/webngram/sigir2010web_ngram_workshop_proceedings.pdf
  43. Google Books. (2012 July). Ngram Viewer [Online]. Available: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
  44. C. X. Zhai et al. (2010, July 19-23). Web N-gram Workshop [online]. Available: http://research.microsoft.com/en-us/events/webngram/sigir2010web_ngram_workshop_proceedings.pdf
  45. T.Brants and A.Franz. (2006). Web 1T 5-gram Version 1 [online]. Available: https://catalog.ldc.upenn.edu/LDC2006T13
  46. Amazon Web Services. (2015). Amazon EC2 Pricing [Online]. Available: https://aws.amazon.com/ec2/pricing/
  47. No Author. RISK MANAGEMENT HANDBOOK [online]. Available: http://www.adelaide.edu.au/legalandrisk/docs/resources/Risk_Management_Handbook.pdf
  48. L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking
  49. L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking

Glossary and Symbols

  • ASIO: Australian Security Intelligence Organisation
  • ASIS: Australian Secret Intelligence Service
  • ASD: Australian Signals Directorate
  • P-value theorem: The p-value is the calculated probability that gives researchers a measure of the strength of evidence against the null hypothesis [1].
  • Chi-Squared Test:
  • Universal Declaration of Human Rights:
  • Project Gutenberg:
  • N-gram model: The N-gram model is a sequence of n items from a given sequence of phonemes, syllables, letters, words or base pairs [2].
  • One-time pad: The one-time pad is a decoder technology which cannot be cracked if the correct key is used [3].
  • Initialism: A group of letters formed using the initial letters of a group of words or a phrase [4].
  • Plaintext: The information of an original message, which is desired to be deciphered from the ciphertext [5].
  • Ciphertext: The encoded format of a message [6].
  • Key: What is needed to convert the ciphertext into the plaintext using the one-time pad [7].
  1. B. David et al., “P Value and the Theory of Hypothesis Testing: An Explanation for New Researchers,” Clinical Orthopaedics and Related Research®, Vol.468 (3), pp.885-892 2010. [25] G G. L et al., “What is the Value of a p Value?,” The Annals of Thoracic Surgery, Vol.87(5), pp.1337-1343 2009. [26] No Author.p-value [online]. Available: http://en.wikipedia.org/wiki/P-value#cite_note-nature506-1
  2. A. Z Broder et al., “Syntactic clustering of the web”. Computer Networks and ISDN Systems 29 (8), pp.1157–1166. [28] No Author. Video Lectures [online]. Available: https://class.coursera.org/nlp/lecture/17
  3. S.M. Bellovin. (2011, July 12). Frank Miller: Inventor of the One-Time Pad [online]. Available: http://www.tandfonline.com.proxy.library.adelaide.edu.au/doi/full/10.1080/01611194.2011.583711#abstract
  4. No Author. Initialism [online]. Available: http://dictionary.reference.com/browse/initialism
  5. No Author (2011). Topic 1: Cryptography [online]. Available: http://www.maths.uq.edu.au/~pa/SCIE1000/gma.pdf
  6. No Author (2011). Topic 1: Cryptography [online]. Available: http://www.maths.uq.edu.au/~pa/SCIE1000/gma.pdf
  7. No Author (2011). Topic 1: Cryptography [online]. Available: http://www.maths.uq.edu.au/~pa/SCIE1000/gma.pdf