Final Report/Thesis 2015: Difference between revisions

From Derek
Jump to navigation Jump to search
A1628026 (talk | contribs)
A1673685 (talk | contribs)
No edit summary
 
(114 intermediate revisions by 3 users not shown)
Line 1: Line 1:
==Abstract==
==Abstract==
The project involves the mysterious case of a dead man found at Somerton Beach, South Australia. There was no evidence to show the man’s identification or cause of death, however, there were 5 lines of letters that were found on a scrap of paper in the dead man’s trouser pocket. It was later discovered that the scrap of paper was torn from a book known as ‘The Rubaiyat of Omar Khayyam’.  These letters are considered vital to the case as it is speculated that they may be a code or cipher of some sort. The case still remains unsolved today, and so this project has been undertaken in order to uncover further case evidence.
The project involves the mysterious case of a dead man found at Somerton Beach, South Australia. There was no evidence to show the man’s identification or cause of death, however, there were 5 lines of letters that were found on a scrap of paper in the dead man’s trouser pocket. It was later discovered that the scrap of paper was torn from a book known as ''The Rubaiyat of Omar Khayyam''.  These letters are considered vital to the case as it is speculated that they may be a code or cipher of some sort. The case still remains unsolved today, and so this project has been undertaken in order to uncover further case evidence.
The aims and objectives of the project include using various computational techniques to statistically analyse the likely language of origin of the code, designing and implementing software in order to decipher the code, and ultimately attempting to solve the cold case.  
The aims and objectives of the project include using various computational techniques to statistically analyse the likely language of origin of the code, designing and implementing software in order to decipher the code, and ultimately attempting to solve the cold case.  


Line 6: Line 6:
This project involves the design and implementation of software in order to decipher the code associated with the Somerton Man murder mystery. This document is a Final Report and Thesis outlining the key aims, methods, results and evaluations and justifications of the specific tasks involved in the Code Cracking: Who Murdered The Somerton Man? Honours project.  The information in the report also includes the motivation, previous studies, aims, objectives, and significance of the project, as well as technical background, knowledge gaps, technical challenges, specific project tasks and project management resources.
This project involves the design and implementation of software in order to decipher the code associated with the Somerton Man murder mystery. This document is a Final Report and Thesis outlining the key aims, methods, results and evaluations and justifications of the specific tasks involved in the Code Cracking: Who Murdered The Somerton Man? Honours project.  The information in the report also includes the motivation, previous studies, aims, objectives, and significance of the project, as well as technical background, knowledge gaps, technical challenges, specific project tasks and project management resources.


Write key findings etc. from conclusion
The project fulfilled the key aims and objectives of the project including statistical analysis of likely language of origin of the Somerton Man code, the design and implementation of software to test the ''Rubaiyat of Omar Khayyam'' as a ''one-time pad'' in conjunction with a new key technique, and developed a search engine to discover possible n-grams contained within the code.
 
The conclusions drawn from the project work include that the Somerton Man code was not created using the ''Rubaiyat of Omar Khayam'' as a ''one-time pad'' and the proposed method of using letter position within words as the key. Further analysis is required to obtain meaningful or useful combinations of grams from the results of the n-gram search engine developed.  Finally, the results of the chi-squared testing have lead to the conclusion that we can now say more confidently than ever that English was the most likely language from which the Somerton Man code was written, assuming it is an ''initialism''.
 
Despite being unable to decipher the Somerton Man code, the 2015 group has designed and implemented software that has furthered past work into the investigation and provided useful tools and resources to be utilised by future Honours students.


==Acknowledgements==
==Acknowledgements==
Line 15: Line 19:


===Motivation===
===Motivation===
[[File:Somerton_Man_Code_Photo.png|thumb|200px|right|'''Fig. 1:''' Photograph of the letters making up the Somerton Man Code]]
[[File:Somerton_Man_Code_Photo.png|thumb|200px|right|'''Fig. 1:''' Photograph of the letters making up the Somerton Man Code <ref>J. Fettes. (2013). Professor’s 15-year search for answers seeks to crack the secret code to the death of the ‘Somerton man’ found on an Adelaide Beach [online]. Available: http://www.heraldsun.com.au/news/law-order/portrait-may- hold-key-to-somerton-man-beach-mystery/story-fni0ffnk-1226674957043</ref>]]
On the 1st of December, 1948, there was a dead body found at Somerton Beach, South Australia <ref>The News. (1948, December 1). Dead Man Found Lying on Somerton Beach [online]. Available: http://trove.nla.gov.au/ndp/del/article/129897161</ref>.  There was no evidence to show the man’s identification and the cause of death <ref>The News. (1948, December 1). Dead Man Found Lying on Somerton Beach [online]. Available: http://trove.nla.gov.au/ndp/del/article/129897161</ref>, however, there were 5 lines of capital letters, with the second line struck out, that were found on a scrap of paper in the dead man’s trouser pocket <ref>The Advertiser. (2005, March 9). Death riddle of a man with no name [online]. Available: http://www.eleceng.adelaide.edu.au/personal/dabbott/tamanshud/advertiser_mar2005.pdf</ref>.  A photo of the paper containing the letters can be seen in Figure 1.  It was later discovered that the scrap of paper was torn from a book known as the 'Rubaiyat of Omar Khayyam' <ref>The Advertiser. (1949, June 9). Cryptic Note on Body [online]. Available: http://trove.nla.gov.au/ndp/del/article/36371152</ref>.  These letters are considered vital to the case as it is speculated that they may be a code or cipher of some sort.  As engineers, we have the ability to help investigators in solving the case.  With that in mind, this project is being undertaken to attempt to decrypt the code in order to help solve the cold case.
On the 1st of December, 1948, the body of a man was found at Somerton Beach, South Australia <ref>The News. (1948, December 1). Dead Man Found Lying on Somerton Beach [online]. Available: http://trove.nla.gov.au/ndp/del/article/129897161</ref>.  There was no evidence to show the man’s identification and the cause of death <ref>The News. (1948, December 1). Dead Man Found Lying on Somerton Beach [online]. Available: http://trove.nla.gov.au/ndp/del/article/129897161</ref>, however, there were 5 lines of capital letters, with the second line struck out, that were found on a scrap of paper in the dead man’s trouser pocket <ref>The Advertiser. (2005, March 9). Death riddle of a man with no name [online]. Available: http://www.eleceng.adelaide.edu.au/personal/dabbott/tamanshud/advertiser_mar2005.pdf</ref>.  A photo of the paper containing the letters can be seen in Figure 1.  It was later discovered that the scrap of paper was torn from a book known as the ''Rubaiyat of Omar Khayyam'' <ref>The Advertiser. (1949, June 9). Cryptic Note on Body [online]. Available: http://trove.nla.gov.au/ndp/del/article/36371152</ref>.  These letters are considered vital to the case as it is speculated that they may be a code or cipher of some sort.  As engineers, we have the ability to help investigators in solving the case.  With that in mind, this project is being undertaken to attempt to decrypt the code in order to help solve the cold case.


The South Australian Police stand to benefit from this project not only from the decoding technology developed for this case, but it also may be able to be applied to solve similar cases.  Historians may be interested in gaining further historical information from this project since the case occurred during the heightened tension of the Cold War, and it is speculated that this case may be related in some way <ref>Hub Pages Author. (2014, August 30). The Body on the Beach: The Somerton Man - Taman Shud Case [online]. Available: http://brokenmeadows.hubpages.com/hub/The-Mystery-of-the-Somerton-Man-Taman-Shud-Case</ref>.  Pathologists may also be interested as the cause of death may have been an unknown or undetectable poison <ref>Cleland. (1949). Coroner's Inquest [online]. Available: http://trove.nla.gov.au/ndp/del/article/130195091</ref>.  This project stands to benefit the wider community as well as extended family of the unknown man to provide closure to the mysterious case.  Professor Derek Abbot also stands to benefit as he has been working closely with honours project students for the past seven years in an attempt to decipher the Somerton Man code.
The South Australian Police stand to benefit from this project not only from the decoding technology developed for this case, but it also may be able to be applied to solve similar cases.  Historians may be interested in gaining further historical information from this project since the case occurred during the heightened tension of the Cold War, and it is speculated that this case may be related in some way <ref>Hub Pages Author. (2014, August 30). The Body on the Beach: The Somerton Man - Taman Shud Case [online]. Available: http://brokenmeadows.hubpages.com/hub/The-Mystery-of-the-Somerton-Man-Taman-Shud-Case</ref>.  Pathologists may also be interested as the cause of death may have been an unknown or undetectable poison <ref>Cleland. (1949). Coroner's Inquest [online]. Available: http://trove.nla.gov.au/ndp/del/article/130195091</ref>.  This project stands to benefit the wider community as well as extended family of the unknown man to provide closure to the mysterious case.  Professor Derek Abbot also stands to benefit as he has been working closely with honours project students for the past seven years in an attempt to decipher the Somerton Man code.


===Previous Studies/Related Work===
===Previous Studies/Related Work===
[[File:3D_generated_reconstruction_of_bust_of_Somerton_Man_from_2012_Final_Report.png|thumb|200px|right|'''Fig. 2:''' 3D generated reconstruction of bust of Somerton Man from 2012 Final Report]]
[[File:3D_generated_reconstruction_of_bust_of_Somerton_Man_from_2012_Final_Report.png|thumb|200px|right|'''Fig. 2:''' 3D generated reconstruction of bust of Somerton Man from 2012 Final Report <ref>A. Duffy and T. Stratfold. (2012). Final Report 2012 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Re port_2012</ref>]]
Previous professional attempts to decipher the code were limited since they did not use modern techniques or have access to modern databases.  Another limitation is that some of the characters in the code appear to be ambiguous and previous attempts made fixed assumptions on particular characters <ref>A. Turnbull and D. Bihari. (2009). Final Report 2009: Who killed the Somerton man? [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_report_2009:_Who_killed_the_Somerton_man%3F</ref>.
Previous professional attempts to decipher the code were limited since they did not use modern techniques or have access to modern databases.  Another limitation is that some of the characters in the code appear to be ambiguous and previous attempts made fixed assumptions on particular characters <ref>A. Turnbull and D. Bihari. (2009). Final Report 2009: Who killed the Somerton man? [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_report_2009:_Who_killed_the_Somerton_man%3F</ref>.
The Australian Navy’s response was that the letters were “neither a code nor a cipher” <ref>Hub Pages Author. (2014, August 30). The Body on the Beach: The Somerton Man - Taman Shud Case [online]. Available: http://brokenmeadows.hubpages.com/hub/The-Mystery-of-the-Somerton-Man-Taman-Shud-Case</ref>.
The Australian Navy’s response was that the letters were “neither a code nor a cipher” <ref>Hub Pages Author. (2014, August 30). The Body on the Beach: The Somerton Man - Taman Shud Case [online]. Available: http://brokenmeadows.hubpages.com/hub/The-Mystery-of-the-Somerton-Man-Taman-Shud-Case</ref>.
Line 28: Line 32:
*b) “The symbols could be a complex substitute code or the meaningless response to a disturbed mind”
*b) “The symbols could be a complex substitute code or the meaningless response to a disturbed mind”
*c) “It is not possible to provide a satisfactory answer” <ref>YouTube ABC. Somerton Beach Mystery 1978 [online]. Available: https://www.youtube.com/watch?v=ieczsZRQnu8</ref>  
*c) “It is not possible to provide a satisfactory answer” <ref>YouTube ABC. Somerton Beach Mystery 1978 [online]. Available: https://www.youtube.com/watch?v=ieczsZRQnu8</ref>  
Other previous studies into deciphering the code include Honours Projects at the University of Adelaide from 2009-2013.  The previous work undertaken by these groups includes: multiple evolutions of letter frequency analysis of the code on a variety of base texts in a number of languages, initial letter and sentence letter probabilities, the probabilities of known cypher techniques, the likelihood of the code being an initialism of a poem, the use of various one-time pad techniques,  the design and implementation of a web crawler, the analysis of text type and genre of the code’s likely plaintext, the implementation of pattern matching software into the web crawler, a 3D generated reconstruction of the bust of the Somerton Man (see Figure 2) and the analysis of mass spectrometer data taken from the Somerton Man’s hair <ref>A. Turnbull and D. Bihari. (2009). Final Report 2009: Who killed the Somerton man? [online]. Available:</ref> <ref>K. Ramirez and L-V. Michael. (2010). Final Report 2010 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2010</ref> <ref>S. Maxwell and P. Johnson. (2011). Final Report 2011 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2011</ref> <ref>A. Duffy and T. Stratfold. (2012). Final Report 2012 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2012</ref> <ref>L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking</ref>.
Other previous studies into deciphering the code include Honours Projects at the University of Adelaide from 2009-2013.  The previous work undertaken by these groups includes: multiple evolutions of letter frequency analysis of the code on a variety of base texts in a number of languages, initial letter and sentence letter probabilities, the probabilities of known cypher techniques, the likelihood of the code being an ''initialism'' of a poem, the use of various ''one-time pad'' techniques,  the design and implementation of a web crawler, the analysis of text type and genre of the code’s likely ''plaintext'', the implementation of pattern matching software into the web crawler, a 3D generated reconstruction of the bust of the Somerton Man (see Figure 2) and the analysis of mass spectrometer data taken from the Somerton Man’s hair <ref>A. Turnbull and D. Bihari. (2009). Final Report 2009: Who killed the Somerton man? [online]. Available:</ref> <ref>K. Ramirez and L-V. Michael. (2010). Final Report 2010 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2010</ref> <ref>S. Maxwell and P. Johnson. (2011). Final Report 2011 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2011</ref> <ref>A. Duffy and T. Stratfold. (2012). Final Report 2012 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2012</ref> <ref>L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking</ref>.
The main conclusions that past groups have come to in their projects are: that the letters are unlikely to be random, the code is unlikely to be an initialism, it is likely that the Rubaiyat of Omar Khayyam was used as a one-time pad, the language of the code is likely to be English, the code is unlikely to be an initialism of a poem and that the Rubaiyat of Omar Khayyam was not used as a straight substitution one-time pad <ref>A. Turnbull and D. Bihari. (2009). Final Report 2009: Who killed the Somerton man? [online]. Available:</ref> <ref>K. Ramirez and L-V. Michael. (2010). Final Report 2010 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2010</ref> <ref>S. Maxwell and P. Johnson. (2011). Final Report 2011 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2011</ref> <ref>A. Duffy and T. Stratfold. (2012). Final Report 2012 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2012</ref> <ref>L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking</ref>.
The main conclusions that past groups have come to in their projects are: that the letters are unlikely to be random, the code is unlikely to be an ''initialism'', it is likely that the ''Rubaiyat of Omar Khayyam'' was used as a ''one-time pad'', the language of the code is likely to be English, the code is unlikely to be an ''initialism'' of a poem and that the ''Rubaiyat of Omar Khayyam'' was not used as a straight substitution ''one-time pad'' <ref>A. Turnbull and D. Bihari. (2009). Final Report 2009: Who killed the Somerton man? [online]. Available:</ref> <ref>K. Ramirez and L-V. Michael. (2010). Final Report 2010 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2010</ref> <ref>S. Maxwell and P. Johnson. (2011). Final Report 2011 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2011</ref> <ref>A. Duffy and T. Stratfold. (2012). Final Report 2012 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2012</ref> <ref>L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking</ref>.
The analysis and extension upon specific elements of previous work that are directly related to the 2015 group’s project are discussed in the 'Method – Specific Tasks' section.
The analysis and extension upon specific elements of previous work that are directly related to the 2015 group’s project are discussed in the 'Method – Specific Tasks' section.


===Aims and Objectives===
===Aims and Objectives===
The key aims and objectives in this project included the aim to statistically analyse the likely language of the plaintext of the code.  Another aim was to design and implement software in order to try and decipher the code.  This was to be implemented by using the 'Rubaiyat of Omar Khayyam' as a one-time pad in conjunction with a new key technique, and by developing a search engine to try to discover possible n-grams contained within the code.  The third aim was to analyse mass spectrometer isotope concentration data of the Somerton Man’s hair.  Finally, the ultimate aim was to decrypt the code in order to solve the mystery, however this was somewhat unrealistic as the code has remained uncracked for many years.  Despite this, computational techniques were to be utilised to attempt the decryption, and at the very least, the past research into the case was to be furthered for future Honours students.
The key aims and objectives in this project included the aim to statistically analyse the likely language of the ''plaintext'' of the code.  Another aim was to design and implement software in order to try and decipher the code.  This was to be implemented by using the ''Rubaiyat of Omar Khayyam'' as a ''one-time pad'' in conjunction with a new ''key'' technique, and by developing a search engine to try to discover possible n-grams contained within the code.  The third aim was to analyse mass spectrometer isotope concentration data of the Somerton Man’s hair.  Finally, the ultimate aim was to decrypt the code in order to solve the mystery, however this was somewhat unrealistic as the code has remained uncracked for many years.  Despite this, computational techniques were to be utilised to attempt the decryption, and at the very least, the past research into the case was to be furthered for future Honours students.


===Significance===
===Significance===
Considered “one of Australia’s most profound mysteries” at the time <ref>The Advertiser. (1949, June 10). Tamam Shud [online]. Available: http://trove.nla.gov.au/ndp/del/article/36371416</ref>, this case still remains unsolved today.  As the development of decoder technology and the related knowledge progresses, this project poses the opportunity to uncover further case evidence.  The skills developed in undertaking this project were also of great significance in a broader sense, as these can be transferrable to possible future career paths.  The techniques developed include: software and programming skills, information theory, probability, statistics, encryption and decryption, datamining and database trawling.  The job areas and industries that these skills can be applied to are: computer security, communications, digital forensics, computational linguistics, defence, software, e-finance, e-security, telecommunications, search engines and information technology.  Some possible job examples include working at: Google, ASIO, ASIS and ASD <ref>N. Gencarelli and J. K. Yang. (2015, March 15). Cipher Cracking 2015 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Cipher_Cracking_2015</ref>.
Considered “one of Australia’s most profound mysteries” at the time <ref>The Advertiser. (1949, June 10). Tamam Shud [online]. Available: http://trove.nla.gov.au/ndp/del/article/36371416</ref>, this case still remains unsolved today.  As the development of decoder technology and the related knowledge progresses, this project poses the opportunity to uncover further case evidence.  The skills developed in undertaking this project were also of great significance in a broader sense, as these can be transferrable to possible future career paths.  The techniques developed include: software and programming skills, information theory, probability, statistics, encryption and decryption, datamining and database trawling.  The job areas and industries that these skills can be applied to are: computer security, communications, digital forensics, computational linguistics, defence, software, e-finance, e-security, telecommunications, search engines and information technology.  Some possible job examples include working at: Google, ''ASIO'', ''ASIS'' and ''ASD'' <ref>N. Gencarelli and J. K. Yang. (2015, March 15). Cipher Cracking 2015 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Cipher_Cracking_2015</ref>.


==Technical Background==
==Technical Background==


===P-Value Theorem Explanation===
===P-Value Theorem Explanation===
The ''p-value'' is the probability of an effect equal to or greater than the one observed, presuming the null hypothesis of no effect is true. It is a measure of evidence against a null hypothesis <ref>B. David et al., “P Value and the Theory of Hypothesis Testing: An Explanation
for New Researchers,” Clinical Orthopaedics and Related Research®, Vol.468 (3),
pp.885-892 2010.</ref>.
In statistics, the ''p-value'' is used for testing a statistical hypothesis by observing
sample results. ''P-value'' testing is an effective method to presume whether the null
hypothesis of no effect is true. In this project, we will use ''p-values'' to do some
simple tests to determine whether English was the language in the original message
<ref>B. David et al., “P Value and the Theory of Hypothesis Testing: An Explanation
for New Researchers,” Clinical Orthopaedics and Related Research®, Vol.468 (3),
pp.885-892 2010.</ref>.
In this case, we make an assumption that English was not the language in the
original message. Thus we should get a larger ''p-value'' which is higher than 0.05 in
the simple test. This means the observed data point is in the range of “more likely
observation”, mentioned in Figure 1 below. Oppositely, if we get a smaller ''p-value''
which is lower than 0.05 in the simple test, this means the observed data point is in
the range of “very un-likely observations”. This indicates our assumption is wrong
and that English was the language in the original message <ref>G G. L et al., “What is the Value of a p Value?,” The Annals of Thoracic
Surgery, Vol.87(5), pp.1337-1343 2009.</ref>.
[[File: Technical_Background_P-value.png |thumb|700px|centre|'''Fig. 3:''' Example of a ''p-value'' computation <ref>No Author.p-value [online]. Available: http://en.wikipedia.org/wiki/P- value#cite_note-nature506-1</ref>]]
As the graph in Figure 3 shows, the horizontal coordinate is the set of possible results, and the
vertical coordinate is the probability density. The ''p-value'' is in the right side of
observed data point, which is between the curve and horizontal coordinate <ref>No Author.p-value [online]. Available: http://en.wikipedia.org/wiki/Pvalue#
cite_note-nature506-1.</ref>.


===Chi-Squared Test Explanation===
===Chi-Squared Test Explanation===
The ''chi-squared test'' is a statistical test commonly used to compare observed data with expected data. It requires than there is no relationship between the observed data and expected data. It means the observed data should not affect the expected data and they are independent respectively. <ref> N Turner, “Chi-squared test” Journal of clinical nursing, Vol.9 (1), pp.93 2000. </ref>.
The ''chi-squared test'' a ‘goodness of fit’ test, meaning it measures how closely one group of data is related to another. Since it is a non-parametric test, it doesn’t care about distribution of samples. It means there is no requirement that expected and observed data should follow a normal distribution <ref> N. Balakrishnan et al., Chi-squared Goodness of Fit Tests with Applications [online]. Available: http://www.sciencedirect.com.proxy.library.adelaide.edu.au/science/book/9780123971944 </ref>.
[[File: Technical_Background_Chi-squared.png |thumb|1000px|centre|'''Fig. 4:''' Chi-squared example]]
Figure 4 is a simple example to show how our group is to use the ''chi-squared test'' for Task 1.  Firstly, we can calculate the chi-squared value for the letter A by using the equation on the right of Figure 4. The observed value is the count of letter A in the Somerton Man code, the expected value is the total number of letters in Somerton Man code which is 44, multiplied by the frequency of letter A in the base text language used.  This is repeated to calculate the chi-squared value for the rest of the letters from B to Z.  Finally, the chi-squared value for all letters are summed. This is the chi-squared value between the base text language and Somerton Man code. Lower Chi-squared values mean the Somerton Man code is more likely to be this language.


===Universal Declaration of Human Rights Explanation===
===Universal Declaration of Human Rights Explanation===
The ''Universal Declaration of Human Rights'' was created in 1948 and is translated into over 400 different languages <ref>Author Unknown. The Universal Declaration of Human Rights [online]. Available: http://www.un.org/en/documents/udhr/history.shtml </ref>.  The group used translations of this declaration as the base text for Task 1: Statistical Frequency Analysis of Letters.
[[File: Technical_Background_UN_Declaration.png |thumb|300px|centre|'''Fig. 5:''' ''Universal Declaration of Human Rights'' From British Library<ref>The British Library Board. Taking Liberties [online]. Available: http://www.bl.uk/onlinegallery/takingliberties/staritems/645universaldeclarationhumanrightspic.html</ref>]]


===Project Gutenberg Explanation===
===Project Gutenberg Explanation===
[[File: Technical_Background_Project_Gutenberg.png |thumb|300px|right|'''Fig. 6:''' ''Project Gutenberg'' Logo <ref>Author Unknown. Free ebooks by Project Gutenberg [online]. Available: https://www.gutenberg.org/</ref>]]
''Project Gutenberg'' was established in 1971. It offers over 50,000 free ebooks available in a range of different formats including TXT, HTML, PDF and EBOOK and a number of languages including Chinese, Danish, Dutch, English, Esperanto, Finnish,  French, German, Greek, Hungarian, Italian, Latin, Portuguese, Spanish, Swedish and Tagalog <ref> Author Unknown. Free ebooks by Project Gutenberg [online]. Available: https://www.gutenberg.org/ </ref>.  The database was used to provide base texts for Task 4: Statistical Frequency of Letters Reanalysis due to the increased sample size of letter frequencies in each language when compared to the ''Universal Declaration of Human Rights'' as a base text.


===N-Gram Model Explanation===
===N-Gram Model Explanation===
The ''n-gram model'' is a sequence of n items from a given sequence of phonemes,
syllables, letters, words or base pairs. The n-grams typically are collected from
articles or books. If the elements are words, n-grams may also be called shingles
<ref> A. Z Broder et al., “Syntactic clustering of the web”. Computer Networks and
ISDN Systems 29 (8), pp.1157–1166. </ref>.
Size n=1 of the ''n-gram model'' is called a "unigram", size 2=2 of the ''n-gram model'' is called a "bigram", size n=23 of the ''n-gram mode''l is called a "trigram", and so on.
An ''n-gram model'' is a type of probabilistic language model for predicting the next item. In a word, the item n is only related to item (n-1) regardless of any other items in the sequence <ref> No Author. Video Lectures [online]. Available: https://class.coursera.org/nlp/lecture/17 </ref>.  The group will use an n-gram database to find common groups of words for a variety of
''initialisms''.
In Figure 7 shown below, we use one line from the Somerton Man code as a sample sequence to produce the ''n-gram model''.
[[File: Technical_Background_N-gram.png |thumb|600px|centre|'''Fig. 7:''' ''N-gram model'' applied to first line of Somerton Man code]]


===One-Time Pad Explanation===
===One-Time Pad Explanation===
The ''one-time pad'' is a decoder technology which cannot be cracked if the correct
''key'' is used.  For example, if we have a ''ciphertext'', we need the ''key'' to decode it
and get correct message <ref> S.M. Bellovin. (2011, July 12). Frank Miller: Inventor of the One-Time Pad
[online]. Available:
http://www.tandfonline.com.proxy.library.adelaide.edu.au/doi/full/10.1080/0161
1194.2011.583711#abstract </ref>.
[[File: Technical_Background_One-Time_Pad.png |thumb|1000px|centre|'''Fig. 8:''' ''One-Time Pad'' Example <ref>No Author. One-time pad [online]. Available: http://enc.slider.com/Enc/OneTimePads</ref>]]
As in Figure 8 <ref> No Author. One-time pad [online]. Available: http://enc.slider.com/Enc/OneTimePads </ref> shown above, we have the cipher text: EQNVZ. E is the fifth letter in the alphabet, so we label it as 4 (initially we label the first letter, A, as 0).  Similarly, the letter Q is 16, N is 13, V is 21 and Z is 25. If we have a ''key'' which is: XMCKL, we can label X as 23, M as 12, C as 2, K as 10 and L as 11 using the same method as above. Then we use E minus X and process all cipher text elements one by one. Negative numbers are not used, so E minus X equals -19, then we add 26 and get the number 7, which represents letter H. Finally, we find the original message to be HELLO. The ''one-time pad'' technique will be used as another method to attempt to decipher the Somerton Man code.


==Knowledge Gaps and Technical Challenges==
==Knowledge Gaps and Technical Challenges==
In order to complete the specific tasks proposed within this project, the group members were required to develop new skills in text processing and programming in a variety of languages. These languages included Java, MATLAB, and Python.  The group was also required to research and learn how to implement and evaluate statistical techniques including chi-squared testing, p-value calculation and hypothesis testing.  Another necessary skill to be developed was the use of Microsoft Excel software to perform statistical analyses.  The technical challenges that were to be faced within the project were directly associated with these knowledge gaps.
In order to complete the specific tasks proposed within this project, the group members were required to develop new skills in text processing and programming in a variety of languages. These languages included Java, MATLAB, and Python.  The group was also required to research and learn how to implement and evaluate statistical techniques including ''chi-squared testing'', ''p-value'' calculation and ''hypothesis testing''.  Another necessary skill to be developed was the use of Microsoft Excel software to perform statistical analyses.  The technical challenges that were to be faced within the project were directly associated with these knowledge gaps.


==Method - Specific Tasks==
==Method - Specific Tasks==
Line 59: Line 124:
===Task 1: Statistical Frequency Analysis of Letters===
===Task 1: Statistical Frequency Analysis of Letters===
====Aim====
====Aim====
A Critical review of the statistical frequency analysis of the letters from the 2013 group was to be conducted.  It was then proposed that the 2015 group was to repeat the statistical tests done by the 2013 group. Like the 2013 group, the ''Universal Declaration of Human Rights'' was to be used as the base text based on advice from Professor Abbott, but the validity of this text being used as the base text was to be statistically tested. The group was to find out how common each letter of the Somerton Man code is in each popular European language. The statistical results from this analysis were then to be used on 44 letters out of pieces of text from the most likely languages. The 2013 group’s analysis was then to be extended by the 2015 group by calculating ''p-values'' and implementing ''hypothesis testing''. Like the 2013 group, the 2015 group was also to use Microsoft Excel to compute the statistical analysis and produce output graphs. The group was to run ''p-value'' tests on benchmark pieces of text from the most likely languages and see if the ''p-values'' suggested that the letters are indeed from those languages. The use of benchmark pieces of text were also to be used to test the statistical accuracy of the method of analysis as well as the validity of the ''Universal Declaration of Human Rights'' as a base text. Once the most likely language was determined, the group was to process an additional number of benchmarks of that language and obtain a mean ''p-value''. Next, a ''hypothesis test'' could be performed based on the mean ''p-value'' obtained from the benchmarks, when compared to, the ''p-value'' of the letters in the Somerton Man code against the base text. The null hypothesis was to be that ‘The group of letters are from the English language’, and the alternative hypothesis was to be that ‘The group of letters are from another language’. The alternative hypothesis could be altered if it was found that the most likely language from the statistical analysis was one other than English. Using this ''hypothesis testing'' method, the 2015 group was hoping to be able to more confidently determine which language the letters in the code are from.


*Critical review of previous results
====Method====
*Use statistical methods to verify whether the most possible language of Somerton Code is English
=====2013 Statistical Frequency Analysis Review=====
*Universal Declaration of Human Rights as base text
A critical review of the statistical frequency analysis of the letters from the 2013 group has been conducted to determine possible extensions to be undertaken by the 2015 group.  The 2013 group based their statistical frequency analysis of letters on the translations and transliterations of the ''Universal Declaration of Human Rights''.  This document was chosen since it is translated in over 400 languages.  The 2015 group is to analyse this choice of base document using statistical techniques.  266 languages were analysed, since many of the translations were not text files, but paper scans.  This seems like a reasonable omission due to the time constraints of the project and the common European languages are all in text form.  The analysis included accented letters normalised to their ‘parent’ characters, for example, considering 'ǎ' as 'a'.  The languages were then analysed using a variety of combinations of the ambiguous letters within the code.  The 2013 group’s statistical analysis was then refined to the top 20 closest European languages.  These included more uncommon European languages such as Scots and Vepsian.  The graphs of the results from the 2013 group’s analysis can be seen in Figure 9.  The most likely language of the Somerton Man code and control text was found to be Scots.  This is unlikely since it is not a common European language as it is a Scottish dialect <ref>S. L. Center. (2015). What is Scots? [online]. Available: http://www.scotslanguage.com/What_is_Scots%3F_uid2/What_is_Scots_%3F</ref>.  Due to this, Scots, and other less common European languages will be omitted in the 2015 group’s statistical analysis.  Instead, only the most common European languages will be included in order to further refine the statistical process.  Despite this, a conclusion was made that the most likely language was English since Scots and English shared very similar initial letters.
*Find out how common letters are in each language
In order to test this conclusion, the 2013 group used the English translation of the ''Rubaiyat of Omar Khayyam'' as an English control text.  The first 44 words from the text were used since this is the same number as the number of letters in the Somerton Man code.  The results showed, once again, that Scots was the most likely language, followed by English (see Figure 9).
*Calculate the P-value between Somerton Man code and other languages
Analysing this choice of control text, the choice was made by the 2013 group as a matter of convenience.  This may not have been the best decision to make as the ''Rubaiyat of Omar Khayyam'' was originally written in Persian, and has since been translated into English.  The use of a translated text as the English control may have skewed the statistical results of the analysis of the control text as the translation may use uncommon words or expressions.  Another reason for the suspected use of uncommon words or expressions is the fact that the text is a book made up of four line poems, meaning that it may not accurately represent commonly used words or letters in such a small sample size of 44 words.  The 2015 group is to attempt to counteract this possible skew by using 44 words out of a popular novel, originally written in the most likely language.
[[File:2013_Statistical_Language_Analysis_Graphed_Results.png|thumb|700px|centre|'''Fig. 9:''' 2013 Statistical Language Analysis Graphed Results <ref>L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester _B_Final_Report_2013_-_Cipher_cracking</ref>]]


=====2013 Statistical Frequency Analysis Review=====
=====Assumptions=====
A critical review of the statistical frequency analysis of the letters from the 2013 group has been conducted to determine possible extensions to be undertaken by the 2015 group. The 2013 group based their statistical frequency analysis of letters on the translations and transliterations of the Universal Declaration of Human Rights. This document was chosen since it is translated in over 400 languages.  The 2015 group is to analyse this choice of base document using statistical techniques.  266 languages were analysed, since many of the translations were not text files, but paper scans. This seems like a reasonable omission due to the time constraints of the project and the common European languages are all in text form.  The analysis included accented letters normalised to their ‘parent’ characters, for example, considering 'ǎ' as 'a'The languages were then analysed using a variety of combinations of the ambiguous letters within the code. The 2013 group’s statistical analysis was then refined to the top 20 closest European languages.  These included more uncommon European languages such as Scots and Vepsian.  The graphs of the results from the 2013 group’s analysis can be seen in Figure 1.1.  The most likely language of the Somerton Man code and control text was found to be Scots.  This is unlikely since it is not a common European language as it is a Scottish dialect <ref>S. L. Center. (2015). What is Scots? [online]. Available: http://www.scotslanguage.com/What_is_Scots%3F_uid2/What_is_Scots_%3F</ref>.  Due to this, Scots, and other less common European languages will be omitted in the 2015 group’s statistical analysis. Instead, only the most common European languages will be included in order to further refine the statistical process.  Despite this, a conclusion was made that the most likely language was English since Scots and English shared very similar initial letters.
Before commencing statistical calculations, a number of initial assumptions were made for this task. One assumption was that the language used as the basis for the code is a European Language. This assumption was made based on the European appearance of the Somerton Man, results from previous groups concluding that the most likely language is English (see Previous Studies/Related Work section), and advice from Professor Abbott. Another assumption that was made is that the code is an ''initialism'', meaning that it is made up of letters that represent the first letters of an ordered series of words. This assumption has also been made based on conclusions made by previous groups (see Previous Studies/Related Work section) and advice from Professor Abbott. Due to this assumption, only the first letters of words in each language in the base text were to be considered for analysis, and not every letterFurther assumptions were the inclusion of all accented letters in base texts and all combinations of ambiguous letters in the code. This assumption entailed the use of versions of the code including 6Ms, 4Ms and 2Ws, and the code as it appeared in the original police report from 1949<ref>L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking</ref>.  These assumptions were made to increase the robustness of the analysis. A final assumption that was made was the choice not to include the crossed out line of letters in the code. This was made since from observation, one can see that this line is very similar to the third line and is most likely to be a mistake.  
In order to test this conclusion, the 2013 group used the English translation of the Rubaiyat of Omar Khayyam as an English control text.  The first 44 words from the text were used since this is the same number as the number of letters in the Somerton Man code. The results showed, once again, that Scots was the most likely language, followed by English (see Figure 1.1).
Analysing this choice of control text, the choice was made by the 2013 group as a matter of convenience.  This may not have been the best decision to make as the Rubaiyat of Omar Khayyam was originally written in Persian, and has since been translated into English.  The use of a translated text as the English control may have skewed the statistical results of the analysis of the control text as the translation may use uncommon words or expressions.  Another reason for the suspected use of uncommon words or expressions is the fact that the text is a book made up of four line poems, meaning that it may not accurately represent commonly used words or letters in such a small sample size of 44 words.  The 2015 group is to attempt to counteract this possible skew by using 44 words out of a popular novel, originally written in the most likely language.
[[File:2013_Statistical_Language_Analysis_Graphed_Results.png|thumb|700px|centre|'''Fig. 1.1:''' 2013 Statistical Language Analysis Graphed Results]]


=====Base Text=====
=====Base Text=====
The 2013 group’s frequency analysis code was obtained. Initially, some background research in Java code was undertaken for use in compiling and editing the code. The text files used by the 2013 group from the Universal Declaration of Human Right for the statistical frequency analysis were also obtained. A selection of these were then processed in Java, using a modified version of the 2013 group’s code, and the results were tabulated in an excel spread sheet including language, numerical frequency of each letter in each language and the proportional frequency based on the total number of initial letters in each language. These results were then used to test our statistical method of analysis in Microsoft Excel, MATLAB and by hand.
The 2013 group’s frequency analysis code was obtained. Initially, some background research in Java code was undertaken for use in compiling and editing the code. The text files used by the 2013 group from the Universal Declaration of Human Right for the statistical frequency analysis were also obtained. A selection of these were then processed in Java, using a modified version of the 2013 group’s code, and the results were tabulated in an excel spread sheet including language, numerical frequency of each letter in each language and the proportional frequency based on the total number of initial letters in each language. These results were then used to test our statistical method of analysis in Microsoft Excel, MATLAB and by hand.


A spread sheet containing the letter frequency data from the Universal Declaration of Human Rights was obtained from the 2013 group. These results were verified by processing the original text files through the initial letter frequency java code. This data spread sheet was used for statistical calculations for comparison of the letter frequency in the code versus each language in the Universal Declaration of Human Rights.
A spread sheet containing the letter frequency data from the ''Universal Declaration of Human Rights'' was obtained from the 2013 group. These results were verified by processing the original text files through the initial letter frequency java code. This data spread sheet was used for statistical calculations for comparison of the letter frequency in the code versus each language in the ''Universal Declaration of Human Rights''.  
 
=====Statistical Methods=====
We prepare to use both Parametric Testing and Non-Parametric Testing for Task 1.
 
====Method====
=====Assumptions=====
Before commencing statistical calculation, a number of initial assumptions were made for this task. One assumption was that the language used as the basis for the code is a European Language. This assumption was made based on the European appearance of the Somerton Man, results from previous groups concluding that the most likely language is English (see Previous Studies/Related Work section), and advice from Professor Abbott. Another assumption that was made is that the code is an initialism, meaning that it is made up of letters that represent the first letters of an ordered series of words. This assumption has also been made based on conclusions made by previous groups (see Previous Studies/Related Work section) and advice from Professor Abbott. Due to this assumption, only the first letters of words in each language in the base text will be considered for analysis, and not every letter. This assumption has been made since in Task 2: The Web Crawler Re-design, the crawler will search for English grams using each letter in the Somerton Man code as the first letter of each word in the gram phrase. In addition, in Task 3: The Rubaiyat of Omar Khayyam as a One-time Pad, the deciphering technique used relies on the initial letters of words in the pad text. Further assumptions were the inclusion of all accented letters in base texts and all combinations of ambiguous letters in the code. These assumptions were made to increase the robustness of the analysis. A final assumption that was made was the choice not to include the crossed out line of letters in the code. This was made since from observation, one can see that this line is very similar to the third line and is most likely to be a mistake.
 
Because the letter M and W in the Somerton code are very similar. Our group make 3 assumptions for this case. First assumption is All 6 M or W letters in the code are M. Second assumption is All 6 M OR W letters in the code are 4 Ms and 2 Ws. Third assumption is All 6 M OR W letters in the code are 5 Ms and 1 W. The last assumption we make as it appears in the original police report from 1949.


=====Parametric Testing=====
=====Parametric Testing=====
Firstly, a group of test texts were prepared, using 3 groups of 44 letters from the English, French and German languages from the ''Universal Declaration of Human Rights'', as well as the Somerton Man Code. The test texts were analysed using ANOVA in Excel. Once these results were obtained, the ''p-value'' calculation method used was tested using various pieces of software. A test case with known ''p-value'' result from the Engineering Maths IIA notes was run through ANOVA in Microsoft Excel, as well as using built-in MATLAB functions found in Engineering Maths IIA notes, and by hand. All of these methods produced the same ''p-value'' using the test case with known ''p-value'' and so the ANOVA method was verified to be functioning correctly.


Firstly, a group of test texts were prepared, using 3 groups of 44 letters from the English, French and German languages from the Universal Declaration of Human Rights, as well as the Somerton Man Code. The test texts were analysed using ANOVA in Excel. Once these results were obtained, the p-value calculation method used was tested using various pieces of software. A test case with known p-value result from the Engineering Maths IIA notes was run through ANOVA in Microsoft Excel, as well as using built-in MATLAB functions found in Engineering Maths IIA notes, and by hand. All of these methods produced the same p-value using the test case with known p-value and so the ANOVA method was verified to be functioning correctly.
======Parametric Testing Results======
 
ANOVA One Way analysis was used in Microsoft Excel to compare letter frequency between texts, as well as ANOVA Two Way Analysis of Variance, where Factor 1 was the letter, Factor 2 was the text and the response was the letter frequency. Both of these methods did not produce valid ''p-values'' as they used the comparison of total means and variances.
=====Non-Parametric Testing=====
Paired data analysis using a paired sample t-test was researched and attempted based on the Engineering Maths IIA Notes.  This was computed using MATLAB, but also produced unusable ''p-values'' since the method uses mean and standard deviation of the differences between samples to calculate the ''p-value'', thus letter and text type data is lost in the process and so is not applicable.
 
The ''p-values'' calculated using these methods fluctuated depending on the type of data usedIf the raw number of letter frequencies for the sample text or code (44 letters) versus the ''Universal Declaration of Human Rights'' reference text (1000-2000 letters depending on language) was used, then the ''p-value'' became very small since the total means of each text were vastly different. Whereas, if the frequency of each letter as a percentage of the total number of letters in each text were used, this gave a ''p-value'' of 1 since the means became the same.  Thus this method of mean comparison did not work for letter frequency analysis.
Due to the lack of usability of the normally distributed statistical methods, a number of non-parametric tests were researched including the Kolmogrov-Smirnov Test, Mann-Whitney U Test and Chi-Squared test.
These 3 tests were researched and considered but ultimately the chi-squared test was chosen based on a cryptography reference found that uses the chi-squared test to compare a code to a particular language <ref>No Author. 2014. Using Chi Squared to Crack Codes [online]. Available: http://ibmathsresources.com/2014/06/15/using-chi-squared-to-crack-codes/.</ref>, as well as advice from supervisor Dr Matthew BerrymanAn extension for future Honours groups could be to attempt the analysis with other non-parametric tests such as the Kolmogrov-Smirnov Test or Mann-Whitney U Test.
The base text chosen for testing was the full English translation of the Universal Declaration of Human Rights.  The sample texts chosen were the first 44 letters from a the novel Tess of the d’Urbervilles by Thomas Hardy as an English benchmark <ref>T Hardy. 2005. Tess of the d’Urbervilles (11th Edition) [online]. Available: https://ia801409.us.archive.org/24/items/tessofthedurberv00110gut/110-8.txt.</ref>, as well as the first 44 letters from the German, French and Zapoteco translations of the Universal Declaration of Human Rights.
 
====Results And Evaluation====
=====Parametric Testing=====


ANOVA One Way analysis was used in Microsoft Excel to compare letter frequency between texts, as well as ANOVA Two Way Analysis of Variance, where Factor 1 was the letter, Factor 2 was the text and the response was the letter frequency. Both of these methods did not produce valid p-values as they used the comparison of total means and variances. Paired data analysis using a paired sample t-test was researched and attempted based on the Engineering Maths IIA Notes. This was computed using MATLAB, but also produced unusable p-values since the method uses mean and standard deviation of the differences between samples to calculate the p-value, thus letter and text type data is lost in the process and so is not applicable. The p-values calculated using these methods fluctuated depending on the type of data used. If the raw number of letter frequencies for the sample text or code (44 letters) versus the Universal Declaration of Human Rights reference text (1000-2000 letters depending on language) was used, then the p-value became very small since the total means of each text were vastly different. Whereas, if the frequency of each letter as a percentage of the total number of letters in each text were used, this gave a p-value of 1 since the means became the same. Thus this method of mean comparison did not work for letter frequency analysis.  
======Parametric Testing Evaluation======
All of these methods did not compare individual data entries from one group to their corresponding entry in another group.  Instead they used the sample size, sum, mean, and variance of each group to compare to the other group.  Because of this, this method could not be used when comparing letter frequency between languages. For example the frequency of the letter A in one text, must be compared with the frequency of the letter A in another text, rather than the frequency of all letters in one text being compared to the frequency of all letters in another text.


=====Non-Parametric Testing=====
=====Non-Parametric Testing=====
Due to the lack of usability of the normally distributed statistical methods, a number of non-parametric tests were researched including the Kolmogrov-Smirnov Test, Mann-Whitney U Test and ''Chi-Squared test''.
These 3 tests were researched and considered but ultimately the ''chi-squared test'' was chosen based on a cryptography reference found that uses the ''chi-squared test'' to compare a code to a particular language <ref>No Author. 2014. Using Chi Squared to Crack Codes [online]. Available: http://ibmathsresources.com/2014/06/15/using-chi-squared-to-crack-codes/.</ref>, as well as advice from supervisor Dr. Berryman.  The base text chosen for testing was the full English translation of the ''Universal Declaration of Human Rights''.  The sample texts chosen were the first 44 letters from a the novel ''Tess of the d’Urbervilles'' by Thomas Hardy as an English benchmark <ref>T Hardy. 2005. Tess of the d’Urbervilles (11th Edition) [online]. Available: https://ia801409.us.archive.org/24/items/tessofthedurberv00110gut/110-8.txt.</ref>, as well as the first 44 letters from the German, French and Zapoteco translations of the ''Universal Declaration of Human Rights''.


[[File: Task_1_Chi-squared_Formula.png |thumb|1000px|centre|'''Fig. 1.2:''' Task 1 Chi-squared Formula]]
======Non-Parametric Testing Initial Results And Evaluation======
 
Initially, a result was unable to be computed for the Chi-Squared value or ''p-value'' using the ''chi-squared test'' method.  It was soon discovered that in the calculation of the Chi-Squared value, if the frequency of a letter in the reference text (a particular language from the Declaration of Human Rights) was 0, this caused the denominator of the Chi-Squared value equation to be 0 and thus the equation became invalid (See Figure 10).
During our Chi-squared value calculation, we meet a serious problem. There are some letters appear 0 times in one particular language in our base text from Declaration of Human Rights. According to figure 1.2, it will cause the denominator of the Chi-Squared value equation to be 0 and as we all know the denominator cannot be 0 in the equation.  We have to make some reasonable assumptions so that we are able to continue using Chi-squared testing for our task 1. First assumption is we use number 1 which is the closest integer to 0 to instead of 0 when the letter counts 0 in this language.  Second assumption is we use 0.001 and 0.0001 to instead of 0 frequencies in same case rather than the number 1. For the first assumption, it would not have significant enough effect on the results to cause a skew in the data. However, for the second assumption, this caused the result to very large Chi-Squared values (in the order of 10^10) and very small p-values (approaching 0).


=====Top 20 European Language Comparison=====
[[File: Task_1_Chi-squared_Formula.png |thumb|1000px|centre|'''Fig. 10:''' Chi-Squared Formula]]


The top 20 closest European languages by squared difference and standard deviation from the 2013 Honours group were then run through the Chi-Squared test against the Somerton Man code as in the original police report (5 Ms and 1 W). This test was run as a method of comparison to test how similar our results were to the 2013 group’s results. We will display the results for our two assumptions mentioned before.
An assumption in the method of the ''chi-squared testing'' had to be made in an attempt to rectify the issue.  The count for letters in the reference text that appeared 0 times, were altered to 1, a small number chosen to be the closest integer value to 0 in the hope that it would not have significant enough effect on the results to cause a skew in the data.  This assumption was made since the sample size for the reference text was too small, and so not every letter appeared at least once.  After doing this, and computing all results, it was realised by the group that despite the initial results when changing the count from 0 to 1 looking promising, the frequencies of letters that actually appeared once had the same frequency as letters that did not appear at all and thus reduced the accuracy of the data used for the analysis. This was rectified by altering the frequencies of letters appearing 0 times to 0.0001, since the frequencies for letters occurring 1 time had a frequency of approximately 0.0005, and so we had to select a number lower than the lowest occurrence.  The results of both methods are compared and contrasted in the following sections.


======Results======
{|style="margin: 0 auto;"
[[File: Squared_Difference_versus_2013_Number.png |thumb|1000px|centre|'''Fig. 1.3:''' Squared Difference versus 2013 Number]]
| [[File: 0_to_1_Validation.png |thumb|400px|centre|'''Fig. 11:''' Initial Chi-Squared Test Results for English Declaration vs Sample Texts (Count from 0 to 1)]]
| [[File: 0_to_0.0001_Validaition.png |thumb|400px|centre|'''Fig. 12:''' Initial Chi-Squared Test Results for English Declaration vs Sample Texts (Frequency from 0 to 0.0001)]]
|}


Figure 1.3 shows the Chi-squared value of closest 20 languages base on 2013 group’s Squared Difference with 5 Ms and 1 W in code (the left graph) versus 2013 group’s result (the right graph). In this case, we use number 1 to instead of counts 0 of letters.
The initial ''chi-squared test'' using both methods resulted in large chi-squared values that were brought about by a small text size sample of 44 letters (See Figures 11 and 12). This effectively caused the ''p-value'' results to be very small numbers and therefore were unable to be used as a means of comparison and caused the null hypothesis to be rejected in every case.  However, this test could still be used as a measure of similarity since the Chi-Squared Values for each text comparison can be compared based on the fact that the smaller the Chi-Squared value, the more similar the two texts are.  Thus, despite being unable to obtain valid ''p-values'' and perform ''hypothesis testing'', the texts could still be ranked based on the Chi-Squared value in the full analysis.  Comparing the results from the count 0 to 1, versus the frequency 0 to 0.0001 shows that the chi-squared values were reduced overall in the second analysis, except for the chi-squared values calculated for the Somerton Man code.


[[File: Standard_Deviation_versus_2013_Number.png |thumb|1000px|centre|'''Fig. 1.4:''' Standard Deviation versus 2013 Number]]
======Top 20 European Language Comparison======
The top 20 closest European languages by squared difference and standard deviation from the 2013 Honours group were then run through the ''chi-squared test'' against the Somerton Man code as in the original police report (5 Ms and 1 W). This test was run as a method of comparison to test how similar our results were to the 2013 group’s results. The results based on the two assumptions mentioned will be displayed.


Figure 1.4 shows the Chi-squared value of closest 20 languages base on 2013 group’s Standard Deviation with 5 Ms and 1 W in code (the left graph) versus 2013 group’s result (the right graph). In this case, we use number 1 to instead of counts 0 of letters.
======Top 20 European Language Comparison Results======
[[File: Squared_Difference_versus_2013_Number.png |thumb|1200px|centre|'''Fig. 13:''' Comparison of Results of Chi-Squared Values of Closest 20 Languages based on 2013 Squared Difference (Count from 0 to 1)]]


[[File: Squared_Difference_versus_2013_Frequency.png |thumb|1000px|centre|'''Fig. 1.5:''' Squared Difference versus 2013 Frequency]]
[[File: Standard_Deviation_versus_2013_Number.png |thumb|1200px|centre|'''Fig. 14:''' Comparison of Results of Chi-Squared Values of Closest 20 Languages based on 2013 Standard Deviation (Count from 0 to 1)]]


Figure 1.5 shows the Chi-squared value of closest 20 languages base on 2013 group’s Squared Difference with 5 Ms and 1 W in code (the left graph) versus 2013 group’s result (the right graph). In this case, we use 0.0001 to instead of frequency 0 of letters.
[[File: Squared_Difference_versus_2013_Frequency.png |thumb|1200px|centre|'''Fig. 15:''' Comparison of Results of Chi-Squared Values of Closest 20 Languages based on 2013 Squared Difference (Frequency from 0 to 0.0001)]]


[[File: Standard_Deviation_versus_2013_Frequency.png |thumb|1000px|centre|'''Fig. 1.6:''' Standard Deviation versus 2013 Frequency]]
[[File: Standard_Deviation_versus_2013_Frequency.png |thumb|1200px|centre|'''Fig. 16:''' Comparison of Results of Chi-Squared Values of Closest 20 Languages based on 2013 Standard Deviation (Frequency from 0 to 0.0001)]]
 
Figure 1.6 shows the Chi-squared value of closest 20 languages base on 2013 group’s Standard Deviation with 5 Ms and 1 W in code (the left graph) versus 2013 group’s result (the right graph). In this case, we use 0.0001 to instead of frequency 0 of letters.
 
 
======Evaluation======


======Top 20 European Language Comparison Evaluation======
The results show that the two closest languages to the Somerton Man code are Scots, followed by English in all cases. The main conclusion to draw from the results of this comparison was that the Chi-Squared method appeared to be functioning correctly, and so the 2015 group could now further extend the analysis.
The results show that the two closest languages to the Somerton Man code are Scots, followed by English in all cases. The main conclusion to draw from the results of this comparison was that the Chi-Squared method appeared to be functioning correctly, and so the 2015 group could now further extend the analysis.


======Top 20 European Languages based on Estimated Number of Native Speakers======
In extension to the 2013 group’s work, the 2015 group conducted a ''chi-squared test'' of the Somerton Man code against the top 20 European languages based on the estimated number of native speakers <ref>No Author. 2015. List of languages by number of native speakers [online]. Available: http://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers#cite_note-Nationalencyklopedin-1.</ref>.  This test included all versions of the Somerton Man code including the versions with 6 Ms, 4 Ms and 2 Ws, 5 Ms and 1 W and then the average of these results were also plotted.


=====Top 20 European Languages based on Estimated Number of Native Speakers=====
======Top 20 European Languages based on Estimated Number of Native Speakers Results======
 
[[File: European _ Languages _ versus _ Code_1_Number.png |thumb|1000px|centre|'''Fig. 17:''' Top 20 European Languages based on Estimated Number of Speakers with 6Ms in Code (left) and 4Ms and 2Ws in Code (right) (Count 0 to 1)]]
In extension to the 2013 group’s work, the 2015 group conducted a Chi-Squared test of the Somerton Man code against the top 20 European languages based on the estimated number of native speakers <ref>No Author. 2015. List of languages by number of native speakers [online]. Available: http://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers#cite_note-Nationalencyklopedin-1.</ref>.  This test included all versions of the Somerton Man code including the versions with 6 Ms, 4 Ms 2 Ws, Average and the police report code (5 Ms and 1 W).
 
======Results======
[[File: European _ Languages _ versus _ Code_1_Number.png |thumb|1000px|centre|'''Fig. 1.7:''' European Languages versus Code 1 Number]]
 
Figure 1.7 shows the Chi-squared value of Top 20 European Languages based on Estimated Number of Speakers with 6 Ms in code (the left graph) and 4 Ms & 2 Ws (the right graph). In this case, we use number 1 to instead of counts 0 of letters.
 
[[File: European _ Languages _ versus _ Code_2_Number.png |thumb|1000px|centre|'''Fig. 1.8:''' European Languages versus Code 2 Number]]


Figure 1.8 shows the Chi-squared value of Top 20 European Languages based on Estimated Number of Speakers with 5 Ms & 1 W in code (the left graph) and Average of 3 cases (the right graph). In this case, we use number 1 to instead of counts 0 of letters.
[[File: European _ Languages _ versus _ Code_2_Number.png |thumb|1000px|centre|'''Fig. 18:''' Top 20 European Languages based on Estimated Number of Speakers with 5Ms and 1W in Code (left) and the Average (right) (Count 0 to 1)]]


[[File: European _ Languages _ versus _ Code_1_Frequency.png |thumb|1000px|centre|'''Fig. 1.9:''' European Languages versus Code 1 Frequency]]
[[File: European _ Languages _ versus _ Code_1_Frequency.png |thumb|1000px|centre|'''Fig. 19:''' Top 20 European Languages based on Estimated Number of Speakers with 6Ms in Code (left) and 4Ms and 2Ws in Code (right) (Frequency 0 to 0.0001)]]


Figure 1.9 shows the Chi-squared value of Top 21 European Languages based on Estimated Number of Speakers with 6 Ms in code (the left graph) and 4 Ms & 2 Ws (the right graph). In this case, we use 0.0001 to instead of frequency 0 of letters.  
[[File: European _ Languages _ versus _ Code_2_Frequency.png |thumb|1000px|centre|'''Fig. 20:''' Top 20 European Languages based on Estimated Number of Speakers with 5Ms and 1W in Code (left) and the Average (right) (Frequency 0 to 0.0001)]]


[[File: European _ Languages _ versus _ Code_2_Frequency.png |thumb|1000px|centre|'''Fig. 1.10:''' European Languages versus Code 2 Frequency]]
======Top 20 European Languages based on Estimated Number of Native Speakers Evaluation======
The results show that when changing the count from 0 to 1, for two of the three code versions and their average value, English was the closest language to the Somerton Man code. Kurdish, a language spoken in some parts of Turkey, was found to be the closest language to the code version with 6 Ms, however, considering Kurdish is not a common language, and based on the average results, it is safe to say that English was the closest language to the code using this method.


Figure 1.10 shows the Chi-squared value of Top 21 European Languages based on Estimated Number of Speakers with 5 Ms & 1 W in code (the left graph) and Average of 3 cases (the right graph). In this case, we use 0.0001 to instead of frequency 0 of letters.
When changing the frequency from 0 to 0.0001, the results deviated from those obtained using a count of 1. This caused English to produce a higher chi-squared value and for other languages to produce lower-chi squared values, causing Kurdish to have the lowest chi-squared value in all cases, and for English to become the third closest language in the average results. This may be evidence to suggest that upon further inspection, the ''Universal Declaration of Human Rights'' may not be a suitable base text for use with the chi-squared method, irrespective of insufficient sample size.


======Evaluation======
======Top 20 European Languages based on Estimated Number of Native Speakers against Thomas Hardy Sample======
A comparison of the English control text (a 44 letter sample of Thomas Hardy’s Tess of the d’Urbervilles) against the top 20 European Languages based on Estimated Number of Native Speakers was conducted in order to test the ability of the method being able to identify an actual English sample text among the top 20 European languages.  Assumptions of altering the count and frequency of letters that appeared 0 times to 1 and 0.0001 respectively were both used and can be seen in Figures 21 and 22.


The results show that for 2 of the 3 code versions and their average value, English was the closest language to the Somerton Man code. Kurdish, a language spoken in some parts of Turkey, was found to be the closest language to the code version with 6 Ms. However, consider Kurdish is not a common and popular language in the world. English was also found to be the closest language to the code as in the original police report.
======Top 20 European Languages based on Estimated Number of Native Speakers against Thomas Hardy Sample Results======


=====Top 20 European Languages based on Estimated Number of Native Speakers against Thomas Hardy Sample=====
{|style="margin: 0 auto;"
 
| [[File: European _ Languages _ versus _ Thomas_1_Number.png |thumb|1000px|centre|'''Fig. 21:''' Top 20 European Languages based on Estimated Number of Speakers versus Thomas Hardy Sample (Count 0 to 1)]]
A comparison of the English control text (a 44 letter sample of Thomas Hardy’s Tess of the d’Urbervilles) against the top 20 European Languages based on Estimated Number of Native Speakers was conducted in order to test the ability of the method being able to identify an English sample text among the top 20 European languages.
| [[File: European _ Languages _ versus _ Thomas_1_Frequency.png |thumb|1000px|centre|'''Fig. 22:''' Top 20 European Languages based on Estimated Number of Speakers versus Thomas Hardy Sample (Frequency 0 to 0.0001)]]
 
|}
 
======Results======
 
[[File: European _ Languages _ versus _ Thomas_1_Number.png |thumb|1000px|centre|'''Fig. 1.11:''' European Languages versus Thomas 1 Number]]
 
Figure 1.11 shows the Chi-squared value of Top 20 European Languages based on Estimated Number of Speakers against Thomas Hardy Sample. In this case, we use number 1 to instead of counts 0 of letters.
 
 
[[File: European _ Languages _ versus _ Thomas_1_Frequency.png |thumb|1000px|centre|'''Fig. 1.12:''' European Languages versus Thomas 1 Frequency]]
 
 
Figure 1.12 shows the Chi-squared value of Top 20 European Languages based on Estimated Number of Speakers against Thomas Hardy Sample. In this case, we use 0.0001 to instead of frequency 0 of letters.
 
======Evaluation======
 
The results of this test show that English was the closest text to the English sample text. This result is desirable as it successfully verified the ability of the Chi-Squared test to distinguish a 44 letter English sample out of the top 20 European languages, and can be used to back up the results obtained from comparing the same 20 languages against the Somerton Man code.


======Top 20 European Languages based on Estimated Number of Native Speakers against Thomas Hardy Sample Evaluation======
The results of this test show that English was the closest language to the English sample text in both cases. This result is desirable as it successfully verified the ability of the ''chi-squared test'' to distinguish a 44 letter English sample out of the top 20 European languages.  This could have been used to back up the results obtained from comparing the same 20 languages against the Somerton Man code, but unlike the same test performed on the Somerton Man code, adjusting the method from count 0 to 1, to frequency 0 to 0.0001 caused no effect on English being the closest language to the sample text.


====Evaluation and Justification====
====Evaluation and Justification====
The original proposal suggested that the group repeat the statistical analysis from the 2013 group and use benchmark texts to statistically assess the validity of the method as well as the ''Universal Declaration of Human Rights'' as a base text. The group was then to extend the analysis by calculating the ''p-values'' for the Somerton Man code when compared to the most common European languages and perform ''hypothesis testing'' based on the results. The group was also to use benchmark texts to test the statistical accuracy of the method as well as the validity of the ''Universal Declaration of Human Rights'' as a base text. The 2015 group’s statistical analysis had achieved almost all of its proposed goals. A slight diverge from the initially proposed method was decided upon once it was found that ''p-values'' useful for comparison or ''hypothesis testing'' were unable to be obtained using any attempted statistical method. Instead, the texts were ranked using their calculated chi-squared value. All assumptions outlined in the proposal were followed, with the addition of the modification of the count and frequency data to account for the small sample size of the base text.


=====Evaluation of Parametric Testing=====
Previous results using the ''Universal Declaration of Human rights'' as a base text were confirmed when using the assumption to adjust letters with count 0 to 1. These were not consistent when using the second assumption to adjust the letters with frequency 0 to 0.0001. These inconclusive results left more to be desired from the analysis and the validity of ''Universal Declaration of Human Rights'' as a base text was found to be questionable due to its limited sample size. English was found to have the lowest chi-squared value for a number of the calculations, meaning that the Somerton Man code is most likely to be formed from the English language, however since a reasonable ''p-value'' for any language could not be obtained, there was still a potential for reanalysis.
 
All of these methods did not compare individual data entries from one group to their corresponding entry in another group. Instead they used the sample size, sum, mean, and variance of each group to compare to the other group. Because of this, this method could not be used when comparing letter frequency between languages. For example the frequency of the letter A in one text, must be compared with the frequency of the letter A in another text, rather than the frequency of all letters in one text being compared to the frequency of all letters in another text.
 
=====Evaluation Of Task 1=====
 
The original proposal suggested that the group repeat the statistical analysis from the 2013 group and use benchmark texts to statistically assess the validity of the method as well as the Universal Declaration of Human Rights as a base text. The group was then to extend the analysis by calculating the p-values for the Somerton Man code when compared to the most common European languages and perform hypothesis testing based on the results. The group was also to use benchmark texts to test the statistical accuracy of the method as well as the validity of the Universal Declaration of Human Rights as a base text. The 2015 group’s statistical analysis has achieved almost all of its proposed goals. A slight diverge from the initially proposed method was decided upon once it was found that p-values useful for comparison or hypothesis testing were unable to be obtained using any attempted statistical method. Instead, the texts were ranked using their calculated Chi-Value. All assumptions outlined in the proposal were followed, with the addition of the modification of the frequency data to account for the small sample size of the base text.
 
=====Conclusion=====
 
The results of all of the Chi-Squared testing lead to the conclusion that we can now say more confidently than ever that English was the most likely language from which the Somerton Man code was written assuming it is an initialism.


===Task 2: N-Gram Search===
===Task 2: N-Gram Search===
====Aim====
====Aim====
The aim of Task 2 was to create a search engine to look for regular expressions that could be linked to the Somerton Man code.  Numerous studies from previous groups showed that it is statistically likely for the Somerton Man code to be an English initialism (see Previous Studies/Related Work section).  Based on this, for this task an assumption was made that the code is an initialism.  A search engine was to be developed that used the letters of the code as initial letters of words in commonly used English phrases.  This concept was to be explored further using a technique that accesses a larger database in much shorter time than the web crawler developed by groups in previous years (see Previous Studies/Related Work section).  Using an n-gram database, rather than crawling the whole web for grams, had the advantage that the crawling had already been done and all grams had been recorded.  This was to drastically increase the speed at which the gram combinations on the web could be found.  The search engine was required to output a list of possible gams from the input letter combinations. An assumption that was to be made in order to complete this task was that the letters in the code, and thus the words in the grams, were order relevant.  Another assumption was that all variants of ambiguous letters in the code were to be included.
The aim of Task 2 was to create a search engine to look for regular expressions that could be linked to the Somerton Man code.  Numerous studies from previous groups showed that it is statistically likely for the Somerton Man code to be an English ''initialism'' (see Previous Studies/Related Work section).  Based on this, for this task an assumption was made that the code is an ''initialism''.  A search engine was to be developed that used the letters of the code as initial letters of words in commonly used English phrases.  This concept was to be explored further using a technique that accesses a larger database in much shorter time than the web crawler developed by groups in previous years (see Previous Studies/Related Work section).  Using an n-gram database, rather than crawling the whole web for grams, had the advantage that the crawling had already been done and all grams had been recorded.  This was to drastically increase the speed at which the gram combinations on the web could be found.  The search engine was required to output a list of possible gams from the input letter combinations.  
 
An assumption that was made in order to complete this task was that the code is an ''initialism'', meaning that it is made up of letters that represent the first letters of an ordered series of words.  This assumption was made based on conclusions made by previous groups (see Previous Studies/Related Work section) and advice from Professor Abbott.  Due to this, a second assumption was made that the letters in the code, and thus the words in the grams, were order relevant.  A final assumption was that all variants of ambiguous letters in the code were to be included.


====Method====
====Method====
Line 214: Line 241:
   
   
The maximum number of n provided by the gram database was five.  Due to this, a maximum of five letter gram groups from the code could be processed at a time.
The maximum number of n provided by the gram database was five.  Due to this, a maximum of five letter gram groups from the code could be processed at a time.
This was achieved by writing a code in Python to generate all possible 5-gram initialisms from all code variants, including the crossed out line, and output them into a corresponding text file.  The same was also done for 4, 3, 2 and 1-grams and stored in their respective text files.  These were to be used as input files from which the search engine was able to perform searches to query the database.
This was achieved by writing a code in Python to generate all possible 5-gram ''initialisms'' from all code variants, including the crossed out line, and output them into a corresponding text file.  The same was also done for 4, 3, 2 and 1-grams and stored in their respective text files.  These were to be used as input files from which the search engine was able to perform searches to query the database.


The search engine code was also written in Python. It functioned by taking in the initialism combinations from the Somerton Man code of length n, from text files created by the intiialism generator code, and stored them in a dictionary labelled 'initialisms of interest'.  The grams from the database were read in by a reader and initialisms were generated from the grams in each line of the reader.  If the initialism generated from the line of the reader matched an initialism in the dictionary containing the initialisms of interest, the full gram was output into a corresponding text file containing results of length n.  This code was copied and modified to be used for each gram length from n=1-5.  A simplified diagram of the way the code works can be seen in the flowchart in Figure X, and the full code can be seen in Appendix X.
The search engine code was also written in Python. It functioned by taking in the ''initialism'' combinations from the Somerton Man code of length n, from text files created by the intiialism generator code, and stored them in a dictionary labelled 'initialisms of interest'.  The grams from the database were read in by a reader and ''initialisms'' were generated from the grams in each line of the reader.  If the ''initialism'' generated from the line of the reader matched an ''initialism'' in the dictionary containing the ''initialisms'' of interest, the full gram was output into a corresponding text file containing results of length n.  This code was copied and modified to be used for each gram length from n=1-5.  A simplified diagram of the way the code works can be seen in the flowchart in Figure 23, and the full code can be seen in Appendix A.


[[File:Search_Engine_Flow_Chart.png|thumb|600px|centre|'''Fig. X:''' Search Engine Flowchart]]
[[File:Search_Engine_Flow_Chart.png|thumb|600px|centre|'''Fig. 23:''' Search Engine Flowchart]]


Running our code on the Google N-Gram database stored in the i2.xlarge instances in parallel for each group of n-gram inputs from n=1-5 took approximately two weeks.  These raw results were then small enough to be stored and processed locally and so the Amazon EC2 service was no longer required.
Running our code on the Google N-Gram database stored in the i2.xlarge instances in parallel for each group of n-gram inputs from n=1-5 took approximately two weeks.  These raw results were then small enough to be stored and processed locally and so the Amazon EC2 service was no longer required.
Line 227: Line 254:
Once the raw results were obtained, some grams contained words followed by an underscore and the corresponding lexical category of the word (ie. noun, verb, adverb, adjective, pronoun etc.).  This was desired to be removed and so another python code was written to remove everything but the words themselves from each line in the results.
Once the raw results were obtained, some grams contained words followed by an underscore and the corresponding lexical category of the word (ie. noun, verb, adverb, adjective, pronoun etc.).  This was desired to be removed and so another python code was written to remove everything but the words themselves from each line in the results.


Upon processing the raw results, the output of the lexical category removal results showed multiple identical results with individual frequencies for the numbers of years in which they occurred.  This was brought about since previously the database considered these entries to be unique results, but now with the lexical categories removed, some results became identical.  This was rectified by writing another code in Python to process these cleaned results to combine identical entries and sum their frequency of years in which they occurred.  This code was then duplicated and modified into two codes, the first output the results sorted alphabetical order, and the second in order of frequency of years in which each result occurred from highest to lowest.  The alphabetically sorted outputs were used as a means of comparison to the cleaned inputs since these were also sorted alphabetically, in order to check that the code was functioning correctly. The frequency sorted outputs were more useful since they able to be used to generate a condensed list of the top 30 most popular initialisms that could be generated from the letters from all variations in the Somerton Man code, seen in the results section in Figure X.
Upon processing the raw results, the output of the lexical category removal results showed multiple identical results with individual frequencies for the numbers of years in which they occurred.  This was brought about since previously the database considered these entries to be unique results, but now with the lexical categories removed, some results became identical.  This was rectified by writing another code in Python to process these cleaned results to combine identical entries and sum their frequency of years in which they occurred.  This code was then duplicated and modified into two codes, the first output the results sorted alphabetical order, and the second in order of frequency of years in which each result occurred from highest to lowest.  The alphabetically sorted outputs were used as a means of comparison to the cleaned inputs since these were also sorted alphabetically, in order to check that the code was functioning correctly. The frequency sorted outputs were more useful since they able to be used to generate a condensed list of the top 30 most popular ''initialisms'' that could be generated from the letters from all variations in the Somerton Man code, seen in the results section in Figure 24.


=====Combinations of Search Results=====
=====Combinations of Search Results=====
Line 234: Line 261:
For simplicity, using 2-grams and the code ABAC: If the top 2 2-grams for AB are Absolute Bargain and American Beagle, and the top 2 2-grams for AC are Air Conditioning and Alternating Current, then all possible combinations for the code are: Absolute Bargain Air Conditioning, Absolute Bargain Alternating Current, American Beagle Air Conditioning and American Beagle Alternating Current.
For simplicity, using 2-grams and the code ABAC: If the top 2 2-grams for AB are Absolute Bargain and American Beagle, and the top 2 2-grams for AC are Air Conditioning and Alternating Current, then all possible combinations for the code are: Absolute Bargain Air Conditioning, Absolute Bargain Alternating Current, American Beagle Air Conditioning and American Beagle Alternating Current.


This code was implemented as an exercise to see if any interesting or useful results could come about using this simple method.  Unfortunately, this produced nonsensical results due to the disjoint between each 5-gram group's search results, a sample of these can be seen in the results section in Figure X.  Due to the time constraints of the project, the code was not able to be developed any further, but the code and the results it provides can be used as a first step towards obtaining meaningful or useful combinations of n-grams from the results obtained using the search engine developed throughout this project.  This code could be improved by using a sliding window that progresses by less than 5 letters for each search, for example, using a step size of 1 letter would create the maximum possible overlap of 4 letters between each input gram group.  More information on this and other suggested improvements can be found in the future work section.
This code was implemented as an exercise to see if any interesting or useful results could come about using this simple method.  Unfortunately, this produced nonsensical results due to the disjoint between each 5-gram group's search results, a sample of these can be seen in the results section in Figure 25.  Due to the time constraints of the project, the code was not able to be developed any further, but the code and the results it provides can be used as a first step towards obtaining meaningful or useful combinations of n-grams from the results obtained using the search engine developed throughout this project.  This code could be improved by using a sliding window that progresses by less than 5 letters for each search, for example, using a step size of 1 letter would create the maximum possible overlap of 4 letters between each input gram group.  More information on this and other suggested improvements can be found in the future work section.


====Results====
====Results====
[[File:Combined_Top_30.pdf]]
* '''Fig. 24:''' Top 30 Most Popular Initialisms from Somerton Man Code [[File:Combined_Top_30.pdf]]
[[File:Top_2_5-Gram_Results_Combined.pdf]]
* '''Fig. 25:''' Top 2 5-Gram Combinations from Somerton Man Code Sample [[File:Top_2_5-Gram_Results_Combined.pdf]]


====Evaluation and Justification====
====Evaluation and Justification====
Line 245: Line 272:
===Task 3: Rubaiyat of Omar Khayyam as a One-Time Pad===
===Task 3: Rubaiyat of Omar Khayyam as a One-Time Pad===
====Aim====
====Aim====
The aim of Task 3 was to use Rubaiyat of Omar Khayyam as a one-time pad to attempt to decode the Somerton Man code to find any meaningful messages after decrypting. This task involved the investigation that the letters had been substituted for others using a one-time pad technique. This group was to use the Somerton Man code to act as the cipher text, the numerical value of letter positions within words, with respect to the first letter of each word to act as the key, and the decoded messages to act as the plaintext of the code.  The first letter of each word in the one-time pad had numerical value 0, the second letter had numerical value 1, and so on.  
The aim of Task 3 was to use ''Rubaiyat of Omar Khayyam'' as a ''one-time pad'' to attempt to decode the Somerton Man code to find any meaningful messages after decrypting. This task involved the investigation that the letters had been substituted for others using a ''one-time pad'' technique. This group was to use the Somerton Man code to act as the cipher text, the numerical value of letter positions within words, with respect to the first letter of each word to act as the ''key'', and the decoded messages to act as the ''plaintext'' of the code.  The first letter of each word in the ''one-time pad'' had numerical value 0, the second letter had numerical value 1, and so on.  


The difference in aim between our group and the 2013 honours group is that the key used to decode the message is based on letter position within each word rather than using numbers assigned to each letter in the alphabet to perform alphabetic shits, as implemented in the 'Decoding Toolkit - One Time Pad' software <ref>L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking</ref>.
The difference in aim between our group and the 2013 honours group is that the ''key'' used to decode the message is based on letter position within each word rather than using numbers assigned to each letter in the alphabet to perform alphabetic shits, as implemented in the 'Decoding Toolkit - One Time Pad' software <ref>L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking</ref>.


====Method====
====Method====
Before commencing Task 3, we had to choose a computer programming language to implement the code. My first attempt was to use Java, because this language was used by all previous groups to implement for their work. I tried to reuse the base code from previous groups and extend it to satisfy our Task 3 requirements, however, the aim of 2013 was significantly different to ours, so there were not too many similar points in the code. On the other hand, I am not good at Java language because I never learnt it before. My second attempt was to use the C and C++  language to implement the code. I completed some functions for the code, but there was a problem in loading text files. Because we used the Rubaiyat as a one-time pad, we were trying to translate Rubaiyat into a text file so that we were able to run the code and search for words in the Rubaiyat. C++ is an old programming language and the process of loading data was too complex. I decided to give up this method. My final attempt was to use Matlab, it is a relatively new programming language, and loading text files is relatively simple. In addition, I have studied Matlab in previous university courses and find it easier to use. Finally, I chose Matlab as implement language for our Task 3.
Before commencing Task 3, we had to choose a computer programming language to implement the code. My first attempt was to use Java, because this language was used by all previous groups to implement for their work. I tried to reuse the base code from previous groups and extend it to satisfy our Task 3 requirements, however, the aim of 2013 was significantly different to ours, so there were not too many similar points in the code. On the other hand, I am not good at Java language because I never learned it before. My second attempt was to use the C and C++  language to implement the code. I completed some functions for the code, but there was a problem in loading text files. Because we used the ''Rubaiyat of Omar Khayyam'' as a ''one-time pad'', we were trying to translate the ''Rubaiyat'' into a text file so that we were able to run the code and search for words in the ''Rubaiyat''. C++ is an old programming language and the process of loading data was too complex. I decided to give up this method. My final attempt was to use MATLAB, it is a relatively new programming language, and loading text files is relatively simple. In addition, I have studied MATLAB in previous university courses and find it easier to use. Finally, I chose MATLAB as the programming language to implement language for Task 3.


The direct substitution of the letters in the pad was to be used. For instance, if we chose a line: MRGOABABD, from the Somerton Man code as the encoded message, the first letter of the code is to be used, which is an M. The program was then to search the Rubaiyat from beginning to end, until it finds the first word that begins with M.  This was then to be decoded to the second letter in the same word. This process was then to be repeated for the second letter in the code, being R. It was to find the first word in the Rubaiyat that starts with R, and decodes it to the second letter in that word.  This method was then to be repeated until there were too few words long enough to decode all of the letters in the code. After this point, one final decode was to be attempted, where rather than using letter position, the last letter of each word was to be used. The output of the software was to be possible words or phrases made up of letter substitutions in place of the letters in the Somerton Man code.
The direct substitution of the letters in the pad was to be used. For instance, if we chose a line: MRGOABABD, from the Somerton Man code as the encoded message, the first letter of the code is to be used, which is an M. The program was then to search the ''Rubaiyat'' from beginning to end, until it finds the first word that begins with M.  This was then to be decoded to the second letter in the same word. This process was then to be repeated for the second letter in the code, being R. It was to find the first word in the ''Rubaiyat'' that starts with R, and decodes it to the second letter in that word.  This method was then to be repeated until there were too few words long enough to decode all of the letters in the code. After this point, one final decode was to be attempted, where rather than using letter position, the last letter of each word was to be used. The output of the software was to be possible words or phrases made up of letter substitutions in place of the letters in the Somerton Man code.


=====Design=====
=====Design=====
[[File: Design_for_task3.png |thumb|1000px|centre|'''Fig. 3.1:''' Design for task3]]
[[File: Design_for_task3.png |thumb|1000px|centre|'''Fig. 26:''' Design Flowchart for Task 3]]


The Figure 3.1 shown above include my design and thinking process for Task 3.  We needed 2 inputs for the code: One was the Somerton Man code which is the encoded message, other was the letter position as the key (the range of the letter position is above 0 and starts with integer 1 based on the MATLAB system). The function was to output a letter from the Somerton Man code and a number n from the letter position. Then, a function called matching was then to receive the letter output from the Somerton Man code. It was to search the Rubaiyat from beginning to end to match the first word with the same initial letter. At the same time, a function called finding was to receive the number output by letter position. Afterwards, the function matching was to send the matched word from Rubaiyat to the function 'finding'. This function was then to choose the correct letter in the matched word based on the letter position n. For example, if n equaled to 2, the 'finding' function would choose the second letter of the matched word. Finally, the function wouldl output the chosen letter to a function called 'recordMessage', and this function would record all the letters from 'finding' in order. The program would have now completed decoding for one letter of Somerton Man code, and would then repeat for the rest of the letters in the code. After decoding all letters, the program would output the original message.
The Figure 26 shown above demonstrates my design and thinking process for Task 3.  We needed 2 inputs for the code: One was the Somerton Man code which is the encoded message, other was the letter position as the ''key'' (the range of the letter position is above 0 and starts with integer 1 based on the MATLAB system). The function was to output a letter from the Somerton Man code and a number n from the letter position. Then, a function called matching was then to receive the letter output from the Somerton Man code. It was to search the ''Rubaiyat'' from beginning to end to match the first word with the same initial letter. At the same time, a function called finding was to receive the number output by letter position. Afterwards, the function matching was to send the matched word from ''Rubaiyat'' to the function 'finding'. This function was then to choose the correct letter in the matched word based on the letter position n. For example, if n equaled to 2, the 'finding' function would choose the second letter of the matched word. Finally, the function would output the chosen letter to a function called 'recordMessage', and this function would record all the letters from 'finding' in order. The program would have now completed decoding for one letter of Somerton Man code, and would then repeat for the rest of the letters in the code. After decoding all letters, the program would output the original message.


=====One-Time Pad Example=====
=====One-Time Pad Example=====
[[File: Example_of_task3_1.png |thumb|1000px|left|'''Fig. 3.2:''' Example of task3 1]]


The Figure 3.2 is an excerpt form the Rubaiyat of Omar Khayyam and the word AWAKE is the first word in the Rubaiyat.  Assuming we have the encoded message AFM and we want to use the second letter position to decode it.  The program will search the Rubaiyat from beginning to end, until it finds the first word that begins with A. From Figure 3.2, the word should be AWAKE. This will then be decoded to the second letter in the same word which is W.  The program will repeat step 2 for the rest of the letters in the code. The second word and third word should be FOR and MORNING based on the Rubaiyat of Omar Khayyam.  The output should be WOO.  Figure 3.3 shows the output from MATLAB, the function called multi, which shows all possible decoded messages according to letter position n. For example, if n equals 2, the output will display the first original encoded message AFM and the second decoded message WOO as well.
Figure 27 contains an excerpt form the ''Rubaiyat of Omar Khayyam'' and the word AWAKE is the first word in the ''Rubaiyat''.  Assuming we have the encoded message AFM and we want to use the second letter position to decode it.  The program will search the ''Rubaiyat'' from beginning to end, until it finds the first word that begins with A. From Figure 27, the word should be AWAKE. This will then be decoded to the second letter in the same word which is W.  The program will repeat step 2 for the rest of the letters in the code. The second word and third word should be FOR and MORNING based on the ''Rubaiyat of Omar Khayyam''.  The output should be WOO.  Figure 28 shows the output from MATLAB, the function called multi, which shows all possible decoded messages according to letter position n. For example, if n equals 2, the output will display the first original encoded message AFM and the second decoded message WOO as well.


[[File: Example_of_task3_2.png |thumb|1000px|right|'''Fig. 3.3:''' Example of task3 2]]
{|style="margin: 0 auto;"
| [[File: Example_of_task3_1.png |thumb|1000px|centre|'''Fig. 27:''' Example of Task 3 (Rubaiyat Excerpt)]]
| [[File: Example_of_task3_2.png |thumb|1000px|centre|'''Fig. 28:''' Example of Task 3 (MATLAB Output)]]
|}


====Results====
=====Verification=====
 
A simple verification method was used to test the MATLAB code to be functioning correctly. One group member encoded the original message GUN using the proposed ''key'' with the third letter position, and ''Rubaiyat of Omar Khayyam'' as the ''one-time pad''. Based on Figure 29, which is an excerpt from the formatted version of the ''Rubaiyat of Omar Khayyam'', the encoded words used were NIGHT, FLUNG and HUNTER. Taking the first letter of each word produces the letters NFH as the encoded message.
[[File: Task_3_result_21.png |thumb|300px|left|'''Fig. 3.4:''' Task 3 result 21]]
[[File: Task_3_result_22.png |thumb|300px|right|'''Fig. 3.5:''' Task 3 result 22]]
 
 
 
The figure 3.4 shows the result for using second letter position as key to decode each line of Somerton Man code. The Rubaiyat is acting as one-time pad and without any formatting (We put whole Rubaiyat into text file without removing any symbols and punctuations)
 
 
 
 
 
 
 
 
 
The figure 3.5 shows the result for using second letter position as key to decode each line of Somerton Man code as well. But the Rubaiyat is acting as one-time pad and with formatting (We put whole Rubaiyat into text file and remove all symbols, punctuations and non-letter parts)
 
 
 
[[File: Task_3_result_31.png |thumb|300px|left|'''Fig. 3.6:''' Task 3 result 31]]
[[File: Task_3_result_32.png |thumb|300px|right|'''Fig. 3.7:''' Task 3 result 32]]
 
 
 
 
 
 
 
The figure 3.6 shows the result for using third letter position as key to decode each line of Somerton Man code. The Rubaiyat is acting as one-time pad and without any formatting (We put whole Rubaiyat into text file without removing any symbols and punctuations)
 
 
 
 
 
 
 
 
 
The figure 3.7 shows the result for using third letter position as key to decode each line of Somerton Man code as well. But the Rubaiyat is acting as one-time pad and with formatting (We put whole Rubaiyat into text file and remove all symbols, punctuations and non-letter parts)
 
 
[[File: Task_3_result_l1.png |thumb|300px|left|'''Fig. 3.8:''' Task 3 result l1]]
[[File: Task_3_result_l2.png |thumb|300px|right|'''Fig. 3.9:''' Task 3 result l2]]


The other group member then took the letters NFH as the code and ran the MATLAB script to test all letter positions until discovering that by using the third letter position as the ''key'', the original message 'GUN' was displayed as the output as expected.  This can be see in Figure 30.


{|style="margin: 0 auto;"
| [[File: Task_3_Verification.png |thumb|300px|left|'''Fig. 29:''' Task 3 Verification (Rubaiyat Excerpt)]]
| [[File: Task_3_Verification_Result.png |thumb|300px|right|'''Fig. 30:''' Task 3 Verification (Result)]]
|}


====Results====


{|style="margin: 0 auto;"
| [[File: Task_3_result_21.png |thumb|300px|centre|'''Fig. 31:''' Task 3 Results using Second Letter Position (Unformatted)]]
| [[File: Task_3_result_22.png |thumb|300px|centre|'''Fig. 32:''' Task 3 Results using Second Letter Position (Formatted)]]
|}


{|style="margin: 0 auto;"
| [[File: Task_3_result_31.png |thumb|300px|centre|'''Fig. 33:''' Task 3 Results using Third Letter Position (Unformatted)]]
| [[File: Task_3_result_32.png |thumb|300px|centre|'''Fig. 34:''' Task 3 Results using Third Letter Position (Formatted)]]
|}


 
{|style="margin: 0 auto;"
The figure 3.8 shows the result for using last letter position for each word as key to decode each line of Somerton Man code. The Rubaiyat is acting as one-time pad and without any formatting (We put whole Rubaiyat into text file without removing any symbols and punctuations)
| [[File: Task_3_result_l1.png |thumb|300px|centre|'''Fig. 35:''' Task 3 Results using Last Letter Position (Unformatted)]]
 
| [[File: Task_3_result_l2.png |thumb|300px|centre|'''Fig. 36:''' Task 3 Results using Last Letter Position (Formatted)]]
 
|}
 
 
 
 
 
The figure 3.9 shows the result for using last letter position for each word as key to decode each line of Somerton Man code as well. But the Rubaiyat is acting as one-time pad and with formatting (We put whole Rubaiyat into text file and remove all symbols, punctuations and non-letter parts)


====Evaluation and Justification====
====Evaluation and Justification====
 
As can be seen in the Figures in the Results section, despite correctly implementing the newly proposed ''key'' method to use the ''Rubaiyat of Omar Khayyam'' as a ''one-time pad'' through the use of software, no useful results were able to be generated by using all variants of the code and all letter positions as ''keys''. The main conclusion to be drawn from this was that the ''Rubaiyat of Omar Khayyam'' was not used as a ''one-time pad'' to generate the Somerton Man code with the proposed method of using letter position within words as the ''key''.
=====Verification=====
[[File: Task_3_Verification.png |thumb|300px|left|'''Fig. 3.10:''' Task 3 Verification]]
[[File: Task_3_Verification_Result.png |thumb|300px|right|'''Fig. 3.11:''' Task 3 Verification Result]]
 
 
 
A verification method is to be used to test the Matlab code to be working effectively and correctly. Assuming we have origin message GUN, and use third letter position to encode the message based on the Rubaiyat. Based on figure 3.10 which is a part from formatted Rubaiyat. The encoded words should be NIGHT, FLUNG and HUNTER which are emphasizing in the figure. Then we take the first letter of each word. The code should be NFH.
 
 
If we use NFH as code and choose third letter position as key, by using One-Time pad method. We should get GUN as the output. The figure 3.11 displays the output of Matlab code by using input NFH and third letter position.


===Task 4: Statistical Frequency of Letters Reanalysis===
===Task 4: Statistical Frequency of Letters Reanalysis===
====Aim====
====Aim====
Towards the end of the project, a decision was made that for Task 4, rather than analysing the mass spectrometer data from the Somerton Man's hair, we would focus our efforts on reanalysing the letter frequencies of various European languages.  This was decided upon since our initial analysis performed in Task 1 produced inconsistent and varied results.  This was brought about due to the limited sample size of the Universal Declaration of Human Rights as a base text, causing the frequency of particular letters to appear 0 times in particular languages.  Due to this, these letter frequencies had to be altered by choosing arbitrary numbers for their frequency in order to perform our chi-Squared testing and thus reduced the accuracy and validity of the test's results.
Towards the end of the project, a decision was made that for Task 4, rather than analysing the mass spectrometer data from the Somerton Man's hair, we would focus our efforts on reanalysing the letter frequencies of various European languages.  This was decided upon since our initial analysis performed in Task 1 produced inconsistent and varied results.  This was brought about due to the limited sample size of the ''Universal Declaration of Human Rights'' as a base text, causing the frequency of particular letters to appear 0 times in particular languages.  Due to this, these letter frequencies had to be altered by choosing arbitrary numbers for their frequency in order to perform our ''chi-Squared testing'' and thus reduced the accuracy and validity of the test's results.


the limited sample size caused the chi-squared values for all languages, including English, to be reasonably large.  This caused the resulting calculated p-values to be extremely small numbers, or in most cases 0.  Because of this, these chi-squared values were not usable to use p-values to perform our initially proposed hypothesis testing from Task 1.
the limited sample size caused the chi-squared values for all languages, including English, to be reasonably large.  This caused the resulting calculated ''p-values'' to be extremely small numbers, or in most cases 0.  Because of this, these chi-squared values were not usable to use ''p-values'' to perform our initially proposed ''hypothesis testing'' from Task 1.


This caused us to question the validity of the Universal Declaration of Human Rights as a base text and so we sought to increase our sample size using alternate base texts and extend our original statistical analysis.   
This caused us to question the validity of the ''Universal Declaration of Human Rights'' as a base text and so we sought to increase our sample size using alternate base texts and extend our original statistical analysis.   


====Method====
====Method====
It was decided that for the reanalysis, we would use Project Gutenberg to increase the sample size for as many of the 21 most popular European languages used in Task 1 as possible by collecting novels from the time before the Somerton Man's death.  This was chosen to be used as our base corpus in an attempt to obtain a more accurate representation of the initial letter frequencies of words in these languages.  Novels in each language were concatenated and their letter frequencies were determined, until each letter appeared at least once in each language.   
It was decided that for the reanalysis, we would use ''Project Gutenberg'' to increase the sample size for as many of the 21 most popular European languages used in Task 1 as possible by collecting novels from the time before the Somerton Man's death.  This was chosen to be used as our base corpus in an attempt to obtain a more accurate representation of the initial letter frequencies of words in these languages.  Novels in each language were concatenated and their letter frequencies were determined, until each letter appeared at least once in each language.   


The 2013 group’s decoding toolkit and initial letter frequency count code were able to be utilised for this task.  The decoding toolkit's 'format texts' function was used to remove all non letter characters and symbols as well as punctuation and accented letters, and the initial letter frequency counter was run on all of our base and benchmark sample texts in order to obtain the data we needed to perform our statistical analysis.  All statistical calculations and graphs were generated using Mircosoft Excel.
The 2013 group’s decoding toolkit and initial letter frequency count code were able to be utilised for this task.  The decoding toolkit's 'format texts' function was used to remove all non letter characters and symbols as well as punctuation and accented letters, and the initial letter frequency counter was run on all of our base and benchmark sample texts in order to obtain the data we needed to perform our statistical analysis.  All statistical calculations and graphs were generated using Mircosoft Excel.


=====Initial Validation=====
=====Initial Validation=====
First of all, the same test that was initially run in Task 1 on the statistics obtained from the English translation of the Universal Declaration of Human Rights (with letters with frequency 0 modified to 0.0001) as a baseline check were also run on the new statistics gathered from the novel: The Life of the Spider by J. Henri Fabre<ref>J. H. Fabre. (2005, March 22). The Life of the Spider [Online]. Available: https://www.gutenberg.org/ebooks/1887</ref>, used as our English base text found on Project Gutenberg, as a means of comparison between the base texts.  The Somerton man code, 44 letter samples from a Thomas Hardy novel acting as an English control <ref>T Hardy. 2005. Tess of the d’Urbervilles (11th Edition) [online]. Available: https://ia801409.us.archive.org/24/items/tessofthedurberv00110gut/110-8.txt.</ref> as well as a French sample, German sample, and Zapoteco sample from the Universal Declaration of Human Rights were all compared to both sets of data and the results can be seen in Figure X.
First of all, the same test that was initially run in Task 1 on the statistics obtained from the English translation of the ''Universal Declaration of Human Rights'' (with letters with frequency 0 modified to 0.0001) as a baseline check were also run on the new statistics gathered from the novel: ''The Life of the Spider'' by J. Henri Fabre<ref>J. H. Fabre. (2005, March 22). The Life of the Spider [Online]. Available: https://www.gutenberg.org/ebooks/1887</ref>, used as our English base text found on ''Project Gutenberg'', as a means of comparison between the base texts.  The Somerton man code, 44 letter samples from a Thomas Hardy novel acting as an English control <ref>T Hardy. 2005. Tess of the d’Urbervilles (11th Edition) [online]. Available: https://ia801409.us.archive.org/24/items/tessofthedurberv00110gut/110-8.txt.</ref> as well as a French sample, German sample, and Zapoteco sample from the ''Universal Declaration of Human Rights'' were all compared to both sets of data and the results can be seen in Figure 12.


=====European Language Comparison=====
=====European Language Comparison=====
Next, once it was found that the English text from Project Gutenberg provided lower chi-Squared values than the Universal Declaration for all samples in the initial test, the chi-Squared testing on European languages could be commenced.  This involved the same procedure as was used in Task 1, but of the top 21 most popular European languages from Task 1, only 12 of the languages were able to be used in the reanalysis due to insufficient usability or availability of texts on Project Gutenberg.  The languages used in the analysis can be seen in the graph in Figure X.  The omitted languages included Greek, Russian, Serbian, Kurdish, Uzbek, Turkish, Ukranian, Belarusian and Kazakh.  The texts used for this analysis can be seen in Appendix X.
Next, once it was found that the English text from ''Project Gutenberg'' provided lower chi-Squared values than the Universal Declaration for all samples in the initial test, the ''chi-Squared testing'' on European languages could be commenced.  This involved the same procedure as was used in Task 1, but of the top 21 most popular European languages from Task 1, only 12 of the languages were able to be used in the reanalysis due to insufficient usability or availability of texts on ''Project Gutenberg''.  The languages used in the analysis can be seen in the graph in Figure 38.  The omitted languages included Greek, Russian, Serbian, Kurdish, Uzbek, Turkish, Ukranian, Belarusian and Kazakh.  The texts used for this analysis can be seen in Appendix B.


=====Significance Level Calculation=====
=====Significance Level Calculation=====
The chi-squared and p-values calculated showed that English was the closest language to the Somerton Man code.  From this, hypothesis testing could be performed based on the English results.  Upon consultation with Prof. Abbott and Dr. Berryman, rather than choosing an arbitrary value of significance level such as the typically used P=0.05, it was decided a significance level could be calculated using the p-value found using real English texts to be used as what we deemed to be an acceptable significance level for which we would confidently be able to say that the most likely language of origin of the Somerton Man code is English.  This was achieved by collecting 20 44 letter excerpts from English novels from Project Gutenberg (see Appendix X), performing the chi-squared testing for these samples against the English project Gutenberg novel used as our English base text, taking an average of the chi-squared values, and from this calculating a p-value.  This result was then compared to the results obtained from the English portion of the chi-squared testing performed on the variants of the code, and was plotted as seen in Figure X.
The chi-squared and ''p-values'' calculated showed that English was the closest language to the Somerton Man code.  From this, ''hypothesis testing'' could be performed based on the English results.  Upon consultation with Prof. Abbott and Dr. Berryman, rather than choosing an arbitrary value of significance level such as the typically used p=0.05, it was decided a significance level could be calculated using the ''p-value'' found using real English texts to be used as what we deemed to be an acceptable significance level for which we would confidently be able to say that the most likely language of origin of the Somerton Man code is English.  This was achieved by collecting 20 44 letter excerpts from English novels from ''Project Gutenberg'' (see Appendix C), performing the ''chi-squared testing'' for these samples against the English ''Project Gutenberg'' novel used as our English base text, taking an average of the chi-squared values, and from this calculating a ''p-value''.  This result was then compared to the results obtained from the English portion of the ''chi-squared testing'' performed on the variants of the code, and was plotted as seen in Figure 40.


This same testing was then also run on the English samples and code variants against the original English translation of the Universal Declaration of Human rights as a means of comparison between the two base texts.  Significance levels were unable to be calculated using the Universal Declaration of Human Rights since the chi-squared values were too large, causing the calculated p-values to be too small (approaching 0).  The results can be seen in Figure X.
This same testing was then also run on the English samples and code variants against the original English translation of the ''Universal Declaration of Human rights'' as a means of comparison between the two base texts.  Significance levels were unable to be calculated using the ''Universal Declaration of Human Rights'' since the chi-squared values were too large, causing the calculated ''p-values'' to be too small (approaching 0).  The results can be seen in Figure 39.


It was unnecessary to extend the analysis to collect benchmarks and perform the hypothesis testing on the other European languages against the code since chi-squared values produced were too large, and so the p-values calculated were unusable.
It was unnecessary to extend the analysis to collect benchmarks and perform the ''hypothesis testing'' on the other European languages against the code since chi-squared values produced were too large, and so the ''p-values'' calculated were unusable.


=====Increased Sample Size Testing=====
=====Increased Sample Size Testing=====
It was then decided that in order to increase our confidence in the calculated significance level, we would increase the sample size for our English base text from Project Gutenberg to not only large enough such that each letter appeared at least once, but to concatenate 20 English novels from the time before the Somerton Man's death to be used as our base English Corpus (See Appendix X).  It was first confirmed whether this would have an affect on the chi-squared values against the code variants when compared to other languages.  We could then also increase our English benchmark sample size by taking 100 44 letter samples from this corpus using code written in Python, and performing the same testing as performed on our smaller English base text.  The results from this testing can be seen in Figure X.
It was then decided that in order to increase our confidence in the calculated significance level, we would increase the sample size for our English base text from ''Project Gutenberg'' to not only large enough such that each letter appeared at least once, but to concatenate 20 English novels from the time before the Somerton Man's death to be used as our base English Corpus (See Appendix D).  It was first confirmed whether this would have an affect on the chi-squared values against the code variants when compared to other languages.  We could then also increase our English benchmark sample size by taking 100 44 letter samples from this corpus using code written in Python, and performing the same testing as performed on our smaller English base text.  The results from this testing can be seen in Figure 41.


Increasing the sample size of the English base text had very little effect on the graphs produced in the Initial Validation, European Language Comparison and so these graphs have been excluded.  A closer look at the changes to the chi-squared and p-values for the Somerton Man code variants caused by this increased sample size can be seen through comparing Figures X and X.  Increasing the number of 44 letter English samples from 20 to 100 however, did have an effect on the chi-squared value and p-value calculated to be used as our significance level, the results of which can be seen in Figure X.  This increase in number of samples had very little effect on the graph of the Universal Declaration of Human Rights significance level calculation, and so this has also been omitted.
Increasing the sample size of the English base text had very little effect on the graphs produced in the Initial Validation, European Language Comparison and so these graphs have been excluded.  A closer look at the changes to the chi-squared and ''p-values'' for the Somerton Man code variants caused by this increased sample size can be seen through comparing Figures 40 and 41.  Increasing the number of 44 letter English samples from 20 to 100 however, did have an effect on the chi-squared value and ''p-value'' calculated to be used as our significance level, the results of which can be seen in Figure 41.  This increase in number of samples had very little effect on the graph of the ''Universal Declaration of Human Rights'' significance level calculation, and so this has also been omitted.


====Results====
====Results====
[[File:Gutenberg_vs_Declaration_against_Samples.png|thumb|500px|centre|'''Fig. X:''' Graph of ]]
{|style="margin: 0 auto;"
| [[File:Gutenberg_vs_Declaration_against_Samples.png|thumb|500px|centre|'''Fig. 37:''' Graph of Initial Validation Chi-Squared Values Comparison between English Declaration and English Gutenberg (Frequency 0 to 0.0001) ]]
| [[File:Gutenberg_vs_Code.png|thumb|500px|centre|'''Fig. 38:''' Graph of Average Chi-Squared values from ''Project Gutenberg'' Base Texts versus Somerton Man Code Variants]]
|}


[[File:Gutenberg_vs_Code.png|thumb|500px|centre|'''Fig. X:''' Graph of Average Chi-Squared values from Project Gutenberg Base Texts versus Somerton Man Code Variants]]
[[File:20_English_Samples_Declaration_Significance.png|thumb|500px|centre|'''Fig. 39:''' Graph of Comparison of Chi-Squared Values of 20 English Samples and Code Variants against English Declaration Base Text (Frequency 0 to 0.0001) ]]


[[File:20_English_Samples_Declaration_Significance.png|thumb|500px|centre|'''Fig. X:''' Graph of ]]
[[File:20_English_Samples_Significance.png|thumb|800px|centre|'''Fig. 40:''' Graphs of Comparison of Chi-Squared and P-Values of 20 English Samples and Code Variants against English Gutenberg Corpus Base Text]]


[[File:20_English_Samples_Significance.png|thumb|800px|centre|'''Fig. X:''' Graph of ]]
[[File:100_English_Samples_Significance.png|thumb|800px|centre|'''Fig. 41:'' Graphs of Comparison of Chi-Squared and P-Values of 100 English Samples and Code Variants against English Gutenberg Corpus Base Text]]
 
[[File:100_English_Samples_Significance.png|thumb|800px|centre|'''Fig. X:''' Graph of ]]


====Evaluation and Justification====
====Evaluation and Justification====
The results of the initial validation seen in Figure X, show that using the Project Gutenberg novel as an English reference text provided lower chi-squared values for all test cases and thus it was deemed to be a more suitable base text than the modified version of the Universal Declaration of Human Rights.
The results of the initial validation seen in Figure 37, show that using the ''Project Gutenberg'' novel as an English reference text provided lower chi-squared values for all test cases and thus it was deemed to be a more suitable base text than the modified version of the ''Universal Declaration of Human Rights''.


The results from the European Language Comparison in Figure X, show that English had the lowest chi-squared value when compared to all languages in our Project Gutenberg corpus, and thus was the closest language to the Somerton Man code.
The results from the European Language Comparison in Figure 38, show that English had the lowest chi-squared value when compared to all languages in our ''Project Gutenberg'' corpus, and thus was the closest language to the Somerton Man code.


The chi-squared values calculated using the English translation of the Declaration of Human Rights (Figure X) were found to be much higher than those calculated using the English Project Gutenberg novel (Figure X).  The large difference in results, and the fact that real english samples obtained such high chi-squared values, show that the Declaration may not provide an accurate representation of letter frequencies in the English language, and thus the validity of this as a base text has been proven to be questionable when used as part of a chi-squared analysis.  In addition to this, despite the chi-squared values calculated using the Somerton Man code variants being much lower in all cases, hypothesis testing could not be completed due to the large chi-squared values producing very small p-values for the code variants and significance level (approaching 0).
The chi-squared values calculated using the English translation of the Declaration of Human Rights (Figure 39) were found to be much higher than those calculated using the English ''Project Gutenberg'' novel (Figure 40).  The large difference in results, and the fact that real english samples obtained such high chi-squared values, show that the Declaration may not provide an accurate representation of letter frequencies in the English language, and thus the validity of this as a base text has been proven to be questionable when used as part of a chi-squared analysis.  In addition to this, despite the chi-squared values calculated using the Somerton Man code variants being much lower in all cases, ''hypothesis testing'' could not be completed due to the large chi-squared values producing very small ''p-values'' for the code variants and significance level (approaching 0).


The results from the initial significance level calculation in Figure X shows that 2 of the 3 Somerton Man code variants, and thus the average result, achieved higher p-values than the calculated significance level.  From this we could deduce that our preliminary results showed that our null hypothesis was accepted and that English is the most likely language of origin of the code, assuming that it is an initialism.
The results from the initial significance level calculation in Figure 40 shows that 2 of the 3 Somerton Man code variants, and thus the average result, achieved higher ''p-values'' than the calculated significance level.  From this we could deduce that our preliminary results showed that our null hypothesis was accepted and that English is the most likely language of origin of the code, assuming that it is an ''initialism''.


Upon increasing the sample size, the significance level calculation in Figure X shows that now only 1 of the 3 code variants achieved a p-value higher than the calculated significance level.  This caused the average result to fall below the significance level.  Due to this, our statement had to be modified to say that overall the null hypothesis was rejected and alternative hypothesis accepted, meaning that we can not confidently say that the language of origin of the Somerton Man code is english for all variants.  Despite this, the null hypothesis could be accepted and English is the most likely language of origin of the Somerton Man code, assuming that it contains 4 M's, 2 W's and is an initialism.
Upon increasing the sample size, the significance level calculation in Figure 41 shows that now only 1 of the 3 code variants achieved a ''p-value'' higher than the calculated significance level.  This caused the average result to fall below the significance level.  Due to this, our statement had to be modified to say that overall the null hypothesis was rejected and alternative hypothesis accepted, meaning that we can not confidently say that the language of origin of the Somerton Man code is english for all variants.  Despite this, the null hypothesis could be accepted and English is the most likely language of origin of the Somerton Man code, assuming that it contains 4 M's, 2 W's and is an ''initialism''.


Regardless of the choice to accept or reject the null hypothesis, the similarities in chi-squared and p-values calculated between real 44 letter English Samples and all variants of the Somerton Man code using the Project Gutenberg base text reinforces the notion that the language of origin of the code is indeed English.
Regardless of the choice to accept or reject the null hypothesis, the similarities in chi-squared and ''p-values'' calculated between real 44 letter English Samples and all variants of the Somerton Man code using the ''Project Gutenberg'' base text reinforces the notion that the language of origin of the code is indeed English.


Although we were able to find a base text with frequency greater than 0 for each letter, suitable for reanalysis and performing hypothesis testing, the chi-squared method used was still not entirely mathematically accurate since the expected value of the number of sample observations for each letter in the code should have been a minimum of 5<ref>Stat Trek. (2015). Chi-Square Goodness of Fit Test [Online]. Available: http://stattrek.com/chi-square-test/goodness-of-fit.aspx?Tutorial=AP.</ref>.  This was unavoidable since we had limited letter frequencies provided by the Somerton Man code and thus this sample size could not be increased.  Since this was constant when comparing across all languages, the method was still able to be used as a means of comparing the 'goodness of fit' of letters in each language.
Although we were able to find a base text with frequency greater than 0 for each letter, suitable for reanalysis and performing ''hypothesis testing'', the chi-squared method used was still not entirely mathematically accurate since the expected value of the number of sample observations for each letter in the code should have been a minimum of 5<ref>Stat Trek. (2015). Chi-Square Goodness of Fit Test [Online]. Available: http://stattrek.com/chi-square-test/goodness-of-fit.aspx?Tutorial=AP.</ref>.  This was unavoidable since we had limited letter frequencies provided by the Somerton Man code and thus this sample size could not be increased.  Since this was constant when comparing across all languages, the method was still able to be used as a means of comparing the 'goodness of fit' of letters in each language.


==Project Management - Planning and Feasibility==
==Project Management - Planning and Feasibility==


===Work Breakdown/Deliverables===
===Work Breakdown/Deliverables===
The workload for this project was broken down into its main tasks. These can be seen in list form in the Final Project Gantt Chart (see Timeline section). The key deliverables are represented as milestones on the Gantt Chart. The dependencies of the tasks and deliverables can be seen in the Gantt Chart as black arrows, these are as follows: The Research Proposal and Progress Report have dependence on the Draft Research Proposal, which has dependence on the Proposal Seminar. Of the specific project tasks, Task 1 was completed first, and Tasks 2, 3 and 4 were completed in parallel. The Final Seminar Presentation, Project Exhibition Poster, Final Performance, Youtube video and Dump of final work are all dependent on the completion of the specific project tasks. The Final Report/Honours Thesis was completed in parallel with the rest of the work from the Research Proposal and Progress Report hand-up, onwards.
The workload for this project was broken down into its main tasks. These can be seen in list form in the Final Project Gantt Chart (see Timeline section). The ''key'' deliverables are represented as milestones on the Gantt Chart. The dependencies of the tasks and deliverables can be seen in the Gantt Chart as black arrows, these are as follows: The Research Proposal and Progress Report have dependence on the Draft Research Proposal, which has dependence on the Proposal Seminar. Of the specific project tasks, Task 1 was completed first, and Tasks 2, 3 and 4 were completed in parallel. The Final Seminar Presentation, Project Exhibition Poster, Final Performance, Youtube video and Dump of final work are all dependent on the completion of the specific project tasks. The Final Report/Honours Thesis was completed in parallel with the rest of the work from the Research Proposal and Progress Report hand-up, onwards.


===Timeline===
===Timeline===
The timeline for this project was created in the form of a Gantt Chart. The proposed Gantt Chart can be seen in Figure X.
The timeline for this project was created in the form of a Gantt Chart. The proposed Gantt Chart can be seen in Figure 42.
[[File:Project_Gantt_Chart.png|thumb|1000px|centre|'''Fig. X:''' Proposed Project Gantt Chart]]
[[File:Project_Gantt_Chart.png|thumb|1000px|centre|'''Fig. 42:''' Proposed Project Gantt Chart]]


The final Gantt Chart after all revisions and updates can be seen in Figure X.
The final Gantt Chart after all revisions and updates can be seen in Figure 43.
[[File:Gantt_Chart_Final.png|thumb|1000px|centre|'''Fig. X:''' Final Project Gantt Chart]]
[[File:Gantt_Chart_Final.png|thumb|1000px|centre|'''Fig. 43:''' Final Project Gantt Chart]]


Changes made from the originally proposed Gantt Chart to the final revised Gantt Chart include the renaming of Tasks 2 and 4 to N-Gram Search and Statistical Frequency of Letters Reanalysis.  Task 2 was completed earlier than expected, but cleaning up results for presentation and finding meaningful combinations of the results proved to take longer than expected, and so the second part of Task 2 was extended.  Task 3 was also extended so that Jikai was able to complete this task.  Task 4 was commenced earlier than proposed since the bulk of Task 2 was completed early.  Due to this, Task 4 was completed in parallel with Tasks 2 and 3 towards the end of the project timeline.  The dump of final work and project youtube video were moved to be completed after the due date of the Final Report/Thesis upon discussion with our supervisors.  Overall, our initially proposed Gantt Chart estimated our project timeline quite accurately and only minor changes needed to be made.
Changes made from the originally proposed Gantt Chart to the final revised Gantt Chart include the renaming of Tasks 2 and 4 to N-Gram Search and Statistical Frequency of Letters Reanalysis.  Task 2 was completed earlier than expected, but cleaning up results for presentation and finding meaningful combinations of the results proved to take longer than expected, and so the second part of Task 2 was extended.  Task 3 was also extended so that Jikai was able to complete this task.  Task 4 was commenced earlier than proposed since the bulk of Task 2 was completed early.  Due to this, Task 4 was completed in parallel with Tasks 2 and 3 towards the end of the project timeline.  The dump of final work and project youtube video were moved to be completed after the due date of the Final Report/Thesis upon discussion with our supervisors.  Overall, our initially proposed Gantt Chart estimated our project timeline quite accurately and only minor changes needed to be made.


===Task Allocation===
===Task Allocation===
The workload for the tasks within this project were allocated based on the strengths and skillset of each member, as well as the estimated time taken and complexity of each task. A table of the project task allocation can be seen in Figure X. The key allocations were that Nicholas Gencarelli undertook the tasks of Project Management, N-Gram Search and the Project Exhibition Poster. Jikai Yang undertook the tasks of the use of the Rubaiyat of Omar Khayyam as a One-time Pad, and the project Youtube video. The allocations did not require changing throughout the project life cycle apart from the decision for both members to perform a statistical reanalysis for Task 4 rather than both analysing the mass spectrometer data from the Somerton Man's hair.
The workload for the tasks within this project were allocated based on the strengths and skillset of each member, as well as the estimated time taken and complexity of each task. A table of the project task allocation can be seen in Figure 44. The key allocations were that Nicholas Gencarelli undertook the tasks of Project Management, N-Gram Search and the Project Exhibition Poster. Jikai Yang undertook the tasks of the use of the ''Rubaiyat of Omar Khayyam'' as a ''One-time Pad'', and the project Youtube video. The allocations did not require changing throughout the project life cycle apart from the decision for both members to perform a statistical reanalysis for Task 4 rather than both analysing the mass spectrometer data from the Somerton Man's hair.
[[File:Table_of_Task_Allocation_Final.png|thumb|500px|centre|'''Fig. X:''' Table of Task Allocation]]
[[File:Table_of_Task_Allocation_Final.png|thumb|500px|centre|'''Fig. 44:''' Table of Task Allocation]]


===Management Strategy===
===Management Strategy===
Line 426: Line 411:
The Google alternative was available for free when obtaining the raw dataset, or at a cost of 150 dollars for a student license when purchased from the University of Pennsylvania Linguistic Data Consortium <ref>T.Brants and A.Franz. (2006). Web 1T 5-gram Version 1 [online]. Available: https://catalog.ldc.upenn.edu/LDC2006T13</ref>.  Unlike the Microsoft alternative, if the Google N-Gram option was chosen, a portion of the budget would have had to be dedicated to storing the database.  It was initially proposed to store the database on a hard drive at a cost of approximately 100 dollars.  
The Google alternative was available for free when obtaining the raw dataset, or at a cost of 150 dollars for a student license when purchased from the University of Pennsylvania Linguistic Data Consortium <ref>T.Brants and A.Franz. (2006). Web 1T 5-gram Version 1 [online]. Available: https://catalog.ldc.upenn.edu/LDC2006T13</ref>.  Unlike the Microsoft alternative, if the Google N-Gram option was chosen, a portion of the budget would have had to be dedicated to storing the database.  It was initially proposed to store the database on a hard drive at a cost of approximately 100 dollars.  


The proposed budget can be seen in the tables highlighting the key costs of each option in Figure X.     
The proposed budget can be seen in the tables highlighting the key costs of each option in Figure 45.     
[[File:Proposed_Budget.png|thumb|500px|centre|'''Fig. X:''' Proposed Budgeting Table]]
[[File:Proposed_Budget.png|thumb|500px|centre|'''Fig. 45:''' Proposed Budgeting Table]]


For reasons discussed in the Method section of Task 2: N-Gram Search, upon deciding to use the Google N-Gram database, a decision was to be made whether to purchase the University of Pennsylvania's Linguistic Data Consortium version or to obtain it for free directly from Google.  A decision was made to utilise the free database provided by Google as it was not deemed justifiable to spend $150 on the processed data from the Linguistic Data Consortium since it was proposed that the raw dataset could be cleaned up through writing software.
For reasons discussed in the Method section of Task 2: N-Gram Search, upon deciding to use the Google N-Gram database, a decision was to be made whether to purchase the University of Pennsylvania's Linguistic Data Consortium version or to obtain it for free directly from Google.  A decision was made to utilise the free database provided by Google as it was not deemed justifiable to spend $150 on the processed data from the Linguistic Data Consortium since it was proposed that the raw dataset could be cleaned up through writing software.
Line 433: Line 418:
The initial budget was based on the assumption that the Google N-Gram database could be stored locally, although this was feasibly possible in its compressed form, the local computing power available would have been insufficient to run the search engine code through the database within a the time frame of the project.  As discussed in the Method section of Task 2: N-Gram Search, a cloud based computing service called ‘Amazon Elastic Compute Cloud’ was utilised to store and process the database.  The free tier was considered but did not provide the specifications required to meet the needs of our task, and so instances on Amazon EC2 were hired at a rate of 0.853 dollars per hour <ref>Amazon Web Services. (2015). Amazon EC2 Pricing [Online]. Available: https://aws.amazon.com/ec2/pricing/</ref>.  Upon storing the initial full database, running our search code, and downloading our results generated from the outputs of the code, the total cost of utilising the service came to 576 dollars.  This caused our project to exceed the initially proposed budget.  The reason for the additional project expenditure was that despite our efforts, it was difficult to predict the precise time that it would take to upload, store and process the database on the cloud service.  The initially proposed budget did not include the need or costing for the Amazon server since this was not something that could be reasonably foreseen at the start of the project since it was initially thought that the Microsoft N-Gram Service would be suitable for the needs of the project, and if this was not suitable, that the Google N-gram alternative would be able to be stored locally.   
The initial budget was based on the assumption that the Google N-Gram database could be stored locally, although this was feasibly possible in its compressed form, the local computing power available would have been insufficient to run the search engine code through the database within a the time frame of the project.  As discussed in the Method section of Task 2: N-Gram Search, a cloud based computing service called ‘Amazon Elastic Compute Cloud’ was utilised to store and process the database.  The free tier was considered but did not provide the specifications required to meet the needs of our task, and so instances on Amazon EC2 were hired at a rate of 0.853 dollars per hour <ref>Amazon Web Services. (2015). Amazon EC2 Pricing [Online]. Available: https://aws.amazon.com/ec2/pricing/</ref>.  Upon storing the initial full database, running our search code, and downloading our results generated from the outputs of the code, the total cost of utilising the service came to 576 dollars.  This caused our project to exceed the initially proposed budget.  The reason for the additional project expenditure was that despite our efforts, it was difficult to predict the precise time that it would take to upload, store and process the database on the cloud service.  The initially proposed budget did not include the need or costing for the Amazon server since this was not something that could be reasonably foreseen at the start of the project since it was initially thought that the Microsoft N-Gram Service would be suitable for the needs of the project, and if this was not suitable, that the Google N-gram alternative would be able to be stored locally.   


The final revised budget including total project expenditure can be seen in Figure X.  
The final revised budget including total project expenditure can be seen in Figure 46.  
[[File:Final_Budget.png|thumb|500px|centre|'''Fig. X:''' Final Budgeting Table]]
[[File:Final_Budget.png|thumb|500px|centre|'''Fig. 46:''' Final Budgeting Table]]


In conclusion, despite going over budget, the additional funds were kindly provided by the school of Electrical and Electronic engineering upon sending an application for funding including justification of our purchases.  The project work has benefited through the purchase of the Amazon service since we were able to complete a search of specific n-gram combinations of the code on the full Google N-Gram database.  It has provided us with results to present as part of our thesis and allowed us to meet the requirements set out in the aim of Task 2.
In conclusion, despite going over budget, the additional funds were kindly provided by the school of Electrical and Electronic engineering upon sending an application for funding including justification of our purchases.  The project work has benefited through the purchase of the Amazon service since we were able to complete a search of specific n-gram combinations of the code on the full Google N-Gram database.  It has provided us with results to present as part of our thesis and allowed us to meet the requirements set out in the aim of Task 2.


===Risk Analysis===
===Risk Analysis===
A risk assessment was undertaken for this project to include risk identification, analysis, evaluation and treatment strategies using the Adelaide University risk matrix procedure [34]<ref>No Author. RISK MANAGEMENT HANDBOOK [online]. Available: http://www.adelaide.edu.au/legalandrisk/docs/resources/Risk_Management_Handbook.pdf</ref>. This can be seen in Figure X. One of the risks that occurred during the project was the inaccurate estimation of time and resources. This occurred since the group and supervisors were unhappy with the results obtained from the initial analysis of letter frequency performed in Task 1.  This was rectified by implementing the flexibility of our schedule and by replacing the initially proposed Task 4: Mass Spectrometer Data Analysis, with a new Task 4: Statistical Frequency of Letters Reanalysis.  Another risk that occurred throughout the project was Illness.  This was able to be dealt with relatively easily through working from home for a short period of time.  The minor misunderstanding of project tasks occurred on a few occasions, but these were clarified through scheduling meetings with group members and supervisors.  Bugs in code were reduced to the best of our ability through thorough testing and debugging of code.  Finally, the inability to decipher the Somerton Man Code was a risk estimated with an almost certain likelihood.  Despite being unable to avoid this risk throughout the project, its effects were considered negligible, and the group was still able to complete all work to the best of its ability, and further the research into the decryption of the code for not only future honours groups, but also the wider community through publishing our results on our Wiki.
A risk assessment was undertaken for this project to include risk identification, analysis, evaluation and treatment strategies using the Adelaide University risk matrix procedure <ref>No Author. RISK MANAGEMENT HANDBOOK [online]. Available: http://www.adelaide.edu.au/legalandrisk/docs/resources/Risk_Management_Handbook.pdf</ref>. This can be seen in Figure 47. One of the risks that occurred during the project was the inaccurate estimation of time and resources. This occurred since the group and supervisors were unhappy with the results obtained from the initial analysis of letter frequency performed in Task 1.  This was rectified by implementing the flexibility of our schedule and by replacing the initially proposed Task 4: Mass Spectrometer Data Analysis, with a new Task 4: Statistical Frequency of Letters Reanalysis.  Another risk that occurred throughout the project was Illness.  This was able to be dealt with relatively easily through working from home for a short period of time.  The minor misunderstanding of project tasks occurred on a few occasions, but these were clarified through scheduling meetings with group members and supervisors.  Bugs in code were reduced to the best of our ability through thorough testing and debugging of code.  Finally, the inability to decipher the Somerton Man Code was a risk estimated with an almost certain likelihood.  Despite being unable to avoid this risk throughout the project, its effects were considered negligible, and the group was still able to complete all work to the best of its ability, and further the research into the decryption of the code for not only future honours groups, but also the wider community through publishing our results on our Wiki.


[[File:Risk_Assessment_Final.png|thumb|500px|centre|'''Fig. 6:''' Table of Risk Assessment]]
[[File:Risk_Assessment_Final.png|thumb|500px|centre|'''Fig. 47:''' Table of Risk Assessment]]


==Conclusions==
==Conclusions==


The work undertaken throughout the project has fulfilled the key aims and objectives of the project including statistical analysis of likely language of origin of the Somerton Man code, the design and implementation of software to test the 'Rubaiyat of Omar Khayyam as a one-time pad in conjunction with a new key technique, and developed a search engine to discover possible n-grams contained within the code.  The group was successful in completing all tasks outlined in the proposal, with the exception of the proposed extension task to analyse the mass spectrometer data of the Somerton Man's hair.
The work undertaken throughout the project has fulfilled the key aims and objectives of the project including statistical analysis of likely language of origin of the Somerton Man code, the design and implementation of software to test the ''Rubaiyat of Omar Khayyam'' as a ''one-time pad'' in conjunction with a new ''key'' technique, and developed a search engine to discover possible n-grams contained within the code.  The group was successful in completing all tasks outlined in the proposal, with the exception of the proposed extension task to analyse the mass spectrometer data of the Somerton Man's hair.


Through this, the group was able to critically review and further the statistical analysis  of the likely language of origin of the Somerton Man code conducted by previous groups.  The 2015 group has improved upon the search for n-grams conducted by previous groups by increasing number of results and search speed.  In addition to this, the results collected through implementing the search engine are valuable for future groups to analyse for useful grams that could be linked to the Somerton Man code.  The group has also furthered the exploration from previous groups into the possibility that the 'Rubaiyat of Omar Khayyam' was used as a one-time pad to encrypt the Somerton Man code by testing a new key technique.
Through this, the group was able to critically review and further the statistical analysis  of the likely language of origin of the Somerton Man code conducted by previous groups.  The 2015 group has improved upon the search for n-grams conducted by previous groups by increasing number of results and search speed.  In addition to this, the results collected through implementing the search engine are valuable for future groups to analyse for useful grams that could be linked to the Somerton Man code.  The group has also furthered the exploration from previous groups into the possibility that the ''Rubaiyat of Omar Khayyam'' was used as a ''one-time pad'' to encrypt the Somerton Man code by testing a new ''key'' technique.


The skills developed through undertaking this project include text processing and programming in a variety of languages including Java, MATLAB and Python.
The skills developed through undertaking this project include text processing and programming in a variety of languages including Java, MATLAB and Python.
The group has also thoroughly researched and learnt how to implement and evaluate statistical tehcniques including chi-squared testing, p-value calculation and hypothesis testing and developed skills in using Microsoft Excel software to perform statistical analyses.
The group has also thoroughly researched and learnt how to implement and evaluate statistical tehcniques including ''chi-squared testing'', ''p-value'' calculation and ''hypothesis testing'' and developed skills in using Microsoft Excel software to perform statistical analyses.


The main conclusions drawn from the project work include that the Somerton Man code was not created using the 'Rubaiyat of Omar Khayam' as one-time pad and the proposed method of using letter position within words as the key.  Further analysis is required to obtain meaningful or useful combinations of grams from the results of the n-gram search.  The Universal Declaration of Human rights has too small a sample size of words in each language to accurately represent the initial letter frequency in each language for use in chi-squared testing.  Finally, although the results from the hypothesis testing were somewhat inconclusive, the results of all of the chi-squared testing have lead to the conclusion that we can now say more confidently than ever that English was the most likely language from which the Somerton Man code was written, assuming it is an initialism.
The main conclusions drawn from the project work include that the Somerton Man code was not created using the ''Rubaiyat of Omar Khayam'' as a ''one-time pad'' and the proposed method of using letter position within words as the ''key''.  Further analysis is required to obtain meaningful or useful combinations of grams from the results of the n-gram search.  The ''Universal Declaration of Human rights'' has too small a sample size of words in each language to accurately represent the initial letter frequency in each language for use in ''chi-squared testing''.  Finally, although the results from the ''hypothesis testing'' were somewhat inconclusive, the results of all of the ''chi-squared testing'' have lead to the conclusion that we can now say more confidently than ever that English was the most likely language from which the Somerton Man code was written, assuming it is an ''initialism''.


Despite being unable to decipher the Somerton Man code, the 2015 group has designed and implemented software that has furthered past work into the investigation and provided useful tools and resources to be utilised by future Honours students.
Despite being unable to decipher the Somerton Man code, the 2015 group has designed and implemented software that has furthered past work into the investigation and provided useful tools and resources to be utilised by future Honours students.
Line 461: Line 446:
There is also the potential to run a limited search on the search engine results using a more sophisticated code than the one used by the 2015 group.  This could be used to generate more useful gram combinations to find commonly used English expressions or phrases that could be linked to the Somerton Man code.
There is also the potential to run a limited search on the search engine results using a more sophisticated code than the one used by the 2015 group.  This could be used to generate more useful gram combinations to find commonly used English expressions or phrases that could be linked to the Somerton Man code.


Future students could extend the statistical analysis to perform hypothesis testing on all European languages in Project Gutenberg, but an alternate method to the chi-squared testing performed would have to be utilised since the chi-squared values for the code against all other languages were too high to produce usable p-values.
According to previous group’s results and our results by using One-time pad method for task 3. Maybe One-time pad is not an appropriate method to decipher the Somerton Man code and Rubaiyat. Future group may use some new decryption methods instead of using One-time pad for task 3. 
 
Future students could extend the statistical analysis to perform ''hypothesis testing'' on all European languages in ''Project Gutenberg'', but an alternate method to the ''chi-squared testing'' performed would have to be utilised since the chi-squared values for the code against all other languages were too high to produce usable ''p-values''.


Another option would be to focus on English as the most likely language and statistically analyse the code against genres as conducted by the 2013 group <ref>L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking</ref>, but using the chi-squared method as opposed to the squared difference and standard deviation methods adopted to consolidate or refute the conclusions drawn from their results .
Another option would be to focus on English as the most likely language and statistically analyse the code against genres as conducted by the 2013 group <ref>L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking</ref>, but using the chi-squared method as opposed to the squared difference and standard deviation methods adopted to consolidate or refute the conclusions drawn from their results .
Line 468: Line 455:


==Appendices==
==Appendices==
* Appendix X: Full Search Engine Code
* Appendix A: Full Search Engine Code
[[File:Search_Engine_Code.pdf]]
[[File:Search_Engine_Code.pdf]]


* Appendix X: Project Gutenberg European Language Comparison Text References
* Appendix B: ''Project Gutenberg'' European Language Comparison Text References
[[File:Gutenberg_European_Language_Comparison_Text_References.pdf]]
[[File:Gutenberg_European_Language_Comparison_Text_References.pdf]]


* Appendix X: Project Gutenberg 20 English 44 Letter Text File References
* Appendix C: ''Project Gutenberg'' 20 English 44 Letter Text File References
[[File:Gutenberg_20_English_44_Letter_Text_File_References.pdf]]
[[File:Gutenberg_20_English_44_Letter_Text_File_References.pdf]]


* Appendix X: Project Gutenberg English Corpus 20 Novels References.pdf
* Appendix D: ''Project Gutenberg'' English Corpus 20 Novels References.pdf
[[File:Gutenberg_English_Corpus_20_Novels_References.pdf]]
[[File:Gutenberg_English_Corpus_20_Novels_References.pdf]]


Line 487: Line 474:
* '''ASIS:''' Australian Secret Intelligence Service
* '''ASIS:''' Australian Secret Intelligence Service
* '''ASD:''' Australian Signals Directorate
* '''ASD:''' Australian Signals Directorate
* '''P-value theorem:''' The p-value is the calculated probability that gives researchers a measure of the strength of evidence against the null hypothesis <ref>B. David et al., “P Value and the Theory of Hypothesis Testing: An Explanation for New Researchers,” Clinical Orthopaedics and Related Research®, Vol.468 (3), pp.885-892 2010.
* '''P-value:''' The p-value is the calculated probability that gives researchers a measure of the strength of evidence against the null hypothesis <ref>B. David et al., “P Value and the Theory of Hypothesis Testing: An Explanation for New Researchers,” Clinical Orthopaedics and Related Research®, Vol.468 (3), pp.885-892 2010.
[25] G G. L et al., “What is the Value of a p Value?,” The Annals of Thoracic Surgery, Vol.87(5), pp.1337-1343 2009.
[25] G G. L et al., “What is the Value of a p Value?,” The Annals of Thoracic Surgery, Vol.87(5), pp.1337-1343 2009.
[26] No Author.p-value [online]. Available: http://en.wikipedia.org/wiki/P-value#cite_note-nature506-1</ref>.
[26] No Author.p-value [online]. Available: http://en.wikipedia.org/wiki/P-value#cite_note-nature506-1</ref>.
* '''Chi-Squared Test:'''  
* '''Chi-Squared Test:''' A 'goodness of fit' statistical test used to compare how closely observed data is related to expected data. <ref> N Turner, “Chi-squared test” Journal of clinical nursing, Vol.9 (1), pp.93 2000. </ref> <ref> N. Balakrishnan et al., Chi-squared Goodness of Fit Tests with Applications [online]. Available: http://www.sciencedirect.com.proxy.library.adelaide.edu.au/science/book/9780123971944 </ref>.
* '''Universal Declaration of Human Rights:'''  
* '''Hypothesis Test:'''  The formal procedure used by statisticians to accept or reject statistical hypotheses <ref>Stat Trek. Hypothesis Tests. [online]. Available: http://stattrek.com/hypothesis-test/hypothesis-testing.aspx</ref>.
* '''Project Gutenberg:'''  
* '''Universal Declaration of Human Rights:''' A text that has been translated into over 400 languages <ref>Author Unknown. The Universal Declaration of Human Rights [online]. Available: http://www.un.org/en/documents/udhr/history.shtml </ref>.
* '''Project Gutenberg:''' A website containing a large number of free ebooks in a wide range of languages <ref> Author Unknown. Free ebooks by Project Gutenberg [online]. Available: https://www.gutenberg.org/ </ref>.
* '''N-gram model:''' The N-gram model is a sequence of n items from a given sequence of phonemes, syllables, letters, words or base pairs <ref>A. Z Broder et al., “Syntactic clustering of the web”. Computer Networks and ISDN Systems 29 (8), pp.1157–1166.
* '''N-gram model:''' The N-gram model is a sequence of n items from a given sequence of phonemes, syllables, letters, words or base pairs <ref>A. Z Broder et al., “Syntactic clustering of the web”. Computer Networks and ISDN Systems 29 (8), pp.1157–1166.
[28] No Author. Video Lectures [online]. Available: https://class.coursera.org/nlp/lecture/17</ref>.
[28] No Author. Video Lectures [online]. Available: https://class.coursera.org/nlp/lecture/17</ref>.
* '''One-time pad:''' The one-time pad is a decoder technology which cannot be cracked if the correct key is used <ref>S.M. Bellovin. (2011, July 12). Frank Miller: Inventor of the One-Time Pad [online]. Available: http://www.tandfonline.com.proxy.library.adelaide.edu.au/doi/full/10.1080/01611194.2011.583711#abstract</ref>.
* '''One-time pad:''' The one-time pad is a decoder technology which cannot be cracked if the correct key is used <ref>S.M. Bellovin. (2011, July 12). Frank Miller: Inventor of the One-Time Pad [online]. Available: http://www.tandfonline.com.proxy.library.adelaide.edu.au/doi/full/10.1080/01611194.2011.583711#abstract</ref>.
* '''Rubaiyat of Omar Khayyam:'''
* '''Initialism:''' A group of letters formed using the initial letters of a group of words or a phrase <ref>No Author. Initialism [online]. Available: http://dictionary.reference.com/browse/initialism</ref>.
* '''Initialism:''' A group of letters formed using the initial letters of a group of words or a phrase <ref>No Author. Initialism [online]. Available: http://dictionary.reference.com/browse/initialism</ref>.
* '''Plaintext:''' The information of an original message, which is desired to be deciphered from the ciphertext <ref>No Author (2011). Topic 1: Cryptography [online]. Available: http://www.maths.uq.edu.au/~pa/SCIE1000/gma.pdf</ref>.
* '''Plaintext:''' The information of an original message, which is desired to be deciphered from the ciphertext <ref>No Author (2011). Topic 1: Cryptography [online]. Available: http://www.maths.uq.edu.au/~pa/SCIE1000/gma.pdf</ref>.
* '''Ciphertext:''' The encoded format of a message <ref>No Author (2011). Topic 1: Cryptography [online]. Available: http://www.maths.uq.edu.au/~pa/SCIE1000/gma.pdf</ref>.
* '''Ciphertext:''' The encoded format of a message <ref>No Author (2011). Topic 1: Cryptography [online]. Available: http://www.maths.uq.edu.au/~pa/SCIE1000/gma.pdf</ref>.
* '''Key:''' What is needed to convert the ciphertext into the plaintext using the one-time pad  <ref>No Author (2011). Topic 1: Cryptography [online]. Available: http://www.maths.uq.edu.au/~pa/SCIE1000/gma.pdf</ref>.
* '''Key:''' What is needed to convert the ciphertext into the plaintext using the one-time pad  <ref>No Author (2011). Topic 1: Cryptography [online]. Available: http://www.maths.uq.edu.au/~pa/SCIE1000/gma.pdf</ref>.

Latest revision as of 18:47, 3 November 2016

Abstract[edit]

The project involves the mysterious case of a dead man found at Somerton Beach, South Australia. There was no evidence to show the man’s identification or cause of death, however, there were 5 lines of letters that were found on a scrap of paper in the dead man’s trouser pocket. It was later discovered that the scrap of paper was torn from a book known as The Rubaiyat of Omar Khayyam. These letters are considered vital to the case as it is speculated that they may be a code or cipher of some sort. The case still remains unsolved today, and so this project has been undertaken in order to uncover further case evidence. The aims and objectives of the project include using various computational techniques to statistically analyse the likely language of origin of the code, designing and implementing software in order to decipher the code, and ultimately attempting to solve the cold case.

Executive Summary[edit]

This project involves the design and implementation of software in order to decipher the code associated with the Somerton Man murder mystery. This document is a Final Report and Thesis outlining the key aims, methods, results and evaluations and justifications of the specific tasks involved in the Code Cracking: Who Murdered The Somerton Man? Honours project. The information in the report also includes the motivation, previous studies, aims, objectives, and significance of the project, as well as technical background, knowledge gaps, technical challenges, specific project tasks and project management resources.

The project fulfilled the key aims and objectives of the project including statistical analysis of likely language of origin of the Somerton Man code, the design and implementation of software to test the Rubaiyat of Omar Khayyam as a one-time pad in conjunction with a new key technique, and developed a search engine to discover possible n-grams contained within the code.

The conclusions drawn from the project work include that the Somerton Man code was not created using the Rubaiyat of Omar Khayam as a one-time pad and the proposed method of using letter position within words as the key. Further analysis is required to obtain meaningful or useful combinations of grams from the results of the n-gram search engine developed. Finally, the results of the chi-squared testing have lead to the conclusion that we can now say more confidently than ever that English was the most likely language from which the Somerton Man code was written, assuming it is an initialism.

Despite being unable to decipher the Somerton Man code, the 2015 group has designed and implemented software that has furthered past work into the investigation and provided useful tools and resources to be utilised by future Honours students.

Acknowledgements[edit]

  • Project supervisor Prof. Derek Abbott for providing the inspiration and motivation in undertaking this project, as well as advice on specific tasks to be completed to further the research into uncovering the mystery of the Somerton Man code.
  • Project supervisor Dr. Matthew Berryman for all advice and assistance with statistical methods and software programming.

Introduction[edit]

Motivation[edit]

Fig. 1: Photograph of the letters making up the Somerton Man Code [1]

On the 1st of December, 1948, the body of a man was found at Somerton Beach, South Australia [2]. There was no evidence to show the man’s identification and the cause of death [3], however, there were 5 lines of capital letters, with the second line struck out, that were found on a scrap of paper in the dead man’s trouser pocket [4]. A photo of the paper containing the letters can be seen in Figure 1. It was later discovered that the scrap of paper was torn from a book known as the Rubaiyat of Omar Khayyam [5]. These letters are considered vital to the case as it is speculated that they may be a code or cipher of some sort. As engineers, we have the ability to help investigators in solving the case. With that in mind, this project is being undertaken to attempt to decrypt the code in order to help solve the cold case.

The South Australian Police stand to benefit from this project not only from the decoding technology developed for this case, but it also may be able to be applied to solve similar cases. Historians may be interested in gaining further historical information from this project since the case occurred during the heightened tension of the Cold War, and it is speculated that this case may be related in some way [6]. Pathologists may also be interested as the cause of death may have been an unknown or undetectable poison [7]. This project stands to benefit the wider community as well as extended family of the unknown man to provide closure to the mysterious case. Professor Derek Abbot also stands to benefit as he has been working closely with honours project students for the past seven years in an attempt to decipher the Somerton Man code.

Previous Studies/Related Work[edit]

Fig. 2: 3D generated reconstruction of bust of Somerton Man from 2012 Final Report [8]

Previous professional attempts to decipher the code were limited since they did not use modern techniques or have access to modern databases. Another limitation is that some of the characters in the code appear to be ambiguous and previous attempts made fixed assumptions on particular characters [9]. The Australian Navy’s response was that the letters were “neither a code nor a cipher” [10]. The Australian Department of Defence analysed the characters in 1978 using the computer technology available of that era and concluded:

  • a) “There are insufficient symbols to provide a pattern”
  • b) “The symbols could be a complex substitute code or the meaningless response to a disturbed mind”
  • c) “It is not possible to provide a satisfactory answer” [11]

Other previous studies into deciphering the code include Honours Projects at the University of Adelaide from 2009-2013. The previous work undertaken by these groups includes: multiple evolutions of letter frequency analysis of the code on a variety of base texts in a number of languages, initial letter and sentence letter probabilities, the probabilities of known cypher techniques, the likelihood of the code being an initialism of a poem, the use of various one-time pad techniques, the design and implementation of a web crawler, the analysis of text type and genre of the code’s likely plaintext, the implementation of pattern matching software into the web crawler, a 3D generated reconstruction of the bust of the Somerton Man (see Figure 2) and the analysis of mass spectrometer data taken from the Somerton Man’s hair [12] [13] [14] [15] [16]. The main conclusions that past groups have come to in their projects are: that the letters are unlikely to be random, the code is unlikely to be an initialism, it is likely that the Rubaiyat of Omar Khayyam was used as a one-time pad, the language of the code is likely to be English, the code is unlikely to be an initialism of a poem and that the Rubaiyat of Omar Khayyam was not used as a straight substitution one-time pad [17] [18] [19] [20] [21]. The analysis and extension upon specific elements of previous work that are directly related to the 2015 group’s project are discussed in the 'Method – Specific Tasks' section.

Aims and Objectives[edit]

The key aims and objectives in this project included the aim to statistically analyse the likely language of the plaintext of the code. Another aim was to design and implement software in order to try and decipher the code. This was to be implemented by using the Rubaiyat of Omar Khayyam as a one-time pad in conjunction with a new key technique, and by developing a search engine to try to discover possible n-grams contained within the code. The third aim was to analyse mass spectrometer isotope concentration data of the Somerton Man’s hair. Finally, the ultimate aim was to decrypt the code in order to solve the mystery, however this was somewhat unrealistic as the code has remained uncracked for many years. Despite this, computational techniques were to be utilised to attempt the decryption, and at the very least, the past research into the case was to be furthered for future Honours students.

Significance[edit]

Considered “one of Australia’s most profound mysteries” at the time [22], this case still remains unsolved today. As the development of decoder technology and the related knowledge progresses, this project poses the opportunity to uncover further case evidence. The skills developed in undertaking this project were also of great significance in a broader sense, as these can be transferrable to possible future career paths. The techniques developed include: software and programming skills, information theory, probability, statistics, encryption and decryption, datamining and database trawling. The job areas and industries that these skills can be applied to are: computer security, communications, digital forensics, computational linguistics, defence, software, e-finance, e-security, telecommunications, search engines and information technology. Some possible job examples include working at: Google, ASIO, ASIS and ASD [23].

Technical Background[edit]

P-Value Theorem Explanation[edit]

The p-value is the probability of an effect equal to or greater than the one observed, presuming the null hypothesis of no effect is true. It is a measure of evidence against a null hypothesis [24].

In statistics, the p-value is used for testing a statistical hypothesis by observing sample results. P-value testing is an effective method to presume whether the null hypothesis of no effect is true. In this project, we will use p-values to do some simple tests to determine whether English was the language in the original message [25].

In this case, we make an assumption that English was not the language in the original message. Thus we should get a larger p-value which is higher than 0.05 in the simple test. This means the observed data point is in the range of “more likely observation”, mentioned in Figure 1 below. Oppositely, if we get a smaller p-value which is lower than 0.05 in the simple test, this means the observed data point is in the range of “very un-likely observations”. This indicates our assumption is wrong and that English was the language in the original message [26].

Fig. 3: Example of a p-value computation [27]

As the graph in Figure 3 shows, the horizontal coordinate is the set of possible results, and the vertical coordinate is the probability density. The p-value is in the right side of observed data point, which is between the curve and horizontal coordinate [28].

Chi-Squared Test Explanation[edit]

The chi-squared test is a statistical test commonly used to compare observed data with expected data. It requires than there is no relationship between the observed data and expected data. It means the observed data should not affect the expected data and they are independent respectively. [29].

The chi-squared test a ‘goodness of fit’ test, meaning it measures how closely one group of data is related to another. Since it is a non-parametric test, it doesn’t care about distribution of samples. It means there is no requirement that expected and observed data should follow a normal distribution [30].

Fig. 4: Chi-squared example

Figure 4 is a simple example to show how our group is to use the chi-squared test for Task 1. Firstly, we can calculate the chi-squared value for the letter A by using the equation on the right of Figure 4. The observed value is the count of letter A in the Somerton Man code, the expected value is the total number of letters in Somerton Man code which is 44, multiplied by the frequency of letter A in the base text language used. This is repeated to calculate the chi-squared value for the rest of the letters from B to Z. Finally, the chi-squared value for all letters are summed. This is the chi-squared value between the base text language and Somerton Man code. Lower Chi-squared values mean the Somerton Man code is more likely to be this language.

Universal Declaration of Human Rights Explanation[edit]

The Universal Declaration of Human Rights was created in 1948 and is translated into over 400 different languages [31]. The group used translations of this declaration as the base text for Task 1: Statistical Frequency Analysis of Letters.

Fig. 5: Universal Declaration of Human Rights From British Library[32]

Project Gutenberg Explanation[edit]

Fig. 6: Project Gutenberg Logo [33]

Project Gutenberg was established in 1971. It offers over 50,000 free ebooks available in a range of different formats including TXT, HTML, PDF and EBOOK and a number of languages including Chinese, Danish, Dutch, English, Esperanto, Finnish, French, German, Greek, Hungarian, Italian, Latin, Portuguese, Spanish, Swedish and Tagalog [34]. The database was used to provide base texts for Task 4: Statistical Frequency of Letters Reanalysis due to the increased sample size of letter frequencies in each language when compared to the Universal Declaration of Human Rights as a base text.

N-Gram Model Explanation[edit]

The n-gram model is a sequence of n items from a given sequence of phonemes, syllables, letters, words or base pairs. The n-grams typically are collected from articles or books. If the elements are words, n-grams may also be called shingles [35].

Size n=1 of the n-gram model is called a "unigram", size 2=2 of the n-gram model is called a "bigram", size n=23 of the n-gram model is called a "trigram", and so on.

An n-gram model is a type of probabilistic language model for predicting the next item. In a word, the item n is only related to item (n-1) regardless of any other items in the sequence [36]. The group will use an n-gram database to find common groups of words for a variety of initialisms.

In Figure 7 shown below, we use one line from the Somerton Man code as a sample sequence to produce the n-gram model.

Fig. 7: N-gram model applied to first line of Somerton Man code

One-Time Pad Explanation[edit]

The one-time pad is a decoder technology which cannot be cracked if the correct key is used. For example, if we have a ciphertext, we need the key to decode it and get correct message [37].

Fig. 8: One-Time Pad Example [38]

As in Figure 8 [39] shown above, we have the cipher text: EQNVZ. E is the fifth letter in the alphabet, so we label it as 4 (initially we label the first letter, A, as 0). Similarly, the letter Q is 16, N is 13, V is 21 and Z is 25. If we have a key which is: XMCKL, we can label X as 23, M as 12, C as 2, K as 10 and L as 11 using the same method as above. Then we use E minus X and process all cipher text elements one by one. Negative numbers are not used, so E minus X equals -19, then we add 26 and get the number 7, which represents letter H. Finally, we find the original message to be HELLO. The one-time pad technique will be used as another method to attempt to decipher the Somerton Man code.

Knowledge Gaps and Technical Challenges[edit]

In order to complete the specific tasks proposed within this project, the group members were required to develop new skills in text processing and programming in a variety of languages. These languages included Java, MATLAB, and Python. The group was also required to research and learn how to implement and evaluate statistical techniques including chi-squared testing, p-value calculation and hypothesis testing. Another necessary skill to be developed was the use of Microsoft Excel software to perform statistical analyses. The technical challenges that were to be faced within the project were directly associated with these knowledge gaps.

Method - Specific Tasks[edit]

Task 1: Statistical Frequency Analysis of Letters[edit]

Aim[edit]

A Critical review of the statistical frequency analysis of the letters from the 2013 group was to be conducted. It was then proposed that the 2015 group was to repeat the statistical tests done by the 2013 group. Like the 2013 group, the Universal Declaration of Human Rights was to be used as the base text based on advice from Professor Abbott, but the validity of this text being used as the base text was to be statistically tested. The group was to find out how common each letter of the Somerton Man code is in each popular European language. The statistical results from this analysis were then to be used on 44 letters out of pieces of text from the most likely languages. The 2013 group’s analysis was then to be extended by the 2015 group by calculating p-values and implementing hypothesis testing. Like the 2013 group, the 2015 group was also to use Microsoft Excel to compute the statistical analysis and produce output graphs. The group was to run p-value tests on benchmark pieces of text from the most likely languages and see if the p-values suggested that the letters are indeed from those languages. The use of benchmark pieces of text were also to be used to test the statistical accuracy of the method of analysis as well as the validity of the Universal Declaration of Human Rights as a base text. Once the most likely language was determined, the group was to process an additional number of benchmarks of that language and obtain a mean p-value. Next, a hypothesis test could be performed based on the mean p-value obtained from the benchmarks, when compared to, the p-value of the letters in the Somerton Man code against the base text. The null hypothesis was to be that ‘The group of letters are from the English language’, and the alternative hypothesis was to be that ‘The group of letters are from another language’. The alternative hypothesis could be altered if it was found that the most likely language from the statistical analysis was one other than English. Using this hypothesis testing method, the 2015 group was hoping to be able to more confidently determine which language the letters in the code are from.

Method[edit]

2013 Statistical Frequency Analysis Review[edit]

A critical review of the statistical frequency analysis of the letters from the 2013 group has been conducted to determine possible extensions to be undertaken by the 2015 group. The 2013 group based their statistical frequency analysis of letters on the translations and transliterations of the Universal Declaration of Human Rights. This document was chosen since it is translated in over 400 languages. The 2015 group is to analyse this choice of base document using statistical techniques. 266 languages were analysed, since many of the translations were not text files, but paper scans. This seems like a reasonable omission due to the time constraints of the project and the common European languages are all in text form. The analysis included accented letters normalised to their ‘parent’ characters, for example, considering 'ǎ' as 'a'. The languages were then analysed using a variety of combinations of the ambiguous letters within the code. The 2013 group’s statistical analysis was then refined to the top 20 closest European languages. These included more uncommon European languages such as Scots and Vepsian. The graphs of the results from the 2013 group’s analysis can be seen in Figure 9. The most likely language of the Somerton Man code and control text was found to be Scots. This is unlikely since it is not a common European language as it is a Scottish dialect [40]. Due to this, Scots, and other less common European languages will be omitted in the 2015 group’s statistical analysis. Instead, only the most common European languages will be included in order to further refine the statistical process. Despite this, a conclusion was made that the most likely language was English since Scots and English shared very similar initial letters. In order to test this conclusion, the 2013 group used the English translation of the Rubaiyat of Omar Khayyam as an English control text. The first 44 words from the text were used since this is the same number as the number of letters in the Somerton Man code. The results showed, once again, that Scots was the most likely language, followed by English (see Figure 9). Analysing this choice of control text, the choice was made by the 2013 group as a matter of convenience. This may not have been the best decision to make as the Rubaiyat of Omar Khayyam was originally written in Persian, and has since been translated into English. The use of a translated text as the English control may have skewed the statistical results of the analysis of the control text as the translation may use uncommon words or expressions. Another reason for the suspected use of uncommon words or expressions is the fact that the text is a book made up of four line poems, meaning that it may not accurately represent commonly used words or letters in such a small sample size of 44 words. The 2015 group is to attempt to counteract this possible skew by using 44 words out of a popular novel, originally written in the most likely language.

Fig. 9: 2013 Statistical Language Analysis Graphed Results [41]
Assumptions[edit]

Before commencing statistical calculations, a number of initial assumptions were made for this task. One assumption was that the language used as the basis for the code is a European Language. This assumption was made based on the European appearance of the Somerton Man, results from previous groups concluding that the most likely language is English (see Previous Studies/Related Work section), and advice from Professor Abbott. Another assumption that was made is that the code is an initialism, meaning that it is made up of letters that represent the first letters of an ordered series of words. This assumption has also been made based on conclusions made by previous groups (see Previous Studies/Related Work section) and advice from Professor Abbott. Due to this assumption, only the first letters of words in each language in the base text were to be considered for analysis, and not every letter. Further assumptions were the inclusion of all accented letters in base texts and all combinations of ambiguous letters in the code. This assumption entailed the use of versions of the code including 6Ms, 4Ms and 2Ws, and the code as it appeared in the original police report from 1949[42]. These assumptions were made to increase the robustness of the analysis. A final assumption that was made was the choice not to include the crossed out line of letters in the code. This was made since from observation, one can see that this line is very similar to the third line and is most likely to be a mistake.

Base Text[edit]

The 2013 group’s frequency analysis code was obtained. Initially, some background research in Java code was undertaken for use in compiling and editing the code. The text files used by the 2013 group from the Universal Declaration of Human Right for the statistical frequency analysis were also obtained. A selection of these were then processed in Java, using a modified version of the 2013 group’s code, and the results were tabulated in an excel spread sheet including language, numerical frequency of each letter in each language and the proportional frequency based on the total number of initial letters in each language. These results were then used to test our statistical method of analysis in Microsoft Excel, MATLAB and by hand.

A spread sheet containing the letter frequency data from the Universal Declaration of Human Rights was obtained from the 2013 group. These results were verified by processing the original text files through the initial letter frequency java code. This data spread sheet was used for statistical calculations for comparison of the letter frequency in the code versus each language in the Universal Declaration of Human Rights.

Parametric Testing[edit]

Firstly, a group of test texts were prepared, using 3 groups of 44 letters from the English, French and German languages from the Universal Declaration of Human Rights, as well as the Somerton Man Code. The test texts were analysed using ANOVA in Excel. Once these results were obtained, the p-value calculation method used was tested using various pieces of software. A test case with known p-value result from the Engineering Maths IIA notes was run through ANOVA in Microsoft Excel, as well as using built-in MATLAB functions found in Engineering Maths IIA notes, and by hand. All of these methods produced the same p-value using the test case with known p-value and so the ANOVA method was verified to be functioning correctly.

Parametric Testing Results[edit]

ANOVA One Way analysis was used in Microsoft Excel to compare letter frequency between texts, as well as ANOVA Two Way Analysis of Variance, where Factor 1 was the letter, Factor 2 was the text and the response was the letter frequency. Both of these methods did not produce valid p-values as they used the comparison of total means and variances. Paired data analysis using a paired sample t-test was researched and attempted based on the Engineering Maths IIA Notes. This was computed using MATLAB, but also produced unusable p-values since the method uses mean and standard deviation of the differences between samples to calculate the p-value, thus letter and text type data is lost in the process and so is not applicable. The p-values calculated using these methods fluctuated depending on the type of data used. If the raw number of letter frequencies for the sample text or code (44 letters) versus the Universal Declaration of Human Rights reference text (1000-2000 letters depending on language) was used, then the p-value became very small since the total means of each text were vastly different. Whereas, if the frequency of each letter as a percentage of the total number of letters in each text were used, this gave a p-value of 1 since the means became the same. Thus this method of mean comparison did not work for letter frequency analysis.

Parametric Testing Evaluation[edit]

All of these methods did not compare individual data entries from one group to their corresponding entry in another group. Instead they used the sample size, sum, mean, and variance of each group to compare to the other group. Because of this, this method could not be used when comparing letter frequency between languages. For example the frequency of the letter A in one text, must be compared with the frequency of the letter A in another text, rather than the frequency of all letters in one text being compared to the frequency of all letters in another text.

Non-Parametric Testing[edit]

Due to the lack of usability of the normally distributed statistical methods, a number of non-parametric tests were researched including the Kolmogrov-Smirnov Test, Mann-Whitney U Test and Chi-Squared test. These 3 tests were researched and considered but ultimately the chi-squared test was chosen based on a cryptography reference found that uses the chi-squared test to compare a code to a particular language [43], as well as advice from supervisor Dr. Berryman. The base text chosen for testing was the full English translation of the Universal Declaration of Human Rights. The sample texts chosen were the first 44 letters from a the novel Tess of the d’Urbervilles by Thomas Hardy as an English benchmark [44], as well as the first 44 letters from the German, French and Zapoteco translations of the Universal Declaration of Human Rights.

Non-Parametric Testing Initial Results And Evaluation[edit]

Initially, a result was unable to be computed for the Chi-Squared value or p-value using the chi-squared test method. It was soon discovered that in the calculation of the Chi-Squared value, if the frequency of a letter in the reference text (a particular language from the Declaration of Human Rights) was 0, this caused the denominator of the Chi-Squared value equation to be 0 and thus the equation became invalid (See Figure 10).

Fig. 10: Chi-Squared Formula

An assumption in the method of the chi-squared testing had to be made in an attempt to rectify the issue. The count for letters in the reference text that appeared 0 times, were altered to 1, a small number chosen to be the closest integer value to 0 in the hope that it would not have significant enough effect on the results to cause a skew in the data. This assumption was made since the sample size for the reference text was too small, and so not every letter appeared at least once. After doing this, and computing all results, it was realised by the group that despite the initial results when changing the count from 0 to 1 looking promising, the frequencies of letters that actually appeared once had the same frequency as letters that did not appear at all and thus reduced the accuracy of the data used for the analysis. This was rectified by altering the frequencies of letters appearing 0 times to 0.0001, since the frequencies for letters occurring 1 time had a frequency of approximately 0.0005, and so we had to select a number lower than the lowest occurrence. The results of both methods are compared and contrasted in the following sections.

Fig. 11: Initial Chi-Squared Test Results for English Declaration vs Sample Texts (Count from 0 to 1)
Fig. 12: Initial Chi-Squared Test Results for English Declaration vs Sample Texts (Frequency from 0 to 0.0001)

The initial chi-squared test using both methods resulted in large chi-squared values that were brought about by a small text size sample of 44 letters (See Figures 11 and 12). This effectively caused the p-value results to be very small numbers and therefore were unable to be used as a means of comparison and caused the null hypothesis to be rejected in every case. However, this test could still be used as a measure of similarity since the Chi-Squared Values for each text comparison can be compared based on the fact that the smaller the Chi-Squared value, the more similar the two texts are. Thus, despite being unable to obtain valid p-values and perform hypothesis testing, the texts could still be ranked based on the Chi-Squared value in the full analysis. Comparing the results from the count 0 to 1, versus the frequency 0 to 0.0001 shows that the chi-squared values were reduced overall in the second analysis, except for the chi-squared values calculated for the Somerton Man code.

Top 20 European Language Comparison[edit]

The top 20 closest European languages by squared difference and standard deviation from the 2013 Honours group were then run through the chi-squared test against the Somerton Man code as in the original police report (5 Ms and 1 W). This test was run as a method of comparison to test how similar our results were to the 2013 group’s results. The results based on the two assumptions mentioned will be displayed.

Top 20 European Language Comparison Results[edit]
Fig. 13: Comparison of Results of Chi-Squared Values of Closest 20 Languages based on 2013 Squared Difference (Count from 0 to 1)
Fig. 14: Comparison of Results of Chi-Squared Values of Closest 20 Languages based on 2013 Standard Deviation (Count from 0 to 1)
Fig. 15: Comparison of Results of Chi-Squared Values of Closest 20 Languages based on 2013 Squared Difference (Frequency from 0 to 0.0001)
Fig. 16: Comparison of Results of Chi-Squared Values of Closest 20 Languages based on 2013 Standard Deviation (Frequency from 0 to 0.0001)
Top 20 European Language Comparison Evaluation[edit]

The results show that the two closest languages to the Somerton Man code are Scots, followed by English in all cases. The main conclusion to draw from the results of this comparison was that the Chi-Squared method appeared to be functioning correctly, and so the 2015 group could now further extend the analysis.

Top 20 European Languages based on Estimated Number of Native Speakers[edit]

In extension to the 2013 group’s work, the 2015 group conducted a chi-squared test of the Somerton Man code against the top 20 European languages based on the estimated number of native speakers [45]. This test included all versions of the Somerton Man code including the versions with 6 Ms, 4 Ms and 2 Ws, 5 Ms and 1 W and then the average of these results were also plotted.

Top 20 European Languages based on Estimated Number of Native Speakers Results[edit]
Fig. 17: Top 20 European Languages based on Estimated Number of Speakers with 6Ms in Code (left) and 4Ms and 2Ws in Code (right) (Count 0 to 1)
Fig. 18: Top 20 European Languages based on Estimated Number of Speakers with 5Ms and 1W in Code (left) and the Average (right) (Count 0 to 1)
Fig. 19: Top 20 European Languages based on Estimated Number of Speakers with 6Ms in Code (left) and 4Ms and 2Ws in Code (right) (Frequency 0 to 0.0001)
Fig. 20: Top 20 European Languages based on Estimated Number of Speakers with 5Ms and 1W in Code (left) and the Average (right) (Frequency 0 to 0.0001)
Top 20 European Languages based on Estimated Number of Native Speakers Evaluation[edit]

The results show that when changing the count from 0 to 1, for two of the three code versions and their average value, English was the closest language to the Somerton Man code. Kurdish, a language spoken in some parts of Turkey, was found to be the closest language to the code version with 6 Ms, however, considering Kurdish is not a common language, and based on the average results, it is safe to say that English was the closest language to the code using this method.

When changing the frequency from 0 to 0.0001, the results deviated from those obtained using a count of 1. This caused English to produce a higher chi-squared value and for other languages to produce lower-chi squared values, causing Kurdish to have the lowest chi-squared value in all cases, and for English to become the third closest language in the average results. This may be evidence to suggest that upon further inspection, the Universal Declaration of Human Rights may not be a suitable base text for use with the chi-squared method, irrespective of insufficient sample size.

Top 20 European Languages based on Estimated Number of Native Speakers against Thomas Hardy Sample[edit]

A comparison of the English control text (a 44 letter sample of Thomas Hardy’s Tess of the d’Urbervilles) against the top 20 European Languages based on Estimated Number of Native Speakers was conducted in order to test the ability of the method being able to identify an actual English sample text among the top 20 European languages. Assumptions of altering the count and frequency of letters that appeared 0 times to 1 and 0.0001 respectively were both used and can be seen in Figures 21 and 22.

Top 20 European Languages based on Estimated Number of Native Speakers against Thomas Hardy Sample Results[edit]
Fig. 21: Top 20 European Languages based on Estimated Number of Speakers versus Thomas Hardy Sample (Count 0 to 1)
Fig. 22: Top 20 European Languages based on Estimated Number of Speakers versus Thomas Hardy Sample (Frequency 0 to 0.0001)
Top 20 European Languages based on Estimated Number of Native Speakers against Thomas Hardy Sample Evaluation[edit]

The results of this test show that English was the closest language to the English sample text in both cases. This result is desirable as it successfully verified the ability of the chi-squared test to distinguish a 44 letter English sample out of the top 20 European languages. This could have been used to back up the results obtained from comparing the same 20 languages against the Somerton Man code, but unlike the same test performed on the Somerton Man code, adjusting the method from count 0 to 1, to frequency 0 to 0.0001 caused no effect on English being the closest language to the sample text.

Evaluation and Justification[edit]

The original proposal suggested that the group repeat the statistical analysis from the 2013 group and use benchmark texts to statistically assess the validity of the method as well as the Universal Declaration of Human Rights as a base text. The group was then to extend the analysis by calculating the p-values for the Somerton Man code when compared to the most common European languages and perform hypothesis testing based on the results. The group was also to use benchmark texts to test the statistical accuracy of the method as well as the validity of the Universal Declaration of Human Rights as a base text. The 2015 group’s statistical analysis had achieved almost all of its proposed goals. A slight diverge from the initially proposed method was decided upon once it was found that p-values useful for comparison or hypothesis testing were unable to be obtained using any attempted statistical method. Instead, the texts were ranked using their calculated chi-squared value. All assumptions outlined in the proposal were followed, with the addition of the modification of the count and frequency data to account for the small sample size of the base text.

Previous results using the Universal Declaration of Human rights as a base text were confirmed when using the assumption to adjust letters with count 0 to 1. These were not consistent when using the second assumption to adjust the letters with frequency 0 to 0.0001. These inconclusive results left more to be desired from the analysis and the validity of Universal Declaration of Human Rights as a base text was found to be questionable due to its limited sample size. English was found to have the lowest chi-squared value for a number of the calculations, meaning that the Somerton Man code is most likely to be formed from the English language, however since a reasonable p-value for any language could not be obtained, there was still a potential for reanalysis.

Task 2: N-Gram Search[edit]

Aim[edit]

The aim of Task 2 was to create a search engine to look for regular expressions that could be linked to the Somerton Man code. Numerous studies from previous groups showed that it is statistically likely for the Somerton Man code to be an English initialism (see Previous Studies/Related Work section). Based on this, for this task an assumption was made that the code is an initialism. A search engine was to be developed that used the letters of the code as initial letters of words in commonly used English phrases. This concept was to be explored further using a technique that accesses a larger database in much shorter time than the web crawler developed by groups in previous years (see Previous Studies/Related Work section). Using an n-gram database, rather than crawling the whole web for grams, had the advantage that the crawling had already been done and all grams had been recorded. This was to drastically increase the speed at which the gram combinations on the web could be found. The search engine was required to output a list of possible gams from the input letter combinations.

An assumption that was made in order to complete this task was that the code is an initialism, meaning that it is made up of letters that represent the first letters of an ordered series of words. This assumption was made based on conclusions made by previous groups (see Previous Studies/Related Work section) and advice from Professor Abbott. Due to this, a second assumption was made that the letters in the code, and thus the words in the grams, were order relevant. A final assumption was that all variants of ambiguous letters in the code were to be included.

Method[edit]

Research upon the available databases was conducted and it was found that the two largest databases that most suited our needs were 'Microsoft Web N-Gram Services' [46], and Google N-Gram [47]. The use of Microsoft Web N-Gram Services was initially considered more favourable due to its larger database [48] , better provided documentation and tutorials [49], and lower initial cost (see Budget Section). Despite the Microsoft alternative's advantages, upon further research and after consulting the Microsoft Web N-Gram, it was discovered that it could be used for our purposes using a best-first DP search, with a very large number of calls to the N-gram service's generate method. Also, through consulting our project supervisor Dr. Berryman, there was a concern that a combinational explosion would occur, meaning that for a 5-gram search, if the Microsoft database was say 50,000 words, the search engine would have to make 50,000^5 + 50,000^4 + 50,000^3 + 50,000^2 + 50,000 calls to the database to complete the search. Due to all the network calls required to use this method, the Microsoft Web N-Gram Service was deemed unfit for purpose in our application, since the method would not be fast enough to complete all of the searches that we required. Due to this, the Google alternative was considered.

Programming Language Selection[edit]

The search engine was initially to be implemented via web application using developer tools such as Microsoft Visual Studio or Eclipse [14], for use with Microsoft's Web N-Gram Service. The application could be programmed in native languages including Visual Basic, Visual C#, Visual C++, Visual F# and Jscript [50]. Instead, the programming language of Python was chosen to be used in conjunction with the Google N-Gram database through advice from Dr Berryman and due to its ease of use and efficiency with text processing.

Upon deciding to use the Google N-Gram database, a decision was to be made whether to purchase the University of Pennsylvania's Linguistic Data Consortium version or to obtain it for free directly from Google. The Linguistic Data Consortium's version of the database was initially chosen as an alternative since it had the advantage of a concise and clean nature, with all total n-gram frequencies summed and collated. Upon discussion with supervisor Dr. Berryman, it was decided that it was not worth purchasing the cleaned version of the database since we could extract the data we needed from the raw database and the outputs could be cleaned up easily enough through writing some simple codes in Python. Due to this, it was decided that the database provided by Google was to be used for our purposes (See budget section for further details).

Storage and Processing Implementation[edit]

Due to the size of the database, although being able to be physically stored locally, the local computing power available would have been insufficient to run the search engine code through the database within the time frame of the project. Instead, a cloud based computing service with increased processing power was sought out to be able to complete the database search within the time restrictions of the project. Upon considering a number of options, it was decided that 'Amazon Elastic Compute Cloud' was to be used due to its robust storage and processing options [51] and Dr Berryman's prior experience in using this service. The Amazon EC2 free tier was assessed for use but had a 30GB storage limit[52], which was insufficient to store the Google N-gram database on. In addition to this, the instance sizes provided by the Amazon EC2 free tier were t2.micro instances, meaning that they provided 1 vCPU, 1 Gib of RAM and only 20% of each vCPU could be used [53]. Based on this, it was estimated that using this version of the Amazon Elastic Compute Cloud would have taken approximately 20 months to complete, which was far too long to complete within the project timeframe. Instead, it was proposed to use the high input/output Amazon i2 tier to provide the performance needed to store and process the database. After some experimentation with different tiers, two i2.xlarge instances run on Amazon EC2 were proposed to be used, providing two sets of instances, each containing 4 vCPUs, 30.5 GiB of RAM, and 2 x 800 GB SSD Storage[54]. Using this tier allowed for parallelisation by running separate processes for each group of n-gram inputs from n=1-5 using 5 separate instances of the search engine code.

Search Engine[edit]

The initial n-gram search code was written in Python and submitted to our GitHub repository for review. Based on advice from project supervisor Dr Berryman, it was discovered that the code would work on a small data set, but since our data set was so large (1.79 Tebibytes when compressed), the code was modified to fit the suggested workflow and run in parallel on Amazon instances.

The maximum number of n provided by the gram database was five. Due to this, a maximum of five letter gram groups from the code could be processed at a time. This was achieved by writing a code in Python to generate all possible 5-gram initialisms from all code variants, including the crossed out line, and output them into a corresponding text file. The same was also done for 4, 3, 2 and 1-grams and stored in their respective text files. These were to be used as input files from which the search engine was able to perform searches to query the database.

The search engine code was also written in Python. It functioned by taking in the initialism combinations from the Somerton Man code of length n, from text files created by the intiialism generator code, and stored them in a dictionary labelled 'initialisms of interest'. The grams from the database were read in by a reader and initialisms were generated from the grams in each line of the reader. If the initialism generated from the line of the reader matched an initialism in the dictionary containing the initialisms of interest, the full gram was output into a corresponding text file containing results of length n. This code was copied and modified to be used for each gram length from n=1-5. A simplified diagram of the way the code works can be seen in the flowchart in Figure 23, and the full code can be seen in Appendix A.

Fig. 23: Search Engine Flowchart

Running our code on the Google N-Gram database stored in the i2.xlarge instances in parallel for each group of n-gram inputs from n=1-5 took approximately two weeks. These raw results were then small enough to be stored and processed locally and so the Amazon EC2 service was no longer required.

The frequency for each n-gram was then taken using Python code to count the number unique entries for each gram. This was implemented in order to speed up the time in which to obtain a frequency to be used to rank the popularity of each gram. This was a bug that unfortunately caused the frequency of occurrence of grams in each year to be lost, and so the count of the number of years in which each gram occurred was used as measure of frequency.

Processing Search Results[edit]

Once the raw results were obtained, some grams contained words followed by an underscore and the corresponding lexical category of the word (ie. noun, verb, adverb, adjective, pronoun etc.). This was desired to be removed and so another python code was written to remove everything but the words themselves from each line in the results.

Upon processing the raw results, the output of the lexical category removal results showed multiple identical results with individual frequencies for the numbers of years in which they occurred. This was brought about since previously the database considered these entries to be unique results, but now with the lexical categories removed, some results became identical. This was rectified by writing another code in Python to process these cleaned results to combine identical entries and sum their frequency of years in which they occurred. This code was then duplicated and modified into two codes, the first output the results sorted alphabetical order, and the second in order of frequency of years in which each result occurred from highest to lowest. The alphabetically sorted outputs were used as a means of comparison to the cleaned inputs since these were also sorted alphabetically, in order to check that the code was functioning correctly. The frequency sorted outputs were more useful since they able to be used to generate a condensed list of the top 30 most popular initialisms that could be generated from the letters from all variations in the Somerton Man code, seen in the results section in Figure 24.

Combinations of Search Results[edit]

Finally, a code was written in Python to generate all possible combinations of the top 2 5-gram group results for each variant of the code, where the top 2 results were based on frequency of years in which they occurred. This was achieved using a non-overlapping sliding window of 5 letters in length. The way this code worked can be more easily explained using the following example:

For simplicity, using 2-grams and the code ABAC: If the top 2 2-grams for AB are Absolute Bargain and American Beagle, and the top 2 2-grams for AC are Air Conditioning and Alternating Current, then all possible combinations for the code are: Absolute Bargain Air Conditioning, Absolute Bargain Alternating Current, American Beagle Air Conditioning and American Beagle Alternating Current.

This code was implemented as an exercise to see if any interesting or useful results could come about using this simple method. Unfortunately, this produced nonsensical results due to the disjoint between each 5-gram group's search results, a sample of these can be seen in the results section in Figure 25. Due to the time constraints of the project, the code was not able to be developed any further, but the code and the results it provides can be used as a first step towards obtaining meaningful or useful combinations of n-grams from the results obtained using the search engine developed throughout this project. This code could be improved by using a sliding window that progresses by less than 5 letters for each search, for example, using a step size of 1 letter would create the maximum possible overlap of 4 letters between each input gram group. More information on this and other suggested improvements can be found in the future work section.

Results[edit]

Evaluation and Justification[edit]

Unfortunately, the database used in the search engine was smaller in size than the preferred Microsoft option[55]. This could have provided a larger number of search results to be used for future analysis, but since the workflow utilised to access this database did not suit the application proposed by this project, the Microsoft option was deemed unusable. Another limitation of using the Google N-Gram database was that it only includes grams that have appeared more than 40 times across the corpus[56]. There is room for improvement in this task by potentially fixing the bug in the search engine code to record frequency of occurrence of grams in each year to calculate the total frequency of each gram. As mentioned, another improvement could be to optimise the gram combination code in order to obtain more useful or interesting results. Overall, I believe Task 2 was successful in achieving its aims to crate a search engine that found regular expressions that could be linked to the Somerton Man code by using the initial letters of the code to find commonly used English phrases. In addition to this, the search time relative to the number of results produced has clearly been largely improved when compared to the web crawler developed in previous years.

Task 3: Rubaiyat of Omar Khayyam as a One-Time Pad[edit]

Aim[edit]

The aim of Task 3 was to use Rubaiyat of Omar Khayyam as a one-time pad to attempt to decode the Somerton Man code to find any meaningful messages after decrypting. This task involved the investigation that the letters had been substituted for others using a one-time pad technique. This group was to use the Somerton Man code to act as the cipher text, the numerical value of letter positions within words, with respect to the first letter of each word to act as the key, and the decoded messages to act as the plaintext of the code. The first letter of each word in the one-time pad had numerical value 0, the second letter had numerical value 1, and so on.

The difference in aim between our group and the 2013 honours group is that the key used to decode the message is based on letter position within each word rather than using numbers assigned to each letter in the alphabet to perform alphabetic shits, as implemented in the 'Decoding Toolkit - One Time Pad' software [57].

Method[edit]

Before commencing Task 3, we had to choose a computer programming language to implement the code. My first attempt was to use Java, because this language was used by all previous groups to implement for their work. I tried to reuse the base code from previous groups and extend it to satisfy our Task 3 requirements, however, the aim of 2013 was significantly different to ours, so there were not too many similar points in the code. On the other hand, I am not good at Java language because I never learned it before. My second attempt was to use the C and C++ language to implement the code. I completed some functions for the code, but there was a problem in loading text files. Because we used the Rubaiyat of Omar Khayyam as a one-time pad, we were trying to translate the Rubaiyat into a text file so that we were able to run the code and search for words in the Rubaiyat. C++ is an old programming language and the process of loading data was too complex. I decided to give up this method. My final attempt was to use MATLAB, it is a relatively new programming language, and loading text files is relatively simple. In addition, I have studied MATLAB in previous university courses and find it easier to use. Finally, I chose MATLAB as the programming language to implement language for Task 3.

The direct substitution of the letters in the pad was to be used. For instance, if we chose a line: MRGOABABD, from the Somerton Man code as the encoded message, the first letter of the code is to be used, which is an M. The program was then to search the Rubaiyat from beginning to end, until it finds the first word that begins with M. This was then to be decoded to the second letter in the same word. This process was then to be repeated for the second letter in the code, being R. It was to find the first word in the Rubaiyat that starts with R, and decodes it to the second letter in that word. This method was then to be repeated until there were too few words long enough to decode all of the letters in the code. After this point, one final decode was to be attempted, where rather than using letter position, the last letter of each word was to be used. The output of the software was to be possible words or phrases made up of letter substitutions in place of the letters in the Somerton Man code.

Design[edit]
Fig. 26: Design Flowchart for Task 3

The Figure 26 shown above demonstrates my design and thinking process for Task 3. We needed 2 inputs for the code: One was the Somerton Man code which is the encoded message, other was the letter position as the key (the range of the letter position is above 0 and starts with integer 1 based on the MATLAB system). The function was to output a letter from the Somerton Man code and a number n from the letter position. Then, a function called matching was then to receive the letter output from the Somerton Man code. It was to search the Rubaiyat from beginning to end to match the first word with the same initial letter. At the same time, a function called finding was to receive the number output by letter position. Afterwards, the function matching was to send the matched word from Rubaiyat to the function 'finding'. This function was then to choose the correct letter in the matched word based on the letter position n. For example, if n equaled to 2, the 'finding' function would choose the second letter of the matched word. Finally, the function would output the chosen letter to a function called 'recordMessage', and this function would record all the letters from 'finding' in order. The program would have now completed decoding for one letter of Somerton Man code, and would then repeat for the rest of the letters in the code. After decoding all letters, the program would output the original message.

One-Time Pad Example[edit]

Figure 27 contains an excerpt form the Rubaiyat of Omar Khayyam and the word AWAKE is the first word in the Rubaiyat. Assuming we have the encoded message AFM and we want to use the second letter position to decode it. The program will search the Rubaiyat from beginning to end, until it finds the first word that begins with A. From Figure 27, the word should be AWAKE. This will then be decoded to the second letter in the same word which is W. The program will repeat step 2 for the rest of the letters in the code. The second word and third word should be FOR and MORNING based on the Rubaiyat of Omar Khayyam. The output should be WOO. Figure 28 shows the output from MATLAB, the function called multi, which shows all possible decoded messages according to letter position n. For example, if n equals 2, the output will display the first original encoded message AFM and the second decoded message WOO as well.

Fig. 27: Example of Task 3 (Rubaiyat Excerpt)
Fig. 28: Example of Task 3 (MATLAB Output)
Verification[edit]

A simple verification method was used to test the MATLAB code to be functioning correctly. One group member encoded the original message GUN using the proposed key with the third letter position, and Rubaiyat of Omar Khayyam as the one-time pad. Based on Figure 29, which is an excerpt from the formatted version of the Rubaiyat of Omar Khayyam, the encoded words used were NIGHT, FLUNG and HUNTER. Taking the first letter of each word produces the letters NFH as the encoded message.

The other group member then took the letters NFH as the code and ran the MATLAB script to test all letter positions until discovering that by using the third letter position as the key, the original message 'GUN' was displayed as the output as expected. This can be see in Figure 30.

Fig. 29: Task 3 Verification (Rubaiyat Excerpt)
Fig. 30: Task 3 Verification (Result)

Results[edit]

Fig. 31: Task 3 Results using Second Letter Position (Unformatted)
Fig. 32: Task 3 Results using Second Letter Position (Formatted)
Fig. 33: Task 3 Results using Third Letter Position (Unformatted)
Fig. 34: Task 3 Results using Third Letter Position (Formatted)
Fig. 35: Task 3 Results using Last Letter Position (Unformatted)
Fig. 36: Task 3 Results using Last Letter Position (Formatted)

Evaluation and Justification[edit]

As can be seen in the Figures in the Results section, despite correctly implementing the newly proposed key method to use the Rubaiyat of Omar Khayyam as a one-time pad through the use of software, no useful results were able to be generated by using all variants of the code and all letter positions as keys. The main conclusion to be drawn from this was that the Rubaiyat of Omar Khayyam was not used as a one-time pad to generate the Somerton Man code with the proposed method of using letter position within words as the key.

Task 4: Statistical Frequency of Letters Reanalysis[edit]

Aim[edit]

Towards the end of the project, a decision was made that for Task 4, rather than analysing the mass spectrometer data from the Somerton Man's hair, we would focus our efforts on reanalysing the letter frequencies of various European languages. This was decided upon since our initial analysis performed in Task 1 produced inconsistent and varied results. This was brought about due to the limited sample size of the Universal Declaration of Human Rights as a base text, causing the frequency of particular letters to appear 0 times in particular languages. Due to this, these letter frequencies had to be altered by choosing arbitrary numbers for their frequency in order to perform our chi-Squared testing and thus reduced the accuracy and validity of the test's results.

the limited sample size caused the chi-squared values for all languages, including English, to be reasonably large. This caused the resulting calculated p-values to be extremely small numbers, or in most cases 0. Because of this, these chi-squared values were not usable to use p-values to perform our initially proposed hypothesis testing from Task 1.

This caused us to question the validity of the Universal Declaration of Human Rights as a base text and so we sought to increase our sample size using alternate base texts and extend our original statistical analysis.

Method[edit]

It was decided that for the reanalysis, we would use Project Gutenberg to increase the sample size for as many of the 21 most popular European languages used in Task 1 as possible by collecting novels from the time before the Somerton Man's death. This was chosen to be used as our base corpus in an attempt to obtain a more accurate representation of the initial letter frequencies of words in these languages. Novels in each language were concatenated and their letter frequencies were determined, until each letter appeared at least once in each language.

The 2013 group’s decoding toolkit and initial letter frequency count code were able to be utilised for this task. The decoding toolkit's 'format texts' function was used to remove all non letter characters and symbols as well as punctuation and accented letters, and the initial letter frequency counter was run on all of our base and benchmark sample texts in order to obtain the data we needed to perform our statistical analysis. All statistical calculations and graphs were generated using Mircosoft Excel.

Initial Validation[edit]

First of all, the same test that was initially run in Task 1 on the statistics obtained from the English translation of the Universal Declaration of Human Rights (with letters with frequency 0 modified to 0.0001) as a baseline check were also run on the new statistics gathered from the novel: The Life of the Spider by J. Henri Fabre[58], used as our English base text found on Project Gutenberg, as a means of comparison between the base texts. The Somerton man code, 44 letter samples from a Thomas Hardy novel acting as an English control [59] as well as a French sample, German sample, and Zapoteco sample from the Universal Declaration of Human Rights were all compared to both sets of data and the results can be seen in Figure 12.

European Language Comparison[edit]

Next, once it was found that the English text from Project Gutenberg provided lower chi-Squared values than the Universal Declaration for all samples in the initial test, the chi-Squared testing on European languages could be commenced. This involved the same procedure as was used in Task 1, but of the top 21 most popular European languages from Task 1, only 12 of the languages were able to be used in the reanalysis due to insufficient usability or availability of texts on Project Gutenberg. The languages used in the analysis can be seen in the graph in Figure 38. The omitted languages included Greek, Russian, Serbian, Kurdish, Uzbek, Turkish, Ukranian, Belarusian and Kazakh. The texts used for this analysis can be seen in Appendix B.

Significance Level Calculation[edit]

The chi-squared and p-values calculated showed that English was the closest language to the Somerton Man code. From this, hypothesis testing could be performed based on the English results. Upon consultation with Prof. Abbott and Dr. Berryman, rather than choosing an arbitrary value of significance level such as the typically used p=0.05, it was decided a significance level could be calculated using the p-value found using real English texts to be used as what we deemed to be an acceptable significance level for which we would confidently be able to say that the most likely language of origin of the Somerton Man code is English. This was achieved by collecting 20 44 letter excerpts from English novels from Project Gutenberg (see Appendix C), performing the chi-squared testing for these samples against the English Project Gutenberg novel used as our English base text, taking an average of the chi-squared values, and from this calculating a p-value. This result was then compared to the results obtained from the English portion of the chi-squared testing performed on the variants of the code, and was plotted as seen in Figure 40.

This same testing was then also run on the English samples and code variants against the original English translation of the Universal Declaration of Human rights as a means of comparison between the two base texts. Significance levels were unable to be calculated using the Universal Declaration of Human Rights since the chi-squared values were too large, causing the calculated p-values to be too small (approaching 0). The results can be seen in Figure 39.

It was unnecessary to extend the analysis to collect benchmarks and perform the hypothesis testing on the other European languages against the code since chi-squared values produced were too large, and so the p-values calculated were unusable.

Increased Sample Size Testing[edit]

It was then decided that in order to increase our confidence in the calculated significance level, we would increase the sample size for our English base text from Project Gutenberg to not only large enough such that each letter appeared at least once, but to concatenate 20 English novels from the time before the Somerton Man's death to be used as our base English Corpus (See Appendix D). It was first confirmed whether this would have an affect on the chi-squared values against the code variants when compared to other languages. We could then also increase our English benchmark sample size by taking 100 44 letter samples from this corpus using code written in Python, and performing the same testing as performed on our smaller English base text. The results from this testing can be seen in Figure 41.

Increasing the sample size of the English base text had very little effect on the graphs produced in the Initial Validation, European Language Comparison and so these graphs have been excluded. A closer look at the changes to the chi-squared and p-values for the Somerton Man code variants caused by this increased sample size can be seen through comparing Figures 40 and 41. Increasing the number of 44 letter English samples from 20 to 100 however, did have an effect on the chi-squared value and p-value calculated to be used as our significance level, the results of which can be seen in Figure 41. This increase in number of samples had very little effect on the graph of the Universal Declaration of Human Rights significance level calculation, and so this has also been omitted.

Results[edit]

Fig. 37: Graph of Initial Validation Chi-Squared Values Comparison between English Declaration and English Gutenberg (Frequency 0 to 0.0001)
Fig. 38: Graph of Average Chi-Squared values from Project Gutenberg Base Texts versus Somerton Man Code Variants
Fig. 39: Graph of Comparison of Chi-Squared Values of 20 English Samples and Code Variants against English Declaration Base Text (Frequency 0 to 0.0001)
Fig. 40: Graphs of Comparison of Chi-Squared and P-Values of 20 English Samples and Code Variants against English Gutenberg Corpus Base Text
'Fig. 41: Graphs of Comparison of Chi-Squared and P-Values of 100 English Samples and Code Variants against English Gutenberg Corpus Base Text

Evaluation and Justification[edit]

The results of the initial validation seen in Figure 37, show that using the Project Gutenberg novel as an English reference text provided lower chi-squared values for all test cases and thus it was deemed to be a more suitable base text than the modified version of the Universal Declaration of Human Rights.

The results from the European Language Comparison in Figure 38, show that English had the lowest chi-squared value when compared to all languages in our Project Gutenberg corpus, and thus was the closest language to the Somerton Man code.

The chi-squared values calculated using the English translation of the Declaration of Human Rights (Figure 39) were found to be much higher than those calculated using the English Project Gutenberg novel (Figure 40). The large difference in results, and the fact that real english samples obtained such high chi-squared values, show that the Declaration may not provide an accurate representation of letter frequencies in the English language, and thus the validity of this as a base text has been proven to be questionable when used as part of a chi-squared analysis. In addition to this, despite the chi-squared values calculated using the Somerton Man code variants being much lower in all cases, hypothesis testing could not be completed due to the large chi-squared values producing very small p-values for the code variants and significance level (approaching 0).

The results from the initial significance level calculation in Figure 40 shows that 2 of the 3 Somerton Man code variants, and thus the average result, achieved higher p-values than the calculated significance level. From this we could deduce that our preliminary results showed that our null hypothesis was accepted and that English is the most likely language of origin of the code, assuming that it is an initialism.

Upon increasing the sample size, the significance level calculation in Figure 41 shows that now only 1 of the 3 code variants achieved a p-value higher than the calculated significance level. This caused the average result to fall below the significance level. Due to this, our statement had to be modified to say that overall the null hypothesis was rejected and alternative hypothesis accepted, meaning that we can not confidently say that the language of origin of the Somerton Man code is english for all variants. Despite this, the null hypothesis could be accepted and English is the most likely language of origin of the Somerton Man code, assuming that it contains 4 M's, 2 W's and is an initialism.

Regardless of the choice to accept or reject the null hypothesis, the similarities in chi-squared and p-values calculated between real 44 letter English Samples and all variants of the Somerton Man code using the Project Gutenberg base text reinforces the notion that the language of origin of the code is indeed English.

Although we were able to find a base text with frequency greater than 0 for each letter, suitable for reanalysis and performing hypothesis testing, the chi-squared method used was still not entirely mathematically accurate since the expected value of the number of sample observations for each letter in the code should have been a minimum of 5[60]. This was unavoidable since we had limited letter frequencies provided by the Somerton Man code and thus this sample size could not be increased. Since this was constant when comparing across all languages, the method was still able to be used as a means of comparing the 'goodness of fit' of letters in each language.

Project Management - Planning and Feasibility[edit]

Work Breakdown/Deliverables[edit]

The workload for this project was broken down into its main tasks. These can be seen in list form in the Final Project Gantt Chart (see Timeline section). The key deliverables are represented as milestones on the Gantt Chart. The dependencies of the tasks and deliverables can be seen in the Gantt Chart as black arrows, these are as follows: The Research Proposal and Progress Report have dependence on the Draft Research Proposal, which has dependence on the Proposal Seminar. Of the specific project tasks, Task 1 was completed first, and Tasks 2, 3 and 4 were completed in parallel. The Final Seminar Presentation, Project Exhibition Poster, Final Performance, Youtube video and Dump of final work are all dependent on the completion of the specific project tasks. The Final Report/Honours Thesis was completed in parallel with the rest of the work from the Research Proposal and Progress Report hand-up, onwards.

Timeline[edit]

The timeline for this project was created in the form of a Gantt Chart. The proposed Gantt Chart can be seen in Figure 42.

Fig. 42: Proposed Project Gantt Chart

The final Gantt Chart after all revisions and updates can be seen in Figure 43.

Fig. 43: Final Project Gantt Chart

Changes made from the originally proposed Gantt Chart to the final revised Gantt Chart include the renaming of Tasks 2 and 4 to N-Gram Search and Statistical Frequency of Letters Reanalysis. Task 2 was completed earlier than expected, but cleaning up results for presentation and finding meaningful combinations of the results proved to take longer than expected, and so the second part of Task 2 was extended. Task 3 was also extended so that Jikai was able to complete this task. Task 4 was commenced earlier than proposed since the bulk of Task 2 was completed early. Due to this, Task 4 was completed in parallel with Tasks 2 and 3 towards the end of the project timeline. The dump of final work and project youtube video were moved to be completed after the due date of the Final Report/Thesis upon discussion with our supervisors. Overall, our initially proposed Gantt Chart estimated our project timeline quite accurately and only minor changes needed to be made.

Task Allocation[edit]

The workload for the tasks within this project were allocated based on the strengths and skillset of each member, as well as the estimated time taken and complexity of each task. A table of the project task allocation can be seen in Figure 44. The key allocations were that Nicholas Gencarelli undertook the tasks of Project Management, N-Gram Search and the Project Exhibition Poster. Jikai Yang undertook the tasks of the use of the Rubaiyat of Omar Khayyam as a One-time Pad, and the project Youtube video. The allocations did not require changing throughout the project life cycle apart from the decision for both members to perform a statistical reanalysis for Task 4 rather than both analysing the mass spectrometer data from the Somerton Man's hair.

Fig. 44: Table of Task Allocation

Management Strategy[edit]

A number of management strategies were adopted for use throughout the project. One of which was frequent face-to-face contact through regular meetings every 2-3 weeks. Another was regular communication between group members via text message and email. Collaboration is another strategy that was useful, if one member required assistance on a particular task, the other was able to step in and help. This was achieved through the use of flexible task allocation. The group was able to make use of collaborative software including Google Drive for working together on project documents, and Git Hub repository for working together on code for software development. The project Wiki page was updated in real time including the weekly progress section to monitor and review work completed by each member every week, as well as plan tasks for the upcoming week. Finally, the use of a Gantt chart was used as a management strategy to incorporate clearly defined task and goals and established a critical path through use of task dependencies.

Budget[edit]

The project budget for this honours group was set at 500 dollars at the commencement of the project. It was initially proposed for the budget to depend on the n-gram database chosen to be used for the search engine in Task 2. As discussed in the Method section of Task 2: N-Gram Search, a variety of options were considered and the main two largest databases were found to be Microsoft Web N-Gram Services[61], and Google N-Gram [62].

The Microsoft alternative was found to be free to use for academic purposes after applying for a user token, and is stored for free on Microsoft’s web server, hence there was no need to purchase storage upon which to store the database[63].

The Google alternative was available for free when obtaining the raw dataset, or at a cost of 150 dollars for a student license when purchased from the University of Pennsylvania Linguistic Data Consortium [64]. Unlike the Microsoft alternative, if the Google N-Gram option was chosen, a portion of the budget would have had to be dedicated to storing the database. It was initially proposed to store the database on a hard drive at a cost of approximately 100 dollars.

The proposed budget can be seen in the tables highlighting the key costs of each option in Figure 45.

Fig. 45: Proposed Budgeting Table

For reasons discussed in the Method section of Task 2: N-Gram Search, upon deciding to use the Google N-Gram database, a decision was to be made whether to purchase the University of Pennsylvania's Linguistic Data Consortium version or to obtain it for free directly from Google. A decision was made to utilise the free database provided by Google as it was not deemed justifiable to spend $150 on the processed data from the Linguistic Data Consortium since it was proposed that the raw dataset could be cleaned up through writing software.

The initial budget was based on the assumption that the Google N-Gram database could be stored locally, although this was feasibly possible in its compressed form, the local computing power available would have been insufficient to run the search engine code through the database within a the time frame of the project. As discussed in the Method section of Task 2: N-Gram Search, a cloud based computing service called ‘Amazon Elastic Compute Cloud’ was utilised to store and process the database. The free tier was considered but did not provide the specifications required to meet the needs of our task, and so instances on Amazon EC2 were hired at a rate of 0.853 dollars per hour [65]. Upon storing the initial full database, running our search code, and downloading our results generated from the outputs of the code, the total cost of utilising the service came to 576 dollars. This caused our project to exceed the initially proposed budget. The reason for the additional project expenditure was that despite our efforts, it was difficult to predict the precise time that it would take to upload, store and process the database on the cloud service. The initially proposed budget did not include the need or costing for the Amazon server since this was not something that could be reasonably foreseen at the start of the project since it was initially thought that the Microsoft N-Gram Service would be suitable for the needs of the project, and if this was not suitable, that the Google N-gram alternative would be able to be stored locally.

The final revised budget including total project expenditure can be seen in Figure 46.

Fig. 46: Final Budgeting Table

In conclusion, despite going over budget, the additional funds were kindly provided by the school of Electrical and Electronic engineering upon sending an application for funding including justification of our purchases. The project work has benefited through the purchase of the Amazon service since we were able to complete a search of specific n-gram combinations of the code on the full Google N-Gram database. It has provided us with results to present as part of our thesis and allowed us to meet the requirements set out in the aim of Task 2.

Risk Analysis[edit]

A risk assessment was undertaken for this project to include risk identification, analysis, evaluation and treatment strategies using the Adelaide University risk matrix procedure [66]. This can be seen in Figure 47. One of the risks that occurred during the project was the inaccurate estimation of time and resources. This occurred since the group and supervisors were unhappy with the results obtained from the initial analysis of letter frequency performed in Task 1. This was rectified by implementing the flexibility of our schedule and by replacing the initially proposed Task 4: Mass Spectrometer Data Analysis, with a new Task 4: Statistical Frequency of Letters Reanalysis. Another risk that occurred throughout the project was Illness. This was able to be dealt with relatively easily through working from home for a short period of time. The minor misunderstanding of project tasks occurred on a few occasions, but these were clarified through scheduling meetings with group members and supervisors. Bugs in code were reduced to the best of our ability through thorough testing and debugging of code. Finally, the inability to decipher the Somerton Man Code was a risk estimated with an almost certain likelihood. Despite being unable to avoid this risk throughout the project, its effects were considered negligible, and the group was still able to complete all work to the best of its ability, and further the research into the decryption of the code for not only future honours groups, but also the wider community through publishing our results on our Wiki.

Fig. 47: Table of Risk Assessment

Conclusions[edit]

The work undertaken throughout the project has fulfilled the key aims and objectives of the project including statistical analysis of likely language of origin of the Somerton Man code, the design and implementation of software to test the Rubaiyat of Omar Khayyam as a one-time pad in conjunction with a new key technique, and developed a search engine to discover possible n-grams contained within the code. The group was successful in completing all tasks outlined in the proposal, with the exception of the proposed extension task to analyse the mass spectrometer data of the Somerton Man's hair.

Through this, the group was able to critically review and further the statistical analysis of the likely language of origin of the Somerton Man code conducted by previous groups. The 2015 group has improved upon the search for n-grams conducted by previous groups by increasing number of results and search speed. In addition to this, the results collected through implementing the search engine are valuable for future groups to analyse for useful grams that could be linked to the Somerton Man code. The group has also furthered the exploration from previous groups into the possibility that the Rubaiyat of Omar Khayyam was used as a one-time pad to encrypt the Somerton Man code by testing a new key technique.

The skills developed through undertaking this project include text processing and programming in a variety of languages including Java, MATLAB and Python. The group has also thoroughly researched and learnt how to implement and evaluate statistical tehcniques including chi-squared testing, p-value calculation and hypothesis testing and developed skills in using Microsoft Excel software to perform statistical analyses.

The main conclusions drawn from the project work include that the Somerton Man code was not created using the Rubaiyat of Omar Khayam as a one-time pad and the proposed method of using letter position within words as the key. Further analysis is required to obtain meaningful or useful combinations of grams from the results of the n-gram search. The Universal Declaration of Human rights has too small a sample size of words in each language to accurately represent the initial letter frequency in each language for use in chi-squared testing. Finally, although the results from the hypothesis testing were somewhat inconclusive, the results of all of the chi-squared testing have lead to the conclusion that we can now say more confidently than ever that English was the most likely language from which the Somerton Man code was written, assuming it is an initialism.

Despite being unable to decipher the Somerton Man code, the 2015 group has designed and implemented software that has furthered past work into the investigation and provided useful tools and resources to be utilised by future Honours students.

Future Work[edit]

The search engine in Task 2 contained a bug in the code that collected the number of years in which each gram occurred, rather than actual frequency of occurrence in each year. This brings about a potential for reanalysis of the database, in particular the 5-gram data, using raw count through minor modification to the search engine code.

There is also the potential to run a limited search on the search engine results using a more sophisticated code than the one used by the 2015 group. This could be used to generate more useful gram combinations to find commonly used English expressions or phrases that could be linked to the Somerton Man code.

According to previous group’s results and our results by using One-time pad method for task 3. Maybe One-time pad is not an appropriate method to decipher the Somerton Man code and Rubaiyat. Future group may use some new decryption methods instead of using One-time pad for task 3.

Future students could extend the statistical analysis to perform hypothesis testing on all European languages in Project Gutenberg, but an alternate method to the chi-squared testing performed would have to be utilised since the chi-squared values for the code against all other languages were too high to produce usable p-values.

Another option would be to focus on English as the most likely language and statistically analyse the code against genres as conducted by the 2013 group [67], but using the chi-squared method as opposed to the squared difference and standard deviation methods adopted to consolidate or refute the conclusions drawn from their results .

Future groups could also extend the 2013 groups analysis of the mass spectrometer data collected using laser ablation of the Somerton Man's hair [68]. The hair was collected from the plaster bust of the Somerton Man made after his autopsy. The concentration of isotopes found in the hair can be used to find out the environment the Somerton Man lived in in the lead up to his death. Additional data from a separate hair has been taken since the 2013 group's analysis, as well as the concentration of isotopes in the plaster. This could be used by a future group to crosscheck the data between separate hairs, as well as crosscheck the isotopes in the plaster versus the hair to see if the isotopes from the plaster may have diffused into the hair.

Appendices[edit]

  • Appendix A: Full Search Engine Code

File:Search Engine Code.pdf

  • Appendix B: Project Gutenberg European Language Comparison Text References

File:Gutenberg European Language Comparison Text References.pdf

  • Appendix C: Project Gutenberg 20 English 44 Letter Text File References

File:Gutenberg 20 English 44 Letter Text File References.pdf

  • Appendix D: Project Gutenberg English Corpus 20 Novels References.pdf

File:Gutenberg English Corpus 20 Novels References.pdf

References[edit]

  1. J. Fettes. (2013). Professor’s 15-year search for answers seeks to crack the secret code to the death of the ‘Somerton man’ found on an Adelaide Beach [online]. Available: http://www.heraldsun.com.au/news/law-order/portrait-may- hold-key-to-somerton-man-beach-mystery/story-fni0ffnk-1226674957043
  2. The News. (1948, December 1). Dead Man Found Lying on Somerton Beach [online]. Available: http://trove.nla.gov.au/ndp/del/article/129897161
  3. The News. (1948, December 1). Dead Man Found Lying on Somerton Beach [online]. Available: http://trove.nla.gov.au/ndp/del/article/129897161
  4. The Advertiser. (2005, March 9). Death riddle of a man with no name [online]. Available: http://www.eleceng.adelaide.edu.au/personal/dabbott/tamanshud/advertiser_mar2005.pdf
  5. The Advertiser. (1949, June 9). Cryptic Note on Body [online]. Available: http://trove.nla.gov.au/ndp/del/article/36371152
  6. Hub Pages Author. (2014, August 30). The Body on the Beach: The Somerton Man - Taman Shud Case [online]. Available: http://brokenmeadows.hubpages.com/hub/The-Mystery-of-the-Somerton-Man-Taman-Shud-Case
  7. Cleland. (1949). Coroner's Inquest [online]. Available: http://trove.nla.gov.au/ndp/del/article/130195091
  8. A. Duffy and T. Stratfold. (2012). Final Report 2012 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Re port_2012
  9. A. Turnbull and D. Bihari. (2009). Final Report 2009: Who killed the Somerton man? [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_report_2009:_Who_killed_the_Somerton_man%3F
  10. Hub Pages Author. (2014, August 30). The Body on the Beach: The Somerton Man - Taman Shud Case [online]. Available: http://brokenmeadows.hubpages.com/hub/The-Mystery-of-the-Somerton-Man-Taman-Shud-Case
  11. YouTube ABC. Somerton Beach Mystery 1978 [online]. Available: https://www.youtube.com/watch?v=ieczsZRQnu8
  12. A. Turnbull and D. Bihari. (2009). Final Report 2009: Who killed the Somerton man? [online]. Available:
  13. K. Ramirez and L-V. Michael. (2010). Final Report 2010 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2010
  14. S. Maxwell and P. Johnson. (2011). Final Report 2011 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2011
  15. A. Duffy and T. Stratfold. (2012). Final Report 2012 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2012
  16. L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking
  17. A. Turnbull and D. Bihari. (2009). Final Report 2009: Who killed the Somerton man? [online]. Available:
  18. K. Ramirez and L-V. Michael. (2010). Final Report 2010 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2010
  19. S. Maxwell and P. Johnson. (2011). Final Report 2011 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2011
  20. A. Duffy and T. Stratfold. (2012). Final Report 2012 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2012
  21. L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking
  22. The Advertiser. (1949, June 10). Tamam Shud [online]. Available: http://trove.nla.gov.au/ndp/del/article/36371416
  23. N. Gencarelli and J. K. Yang. (2015, March 15). Cipher Cracking 2015 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Cipher_Cracking_2015
  24. B. David et al., “P Value and the Theory of Hypothesis Testing: An Explanation for New Researchers,” Clinical Orthopaedics and Related Research®, Vol.468 (3), pp.885-892 2010.
  25. B. David et al., “P Value and the Theory of Hypothesis Testing: An Explanation for New Researchers,” Clinical Orthopaedics and Related Research®, Vol.468 (3), pp.885-892 2010.
  26. G G. L et al., “What is the Value of a p Value?,” The Annals of Thoracic Surgery, Vol.87(5), pp.1337-1343 2009.
  27. No Author.p-value [online]. Available: http://en.wikipedia.org/wiki/P- value#cite_note-nature506-1
  28. No Author.p-value [online]. Available: http://en.wikipedia.org/wiki/Pvalue# cite_note-nature506-1.
  29. N Turner, “Chi-squared test” Journal of clinical nursing, Vol.9 (1), pp.93 2000.
  30. N. Balakrishnan et al., Chi-squared Goodness of Fit Tests with Applications [online]. Available: http://www.sciencedirect.com.proxy.library.adelaide.edu.au/science/book/9780123971944
  31. Author Unknown. The Universal Declaration of Human Rights [online]. Available: http://www.un.org/en/documents/udhr/history.shtml
  32. The British Library Board. Taking Liberties [online]. Available: http://www.bl.uk/onlinegallery/takingliberties/staritems/645universaldeclarationhumanrightspic.html
  33. Author Unknown. Free ebooks by Project Gutenberg [online]. Available: https://www.gutenberg.org/
  34. Author Unknown. Free ebooks by Project Gutenberg [online]. Available: https://www.gutenberg.org/
  35. A. Z Broder et al., “Syntactic clustering of the web”. Computer Networks and ISDN Systems 29 (8), pp.1157–1166.
  36. No Author. Video Lectures [online]. Available: https://class.coursera.org/nlp/lecture/17
  37. S.M. Bellovin. (2011, July 12). Frank Miller: Inventor of the One-Time Pad [online]. Available: http://www.tandfonline.com.proxy.library.adelaide.edu.au/doi/full/10.1080/0161 1194.2011.583711#abstract
  38. No Author. One-time pad [online]. Available: http://enc.slider.com/Enc/OneTimePads
  39. No Author. One-time pad [online]. Available: http://enc.slider.com/Enc/OneTimePads
  40. S. L. Center. (2015). What is Scots? [online]. Available: http://www.scotslanguage.com/What_is_Scots%3F_uid2/What_is_Scots_%3F
  41. L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester _B_Final_Report_2013_-_Cipher_cracking
  42. L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking
  43. No Author. 2014. Using Chi Squared to Crack Codes [online]. Available: http://ibmathsresources.com/2014/06/15/using-chi-squared-to-crack-codes/.
  44. T Hardy. 2005. Tess of the d’Urbervilles (11th Edition) [online]. Available: https://ia801409.us.archive.org/24/items/tessofthedurberv00110gut/110-8.txt.
  45. No Author. 2015. List of languages by number of native speakers [online]. Available: http://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers#cite_note-Nationalencyklopedin-1.
  46. Microsoft Research. (2015). Microsoft Web N-Gram Services [Online]. Available: http://research.microsoft.com/en-us/collaboration/focus/cs/web-ngram.aspx
  47. Google Research Blog. (2006, August 3). All Our N-Gram are Belong to You [Online]. Available: http://googleresearch.blogspot.com.au/2006/08/all-our-n-gram-are-belong-to-you.html
  48. C. X. Zhai et al. (2010, July 19-23). Web N-gram Workshop [online]. Available: http://research.microsoft.com/en-us/events/webngram/sigir2010web_ngram_workshop_proceedings.pdf
  49. No Author. Microsoft Web N-Gram Service Quick Start [online]. Available: http://weblm.research.microsoft.com/info/QuickStart.htm
  50. No Author. Visual Studio Languages [online]. Available: https://msdn.microsoft.com/en-us/library/vstudio/ee822860%28v=vs.100%29.aspx
  51. Amazon Web Services. (2015). Amazon EC2 Instances [Online]. Available: https://aws.amazon.com/ec2/instance-types/
  52. Amazon Web Services. (2015). Amazon EC2 Instances [Online]. Available: https://aws.amazon.com/ec2/instance-types/
  53. Amazon Web Services. (2015). Amazon EC2 Instances [Online]. Available: https://aws.amazon.com/ec2/instance-types/
  54. Amazon Web Services. (2015). Amazon EC2 Instances [Online]. Available: https://aws.amazon.com/ec2/instance-types/
  55. C. X. Zhai et al. (2010, July 19-23). Web N-gram Workshop [online]. Available: http://research.microsoft.com/en-us/events/webngram/sigir2010web_ngram_workshop_proceedings.pdf
  56. Google Books. (2012 July). Ngram Viewer [Online]. Available: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
  57. L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking
  58. J. H. Fabre. (2005, March 22). The Life of the Spider [Online]. Available: https://www.gutenberg.org/ebooks/1887
  59. T Hardy. 2005. Tess of the d’Urbervilles (11th Edition) [online]. Available: https://ia801409.us.archive.org/24/items/tessofthedurberv00110gut/110-8.txt.
  60. Stat Trek. (2015). Chi-Square Goodness of Fit Test [Online]. Available: http://stattrek.com/chi-square-test/goodness-of-fit.aspx?Tutorial=AP.
  61. C. X. Zhai et al. (2010, July 19-23). Web N-gram Workshop [online]. Available: http://research.microsoft.com/en-us/events/webngram/sigir2010web_ngram_workshop_proceedings.pdf
  62. Google Books. (2012 July). Ngram Viewer [Online]. Available: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
  63. C. X. Zhai et al. (2010, July 19-23). Web N-gram Workshop [online]. Available: http://research.microsoft.com/en-us/events/webngram/sigir2010web_ngram_workshop_proceedings.pdf
  64. T.Brants and A.Franz. (2006). Web 1T 5-gram Version 1 [online]. Available: https://catalog.ldc.upenn.edu/LDC2006T13
  65. Amazon Web Services. (2015). Amazon EC2 Pricing [Online]. Available: https://aws.amazon.com/ec2/pricing/
  66. No Author. RISK MANAGEMENT HANDBOOK [online]. Available: http://www.adelaide.edu.au/legalandrisk/docs/resources/Risk_Management_Handbook.pdf
  67. L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking
  68. L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking

Glossary and Symbols[edit]

  • ASIO: Australian Security Intelligence Organisation
  • ASIS: Australian Secret Intelligence Service
  • ASD: Australian Signals Directorate
  • P-value: The p-value is the calculated probability that gives researchers a measure of the strength of evidence against the null hypothesis [1].
  • Chi-Squared Test: A 'goodness of fit' statistical test used to compare how closely observed data is related to expected data. [2] [3].
  • Hypothesis Test: The formal procedure used by statisticians to accept or reject statistical hypotheses [4].
  • Universal Declaration of Human Rights: A text that has been translated into over 400 languages [5].
  • Project Gutenberg: A website containing a large number of free ebooks in a wide range of languages [6].
  • N-gram model: The N-gram model is a sequence of n items from a given sequence of phonemes, syllables, letters, words or base pairs [7].
  • One-time pad: The one-time pad is a decoder technology which cannot be cracked if the correct key is used [8].
  • Rubaiyat of Omar Khayyam:
  • Initialism: A group of letters formed using the initial letters of a group of words or a phrase [9].
  • Plaintext: The information of an original message, which is desired to be deciphered from the ciphertext [10].
  • Ciphertext: The encoded format of a message [11].
  • Key: What is needed to convert the ciphertext into the plaintext using the one-time pad [12].
  1. B. David et al., “P Value and the Theory of Hypothesis Testing: An Explanation for New Researchers,” Clinical Orthopaedics and Related Research®, Vol.468 (3), pp.885-892 2010. [25] G G. L et al., “What is the Value of a p Value?,” The Annals of Thoracic Surgery, Vol.87(5), pp.1337-1343 2009. [26] No Author.p-value [online]. Available: http://en.wikipedia.org/wiki/P-value#cite_note-nature506-1
  2. N Turner, “Chi-squared test” Journal of clinical nursing, Vol.9 (1), pp.93 2000.
  3. N. Balakrishnan et al., Chi-squared Goodness of Fit Tests with Applications [online]. Available: http://www.sciencedirect.com.proxy.library.adelaide.edu.au/science/book/9780123971944
  4. Stat Trek. Hypothesis Tests. [online]. Available: http://stattrek.com/hypothesis-test/hypothesis-testing.aspx
  5. Author Unknown. The Universal Declaration of Human Rights [online]. Available: http://www.un.org/en/documents/udhr/history.shtml
  6. Author Unknown. Free ebooks by Project Gutenberg [online]. Available: https://www.gutenberg.org/
  7. A. Z Broder et al., “Syntactic clustering of the web”. Computer Networks and ISDN Systems 29 (8), pp.1157–1166. [28] No Author. Video Lectures [online]. Available: https://class.coursera.org/nlp/lecture/17
  8. S.M. Bellovin. (2011, July 12). Frank Miller: Inventor of the One-Time Pad [online]. Available: http://www.tandfonline.com.proxy.library.adelaide.edu.au/doi/full/10.1080/01611194.2011.583711#abstract
  9. No Author. Initialism [online]. Available: http://dictionary.reference.com/browse/initialism
  10. No Author (2011). Topic 1: Cryptography [online]. Available: http://www.maths.uq.edu.au/~pa/SCIE1000/gma.pdf
  11. No Author (2011). Topic 1: Cryptography [online]. Available: http://www.maths.uq.edu.au/~pa/SCIE1000/gma.pdf
  12. No Author (2011). Topic 1: Cryptography [online]. Available: http://www.maths.uq.edu.au/~pa/SCIE1000/gma.pdf