Editing Final Report/Thesis 2015 (section)

===Task 1: Statistical Frequency Analysis of Letters===
====Aim====
A Critical review of the statistical frequency analysis of the letters from the 2013 group was to be conducted.  It was then proposed that the 2015 group was to repeat the statistical tests done by the 2013 group. Like the 2013 group, the ''Universal Declaration of Human Rights'' was to be used as the base text based on advice from Professor Abbott, but the validity of this text being used as the base text was to be statistically tested. The group was to find out how common each letter of the Somerton Man code is in each popular European language. The statistical results from this analysis were then to be used on 44 letters out of pieces of text from the most likely languages. The 2013 group’s analysis was then to be extended by the 2015 group by calculating ''p-values'' and implementing ''hypothesis testing''. Like the 2013 group, the 2015 group was also to use Microsoft Excel to compute the statistical analysis and produce output graphs. The group was to run ''p-value'' tests on benchmark pieces of text from the most likely languages and see if the ''p-values'' suggested that the letters are indeed from those languages. The use of benchmark pieces of text were also to be used to test the statistical accuracy of the method of analysis as well as the validity of the ''Universal Declaration of Human Rights'' as a base text. Once the most likely language was determined, the group was to process an additional number of benchmarks of that language and obtain a mean ''p-value''. Next, a ''hypothesis test'' could be performed based on the mean ''p-value'' obtained from the benchmarks, when compared to, the ''p-value'' of the letters in the Somerton Man code against the base text. The null hypothesis was to be that ‘The group of letters are from the English language’, and the alternative hypothesis was to be that ‘The group of letters are from another language’. The alternative hypothesis could be altered if it was found that the most likely language from the statistical analysis was one other than English. Using this ''hypothesis testing'' method, the 2015 group was hoping to be able to more confidently determine which language the letters in the code are from.

====Method====
=====2013 Statistical Frequency Analysis Review=====
A critical review of the statistical frequency analysis of the letters from the 2013 group has been conducted to determine possible extensions to be undertaken by the 2015 group.  The 2013 group based their statistical frequency analysis of letters on the translations and transliterations of the ''Universal Declaration of Human Rights''.  This document was chosen since it is translated in over 400 languages.  The 2015 group is to analyse this choice of base document using statistical techniques.  266 languages were analysed, since many of the translations were not text files, but paper scans.  This seems like a reasonable omission due to the time constraints of the project and the common European languages are all in text form.  The analysis included accented letters normalised to their ‘parent’ characters, for example, considering 'ǎ' as 'a'.  The languages were then analysed using a variety of combinations of the ambiguous letters within the code.  The 2013 group’s statistical analysis was then refined to the top 20 closest European languages.  These included more uncommon European languages such as Scots and Vepsian.  The graphs of the results from the 2013 group’s analysis can be seen in Figure 9.  The most likely language of the Somerton Man code and control text was found to be Scots.  This is unlikely since it is not a common European language as it is a Scottish dialect <ref>S. L. Center. (2015). What is Scots? [online]. Available: http://www.scotslanguage.com/What_is_Scots%3F_uid2/What_is_Scots_%3F</ref>.  Due to this, Scots, and other less common European languages will be omitted in the 2015 group’s statistical analysis.  Instead, only the most common European languages will be included in order to further refine the statistical process.  Despite this, a conclusion was made that the most likely language was English since Scots and English shared very similar initial letters.
In order to test this conclusion, the 2013 group used the English translation of the ''Rubaiyat of Omar Khayyam'' as an English control text.  The first 44 words from the text were used since this is the same number as the number of letters in the Somerton Man code.  The results showed, once again, that Scots was the most likely language, followed by English (see Figure 9).
Analysing this choice of control text, the choice was made by the 2013 group as a matter of convenience.   This may not have been the best decision to make as the ''Rubaiyat of Omar Khayyam'' was originally written in Persian, and has since been translated into English.  The use of a translated text as the English control may have skewed the statistical results of the analysis of the control text as the translation may use uncommon words or expressions.  Another reason for the suspected use of uncommon words or expressions is the fact that the text is a book made up of four line poems, meaning that it may not accurately represent commonly used words or letters in such a small sample size of 44 words.  The 2015 group is to attempt to counteract this possible skew by using 44 words out of a popular novel, originally written in the most likely language.
[[File:2013_Statistical_Language_Analysis_Graphed_Results.png|thumb|700px|centre|'''Fig. 9:''' 2013 Statistical Language Analysis Graphed Results <ref>L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester _B_Final_Report_2013_-_Cipher_cracking</ref>]]

=====Assumptions=====
Before commencing statistical calculations, a number of initial assumptions were made for this task. One assumption was that the language used as the basis for the code is a European Language. This assumption was made based on the European appearance of the Somerton Man, results from previous groups concluding that the most likely language is English (see Previous Studies/Related Work section), and advice from Professor Abbott. Another assumption that was made is that the code is an ''initialism'', meaning that it is made up of letters that represent the first letters of an ordered series of words. This assumption has also been made based on conclusions made by previous groups (see Previous Studies/Related Work section) and advice from Professor Abbott. Due to this assumption, only the first letters of words in each language in the base text were to be considered for analysis, and not every letter.  Further assumptions were the inclusion of all accented letters in base texts and all combinations of ambiguous letters in the code. This assumption entailed the use of versions of the code including 6Ms, 4Ms and 2Ws, and the code as it appeared in the original police report from 1949<ref>L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking</ref>.  These assumptions were made to increase the robustness of the analysis. A final assumption that was made was the choice not to include the crossed out line of letters in the code. This was made since from observation, one can see that this line is very similar to the third line and is most likely to be a mistake. 

=====Base Text=====
The 2013 group’s frequency analysis code was obtained. Initially, some background research in Java code was undertaken for use in compiling and editing the code. The text files used by the 2013 group from the Universal Declaration of Human Right for the statistical frequency analysis were also obtained. A selection of these were then processed in Java, using a modified version of the 2013 group’s code, and the results were tabulated in an excel spread sheet including language, numerical frequency of each letter in each language and the proportional frequency based on the total number of initial letters in each language. These results were then used to test our statistical method of analysis in Microsoft Excel, MATLAB and by hand.

A spread sheet containing the letter frequency data from the ''Universal Declaration of Human Rights'' was obtained from the 2013 group. These results were verified by processing the original text files through the initial letter frequency java code. This data spread sheet was used for statistical calculations for comparison of the letter frequency in the code versus each language in the ''Universal Declaration of Human Rights''. 

=====Parametric Testing=====
Firstly, a group of test texts were prepared, using 3 groups of 44 letters from the English, French and German languages from the ''Universal Declaration of Human Rights'', as well as the Somerton Man Code. The test texts were analysed using ANOVA in Excel. Once these results were obtained, the ''p-value'' calculation method used was tested using various pieces of software. A test case with known ''p-value'' result from the Engineering Maths IIA notes was run through ANOVA in Microsoft Excel, as well as using built-in MATLAB functions found in Engineering Maths IIA notes, and by hand. All of these methods produced the same ''p-value'' using the test case with known ''p-value'' and so the ANOVA method was verified to be functioning correctly. 

======Parametric Testing Results======
ANOVA One Way analysis was used in Microsoft Excel to compare letter frequency between texts, as well as ANOVA Two Way Analysis of Variance, where Factor 1 was the letter, Factor 2 was the text and the response was the letter frequency.  Both of these methods did not produce valid ''p-values'' as they used the comparison of total means and variances.
Paired data analysis using a paired sample t-test was researched and attempted based on the Engineering Maths IIA Notes.  This was computed using MATLAB, but also produced unusable ''p-values'' since the method uses mean and standard deviation of the differences between samples to calculate the ''p-value'', thus letter and text type data is lost in the process and so is not applicable.
The ''p-values'' calculated using these methods fluctuated depending on the type of data used.  If the raw number of letter frequencies for the sample text or code (44 letters) versus the ''Universal Declaration of Human Rights'' reference text (1000-2000 letters depending on language) was used, then the ''p-value'' became very small since the total means of each text were vastly different.  Whereas, if the frequency of each letter as a percentage of the total number of letters in each text were used, this gave a ''p-value'' of 1 since the means became the same.  Thus this method of mean comparison did not work for letter frequency analysis.

======Parametric Testing Evaluation======
All of these methods did not compare individual data entries from one group to their corresponding entry in another group.  Instead they used the sample size, sum, mean, and variance of each group to compare to the other group.  Because of this, this method could not be used when comparing letter frequency between languages.  For example the frequency of the letter A in one text, must be compared with the frequency of the letter A in another text, rather than the frequency of all letters in one text being compared to the frequency of all letters in another text.

=====Non-Parametric Testing=====
Due to the lack of usability of the normally distributed statistical methods, a number of non-parametric tests were researched including the Kolmogrov-Smirnov Test, Mann-Whitney U Test and ''Chi-Squared test''.
These 3 tests were researched and considered but ultimately the ''chi-squared test'' was chosen based on a cryptography reference found that uses the ''chi-squared test'' to compare a code to a particular language <ref>No Author. 2014. Using Chi Squared to Crack Codes [online]. Available: http://ibmathsresources.com/2014/06/15/using-chi-squared-to-crack-codes/.</ref>, as well as advice from supervisor Dr. Berryman.  The base text chosen for testing was the full English translation of the ''Universal Declaration of Human Rights''.  The sample texts chosen were the first 44 letters from a the novel ''Tess of the d’Urbervilles'' by Thomas Hardy as an English benchmark <ref>T Hardy. 2005. Tess of the d’Urbervilles (11th Edition) [online]. Available: https://ia801409.us.archive.org/24/items/tessofthedurberv00110gut/110-8.txt.</ref>, as well as the first 44 letters from the German, French and Zapoteco translations of the ''Universal Declaration of Human Rights''.

======Non-Parametric Testing Initial Results And Evaluation======
Initially, a result was unable to be computed for the Chi-Squared value or ''p-value'' using the ''chi-squared test'' method.  It was soon discovered that in the calculation of the Chi-Squared value, if the frequency of a letter in the reference text (a particular language from the Declaration of Human Rights) was 0, this caused the denominator of the Chi-Squared value equation to be 0 and thus the equation became invalid (See Figure 10).  

[[File: Task_1_Chi-squared_Formula.png |thumb|1000px|centre|'''Fig. 10:''' Chi-Squared Formula]]

An assumption in the method of the ''chi-squared testing'' had to be made in an attempt to rectify the issue.  The count for letters in the reference text that appeared 0 times, were altered to 1, a small number chosen to be the closest integer value to 0 in the hope that it would not have significant enough effect on the results to cause a skew in the data.  This assumption was made since the sample size for the reference text was too small, and so not every letter appeared at least once.  After doing this, and computing all results, it was realised by the group that despite the initial results when changing the count from 0 to 1 looking promising, the frequencies of letters that actually appeared once had the same frequency as letters that did not appear at all and thus reduced the accuracy of the data used for the analysis.  This was rectified by altering the frequencies of letters appearing 0 times to 0.0001, since the frequencies for letters occurring 1 time had a frequency of approximately 0.0005, and so we had to select a number lower than the lowest occurrence.  The results of both methods are compared and contrasted in the following sections.

{|style="margin: 0 auto;"
| [[File: 0_to_1_Validation.png |thumb|400px|centre|'''Fig. 11:''' Initial Chi-Squared Test Results for English Declaration vs Sample Texts (Count from 0 to 1)]]
| [[File: 0_to_0.0001_Validaition.png |thumb|400px|centre|'''Fig. 12:''' Initial Chi-Squared Test Results for English Declaration vs Sample Texts (Frequency from 0 to 0.0001)]]
|}

The initial ''chi-squared test'' using both methods resulted in large chi-squared values that were brought about by a small text size sample of 44 letters (See Figures 11 and 12).  This effectively caused the ''p-value'' results to be very small numbers and therefore were unable to be used as a means of comparison and caused the null hypothesis to be rejected in every case.  However, this test could still be used as a measure of similarity since the Chi-Squared Values for each text comparison can be compared based on the fact that the smaller the Chi-Squared value, the more similar the two texts are.  Thus, despite being unable to obtain valid ''p-values'' and perform ''hypothesis testing'', the texts could still be ranked based on the Chi-Squared value in the full analysis.  Comparing the results from the count 0 to 1, versus the frequency 0 to 0.0001 shows that the chi-squared values were reduced overall in the second analysis, except for the chi-squared values calculated for the Somerton Man code.

======Top 20 European Language Comparison======
The top 20 closest European languages by squared difference and standard deviation from the 2013 Honours group were then run through the ''chi-squared test'' against the Somerton Man code as in the original police report (5 Ms and 1 W). This test was run as a method of comparison to test how similar our results were to the 2013 group’s results. The results based on the two assumptions mentioned will be displayed.

======Top 20 European Language Comparison Results======
[[File: Squared_Difference_versus_2013_Number.png |thumb|1200px|centre|'''Fig. 13:''' Comparison of Results of Chi-Squared Values of Closest 20 Languages based on 2013 Squared Difference (Count from 0 to 1)]]

[[File: Standard_Deviation_versus_2013_Number.png |thumb|1200px|centre|'''Fig. 14:''' Comparison of Results of Chi-Squared Values of Closest 20 Languages based on 2013 Standard Deviation (Count from 0 to 1)]]

[[File: Squared_Difference_versus_2013_Frequency.png |thumb|1200px|centre|'''Fig. 15:''' Comparison of Results of Chi-Squared Values of Closest 20 Languages based on 2013 Squared Difference (Frequency from 0 to 0.0001)]]

[[File: Standard_Deviation_versus_2013_Frequency.png |thumb|1200px|centre|'''Fig. 16:''' Comparison of Results of Chi-Squared Values of Closest 20 Languages based on 2013 Standard Deviation (Frequency from 0 to 0.0001)]]

======Top 20 European Language Comparison Evaluation======
The results show that the two closest languages to the Somerton Man code are Scots, followed by English in all cases. The main conclusion to draw from the results of this comparison was that the Chi-Squared method appeared to be functioning correctly, and so the 2015 group could now further extend the analysis.

======Top 20 European Languages based on Estimated Number of Native Speakers======
In extension to the 2013 group’s work, the 2015 group conducted a ''chi-squared test'' of the Somerton Man code against the top 20 European languages based on the estimated number of native speakers <ref>No Author. 2015. List of languages by number of native speakers [online]. Available: http://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers#cite_note-Nationalencyklopedin-1.</ref>.  This test included all versions of the Somerton Man code including the versions with 6 Ms, 4 Ms and 2 Ws, 5 Ms and 1 W and then the average of these results were also plotted.

======Top 20 European Languages based on Estimated Number of Native Speakers Results======
[[File: European _ Languages _ versus _ Code_1_Number.png |thumb|1000px|centre|'''Fig. 17:''' Top 20 European Languages based on Estimated Number of Speakers with 6Ms in Code (left) and 4Ms and 2Ws in Code (right) (Count 0 to 1)]]

[[File: European _ Languages _ versus _ Code_2_Number.png |thumb|1000px|centre|'''Fig. 18:''' Top 20 European Languages based on Estimated Number of Speakers with 5Ms and 1W in Code (left) and the Average (right) (Count 0 to 1)]]

[[File: European _ Languages _ versus _ Code_1_Frequency.png |thumb|1000px|centre|'''Fig. 19:''' Top 20 European Languages based on Estimated Number of Speakers with 6Ms in Code (left) and 4Ms and 2Ws in Code (right) (Frequency 0 to 0.0001)]]

[[File: European _ Languages _ versus _ Code_2_Frequency.png |thumb|1000px|centre|'''Fig. 20:''' Top 20 European Languages based on Estimated Number of Speakers with 5Ms and 1W in Code (left) and the Average (right) (Frequency 0 to 0.0001)]]

======Top 20 European Languages based on Estimated Number of Native Speakers Evaluation======
The results show that when changing the count from 0 to 1, for two of the three code versions and their average value, English was the closest language to the Somerton Man code.  Kurdish, a language spoken in some parts of Turkey, was found to be the closest language to the code version with 6 Ms, however, considering Kurdish is not a common language, and based on the average results, it is safe to say that English was the closest language to the code using this method.

When changing the frequency from 0 to 0.0001, the results deviated from those obtained using a count of 1.  This caused English to produce a higher chi-squared value and for other languages to produce lower-chi squared values, causing Kurdish to have the lowest chi-squared value in all cases, and for English to become the third closest language in the average results.  This may be evidence to suggest that upon further inspection, the ''Universal Declaration of Human Rights'' may not be a suitable base text for use with the chi-squared method, irrespective of insufficient sample size.

======Top 20 European Languages based on Estimated Number of Native Speakers against Thomas Hardy Sample======
A comparison of the English control text (a 44 letter sample of Thomas Hardy’s Tess of the d’Urbervilles) against the top 20 European Languages based on Estimated Number of Native Speakers was conducted in order to test the ability of the method being able to identify an actual English sample text among the top 20 European languages.  Assumptions of altering the count and frequency of letters that appeared 0 times to 1 and 0.0001 respectively were both used and can be seen in Figures 21 and 22.

======Top 20 European Languages based on Estimated Number of Native Speakers against Thomas Hardy Sample Results======

{|style="margin: 0 auto;"
| [[File: European _ Languages _ versus _ Thomas_1_Number.png |thumb|1000px|centre|'''Fig. 21:''' Top 20 European Languages based on Estimated Number of Speakers versus Thomas Hardy Sample (Count 0 to 1)]]
| [[File: European _ Languages _ versus _ Thomas_1_Frequency.png |thumb|1000px|centre|'''Fig. 22:''' Top 20 European Languages based on Estimated Number of Speakers versus Thomas Hardy Sample (Frequency 0 to 0.0001)]]
|}

======Top 20 European Languages based on Estimated Number of Native Speakers against Thomas Hardy Sample Evaluation======
The results of this test show that English was the closest language to the English sample text in both cases. This result is desirable as it successfully verified the ability of the ''chi-squared test'' to distinguish a 44 letter English sample out of the top 20 European languages.  This could have been used to back up the results obtained from comparing the same 20 languages against the Somerton Man code, but unlike the same test performed on the Somerton Man code, adjusting the method from count 0 to 1, to frequency 0 to 0.0001 caused no effect on English being the closest language to the sample text.

====Evaluation and Justification====
The original proposal suggested that the group repeat the statistical analysis from the 2013 group and use benchmark texts to statistically assess the validity of the method as well as the ''Universal Declaration of Human Rights'' as a base text. The group was then to extend the analysis by calculating the ''p-values'' for the Somerton Man code when compared to the most common European languages and perform ''hypothesis testing'' based on the results. The group was also to use benchmark texts to test the statistical accuracy of the method as well as the validity of the ''Universal Declaration of Human Rights'' as a base text. The 2015 group’s statistical analysis had achieved almost all of its proposed goals. A slight diverge from the initially proposed method was decided upon once it was found that ''p-values'' useful for comparison or ''hypothesis testing'' were unable to be obtained using any attempted statistical method. Instead, the texts were ranked using their calculated chi-squared value. All assumptions outlined in the proposal were followed, with the addition of the modification of the count and frequency data to account for the small sample size of the base text.

Previous results using the ''Universal Declaration of Human rights'' as a base text were confirmed when using the assumption to adjust letters with count 0 to 1.  These were not consistent when using the second assumption to adjust the letters with frequency 0 to 0.0001.  These inconclusive results left more to be desired from the analysis and the validity of ''Universal Declaration of Human Rights'' as a base text was found to be questionable due to its limited sample size.  English was found to have the lowest chi-squared value for a number of the calculations, meaning that the Somerton Man code is most likely to be formed from the English language, however since a reasonable ''p-value'' for any language could not be obtained, there was still a potential for reanalysis.