Editing Final Report/Thesis 2015 (section)

=====Non-Parametric Testing=====
Due to the lack of usability of the normally distributed statistical methods, a number of non-parametric tests were researched including the Kolmogrov-Smirnov Test, Mann-Whitney U Test and ''Chi-Squared test''.
These 3 tests were researched and considered but ultimately the ''chi-squared test'' was chosen based on a cryptography reference found that uses the ''chi-squared test'' to compare a code to a particular language <ref>No Author. 2014. Using Chi Squared to Crack Codes [online]. Available: http://ibmathsresources.com/2014/06/15/using-chi-squared-to-crack-codes/.</ref>, as well as advice from supervisor Dr. Berryman.  The base text chosen for testing was the full English translation of the ''Universal Declaration of Human Rights''.  The sample texts chosen were the first 44 letters from a the novel ''Tess of the d’Urbervilles'' by Thomas Hardy as an English benchmark <ref>T Hardy. 2005. Tess of the d’Urbervilles (11th Edition) [online]. Available: https://ia801409.us.archive.org/24/items/tessofthedurberv00110gut/110-8.txt.</ref>, as well as the first 44 letters from the German, French and Zapoteco translations of the ''Universal Declaration of Human Rights''.

======Non-Parametric Testing Initial Results And Evaluation======
Initially, a result was unable to be computed for the Chi-Squared value or ''p-value'' using the ''chi-squared test'' method.  It was soon discovered that in the calculation of the Chi-Squared value, if the frequency of a letter in the reference text (a particular language from the Declaration of Human Rights) was 0, this caused the denominator of the Chi-Squared value equation to be 0 and thus the equation became invalid (See Figure 10).  

[[File: Task_1_Chi-squared_Formula.png |thumb|1000px|centre|'''Fig. 10:''' Chi-Squared Formula]]

An assumption in the method of the ''chi-squared testing'' had to be made in an attempt to rectify the issue.  The count for letters in the reference text that appeared 0 times, were altered to 1, a small number chosen to be the closest integer value to 0 in the hope that it would not have significant enough effect on the results to cause a skew in the data.  This assumption was made since the sample size for the reference text was too small, and so not every letter appeared at least once.  After doing this, and computing all results, it was realised by the group that despite the initial results when changing the count from 0 to 1 looking promising, the frequencies of letters that actually appeared once had the same frequency as letters that did not appear at all and thus reduced the accuracy of the data used for the analysis.  This was rectified by altering the frequencies of letters appearing 0 times to 0.0001, since the frequencies for letters occurring 1 time had a frequency of approximately 0.0005, and so we had to select a number lower than the lowest occurrence.  The results of both methods are compared and contrasted in the following sections.

{|style="margin: 0 auto;"
| [[File: 0_to_1_Validation.png |thumb|400px|centre|'''Fig. 11:''' Initial Chi-Squared Test Results for English Declaration vs Sample Texts (Count from 0 to 1)]]
| [[File: 0_to_0.0001_Validaition.png |thumb|400px|centre|'''Fig. 12:''' Initial Chi-Squared Test Results for English Declaration vs Sample Texts (Frequency from 0 to 0.0001)]]
|}

The initial ''chi-squared test'' using both methods resulted in large chi-squared values that were brought about by a small text size sample of 44 letters (See Figures 11 and 12).  This effectively caused the ''p-value'' results to be very small numbers and therefore were unable to be used as a means of comparison and caused the null hypothesis to be rejected in every case.  However, this test could still be used as a measure of similarity since the Chi-Squared Values for each text comparison can be compared based on the fact that the smaller the Chi-Squared value, the more similar the two texts are.  Thus, despite being unable to obtain valid ''p-values'' and perform ''hypothesis testing'', the texts could still be ranked based on the Chi-Squared value in the full analysis.  Comparing the results from the count 0 to 1, versus the frequency 0 to 0.0001 shows that the chi-squared values were reduced overall in the second analysis, except for the chi-squared values calculated for the Somerton Man code.

======Top 20 European Language Comparison======
The top 20 closest European languages by squared difference and standard deviation from the 2013 Honours group were then run through the ''chi-squared test'' against the Somerton Man code as in the original police report (5 Ms and 1 W). This test was run as a method of comparison to test how similar our results were to the 2013 group’s results. The results based on the two assumptions mentioned will be displayed.

======Top 20 European Language Comparison Results======
[[File: Squared_Difference_versus_2013_Number.png |thumb|1200px|centre|'''Fig. 13:''' Comparison of Results of Chi-Squared Values of Closest 20 Languages based on 2013 Squared Difference (Count from 0 to 1)]]

[[File: Standard_Deviation_versus_2013_Number.png |thumb|1200px|centre|'''Fig. 14:''' Comparison of Results of Chi-Squared Values of Closest 20 Languages based on 2013 Standard Deviation (Count from 0 to 1)]]

[[File: Squared_Difference_versus_2013_Frequency.png |thumb|1200px|centre|'''Fig. 15:''' Comparison of Results of Chi-Squared Values of Closest 20 Languages based on 2013 Squared Difference (Frequency from 0 to 0.0001)]]

[[File: Standard_Deviation_versus_2013_Frequency.png |thumb|1200px|centre|'''Fig. 16:''' Comparison of Results of Chi-Squared Values of Closest 20 Languages based on 2013 Standard Deviation (Frequency from 0 to 0.0001)]]

======Top 20 European Language Comparison Evaluation======
The results show that the two closest languages to the Somerton Man code are Scots, followed by English in all cases. The main conclusion to draw from the results of this comparison was that the Chi-Squared method appeared to be functioning correctly, and so the 2015 group could now further extend the analysis.

======Top 20 European Languages based on Estimated Number of Native Speakers======
In extension to the 2013 group’s work, the 2015 group conducted a ''chi-squared test'' of the Somerton Man code against the top 20 European languages based on the estimated number of native speakers <ref>No Author. 2015. List of languages by number of native speakers [online]. Available: http://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers#cite_note-Nationalencyklopedin-1.</ref>.  This test included all versions of the Somerton Man code including the versions with 6 Ms, 4 Ms and 2 Ws, 5 Ms and 1 W and then the average of these results were also plotted.

======Top 20 European Languages based on Estimated Number of Native Speakers Results======
[[File: European _ Languages _ versus _ Code_1_Number.png |thumb|1000px|centre|'''Fig. 17:''' Top 20 European Languages based on Estimated Number of Speakers with 6Ms in Code (left) and 4Ms and 2Ws in Code (right) (Count 0 to 1)]]

[[File: European _ Languages _ versus _ Code_2_Number.png |thumb|1000px|centre|'''Fig. 18:''' Top 20 European Languages based on Estimated Number of Speakers with 5Ms and 1W in Code (left) and the Average (right) (Count 0 to 1)]]

[[File: European _ Languages _ versus _ Code_1_Frequency.png |thumb|1000px|centre|'''Fig. 19:''' Top 20 European Languages based on Estimated Number of Speakers with 6Ms in Code (left) and 4Ms and 2Ws in Code (right) (Frequency 0 to 0.0001)]]

[[File: European _ Languages _ versus _ Code_2_Frequency.png |thumb|1000px|centre|'''Fig. 20:''' Top 20 European Languages based on Estimated Number of Speakers with 5Ms and 1W in Code (left) and the Average (right) (Frequency 0 to 0.0001)]]

======Top 20 European Languages based on Estimated Number of Native Speakers Evaluation======
The results show that when changing the count from 0 to 1, for two of the three code versions and their average value, English was the closest language to the Somerton Man code.  Kurdish, a language spoken in some parts of Turkey, was found to be the closest language to the code version with 6 Ms, however, considering Kurdish is not a common language, and based on the average results, it is safe to say that English was the closest language to the code using this method.

When changing the frequency from 0 to 0.0001, the results deviated from those obtained using a count of 1.  This caused English to produce a higher chi-squared value and for other languages to produce lower-chi squared values, causing Kurdish to have the lowest chi-squared value in all cases, and for English to become the third closest language in the average results.  This may be evidence to suggest that upon further inspection, the ''Universal Declaration of Human Rights'' may not be a suitable base text for use with the chi-squared method, irrespective of insufficient sample size.

======Top 20 European Languages based on Estimated Number of Native Speakers against Thomas Hardy Sample======
A comparison of the English control text (a 44 letter sample of Thomas Hardy’s Tess of the d’Urbervilles) against the top 20 European Languages based on Estimated Number of Native Speakers was conducted in order to test the ability of the method being able to identify an actual English sample text among the top 20 European languages.  Assumptions of altering the count and frequency of letters that appeared 0 times to 1 and 0.0001 respectively were both used and can be seen in Figures 21 and 22.

======Top 20 European Languages based on Estimated Number of Native Speakers against Thomas Hardy Sample Results======

{|style="margin: 0 auto;"
| [[File: European _ Languages _ versus _ Thomas_1_Number.png |thumb|1000px|centre|'''Fig. 21:''' Top 20 European Languages based on Estimated Number of Speakers versus Thomas Hardy Sample (Count 0 to 1)]]
| [[File: European _ Languages _ versus _ Thomas_1_Frequency.png |thumb|1000px|centre|'''Fig. 22:''' Top 20 European Languages based on Estimated Number of Speakers versus Thomas Hardy Sample (Frequency 0 to 0.0001)]]
|}

======Top 20 European Languages based on Estimated Number of Native Speakers against Thomas Hardy Sample Evaluation======
The results of this test show that English was the closest language to the English sample text in both cases. This result is desirable as it successfully verified the ability of the ''chi-squared test'' to distinguish a 44 letter English sample out of the top 20 European languages.  This could have been used to back up the results obtained from comparing the same 20 languages against the Somerton Man code, but unlike the same test performed on the Somerton Man code, adjusting the method from count 0 to 1, to frequency 0 to 0.0001 caused no effect on English being the closest language to the sample text.