Editing Final Report/Thesis 2015 (section)

===Task 4: Statistical Frequency of Letters Reanalysis===
====Aim====
Towards the end of the project, a decision was made that for Task 4, rather than analysing the mass spectrometer data from the Somerton Man's hair, we would focus our efforts on reanalysing the letter frequencies of various European languages.  This was decided upon since our initial analysis performed in Task 1 produced inconsistent and varied results.  This was brought about due to the limited sample size of the ''Universal Declaration of Human Rights'' as a base text, causing the frequency of particular letters to appear 0 times in particular languages.  Due to this, these letter frequencies had to be altered by choosing arbitrary numbers for their frequency in order to perform our ''chi-Squared testing'' and thus reduced the accuracy and validity of the test's results.

the limited sample size caused the chi-squared values for all languages, including English, to be reasonably large.  This caused the resulting calculated ''p-values'' to be extremely small numbers, or in most cases 0.  Because of this, these chi-squared values were not usable to use ''p-values'' to perform our initially proposed ''hypothesis testing'' from Task 1.

This caused us to question the validity of the ''Universal Declaration of Human Rights'' as a base text and so we sought to increase our sample size using alternate base texts and extend our original statistical analysis.  

====Method====
It was decided that for the reanalysis, we would use ''Project Gutenberg'' to increase the sample size for as many of the 21 most popular European languages used in Task 1 as possible by collecting novels from the time before the Somerton Man's death.  This was chosen to be used as our base corpus in an attempt to obtain a more accurate representation of the initial letter frequencies of words in these languages.  Novels in each language were concatenated and their letter frequencies were determined, until each letter appeared at least once in each language.  

The 2013 group’s decoding toolkit and initial letter frequency count code were able to be utilised for this task.  The decoding toolkit's 'format texts' function was used to remove all non letter characters and symbols as well as punctuation and accented letters, and the initial letter frequency counter was run on all of our base and benchmark sample texts in order to obtain the data we needed to perform our statistical analysis.  All statistical calculations and graphs were generated using Mircosoft Excel.

=====Initial Validation=====
First of all, the same test that was initially run in Task 1 on the statistics obtained from the English translation of the ''Universal Declaration of Human Rights'' (with letters with frequency 0 modified to 0.0001) as a baseline check were also run on the new statistics gathered from the novel: ''The Life of the Spider'' by J. Henri Fabre<ref>J. H. Fabre. (2005, March 22). The Life of the Spider [Online]. Available: https://www.gutenberg.org/ebooks/1887</ref>, used as our English base text found on ''Project Gutenberg'', as a means of comparison between the base texts.  The Somerton man code, 44 letter samples from a Thomas Hardy novel acting as an English control <ref>T Hardy. 2005. Tess of the d’Urbervilles (11th Edition) [online]. Available: https://ia801409.us.archive.org/24/items/tessofthedurberv00110gut/110-8.txt.</ref> as well as a French sample, German sample, and Zapoteco sample from the ''Universal Declaration of Human Rights'' were all compared to both sets of data and the results can be seen in Figure 12.

=====European Language Comparison=====
Next, once it was found that the English text from ''Project Gutenberg'' provided lower chi-Squared values than the Universal Declaration for all samples in the initial test, the ''chi-Squared testing'' on European languages could be commenced.  This involved the same procedure as was used in Task 1, but of the top 21 most popular European languages from Task 1, only 12 of the languages were able to be used in the reanalysis due to insufficient usability or availability of texts on ''Project Gutenberg''.  The languages used in the analysis can be seen in the graph in Figure 38.  The omitted languages included Greek, Russian, Serbian, Kurdish, Uzbek, Turkish, Ukranian, Belarusian and Kazakh.  The texts used for this analysis can be seen in Appendix B.

=====Significance Level Calculation=====
The chi-squared and ''p-values'' calculated showed that English was the closest language to the Somerton Man code.  From this, ''hypothesis testing'' could be performed based on the English results.  Upon consultation with Prof. Abbott and Dr. Berryman, rather than choosing an arbitrary value of significance level such as the typically used p=0.05, it was decided a significance level could be calculated using the ''p-value'' found using real English texts to be used as what we deemed to be an acceptable significance level for which we would confidently be able to say that the most likely language of origin of the Somerton Man code is English.  This was achieved by collecting 20 44 letter excerpts from English novels from ''Project Gutenberg'' (see Appendix C), performing the ''chi-squared testing'' for these samples against the English ''Project Gutenberg'' novel used as our English base text, taking an average of the chi-squared values, and from this calculating a ''p-value''.  This result was then compared to the results obtained from the English portion of the ''chi-squared testing'' performed on the variants of the code, and was plotted as seen in Figure 40.

This same testing was then also run on the English samples and code variants against the original English translation of the ''Universal Declaration of Human rights'' as a means of comparison between the two base texts.  Significance levels were unable to be calculated using the ''Universal Declaration of Human Rights'' since the chi-squared values were too large, causing the calculated ''p-values'' to be too small (approaching 0).  The results can be seen in Figure 39.

It was unnecessary to extend the analysis to collect benchmarks and perform the ''hypothesis testing'' on the other European languages against the code since chi-squared values produced were too large, and so the ''p-values'' calculated were unusable.

=====Increased Sample Size Testing=====
It was then decided that in order to increase our confidence in the calculated significance level, we would increase the sample size for our English base text from ''Project Gutenberg'' to not only large enough such that each letter appeared at least once, but to concatenate 20 English novels from the time before the Somerton Man's death to be used as our base English Corpus (See Appendix D).  It was first confirmed whether this would have an affect on the chi-squared values against the code variants when compared to other languages.  We could then also increase our English benchmark sample size by taking 100 44 letter samples from this corpus using code written in Python, and performing the same testing as performed on our smaller English base text.  The results from this testing can be seen in Figure 41.

Increasing the sample size of the English base text had very little effect on the graphs produced in the Initial Validation, European Language Comparison and so these graphs have been excluded.  A closer look at the changes to the chi-squared and ''p-values'' for the Somerton Man code variants caused by this increased sample size can be seen through comparing Figures 40 and 41.  Increasing the number of 44 letter English samples from 20 to 100 however, did have an effect on the chi-squared value and ''p-value'' calculated to be used as our significance level, the results of which can be seen in Figure 41.  This increase in number of samples had very little effect on the graph of the ''Universal Declaration of Human Rights'' significance level calculation, and so this has also been omitted.

====Results====
{|style="margin: 0 auto;"
| [[File:Gutenberg_vs_Declaration_against_Samples.png|thumb|500px|centre|'''Fig. 37:''' Graph of Initial Validation Chi-Squared Values Comparison between English Declaration and English Gutenberg (Frequency 0 to 0.0001) ]]
| [[File:Gutenberg_vs_Code.png|thumb|500px|centre|'''Fig. 38:''' Graph of Average Chi-Squared values from ''Project Gutenberg'' Base Texts versus Somerton Man Code Variants]]
|}

[[File:20_English_Samples_Declaration_Significance.png|thumb|500px|centre|'''Fig. 39:''' Graph of Comparison of Chi-Squared Values of 20 English Samples and Code Variants against English Declaration Base Text (Frequency 0 to 0.0001) ]]

[[File:20_English_Samples_Significance.png|thumb|800px|centre|'''Fig. 40:''' Graphs of Comparison of Chi-Squared and P-Values of 20 English Samples and Code Variants against English Gutenberg Corpus Base Text]]

[[File:100_English_Samples_Significance.png|thumb|800px|centre|'''Fig. 41:'' Graphs of Comparison of Chi-Squared and P-Values of 100 English Samples and Code Variants against English Gutenberg Corpus Base Text]]

====Evaluation and Justification====
The results of the initial validation seen in Figure 37, show that using the ''Project Gutenberg'' novel as an English reference text provided lower chi-squared values for all test cases and thus it was deemed to be a more suitable base text than the modified version of the ''Universal Declaration of Human Rights''.

The results from the European Language Comparison in Figure 38, show that English had the lowest chi-squared value when compared to all languages in our ''Project Gutenberg'' corpus, and thus was the closest language to the Somerton Man code.

The chi-squared values calculated using the English translation of the Declaration of Human Rights (Figure 39) were found to be much higher than those calculated using the English ''Project Gutenberg'' novel (Figure 40).  The large difference in results, and the fact that real english samples obtained such high chi-squared values, show that the Declaration may not provide an accurate representation of letter frequencies in the English language, and thus the validity of this as a base text has been proven to be questionable when used as part of a chi-squared analysis.  In addition to this, despite the chi-squared values calculated using the Somerton Man code variants being much lower in all cases, ''hypothesis testing'' could not be completed due to the large chi-squared values producing very small ''p-values'' for the code variants and significance level (approaching 0).

The results from the initial significance level calculation in Figure 40 shows that 2 of the 3 Somerton Man code variants, and thus the average result, achieved higher ''p-values'' than the calculated significance level.  From this we could deduce that our preliminary results showed that our null hypothesis was accepted and that English is the most likely language of origin of the code, assuming that it is an ''initialism''.

Upon increasing the sample size, the significance level calculation in Figure 41 shows that now only 1 of the 3 code variants achieved a ''p-value'' higher than the calculated significance level.  This caused the average result to fall below the significance level.  Due to this, our statement had to be modified to say that overall the null hypothesis was rejected and alternative hypothesis accepted, meaning that we can not confidently say that the language of origin of the Somerton Man code is english for all variants.  Despite this, the null hypothesis could be accepted and English is the most likely language of origin of the Somerton Man code, assuming that it contains 4 M's, 2 W's and is an ''initialism''.

Regardless of the choice to accept or reject the null hypothesis, the similarities in chi-squared and ''p-values'' calculated between real 44 letter English Samples and all variants of the Somerton Man code using the ''Project Gutenberg'' base text reinforces the notion that the language of origin of the code is indeed English.

Although we were able to find a base text with frequency greater than 0 for each letter, suitable for reanalysis and performing ''hypothesis testing'', the chi-squared method used was still not entirely mathematically accurate since the expected value of the number of sample observations for each letter in the code should have been a minimum of 5<ref>Stat Trek. (2015). Chi-Square Goodness of Fit Test [Online]. Available: http://stattrek.com/chi-square-test/goodness-of-fit.aspx?Tutorial=AP.</ref>.  This was unavoidable since we had limited letter frequencies provided by the Somerton Man code and thus this sample size could not be increased.  Since this was constant when comparing across all languages, the method was still able to be used as a means of comparing the 'goodness of fit' of letters in each language.