Editing Final Report/Thesis 2015 (section)

====Method====
It was decided that for the reanalysis, we would use ''Project Gutenberg'' to increase the sample size for as many of the 21 most popular European languages used in Task 1 as possible by collecting novels from the time before the Somerton Man's death.  This was chosen to be used as our base corpus in an attempt to obtain a more accurate representation of the initial letter frequencies of words in these languages.  Novels in each language were concatenated and their letter frequencies were determined, until each letter appeared at least once in each language.  

The 2013 group’s decoding toolkit and initial letter frequency count code were able to be utilised for this task.  The decoding toolkit's 'format texts' function was used to remove all non letter characters and symbols as well as punctuation and accented letters, and the initial letter frequency counter was run on all of our base and benchmark sample texts in order to obtain the data we needed to perform our statistical analysis.  All statistical calculations and graphs were generated using Mircosoft Excel.

=====Initial Validation=====
First of all, the same test that was initially run in Task 1 on the statistics obtained from the English translation of the ''Universal Declaration of Human Rights'' (with letters with frequency 0 modified to 0.0001) as a baseline check were also run on the new statistics gathered from the novel: ''The Life of the Spider'' by J. Henri Fabre<ref>J. H. Fabre. (2005, March 22). The Life of the Spider [Online]. Available: https://www.gutenberg.org/ebooks/1887</ref>, used as our English base text found on ''Project Gutenberg'', as a means of comparison between the base texts.  The Somerton man code, 44 letter samples from a Thomas Hardy novel acting as an English control <ref>T Hardy. 2005. Tess of the d’Urbervilles (11th Edition) [online]. Available: https://ia801409.us.archive.org/24/items/tessofthedurberv00110gut/110-8.txt.</ref> as well as a French sample, German sample, and Zapoteco sample from the ''Universal Declaration of Human Rights'' were all compared to both sets of data and the results can be seen in Figure 12.

=====European Language Comparison=====
Next, once it was found that the English text from ''Project Gutenberg'' provided lower chi-Squared values than the Universal Declaration for all samples in the initial test, the ''chi-Squared testing'' on European languages could be commenced.  This involved the same procedure as was used in Task 1, but of the top 21 most popular European languages from Task 1, only 12 of the languages were able to be used in the reanalysis due to insufficient usability or availability of texts on ''Project Gutenberg''.  The languages used in the analysis can be seen in the graph in Figure 38.  The omitted languages included Greek, Russian, Serbian, Kurdish, Uzbek, Turkish, Ukranian, Belarusian and Kazakh.  The texts used for this analysis can be seen in Appendix B.

=====Significance Level Calculation=====
The chi-squared and ''p-values'' calculated showed that English was the closest language to the Somerton Man code.  From this, ''hypothesis testing'' could be performed based on the English results.  Upon consultation with Prof. Abbott and Dr. Berryman, rather than choosing an arbitrary value of significance level such as the typically used p=0.05, it was decided a significance level could be calculated using the ''p-value'' found using real English texts to be used as what we deemed to be an acceptable significance level for which we would confidently be able to say that the most likely language of origin of the Somerton Man code is English.  This was achieved by collecting 20 44 letter excerpts from English novels from ''Project Gutenberg'' (see Appendix C), performing the ''chi-squared testing'' for these samples against the English ''Project Gutenberg'' novel used as our English base text, taking an average of the chi-squared values, and from this calculating a ''p-value''.  This result was then compared to the results obtained from the English portion of the ''chi-squared testing'' performed on the variants of the code, and was plotted as seen in Figure 40.

This same testing was then also run on the English samples and code variants against the original English translation of the ''Universal Declaration of Human rights'' as a means of comparison between the two base texts.  Significance levels were unable to be calculated using the ''Universal Declaration of Human Rights'' since the chi-squared values were too large, causing the calculated ''p-values'' to be too small (approaching 0).  The results can be seen in Figure 39.

It was unnecessary to extend the analysis to collect benchmarks and perform the ''hypothesis testing'' on the other European languages against the code since chi-squared values produced were too large, and so the ''p-values'' calculated were unusable.

=====Increased Sample Size Testing=====
It was then decided that in order to increase our confidence in the calculated significance level, we would increase the sample size for our English base text from ''Project Gutenberg'' to not only large enough such that each letter appeared at least once, but to concatenate 20 English novels from the time before the Somerton Man's death to be used as our base English Corpus (See Appendix D).  It was first confirmed whether this would have an affect on the chi-squared values against the code variants when compared to other languages.  We could then also increase our English benchmark sample size by taking 100 44 letter samples from this corpus using code written in Python, and performing the same testing as performed on our smaller English base text.  The results from this testing can be seen in Figure 41.

Increasing the sample size of the English base text had very little effect on the graphs produced in the Initial Validation, European Language Comparison and so these graphs have been excluded.  A closer look at the changes to the chi-squared and ''p-values'' for the Somerton Man code variants caused by this increased sample size can be seen through comparing Figures 40 and 41.  Increasing the number of 44 letter English samples from 20 to 100 however, did have an effect on the chi-squared value and ''p-value'' calculated to be used as our significance level, the results of which can be seen in Figure 41.  This increase in number of samples had very little effect on the graph of the ''Universal Declaration of Human Rights'' significance level calculation, and so this has also been omitted.