Editing Final Report 2010 (section)

==Verification of Past Results==
In order to develop guidelines that will give direction to the project goals, such as what language and cipher method the code is most likely in, it was important to re-run and analyse the results that were obtained by the previous students. This was achieved in two ways:

*Collection and compilation of random letter samples 
*Re-run and verification of past algorithms

===Random Letters===
[[Image:randomletter.jpg|thumb|right|Completed Random Letter Sample form.]]
A total of 46 samples (23 sober, 23 intoxicated) were collected from a range of different subjects.  This allowed for significant results that were compiled and compared to the letter distribution found in the mystery code.  Samples from sober and intoxicated subjects were compiled, just like last year’s students did, in order to also analyse the likelihood of the mystery code being produced randomly by a person under the influence of a poison.
Due to moral and ethical issues that arise with actually poisoning people, alcohol was used as a substitute in order to impair the subject’s judgment.
A template for collecting random letter samples was created and used to collect 45 random letter samples from a variety of subjects. 45 was chosen as it is roughly the number of letters in the Somerton man code. Details taken from each subject include the following, and can be seen in the completed form to the right. 

*Intox. Yes/no - the Physical state of each subject was recorded to keep samples taken from intoxicated subjects separate from those that were, to the best of our knowledge, not under the influence of any substance. 

*Age – each subject was required to supply their age for future use 

*M/F – the sex of each subject was recorded for future use

As can be seen by the sample to the right it was initially planned to use the age and sex data that has been collected to subdivide the results into different categories in order to conclude whether the letters in the Somerton man code are better correlated to a specific age group or sex.  After the results had been compiled it was realised that this information will only prove fruitful when there is a much larger amount of random samples recorded.  These extra details have not been utilised in the results of this project.

====Discussion====
As can be seen, the results collected are not identical to the previous student’s results; however they are very different to the graph depicting the code from the Somerton man. It is clear that in both sets of results there is not 1 singular letter that has a frequency less than 1%. This leads us to agree with the previous assumption that the code has not been produced randomly in a sober attempt to deceive or create a diversion.

A very similar result was obtained with the intoxicated samples.  This suggests that it is also unlikely that the mystery code has been written in a random fashion by an intoxicated or delusional individual.

The fact that the two different results (this year and last year) show varying results is important.  This suggests that the amount of samples collected each time is most likely not enough to give a perfect answer as to whether or not the code has been produced randomly by an individual.  Future studies could work on taking a much larger number of samples to see whether the letter distribution varies greatly from those already collected and therefore also obtain a more definitive conclusion.

Another important detail is the small number of letters in the mystery code itself (about 45).  If a given letter of the alphabet is only expected to occur, say 1% of the time when produced randomly by a person, then it is highly likely that it would not occur at all in a sample of only 45 letters.  If it has been produced randomly this could explain why some letters, such as X, Y, and J are not present in the mystery code at all.  However, the high frequency of the letters K and Z in the collected samples suggest that this is not the case.

In summary, from the results obtained, it is feasible to assume that the mystery code has not been produced randomly by somebody in a state of intoxication or as a form of deception however this could be further established with a larger sample base.


[[Image:Random_letter_samples_pic.jpg|center]]

Below are the results obtained by the 2009 students.
[[Image:compare.png|center]]

===Verification of Past Algorithms===

Last year’s project group attempted to selectively rule out different possibilities of the code’s meaning in order to get a better idea of what the code is. Specifically, they tested the code against transposition ciphers, tested the code as an initialism as well as several different cipher schemes in both English and other different languages.

One of the tests last year’s group made was to test the possibility of the code being a one-time pad. Using the Somerton Man’s code as the cipher text and segments of the Bible as the cipher key, a resultant undeciphered text was obtained. This was also repeated using each of the poems in the Rubaiyat and for all of these cases, the resultant plaintext revealed no sentences or any English words at all. The resultant output for one of the lines in The Rubaiyat is shown below (the rest can be found [[Media:Past Year Results.rar|here]]).

 Ciphertext:  MTBIMPANETP
 Cipherkey:  ANDLOTHEHUNTEROFTHEEASTHASCAUGHT
 Resultant Plaintext:  ESICAUMGKLK

Given that the cipher key is longer than the ciphertext in each line, each line was independently deciphered as a Vigenere cipher.


Another test last year’s group made use of was [http://en.wikipedia.org/wiki/Markov_chain Markov chains] to determine the probabilities of the code being a certain language or if it used a specific cipher. In relation to this project, the Markov chain<ref>http://en.wikipedia.org/wiki/Markov_chain</ref> models a line of text as a random process where the next character only depends on a fixed number of previous characters in the text. Due to the complexity of these chains, only the first and second order probabilities were calculated using the following equations:

 MP(firstorder) = p(X1)p(X2 | X1)p(X3 | X2)...p(Xn | Xn − 1) 
 MP(secondorder) = p(X1)p(X2 | X1)p(X3 | X2,X1)...p(Xn | Xn − 1,Xn − 2)

For example, the probability of finding the 1st order Markov probability is the probability of getting letter 1, times the probability of getting letter 2 given that the previous letter was letter 1, times the probability of getting letter 3 given that the previous letter was letter 2 and so on.

Because these probabilities are extremely small (of the order 10^-60), the probabilities were normalized using this equation:

 HMMER Score = log2(MP/(1/26)^44)

Where the 1/26^44 represents the probability that each letter of the alphabet has a 1/26 chance of occurring in the 44 positions of the code sequence.

This process was used on various texts in which the Playfair and Vigenere Cipher was implemented. The resultant Markov probabilities were very low and it was determined to be very unlikely that these ciphers were used. This process was also used to obtain Markov probabilities for the code being an initialism of different languages by analysing texts of varying languages. It was determined that the code was most likely to be English if it was an initialism<ref>https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_report_2009:_Who_killed_the_Somerton_man%3F#Initial_Letters_of_a_sentence</ref>.

Their results conclude that the Somerton Man’s code is not a one-time pad or a Playfair or Vigenere Cipher, and rather resembles a set of initialisms which may or may not be substituted. From reviewing the previous year's code and attempting to confirm their results, we were able to come to the same conclusion. Based on this, our code attempts to analyse initialisms based in English text and further narrow down what the mystery code could be.