Editing Final Report 2010 (section)

==Results==
===Exact Initialism===
====Method====
Below is a table of relative frequencies of the first letters of a word in the English language that was taken from a wiki page [http://en.wikipedia.org/wiki/Letter_frequency#Relative_frequencies_of_the_first_letters_of_a_word_in_the_English_language here]<ref>http://en.wikipedia.org/wiki/Letter_frequency#Relative_frequencies_of_the_first_letters_of_a_word_in_the_English_language</ref>. We chose random 3 and 4 character long segments of the Somerton Man’s code and using this table, the expected probability of the segments occurring in regular English text was obtained.

<center>
{|class="wikitable sortable" border="1"
|-
!Letter
!Frequency
|-
|'''a'''||25.602%
|-
|'''b'''||4.702%
|-
|'''c'''||3.511%
|-
|'''d'''||2.670%
|-
|'''e'''||2.000%
|-
|'''f'''||3.779%
|-
|'''g'''||1.950%
|-
|'''h'''||7.232%
|-
|'''i'''||6.286%
|-
|'''j'''||0.631%
|-
|'''k'''||0.690%
|-
|'''l'''||2.705%
|-
|'''m'''||4.374%
|-
|'''n'''||2.365%
|-
|'''o'''||6.264%
|-
|'''p'''||2.545%
|-
|'''q'''||0.173%
|-
|'''r'''||1.653%
|-
|'''s'''||7.755%
|-
|'''t'''||16.671%
|-
|'''u'''||1.487%
|-
|'''v'''||0.619%
|-
|'''w'''||6.661%
|-
|'''x'''||0.005%
|-
|'''y'''||1.620%
|-
|'''z'''||0.050%
|}
</center>


Several different types of text were divided into categories and then tested in groups. The categories include The Bible, The Rubaiyat, Novels, Science Texts, Long Poems, and works by William Shakespeare. The complete list of texts used can be found in the [[Final_Report_2010#Texts_Used_in_Analysis|appendix]]. The expected probability was then compared to the actual proportion that occurred to determine what text type the mystery code could be from using the following formula:

  P(actual) = Total Number of Occurrences / (Total Words in Text – n + 1)

Where n is the size of the segment we are looking for.

Using the results, several tables and graphs were generated to simplify viewing. The raw tables and excel sheets can be found in the appendix. Some of the graphs are below. The graphs indicate the proportions of each sequence found in the selected texts; the red bar graphs showing the code sequence found as is in the Somerton Man's code and the blue bar graphs showing the reverse of the sequence (in cases where the sequence is a palindrome, a blue bar graph is displayed for both). The far left bar of each graph shows the expected probability of the sequence occurring in regular English text(found using the table above), and then shows the actual proportions found in each of the text types. This was done in order to easily compare whether the sequence was found more or less than expected for each text type.


[[Image:mrg grm.jpg|left|530px]][[Image:aba.jpg|right|530px]]
[[Image:itt tti.jpg|center|530px]]
[[Image:mtbi ibtm.jpg|left|530px]][[Image:ttmt tmtt.jpg|right|530px]]
[[Image:tgab bagt.jpg|center|530px]]

The remaining direct initialism graphs can be found on [[Direct Initialism Graphs|this page]].


A raw comparison of actual number of hits can be seen in the tables below.


[[Image:3 letter greatest hits.jpg|center|1050px]]


[[Image:4 letter greatest hits.jpg|center|1050px]]



The table shows the total about of forward and backward initialisms found in each different text, as well as the total amount of words found in the text. This is important because longer texts will obviously have a greater chance of having more initialisms, so in order to properly compare between texts, we have to determine how many initialisms were found relative to the length of the texts (which is the row titled "Ratio Relative to Total"). This is done for forward and backward initialisms separately as well as together.

From the tables, we can see that the text with the greatest amount of initialisms found was the Revised Standard Edition of the Bible. However, this is like to only be the case because the text files in the RSV Bible have far greater words in them compared to the Rubaiyat or poems. We can actually see that relative to the length of the texts tested, The Rubaiyat has the most hits for 3 letter initialisms whilst the science texts has the most hits for the 4 letter initialisms.

====Discussion====
Our results indicate that most of the code segments occur around the expected frequency for each of the text types, so we can’t really determine if the code belongs to any of the tested text types. However, from inspection of the results, the code is more likely to be divided into 3 word long fragments than 4 word long initialisms. Additionally, we also considered that the code could be backwards. Typically, the results favoured forward segments over the backwards segments, although there were occasions when the backwards segment did appear more often than the forwards segment.

Of particular note was that neither of the 4 or 3 character long code segments appeared much at all in the Rubaiyat. This seems quite highly suspect, and there is a possibility that someone intentionally created the mystery code in such a way that this would happen because is it very unlikely that almost none of the code segments would be found in the Rubaiyat. More testing is probably needed to prove this by testing more poems, and possibly truncating other poems to the same size as the Rubaiyat.

To test if The Rubaiyat results were meaningful or just an artifact of length, the same assorted poetry texts were truncated to a size of 400 to 500 lines down from several thousand (a similar size to The Rubaiyat text file) and then the same tests were run again. The results for this are shown in the graphs below.

[[Image:compare poems1.jpg|center|530px]]


[[Image:compare poems2.jpg|left|530px]]
[[Image:compare poems3.jpg|right|530px]]


















[[Image:compare poems4.jpg|center|530px]]

The graphs show the expected probability of finding each sequence in English text on the left, followed by the actual proportion that occurred in The Rubaiyat as well as the truncated poetry texts. The graphs show that for the truncated poetry texts, there are some cases where the initialism is not found at all, but more often than not the initialism sequence is found at close to the expected amount. These results indicate that the abnormal results found for The Rubaiyat aren't likely to be an artifact of length and is more likely intentional.

These exact initialism results indicate that the Somerton man’s code could be a substitution cipher of initialisms found in the Rubaiyat, based on the fact that it is incredibly unlikely that none of the tested code segments were found in the Rubaiyat. To further test this theory, we began testing for patterns of initialisms to see if there was any correlation between texts and attempt to narrow down what the code has been substituted from.

===Pattern Initialism===

The Pattern matching algorithm was utilised in a very similar manner to the Initialism algorithm discussed above.  Patterns that contain two symbols and appear in the mystery code were highlighted and then used as a basis for a wildcard pattern search on the different types of English texts.

====Pattern Selection====

As shown in the following table the letters of the mystery code were spread out in order to single out the patterns that actually occur.   The different pattern combinations can be seen highlighted in yellow and are broken down further in the next table.

As shown, patterns that occur both horizontally and vertically were used for the testing.  It has since been decided that the vertical patterns are unlikely to be of any interest as a simple inspection of the actual mystery code shows very little structure or consistency in a vertical manner compared to the horizontal. 



<center>
{| class="wikitable" border = "1" cellpadding="2" style="text-align"
|+ Apparent patterns in the Somerton Man Code
|-
|bgcolor="yellow"|M
|R
|G
|O
|bgcolor="yellow"|A
|bgcolor="yellow"|B
|bgcolor="yellow"|A
|bgcolor="yellow"|B
|D
|
|
|
|
|-
|bgcolor="yellow"|M
|bgcolor="yellow"|T
|B
|I
|M
|P
|bgcolor="yellow"|A
|N
|E
|bgcolor="yellow"|T
|P
|
|
|-
|bgcolor="yellow"|M
|bgcolor="yellow"|L
|I
|A
|B
|O
|bgcolor="yellow"|A
|bgcolor="yellow"|I
|bgcolor="yellow"|A
|bgcolor="yellow"|Q
|C
|
|
|-
|bgcolor="yellow"|I
|bgcolor="yellow"|T
|bgcolor="yellow"|T
|bgcolor="yellow"|M
|bgcolor="yellow"|T
|S
|bgcolor="yellow"|A
|M
|S
|bgcolor="yellow"|T
|G
|A
|B
|}
</center>

The highlighted letters are all, in some form, part of a two symbol pattern of length 3 or 4. As shown below, patterns occurring both forwards (left to right) and backwards (right to left) were used for the testing.  This assumes that there is a possibility that the code was written backwards.

From these patterns the individual wildcard code is derived, giving the patterns of interest that were used on the different English texts.  For example where ABAB occurs in the mystery code, a pattern of @#@# was searched to reveal any combination of letters that match the ‘pattern’ of ABAB.

A full breakdown of the patterns of interest and their associated combination can be seen in the following table.



<center>
{| class="wikitable" border="1"
|+ Interesting Pattern Symbol Breakdown
|-
! scope="col" | Interesting 4 Symbol Sequences
! scope="col" | Interesting 3 Symbol Sequences
|-
! scope="col" colspan = "2"| Horizontal
|-
| ABAB - @#@# || ABA - @#@ 
|-
| TTMT - @@#@ || AIA - @#@ 
|-
| TMTT - @#@@ || ITT  - @##
|-
| || TTI - @@#
|-
| || TTM - @@#
|-
| || MTT - @##
|-
| || TMT - @#@
|-
| || BAB - @#@
|-
! scope="col" colspan = "2"| Vertical
|-
| MMMI - @@@# || TLT - @#@
|-
| IMMM - @### || MMM - @@@
|-
| AAAA - @@@@ || AAA = @@@
|-
|  || TQT - @#@
|-
|  || MMI - @@#
|-
|  || IMM - @##
|-
! scope="col" colspan = "2"| Patterns of Interest
|-
| @#@# || @#@
|-
| @@#@ || @##
|-
| @#@@ || @@#
|-
| @@@# || @@@
|-
| @### || 
|-
| @@@@ || 
|-
|}
</center>



Each of the pattern combinations found in the mystery code was then used to search through different types of text so that any differences between the initial letters used in the different text types would be clear.  The result of these tests would then hopefully suggest the most likely origin of the mystery code if it is in fact initialism.


The types of texts used were:
*Poems (a variety of long and short poems from a large variety of poets/archives)
*Novels 
*Science texts (a selection of textbooks - chemistry, maths and physics)
*Shakespeare (entire collection)
*The Bible (separately tested the King James and Revised Standard Versions)
*Rubaiyat (book of poems associated with the murder itself)

====Discussion====
The results, for each pattern possibility, have been tabulated and can be found in the appendix. A few example results tables are shown below.  The highlighted cells depict the patterns that are an exact match to the pattern sourced from the mystery code.


[[Image:Patternpic1.JPG|center]]


[[Image:Patternpic2.JPG|center]]


[[Image:Patternpic3.JPG|center]]


The first thing that is apparent from the results is the similarities between the different types of text.  This suggests that the general occurrence of the order of initial letters in English is independent of the source.  That is, it does not really matter where the raw text is from as it is all English and will produce very similar results.


The second interesting point is the highlighted results that relate to the letters taken directly from the mystery code.  Having these results occur in the top list of results implies that it is still technically possible that the code is in fact an initialism without substitution. However, apart from the patterns of one letter (@@@@ and @@@), there is a very large spread of results with the top result usually only about 5% of the total amount of matches found.  The top 26 matches are shown for each search however in the full list some patterns returned results with up to 500 different match variations.  This suggests that it is highly likely that if the mystery code is in fact initialism, there is a good chance that it is also using some type of substitution method.  That is, each letter is actually representing a different letter.


The very low amounts of matches found that are a direct link from the book of Rubaiyat and the mystery code is also important.  This book, linked so closely to the murder, would be the most obvious source of the code.  It is clear from the results that if the code has been made as an initialism from the Rubaiyat then it is definitely also incorporating substitution.

===Further Research===

Our results were supposed to attempt to find the type of text that the Somerton man’s code is from; however there wasn’t any single text type with much more greater than expected occurrences. In fact, other than The Rubaiyat, there were no probabilistic irregularities found. In order to determine if this is significant to unraveling the Somerton Man’s code, further testing is required. This can be done by running the text parser for more code segments, as well as choosing other texts to obtain results on.

Additionally, further testing with poems is needed. Only a small amount of the poem texts tested match the same format as the Rubaiyat (four lines per poem), so tests of other four line poems are needed to determine if the exact initialism results are specific to the Rubaiyat or all/most poems of a similar size. Another method would be to truncate the poem texts to the same size. The Rubaiyat is about 84 poems of 4 lines each, so we can truncate the other poem texts to either 84 poems, or possibly 300-400 lines and then determine if there are any differences in the results. While this truncation method has been done for the tested set of poetry texts, the results can be more accurate with more poetry texts.

Also, at the moment, our results assume that all of the words in the text and in the code are independent of each other. We know this isn’t actually the case and that there is some sort of dependence between initial letters of text. Our results could be expanded by taking this into account and using Markov chains similar to last year’s project group<ref>https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Markov_models</ref>.