Editing Final Report/Thesis 2015 (section)

==Technical Background==

===P-Value Theorem Explanation===
The ''p-value'' is the probability of an effect equal to or greater than the one observed, presuming the null hypothesis of no effect is true. It is a measure of evidence against a null hypothesis <ref>B. David et al., “P Value and the Theory of Hypothesis Testing: An Explanation
for New Researchers,” Clinical Orthopaedics and Related Research®, Vol.468 (3),
pp.885-892 2010.</ref>.

In statistics, the ''p-value'' is used for testing a statistical hypothesis by observing
sample results. ''P-value'' testing is an effective method to presume whether the null
hypothesis of no effect is true. In this project, we will use ''p-values'' to do some
simple tests to determine whether English was the language in the original message
<ref>B. David et al., “P Value and the Theory of Hypothesis Testing: An Explanation
for New Researchers,” Clinical Orthopaedics and Related Research®, Vol.468 (3),
pp.885-892 2010.</ref>.

In this case, we make an assumption that English was not the language in the
original message. Thus we should get a larger ''p-value'' which is higher than 0.05 in
the simple test. This means the observed data point is in the range of “more likely
observation”, mentioned in Figure 1 below. Oppositely, if we get a smaller ''p-value''
which is lower than 0.05 in the simple test, this means the observed data point is in
the range of “very un-likely observations”. This indicates our assumption is wrong
and that English was the language in the original message <ref>G G. L et al., “What is the Value of a p Value?,” The Annals of Thoracic
Surgery, Vol.87(5), pp.1337-1343 2009.</ref>.

[[File: Technical_Background_P-value.png |thumb|700px|centre|'''Fig. 3:''' Example of a ''p-value'' computation <ref>No Author.p-value [online]. Available: http://en.wikipedia.org/wiki/P- value#cite_note-nature506-1</ref>]] 

As the graph in Figure 3 shows, the horizontal coordinate is the set of possible results, and the
vertical coordinate is the probability density. The ''p-value'' is in the right side of
observed data point, which is between the curve and horizontal coordinate <ref>No Author.p-value [online]. Available: http://en.wikipedia.org/wiki/Pvalue#
cite_note-nature506-1.</ref>.

===Chi-Squared Test Explanation===
The ''chi-squared test'' is a statistical test commonly used to compare observed data with expected data. It requires than there is no relationship between the observed data and expected data. It means the observed data should not affect the expected data and they are independent respectively. <ref> N Turner, “Chi-squared test” Journal of clinical nursing, Vol.9 (1), pp.93 2000. </ref>.

The ''chi-squared test'' a ‘goodness of fit’ test, meaning it measures how closely one group of data is related to another. Since it is a non-parametric test, it doesn’t care about distribution of samples. It means there is no requirement that expected and observed data should follow a normal distribution <ref> N. Balakrishnan et al., Chi-squared Goodness of Fit Tests with Applications [online]. Available: http://www.sciencedirect.com.proxy.library.adelaide.edu.au/science/book/9780123971944 </ref>.

[[File: Technical_Background_Chi-squared.png |thumb|1000px|centre|'''Fig. 4:''' Chi-squared example]]

Figure 4 is a simple example to show how our group is to use the ''chi-squared test'' for Task 1.  Firstly, we can calculate the chi-squared value for the letter A by using the equation on the right of Figure 4. The observed value is the count of letter A in the Somerton Man code, the expected value is the total number of letters in Somerton Man code which is 44, multiplied by the frequency of letter A in the base text language used.  This is repeated to calculate the chi-squared value for the rest of the letters from B to Z.  Finally, the chi-squared value for all letters are summed. This is the chi-squared value between the base text language and Somerton Man code. Lower Chi-squared values mean the Somerton Man code is more likely to be this language.

===Universal Declaration of Human Rights Explanation===
The ''Universal Declaration of Human Rights'' was created in 1948 and is translated into over 400 different languages <ref>Author Unknown. The Universal Declaration of Human Rights [online]. Available: http://www.un.org/en/documents/udhr/history.shtml </ref>.  The group used translations of this declaration as the base text for Task 1: Statistical Frequency Analysis of Letters.
[[File: Technical_Background_UN_Declaration.png |thumb|300px|centre|'''Fig. 5:''' ''Universal Declaration of Human Rights'' From British Library<ref>The British Library Board. Taking Liberties [online]. Available: http://www.bl.uk/onlinegallery/takingliberties/staritems/645universaldeclarationhumanrightspic.html</ref>]]

===Project Gutenberg Explanation===
[[File: Technical_Background_Project_Gutenberg.png |thumb|300px|right|'''Fig. 6:''' ''Project Gutenberg'' Logo <ref>Author Unknown. Free ebooks by Project Gutenberg [online]. Available: https://www.gutenberg.org/</ref>]]
''Project Gutenberg'' was established in 1971. It offers over 50,000 free ebooks available in a range of different formats including TXT, HTML, PDF and EBOOK and a number of languages including Chinese, Danish, Dutch, English, Esperanto, Finnish,  French, German, Greek, Hungarian, Italian, Latin, Portuguese, Spanish, Swedish and Tagalog <ref> Author Unknown. Free ebooks by Project Gutenberg [online]. Available: https://www.gutenberg.org/ </ref>.  The database was used to provide base texts for Task 4: Statistical Frequency of Letters Reanalysis due to the increased sample size of letter frequencies in each language when compared to the ''Universal Declaration of Human Rights'' as a base text.

===N-Gram Model Explanation===
The ''n-gram model'' is a sequence of n items from a given sequence of phonemes,
syllables, letters, words or base pairs. The n-grams typically are collected from
articles or books. If the elements are words, n-grams may also be called shingles
<ref> A. Z Broder et al., “Syntactic clustering of the web”. Computer Networks and
ISDN Systems 29 (8), pp.1157–1166. </ref>.

Size n=1 of the ''n-gram model'' is called a "unigram", size 2=2 of the ''n-gram model'' is called a "bigram", size n=23 of the ''n-gram mode''l is called a "trigram", and so on.

An ''n-gram model'' is a type of probabilistic language model for predicting the next item. In a word, the item n is only related to item (n-1) regardless of any other items in the sequence <ref> No Author. Video Lectures [online]. Available: https://class.coursera.org/nlp/lecture/17 </ref>.  The group will use an n-gram database to find common groups of words for a variety of
''initialisms''.

In Figure 7 shown below, we use one line from the Somerton Man code as a sample sequence to produce the ''n-gram model''.
[[File: Technical_Background_N-gram.png |thumb|600px|centre|'''Fig. 7:''' ''N-gram model'' applied to first line of Somerton Man code]]

===One-Time Pad Explanation===
The ''one-time pad'' is a decoder technology which cannot be cracked if the correct
''key'' is used.  For example, if we have a ''ciphertext'', we need the ''key'' to decode it
and get correct message <ref> S.M. Bellovin. (2011, July 12). Frank Miller: Inventor of the One-Time Pad
[online]. Available:
http://www.tandfonline.com.proxy.library.adelaide.edu.au/doi/full/10.1080/0161
1194.2011.583711#abstract </ref>. 

[[File: Technical_Background_One-Time_Pad.png |thumb|1000px|centre|'''Fig. 8:''' ''One-Time Pad'' Example <ref>No Author. One-time pad [online]. Available: http://enc.slider.com/Enc/OneTimePads</ref>]]

As in Figure 8 <ref> No Author. One-time pad [online]. Available: http://enc.slider.com/Enc/OneTimePads </ref> shown above, we have the cipher text: EQNVZ. E is the fifth letter in the alphabet, so we label it as 4 (initially we label the first letter, A, as 0).  Similarly, the letter Q is 16, N is 13, V is 21 and Z is 25. If we have a ''key'' which is: XMCKL, we can label X as 23, M as 12, C as 2, K as 10 and L as 11 using the same method as above. Then we use E minus X and process all cipher text elements one by one. Negative numbers are not used, so E minus X equals -19, then we add 26 and get the number 7, which represents letter H. Finally, we find the original message to be HELLO. The ''one-time pad'' technique will be used as another method to attempt to decipher the Somerton Man code.