Editing
Final Report/Thesis 2015
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Technical Background== ===P-Value Theorem Explanation=== The ''p-value'' is the probability of an effect equal to or greater than the one observed, presuming the null hypothesis of no effect is true. It is a measure of evidence against a null hypothesis <ref>B. David et al., “P Value and the Theory of Hypothesis Testing: An Explanation for New Researchers,” Clinical Orthopaedics and Related Research®, Vol.468 (3), pp.885-892 2010.</ref>. In statistics, the ''p-value'' is used for testing a statistical hypothesis by observing sample results. ''P-value'' testing is an effective method to presume whether the null hypothesis of no effect is true. In this project, we will use ''p-values'' to do some simple tests to determine whether English was the language in the original message <ref>B. David et al., “P Value and the Theory of Hypothesis Testing: An Explanation for New Researchers,” Clinical Orthopaedics and Related Research®, Vol.468 (3), pp.885-892 2010.</ref>. In this case, we make an assumption that English was not the language in the original message. Thus we should get a larger ''p-value'' which is higher than 0.05 in the simple test. This means the observed data point is in the range of “more likely observation”, mentioned in Figure 1 below. Oppositely, if we get a smaller ''p-value'' which is lower than 0.05 in the simple test, this means the observed data point is in the range of “very un-likely observations”. This indicates our assumption is wrong and that English was the language in the original message <ref>G G. L et al., “What is the Value of a p Value?,” The Annals of Thoracic Surgery, Vol.87(5), pp.1337-1343 2009.</ref>. [[File: Technical_Background_P-value.png |thumb|700px|centre|'''Fig. 3:''' Example of a ''p-value'' computation <ref>No Author.p-value [online]. Available: http://en.wikipedia.org/wiki/P- value#cite_note-nature506-1</ref>]] As the graph in Figure 3 shows, the horizontal coordinate is the set of possible results, and the vertical coordinate is the probability density. The ''p-value'' is in the right side of observed data point, which is between the curve and horizontal coordinate <ref>No Author.p-value [online]. Available: http://en.wikipedia.org/wiki/Pvalue# cite_note-nature506-1.</ref>. ===Chi-Squared Test Explanation=== The ''chi-squared test'' is a statistical test commonly used to compare observed data with expected data. It requires than there is no relationship between the observed data and expected data. It means the observed data should not affect the expected data and they are independent respectively. <ref> N Turner, “Chi-squared test” Journal of clinical nursing, Vol.9 (1), pp.93 2000. </ref>. The ''chi-squared test'' a ‘goodness of fit’ test, meaning it measures how closely one group of data is related to another. Since it is a non-parametric test, it doesn’t care about distribution of samples. It means there is no requirement that expected and observed data should follow a normal distribution <ref> N. Balakrishnan et al., Chi-squared Goodness of Fit Tests with Applications [online]. Available: http://www.sciencedirect.com.proxy.library.adelaide.edu.au/science/book/9780123971944 </ref>. [[File: Technical_Background_Chi-squared.png |thumb|1000px|centre|'''Fig. 4:''' Chi-squared example]] Figure 4 is a simple example to show how our group is to use the ''chi-squared test'' for Task 1. Firstly, we can calculate the chi-squared value for the letter A by using the equation on the right of Figure 4. The observed value is the count of letter A in the Somerton Man code, the expected value is the total number of letters in Somerton Man code which is 44, multiplied by the frequency of letter A in the base text language used. This is repeated to calculate the chi-squared value for the rest of the letters from B to Z. Finally, the chi-squared value for all letters are summed. This is the chi-squared value between the base text language and Somerton Man code. Lower Chi-squared values mean the Somerton Man code is more likely to be this language. ===Universal Declaration of Human Rights Explanation=== The ''Universal Declaration of Human Rights'' was created in 1948 and is translated into over 400 different languages <ref>Author Unknown. The Universal Declaration of Human Rights [online]. Available: http://www.un.org/en/documents/udhr/history.shtml </ref>. The group used translations of this declaration as the base text for Task 1: Statistical Frequency Analysis of Letters. [[File: Technical_Background_UN_Declaration.png |thumb|300px|centre|'''Fig. 5:''' ''Universal Declaration of Human Rights'' From British Library<ref>The British Library Board. Taking Liberties [online]. Available: http://www.bl.uk/onlinegallery/takingliberties/staritems/645universaldeclarationhumanrightspic.html</ref>]] ===Project Gutenberg Explanation=== [[File: Technical_Background_Project_Gutenberg.png |thumb|300px|right|'''Fig. 6:''' ''Project Gutenberg'' Logo <ref>Author Unknown. Free ebooks by Project Gutenberg [online]. Available: https://www.gutenberg.org/</ref>]] ''Project Gutenberg'' was established in 1971. It offers over 50,000 free ebooks available in a range of different formats including TXT, HTML, PDF and EBOOK and a number of languages including Chinese, Danish, Dutch, English, Esperanto, Finnish, French, German, Greek, Hungarian, Italian, Latin, Portuguese, Spanish, Swedish and Tagalog <ref> Author Unknown. Free ebooks by Project Gutenberg [online]. Available: https://www.gutenberg.org/ </ref>. The database was used to provide base texts for Task 4: Statistical Frequency of Letters Reanalysis due to the increased sample size of letter frequencies in each language when compared to the ''Universal Declaration of Human Rights'' as a base text. ===N-Gram Model Explanation=== The ''n-gram model'' is a sequence of n items from a given sequence of phonemes, syllables, letters, words or base pairs. The n-grams typically are collected from articles or books. If the elements are words, n-grams may also be called shingles <ref> A. Z Broder et al., “Syntactic clustering of the web”. Computer Networks and ISDN Systems 29 (8), pp.1157–1166. </ref>. Size n=1 of the ''n-gram model'' is called a "unigram", size 2=2 of the ''n-gram model'' is called a "bigram", size n=23 of the ''n-gram mode''l is called a "trigram", and so on. An ''n-gram model'' is a type of probabilistic language model for predicting the next item. In a word, the item n is only related to item (n-1) regardless of any other items in the sequence <ref> No Author. Video Lectures [online]. Available: https://class.coursera.org/nlp/lecture/17 </ref>. The group will use an n-gram database to find common groups of words for a variety of ''initialisms''. In Figure 7 shown below, we use one line from the Somerton Man code as a sample sequence to produce the ''n-gram model''. [[File: Technical_Background_N-gram.png |thumb|600px|centre|'''Fig. 7:''' ''N-gram model'' applied to first line of Somerton Man code]] ===One-Time Pad Explanation=== The ''one-time pad'' is a decoder technology which cannot be cracked if the correct ''key'' is used. For example, if we have a ''ciphertext'', we need the ''key'' to decode it and get correct message <ref> S.M. Bellovin. (2011, July 12). Frank Miller: Inventor of the One-Time Pad [online]. Available: http://www.tandfonline.com.proxy.library.adelaide.edu.au/doi/full/10.1080/0161 1194.2011.583711#abstract </ref>. [[File: Technical_Background_One-Time_Pad.png |thumb|1000px|centre|'''Fig. 8:''' ''One-Time Pad'' Example <ref>No Author. One-time pad [online]. Available: http://enc.slider.com/Enc/OneTimePads</ref>]] As in Figure 8 <ref> No Author. One-time pad [online]. Available: http://enc.slider.com/Enc/OneTimePads </ref> shown above, we have the cipher text: EQNVZ. E is the fifth letter in the alphabet, so we label it as 4 (initially we label the first letter, A, as 0). Similarly, the letter Q is 16, N is 13, V is 21 and Z is 25. If we have a ''key'' which is: XMCKL, we can label X as 23, M as 12, C as 2, K as 10 and L as 11 using the same method as above. Then we use E minus X and process all cipher text elements one by one. Negative numbers are not used, so E minus X equals -19, then we add 26 and get the number 7, which represents letter H. Finally, we find the original message to be HELLO. The ''one-time pad'' technique will be used as another method to attempt to decipher the Somerton Man code.
Summary:
Please note that all contributions to Derek may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Derek:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information