Difference between revisions of "Final Report/Thesis 2018"

From Derek
Jump to: navigation, search
(The University of Adelaide project groups)
(References)
 
(20 intermediate revisions by the same user not shown)
Line 47: Line 47:
 
With these conclusions, this project will look into further detail of what the Mysterious Code is [9] [10] [11] [12] [13] [14] [15].
 
With these conclusions, this project will look into further detail of what the Mysterious Code is [9] [10] [11] [12] [13] [14] [15].
  
==== Previous study on mass spectral analysis ====
+
==== Mass Spectrometer Pervious Work  ====
In addition to police men’s investigation, there already have several academic studies. Prof. Derek Abbott has worked on this case for more than 2000 hours. [23] His team took the hairs from Somerton man’s plaster cast and recorded elements data by ICP-MS. Since 2009, it had become a final project for the University of Adelaide students and some students had participated this project. In 2013, the project group plot some Somerton man’s hair elements’ content figures and compared with control samples. [12] They have analysed the differences between Somerton man’s hair and control samples. However, those hairs’ data were all got from glass test. In this project, we redo the analysis based on the data that got from quartz test. In addition, we will compare those two different data figures. However, the new data should not be compared with old directly due to different drift values. The new data will multiply a constant which is got from the glass test remainder that is a bit rest of Somerton man’s unburned hair. Besides, the different year of plasters needed to be considered as well.
+
Previous years have also done study with Mass Spectrometer. The 2013 project group had some of Somerton Mans hair and plotted the different elements in the hair comparing with controlled samples. Analysis was conducted on different elements between the two samples. This was done using a glass slide, which have impurities in it [13]. In the 2016 project group, they recreated the analysis using a quartz slide, which does not have impurities. They concluded that Somerton Mans had some abnormally high readings, of some elements, one of which is strontium [15]. In this project, the strontium level will be looked in higher detail and this will be used to indicate how long the Somerton Man was in Adelaide before his death.
  
==Previous Studies/Related Work==
+
===Aims and Objectives===
==== Australian Department of Defence ====
+
The first task that will be done is to understand if the mysterious code are a collective object (horse names, Adelaide street names, Australian beaches, etc.). This will be done using hypothesis testing. An extension of this task will also be completed, which involves the Rubaiyat of Omar Khayyam.
In response to the request from journalist Stuart Littlemore the Australian Department of Defence had worked on cracking the code left in the Tamám Shud case. Unfortunately after a time of working the cryptographers defined the code as unable to crack. The code was said to either “have insufficient symbols” or it was just a meaningless product generated under a “disturbed mind” [6].
+
The second task that will be performed is using a mass spectrometer. Controlled sample hairs will be compared with Somerton man’s hair, to see how long he was in Adelaide before his death, by finding different elements in the hair.  
==== Previous Groups in the University of Adelaide ====
+
The third task that will be accomplished is using DNA data. The data will be degraded using software tools till it becomes unidentifiable. This then can be used to see how much DNA we need from Somerton man, where further research can undergo.
As Professor Derek Abbott has been following this case for a long time, there were several project groups that had been working on this unsolved case from 2009 to the present. The conclusions drawn out by previous groups are: the code is unlikely to be generated randomly, the code is unlikely to be initial letters  from words,  the book  Rubaiyat of Omar Khayyam was unlikely to be used as a one-time pad for encryption, the original language of the code is likely to be English, the code is unlikely to be initialisms extracted from poems, the book Rubaiyat of Omar Khayyam was not used as a straight substitution one-time pad for encryption and the code was not created using the Rubaiyat of Omar Khayyam as a one-time pad.[7][8][9][10][11][12]
+
==== Previous study on mass spectral analysis ====
+
In addition to police men’s investigation, there already have several academic studies. Prof. Derek Abbott has worked on this case for more than 2000 hours. [23] His team took the hairs from Somerton man’s plaster cast and recorded elements data by ICP-MS. Since 2009, it had become a final project for the University of Adelaide students and some students had participated this project. In 2013, the project group plot some Somerton man’s hair elements’ content figures and compared with control samples. [12] They have analysed the differences between Somerton man’s hair and control samples. However, those hairs’ data were all got from glass test. In this project, we redo the analysis based on the data that got from quartz test. In addition, we will compare those two different data figures. However, the new data should not be compared with old directly due to different drift values. The new data will multiply a constant which is got from the glass test remainder that is a bit rest of Somerton man’s unburned hair. Besides, the different year of plasters needed to be considered as well.
+
  
 
==Technical Background==
 
==Technical Background==
=== Hamming Distance and Levenshtein Distance ===
+
=== P-value Theorem ===
The two concepts were introduced from Information Theory. Both of them describe the amount of differences between two strings. The Levenshtein Distance is applied to calculate the difference between two strings that consist of letters, while the Hamming Distance is used to compare two binary strings with same length.
+
A p-value is a recognised statistical probability, which acknowledges whether an equal or larger effect is present, in comparison to its observed counterpart. In statistics, the p-value helps you determine the significance of the statistical hypothesis by observing the results that were sampled. This determines the probability that the results are due to chance, rather than the experimental conditions. Thus, determining the strength and validity of the results against that of the null hypothesis [16].
The Hamming distance measures the minimum time of calculations (substitutions, precisely) required to transform string A into string B. For example, the Hamming distance between “1010” and “0010” is 1 as it requires substituting the first bit '1' in the first string with '0'. In this case the calculations of Hamming distance are based on pure binary strings so the Hamming distance can be easily expressed as H(a, b) = a XOR b.  
+
In this project, the main focus with the p-values is to determine whether the mysterious code, represents local horse names. Where the null hypothesis is ‘The group of letters are horse names’ with the alternative hypothesis being ‘The group of letters are not horse names.
The Levenshtein Distance, also known as edit distance, is an enhanced version of the Hamming Distance. It not only counts substitution, also it considers insertions and deletions. The Levenshtein distance between two words is the smallest calculation times of substitutions, insertions, and deletions of symbols that are used to transform one string into another.
+
For null hypothesis to be accepted the p-value must be larger than 0.05, this indicates that the observed data point is located in the ‘most likely observation’ range, as seen in Figure 4. If the p-value is lower than 0.05, this indicates that the collected results are statistically significant and that the observed data point is located in the ‘very unlikely observations’ range. If the results are in the ‘very unlikely observations’ range, then the null hypothesis can be rejected, which means that the mysterious code is indeed, not horse names [17].
Here is an example demonstrating the calculation of the Levenshtein distance, substitution is marked as s and d stands for deletion, i for insertion.  String1: INTENSION, string2: EXECUSION.  
+
  
            INTE#NSION
+
[[File:ZFigure4.jpg|thumb|500px|center|Figure4. Computation of a p-value]]
            ||||||||||
+
Figure 4 shows, the y-axis is the probability and the x-axis is the set of possible results.
            #EXECUSION
+
            ||||||||||
+
            dss-is----
+
        Table1. Levenshtein distance
+
According to Table1, the minimum cost to turn string1 into string2 is 5: 3 substitutions, 1 deletion and 1 insertion. As the Levenshtein Distance considers three kinds of calculations, the complexity is inherently higher than the Hamming Distance.  
+
  
=== Simhash Algorithms ===
+
=== Mass Spectrometer ===
 +
A mass spectrometer is an analytical technique which, when given a foreign sample, it can detect unknown compounds within it. The mass spectrometer produces a multitude of ions in the sample, which then uses a mass to charge ratio of the different ions and records the quantity of each ion type [19].
 +
The components of a typical mass spectrometer is seen in Figure 5. The three major components are the ion source, analyser and the ion detector system.
 +
[[File:ZFigure5.jpg|thumb|500px|center|Figure 5. The Main Components of a Mass Spectrometer [19]]]
 +
The ion source produces gaseous ions from the sample that was used. The analyser, then sorts the different ions by using the mass-to-charge ratio, according to their mass components. The ion detector system detects the different ions in the sample and records the quantity of each ion type and converts it into an electric signal [20].
 +
The Inductively Coupled Plasma Mass Spectrometer (ICP-MS) is the type of mass spectrometer that was used in this project. The ICP-MS is faster, more precise and sensitive at finding different ions compared to other types of mass spectrometer [21].
 +
In regards to the project, the sample being used is the shaft of the hair.
  
==== Brief Introduction of Simhash ====
+
=== Single Nucleotide Polymorphism (SNP) ===
The Simhash algorithm was originally invented by Moses Charikar. It was invented to estimate the similarities of a large volume of data [13]. Later the Simhash was applied by Google as their duplicate removal algorithm to deal with Google's massive data. Charikar's algorithm has been proved to be practically useful for identifying near-duplicates in web documents belonging to a multi-billion page repository [14] in Google's thesis. The idea of the Simhash algorithm are extremely condensed, it is even easier than the algorithm of finding all fingerprints with Hamming Distance less than k  in Google's thesis mentioned above. As the algorithm performs well in similarity check, it is adapted here to calculate the similarities of 2-grams strings in tests.
+
Single Nucleotide Polymorphisms (SNPs), are one of the most common and well talked about genetic variations, which can be present between humans [22]. These variations occur within a nucleotide (a single block of DNA) and happen roughly within 1 out of 300 base pairings [23].
 +
In regards to the project, SNPs will be removed from a DNA sample.
  
==== Why Simhash? ====
+
==Knowledge Gaps and Technical Challenges==
Traditional similarity check algorithms use Vector Space Model to separate documents into individual terms, allocate these terms into their corresponding vectors in multidimensional space. Each dimension indicates an individual term. Values of these vectors are calculated by a specific algorithm based on the terms, mostly based on the occurrence frequencies of terms, high frequency terms will be assigned a relatively larger value. Algorithms may differ but the Vector Space Model is static.  After the modeling text will be transformed into a set of vectors.
+
The technical challenges that will be encountered in this project are related to all the knowledge gaps mentioned.
[[File:SFigure4.jpg|thumb|500px|center|Figure4. Space Vector Model]]
+
To complete each task within the project, further development for programming skills, such as Matlab were required. P-value calculation and hypothesis testing needed some revision, to ensure that a satisfactory level of understanding of the concepts was present. The skill to use Microsoft Excel to perform statistical analysis on the p-value, is required. It was also required to learn how to correctly use a mass spectrometer and interpret the results, this can be done by finding multiple ways to enhance knowledge, before trials.
Figure 4 illustrates an example of the modeling of Vector Space Model, d1 and d2 represent two texts that have already been modeled. The two documents’ similarity can be found by calculating the Cosine Similarity of their corresponding vectors (dot product of the two vectors divided by the product of the two vectors’ Euclidean lengths):
+
Similarity <d1, d2 > = cos(theta(d1,d2) ).
+
+
Notice that the usage of a 2-D dimension was just for illustration purpose. In practice the number of dimension can be a much higher value, but the principle will be all the same.
+
It has been proved that the Vector Space Model brings accurate estimation of the similarities, but this comes at the expensive of the exceptional high complexity in both time domain and space domain. Modeling long texts into vectors will certainly take up a large volume of storage; on top of that, each similarity value is generated by calculating the cosine value of two vectors. Obviously it is not a wise choice when dealing with data that contains a large amount of terms. Using SimHash algorithm could reduce the complexity significantly while preserve the accurate estimation of similarities.
+
  
Before explaining the Simhash, it is necessary to introduce the Hash function. A Hash function mappings different data of arbitrary size into totally different hash values of a fixed size. After hash mapping each term will be allocated a unique hash value as its fingerprint. A well-defined Hash function should be collision-resistant, which means that it is impossible to find two data sets that will generate identically the same Hash values. Also, the Hash function should be sensitive to trivial changes. Even the string changes only one bit, the Hash value will be totally different.
 
 
Here explains the Simhash algorithm in details.  The figure below demonstrates the general processes of calculating the Simhash fingerprint of a given document. 
 
1. Firstly the original document (big blue box) will be separated into n individual terms (small blue bars). The rule of separating can be customized in different cases. Here, in the next tests the document will be separated into 2-grams letter groups.
 
 
2. Now the separation has been done. For each term, calculate its corresponding weight (w1, w2, … wn).  Normally the weight is determined by the frequency of occurrence of each term.  The weight can also be customized to meet specific requirements.
 
 
3. For each individual term and its corresponding weight: Apply hash mapping to each term (the length of hash mapping (n) could be adjusted by adjusting the segmentation method in step1, Google used 64-bits hash mapping in its webpage duplication remove program), then multiple the n-bits hash mapping result with the corresponding weight, '0' in hash mapping result will be treated as '-1' and '1' stays unchanged. 
 
 
4. Finally, add all the results generated in the previous step together. Here the result is a string consists of n numbers. The final fingerprint is also an n-bit long binary string. For each number in the string, if it is positive then set its corresponding bit in the final result to '1', otherwise set the bit to '0'.
 
 
5. The similarities of two strings are then generated by calculating the Hamming Distance of two Simhash strings.
 
[[File:SFigure5.jpg|thumb|500px|center|Figure5 Simhash Procedure]]
 
 
This is how the Simhash algorithm mappings a given long text into an n-bit long fingerprint. After the previous illustration, the geometric meaning of this algorithm is quite explicit: Firstly it mappings the long text into an n-dimension space, each individual term is transformed to a vector in the space. By using hash mapping, all the terms could be guaranteed to be transformed into a set of vectors which can be seen as nearly uniformly distributed.  By multiplying weights and summing up together, the result can be called as a “sum-vector”. The “sum-vector” is then compressed by mapping positive values to 1 and others to 0, this operation is actually preserving the quadrant information of the “sum-vector”. Assume that n = 64, a 64-bit long fingerprint can express as many as 264 quadrants, which seems to be enough to represent a specific document.
 
Theoretically the algorithm is reasonable, but the reason why the n-bit fingerprint has the ability to manifest similarities between documents is still unknown. Nor did Charikar (The inventor of the Simhash algorithm) give out any justification. Nevertheless, tests that have been done illustrates that the Simhash algorithm actually does a good job in similarity estimations. Here presents a simple example of similarity estimation based on the Simhash algorithm and the Hamming Distance.   
 
 
String A (initialism of the lyrics of “Bohemian Rhapsody” by Queen), String B (copied from String A with 3 letters modified) and String C (an arbitrary string) are strings being tested in the Simhash algorithm.
 
 
String A:  ITTRLITJFCIALNEFROYELUTTSASIJAPBINNSBIECEGALHLLATWBDRMTMTM
 
String B:  ITTRQITJFCIALNEFRQYELUTTSASIJAPBINNSBIECEZALHLLATWBDRMTMTM
 
String C:         KAYCDRKBPQOGVTAACDUQKXJNZNZMXCBNUKPHVODWUUSQGJZFFYUKHBDMFY
 
 
After the 2-grams separation, the Simhash of each strings were presented below (red characters indicate the differnces):
 
SimHash of A: 100111001000100001111000011010110111100011011110001001 
 
SimHash of B: 100111001100100001111000011110110111100011011110001001
 
 
SimHash of A: 100111001000100001111000011010110111100011011110001001
 
SimHash of C: 100110101011010101101100011001001110000111111111001010
 
 
By observing, the Hamming Distance between two Simhash strings of A and B should be 3, which is low enough to indicate that the two strings are extremely similar. Actually they are extremely similar (only 3 letters out of 59 were different).  While the Hamming Distance between A and C is 21, indicating that the two strings A and C are unlikely to have any similarities.
 
 
===Data analysis===
 
Hair data is recorded by Inductively Coupled Plasma Mass Spectrometer (ICP-MS) and presented in the form of Excel tables. Matlab is used to plot figures which are used to show the elements comparison results clearly. It has some graphing capabilities and can be applied for making engineering plots.[3] In this project, the massive hair element data are plotted by Matlab command ‘scatter’. Then, use command ‘hold on’ to put the Somerton man’s and control samples hair data on the same figure and make the comparison clear.
 
 
==Task 1: Code Analysis==
 
==Task 1: Code Analysis==
There are mainly two aspects of tests in this project, the Direct Comparison Test and the 2-grams Comparison Test. Nine different languages are selected and tested, they are English, Italian, French, German, Portuguese, Latin, Spanish, Turkish and Polish.
+
=== Aim ===
As this section may be too long and disturbing, here presents a simple chart for the navigation usage in Figure 6.
+
The aim of this task is to comprehend whether the mysterious code represents some collective object. The collective objects that will be used are horse names, Australian beaches and cities, South Australian street names, and The Rubaiyat of Omar Khayyam book. The assumption will be made that the letters in the mysterious code, are the initialism of a word. The Somerton Man had a lot of associations with horses, so further research will be going into the assumption that the mysterious code are ‘horse names’. The null hypothesis is ‘The group of letters are horse names’ and the alternative hypothesis is ‘The group of letters are not horse name.
[[File:SFigure6.jpg|thumb|500px|center|Test Layout]]
+
  
=== Levenshtein Distance Test ===
+
=== Preperation ===
The comparison method of this section is calculating the Levenshtein Distance between two initialism strings directly. Materials used for comparison, length of string and comparison time may vary based on the purposes of setting up each test.
+
Before the initialisation of this task, the details of the whole case were reviewed. This review included the following; who found the corpse, where it was located and when it was discovered. This gave a better understanding of what type of collective object should be researched, which are stated above.
 +
As mentioned before, the reason why horse names are a major factor is because the location of his body was found to be located near Morphettville Race course, as seen in Figure 6.
 +
[[File:ZFigure6.jpg|thumb|500px|center|Figure 6. Location of Somerton Beach and Morphettville Racecourse]]
 +
It should also be noted that the people who found the deceased body were racehorse jockeys. Hence the reason we assumed the mysterious code are the initialism of horse names from the year 1948.
  
==== Data Processing ====
+
=== Method ===
All the materials for this test were transformed into Initialisms format, with all punctuations and spaces removed, for example:
+
The approach to determine if the mysterious code represents collective objects, will be done by calculating the p-value and implementing hypothesis testing. In the case of the horse names, there were no direct websites, which provided horse names in the year of 1948. This led to discovering evidence of these names within relevant newspapers and articles. This was done by using ‘Trove’, a search engine to help find resources in Australia. More specifically, in this case it was used to obtain articles and newspapers from 1948. The other collective objects, as mentioned above, were found using South Australian government websites. This led to an abundance of cross checking, to make sure that the list which was going to be utilised, was indeed correct. 
Original Text: “Snow is falling, all around me, children playing, having fun…”
+
Matlab was used as the software tool. The initial letter of the each collective object was all that was needed, where using code was implemented to perform this task. In the case, where a collective object had multiple words, both words were included. Also if there was any extra punctuation, it was removed. A demonstration is shown in Figure 7, which illustrates the input and output results for the horse names case and also displays the frequency of each letter.
Initialisms Format:    “SIFARMCPHF…”
+
[[File:ZFigure7.jpg|thumb|500px|center|Figure 7. Matlab Results]]
Initialisms were then divided into segmentations of a given length. Each segmentation is referred as ‘unit’ in the next sections.  
+
Excel was used, to produce the statistical results from Matlab. The letter frequency will be obtained by dividing the amount of each letter over the total amount calculated, where then a p-value test was performed and a comparison graph was completed.
  
==== Preparation Test ====
+
=== Results ===
It is always necessary to check if the method for tests is effective before starting any tests. Hence the purpose of this section is to prove that the Levenshtein distance algorithm has the ability to revealing the differences between different languages.
+
Each collective object was compared to the mysterious code by the frequency of each letter. Where the x-axis represents the alphabet and the y-axis represents the frequency of the letters between the two testing objects.
Units of a fixed length extracted from English version and eight different language versions of the War and Peace were compared. To make it comprehensive, there are six groups of tests with different length varies from 50 letters/unit to 800 letters/unit and in each length group there are 100 times of comparison (10 units of one language compared with 10 units of another language one by one using a double ‘for loop’ ). Each unit was uniquely selected from texts without any overlapping. Results of these tests are normalized with the unit of length = 50.  
+
The p-value test was also completed to verify the results, where a p-value of less than 0.05 shows that it is very unlikely that the collective object is the mysterious code.  
  
The result of the tests was presented in boxplots in the figure below (Language names were abbreviated, En stands for English, It-Italian, Fr-French, Ge-German, Pq-Portuguese, La-Latin, Sp-Spanish, Tr-Turkish and Po-Polish). Each box is a set of the output Levenshtein Distances between two texts. For example, the leftmost box represents the test results of 10 English units versus another 10 English units; same rule applies for the others.  
+
====Horse Names====
 +
The comparison of horse names to the mysterious code is seen in Figure 8.
 +
[[File:ZFigure8.jpg|thumb|500px|center|Figure 8. Comparison of Mysterious Code with Horse Names]]
 +
There was a sample of 69 horse names and it can be seen on the graph that the horse names do not correlate with the mysterious code with many of the English letters. This was also proven by the p-value, as it was lower than 0.05, which means the null hypothesis is not accepted.
  
[[File:Group1.jpg|thumb|600px|center|Levenshtein Test Result]]
+
====Australian Beaches====
 +
The comparison of Australian beaches to the mysterious code is seen in Figure 9.
 +
[[File:ZFigure9.jpg|thumb|500px|center|Figure 9. Comparison of Mysterious Code with Australian Beach Names]]
 +
There was a sample of 114 beach names. Analysing the graph it be seen that the frequency of the letters do correlate with mysterious code. As the results seemed genuine a hypothesis test was done between this values. The results showed a p-value of greater than 0.05, which indicates that the mysterious code could be Australian beach names.
  
By analyzing the five box plots above in the Figure Group 1, the following conclusions can be inferred:
+
====South Australia Street Names====
1. In all the five figures, the leftmost boxes are significantly lower than the others, which mean that the Levenshtein distances between English strings are much lower than those between English strings and strings of other languages. Hence, the Levenshtein Distance is able to manifest the difference between different languages.
+
The comparison of South Australian street names to the mysterious code is seen in Figure 10.
 +
[[File:ZFigure10.jpg|thumb|500px|center|Figure 10. Comparison of Mysterious Code with South Australian Street Names]]
 +
There was a sample of 447 South Australian street names. Observing the graph it can be seen that the frequency of the letter are not similar with the mysterious code. This was also proven by the p-value, as it was lower than 0.05.
  
2. As the unit length grows the boxes become lower and more compact (median value and standard deviation shrink when unit length goes up). Some of them even converged into a specific value. In addition, the difference between the last two plots (unit length = 400 and unit length = 800) is trivial. It means that the Levenshtein distances tend to be stable as the length of the unit increases.
+
====Australian City's====
 +
The comparison of Australian city names to the mysterious code is seen in Figure 11.
 +
[[File:ZFigure11.jpg|thumb|500px|center|Figure 11. Comparison of Mysterious Code with Australian City Names]]
 +
There was a sample of 90 Australian city names. Observing the graph it can be seen that the frequency of some letter are similar with the mysterious code. A hypothesis test was then done to check the results. The p-value that was obtained was less than 0.05.
  
3. When increasing the unit size, the trends of the nine language groups are highly similar: median values of the Levenshtein Distances decrease as the unit size increases. This implies that the performance of the Levenshtein Distance algorithm is independent with different languages.  
+
====The Rubaiyat of Omar Khayyam book====
 +
The comparison of the Rubaiyat of Omar Khayyam book to the mysterious code is seen in Figure 12.
 +
[[File:ZFigure12.jpg|thumb|500px|center|Figure 12. Comparison of Mysterious Code with The Rubaiyat of Omar Khayyam book]]
 +
There was a sample of 852 words form the book. Observing the graph it can be seen that the frequency of the letter are not similar with the mysterious code. This was also proven by the p-value, as it was lower than 0.05.
 +
An extension of this task was also done. This includes analysing The Rubaiyat of Omar Khayyam book more carefully. Previous years stated that the mysterious code does not correlate with the book. Each paragraph in the book has four lines of words (see Figure 13), which compared with the mysterious code also has four lines. Still assuming that each letter in the mysterious code is an initial word, we can compare the two.
 +
[[File:ZFigure13.jpg|thumb|500px|center|Figure 13. A page from The Rubaiyat of Omar Khayyam]]
 +
The task was to count how many words are in each line of the book and compare it with the mysterious code. Using the first paragraph in Figure 13 (outlined with a red square), the first line has 9 words, then followed by 7 words in the second line, then 8 words in the last two lines. Comparing just the first paragraph with the mysterious code from line 1 to 4, there are 9, 11, 11 and 13 letters respectively. It already can be seen from the first paragraph that there may not be a correlation between the mysterious code and the book.
 +
Counting every line would be very time consuming, therefore a text file of The Rubaiyat of Omar Khayyam was used, in correlation with Matlab to count each word in each line. Then using excel, a graph was plotted with error bars to the number of letters in the mysterious code. This can be seen in Figure 14.
 +
[[File:ZFigure14.jpg|thumb|500px|center|Figure 14. Error Bars against the book]]
 +
The x-axis represents which line in the paragraph it is and the y-axis represents the amount of words present in that line. It can be seen that on line 1, the mysterious code is in the error bars. The rest of the lines are out of the error bars. This indicates that the mysterious code is not from The Rubaiyat of Omar Khayyam book and further proves the previous year’s studies of the book not being part of the mysterious code.
  
==== Main Test ====
+
====P-value====
 +
A summary of the p-values are shown in Table 1.
 +
[[File:ZFigure15.jpg|thumb|500px|center|Table 1. P-value Results]]
 +
It can be observed that the only collective object that is above 0.05 is Australian beaches, this indicates that the mysterious code could be Australian beaches.
  
The previous section has proved that the Levenshtein Distance actually can reveal the differences between languages. Based on this, the test in this section is to find out the difference between the mysterious code and texts of varies language.   
+
===Conclusion===
In the experimental group (line plots in figures), two versions the mysterious code were compared with 100 units of different languages extracted from the War and Peace. As the code is only of 50 letters long, there is no need to carry out the test in different unit length.
+
Overall, the results have shown, that it is unlikely that the mysterious code represents the hypothesis stated, of that it is horse names. It has also shown that it is unlikely to be South Australian street names, Australian city names or The Rubaiyat of Omar Khayyam book, even with extensive analysis on the book. But to some surprise it is possible that the mysterious code is Australian beach names, as the p-value was above 0.05. Future research will be required to obtain more meaningful results.
In the comparison group (box plots in figures), units extracted from the UDHR were compared with the same materials in the experimental group. Like the preparation test, this test was also divided into 6 different length groups. 100 times of comparison was made, the 10 vs 10 double for loop structure was also preserved here.  
+
The reason why Latin language was not tested is because that the Latin version of the War and Peace was not found, and lacking of material for the control group test would affect the confidence.
+
Results are presented in the following figures. The horizontal line indicates the mean (average value) of the experimental group’s result and the boxes indicate results of the comparison group. For example, in the first plot: the six boxes represent the test result of UDHR units in English versus War and Peace units in English in six different length groups. Two lines indicate the test result of two versions of code versus units of War and Peace in English. The “Codes vs English” test results (red and green lines) appear in every plot as a reference.
+
[[File:Group2.jpg|thumb|600px|center|Levenshtein Main Test Result]]
+
The following interesting phenomenon could be observed from the Figure group2 above:
+
  
1. Observing within each plot: both the median value and the standard deviation fall as the unit length increases. This trend fits reasonably well with the trend of the preparation test (the one mentioned in the 2nd conclusion of section 4.1.2).
+
==Task 2 Hair Analysis==
2.  Observing from the second plot to the last plot: there are distinct gaps between the Levenshtein Distances of codes versus English text (red and green lines), and the Levenshtein Distances of codes versus other languages (blue and purple lines).
+
===Background Theory===
3. Among all the languages tested except English and Turkish tests, results of the experimental group are consistently higher than results of the comparison group. In addition, experimental group’s result in Turkish was significant higher the one in English.
+
Analysing elements in the hair could reveal a great deal of useful information on a person’s recent life, this includes their lifestyle imbalances, living environment and dietary problems. Also the mineral levels in hair is about ten times more robust, than compared to blood [26]. These results could reveal where the last place Somerton Man had been to or even the last activity that Somerton Man had done, which could provide some useful evidences in solving this case.
4. Observing inside of the English test plot and the Turkish test plot: both the two groups of lines located around the median value of test result in 50 letters length group.
+
Knowing how much hair grows is very important for this project. For every month that passes, hair grows by 1cm [27]. This means the newest hair is in the root.
Conclusion:
+
  
According to the aforementioned phenomenon, the corresponding inferences can be drawn out:
+
===Aim===
1. The similarity of two trends reveals that the Levenshtein Distance algorithm does not sensitive to different testing materials. On the other hand, the relatively higher standard deviation in 50 letters group implies the higher randomness in small length tests. Notice that results of the experimental group are literally of 50-letters group. This reduces the credibility of the whole test
+
The aim of this task is to identify the different isotopes present in several different people’s hair. More specifically the element of concern is strontium. Adelaide has high levels in strontium in the soil compared to the rest of Australia. With this knowledge, the task is to test various hair samples, which have left Adelaide, within the past month and compare it to that of hair samples that have not left Adelaide for at least a year, to see how the strontium values change. This will then be compared with the strontium levels in the Somerton Man’s hair, which can determine how long he was in Adelaide before his death. The ICP-MS, is the approach to determine the different isotopes within the hair, which will then return a spectral analysis of the hair. The spectral analysis will be completed by laser ablation of the hair, where the hair will be ablated with a laser and the spectral elements are recorded.
2. Within the scope of the experimental group, the mysterious code has relatively smaller Levenshtein Distance with English text compared to those with other languages.
+
3. The differences between results of experimental and comparison groups in six languages (German, Italian, French, Portuguese, Spanish and Polish) imply that the code is unlikely to be one of these languages.
+
4. In English and Turkish test results, the overlapped results from experimental and comparison groups imply the consistencies of the Levenshtein Distances between ‘Code vs War and Peace’ and ‘UDHR vs War and Peace’.
+
  
==== Conclusion of the Levenshtein Distance Test ====
+
===Preparation===
According to the previous analysis, the code has more possibility to be initialisms of English or Turkish text, rather than of texts in other languages. In addition, considering the higher Levenshtein Distance in “code vs Turkish” test, being initialisms of English is more possible than being initialisms of Turkish.  
+
====Hair Elements====
On the other hand, the credibility of the test was challenged by the existence of randomness in small length group test. Hence the conclusion that the code is more likely to be English should be considered cautiously.  
+
Before the ICP-MS could be used for the analysis on the hair samples, different isotopes had to be chosen for the ICP-MS to find. Research was completed to find the most common chemical elements inside human hair, which included carbon, hydrogen, oxygen, sulphur, phosphorus and zinc. Then further research to indicate what element we want the ICP-MS to find, which included any toxic elements and common elements in food. There were 24 isotopes that were recorded by the ICP-MS and are shown in Table 2.
 +
[[File:ZFigure16.jpg|thumb|500px|center|Table 2. Isotopes chosen for the ICP-MS]]
  
=== 2-grams Simhash Test ===
+
====Hair Samples====
Though the previous test has drawn out some useful conclusions, its lacking of credibility is still unacceptable. As a result, the 2-grams test with another algorithm (SimHash algorithm) has been designed and implemented.  
+
In this project, a different hair samples was obtained, from five different individuals, for more reliable and accurate results. Two of these samples obtained, were from people that left Adelaide for a period of time and the other three samples were from people that stayed in Adelaide. All the samples will be anonymous for privacy reasons and will be identified as A, B, C, D and E. A summary of each hair sample is shown in Table 3, this includes which sample is identified as, sex, the date the hair was obtained and a brief description of the sample.
The test is expected to give supports to the conclusion drawn out in the previous Levenshtein Distance Test. In the meantime, it is expected to have a better credibility.  
+
[[File:ZFigure17.jpg|thumb|500px|center|Table 3. Summary of the samples]]
 +
The length of time out of Adelaide is very important, as we can determine the length of hair that was in Adelaide and the length of the hair that was outside of Adelaide. Sample A left Adelaide for 7 days, assuming there is 30 days in one month and hair grows 1cm per month, therefore 2.33mm of hair is from Japan. Similarly with sample E, they left Adelaide for 13 days, therefore 4.33mm of hair is from Bali. Also have to take consideration of the 2 days from when they came back to Adelaide to when the hair was obtained, this is 0.67mm. An example of the length of the hairs is shown in Figure 15.
 +
[[File:ZFigure18.jpg|thumb|500px|center|Figure 15. Example of the hair length]]
 +
The red displays the 2 days before obtaining the hair, when both samples came back to Adelaide. The orange shows the length of sample A and the green shows the length of sample E, when both samples were away from Adelaide.
  
==== Data Processing ====
+
====Quartz Slide====
Another difference was that texts for this test were rearranged into the 2-grams format based on the initialisms format in the previous test. For example:
+
The hair samples will be placed on a slide, in this case the slide will be made of pure quartz. The reason a pure quartz slide is used, rather than an ordinary glass slide is that, glass slides have a lot of impurities that would contaminate the result, where as a pure quartz slide does not. The elements in a glass slide and quartz slide is shown in Table 4.
Original String: “The visions dancing in my mind the early dawn the shades of time …”
+
[[File:ZFigure19.jpg|thumb|500px|center|Table 4. Comparison of elements in glass and quartz slide]]
Initialisms format:      “TVDIMMTEDTSOT ...”
+
The hair samples were stuck down with double sided sticky tape on the quartz slide, this will have some contamination and will have to be dealt with appropriately.
2-grams format:         “TV VD DI IM MM MT TE ED DT TS SO OT T…”
+
  
==== Preparation Test ====
+
===Mass Spectrometer Experiment===
Just like before, the preparation test was designed to check the performance of algorithm. Unlike the Levenshtein Algorithm, the grouping of test based on unit length is meaningless. This is because of the nature of the Simhash algorithm. For details please refer to section 3.2.2.
+
====Capturing the data====
The preparation test was based on the UDHR. In the first test: it compared 50 units extracted from the English version of UDHR with the whole text of UDHR (2-grams formatted as well) in both English and other languages, by firstly turning each unit to its corresponding Simhash string, then calculating the Hamming Distance between two Simhash strings. In the second test, the same method and layout are used again to compare 50 units with the whole text of UDHR in the same kind of language. To make it simple, test 1 is a cross-language test while test2 is a same-language test. Results were presented in the two box plots below (apologize for the being out of order of the x-axis; please make comparisons according to the column names under each box):
+
The quartz slide was placed into two stabilisers, which was then installed into the machine. This machine did the laser ablation on the hair samples. Figure 16 displays the quartz slide with the hair samples, installed into the laser ablation system. While Figure 17 displays the enlarged version of the quartz slide on the monitor.
[[File:Group3.jpg|thumb|600px|center|Simhash Preparation Test Result]]
+
[[File:ZFigure20.jpg|thumb|500px|center|Figure 16. Hair samples installed into the laser ablation system]]
 +
[[File:ZFigure21.jpg|thumb|500px|center|Figure 17. Laser ablation system with monitor]]
 +
Next to the laser ablation system, there are two monitors which will be used for the laser ablation task. The left monitor records the data of the laser ablation and the right monitor controls the laser ablation system, this can be seen in Figure 18. The next step involved choosing how many spots of the hair we want ablated and the distance between these spots. A spot is where the laser will ablate the hair and the isotopes will be recorded for that spot. It was chosen to ablate the hair about half way (1cm), which gave 11 spots at about 800 micro meters apart, two of these spots were located in the root. The mass spectrometer will then document the chosen isotopes. The complete set up of the experiment is shown in Figure 19, with the laser ablation system on the left and the mass spectrometer on the right.
 +
[[File:ZFigure22.jpg|thumb|500px|center|Figure 18. Recording the data (left) and controlling the laser ablation machine (right)]]
 +
[[File:ZFigure23.jpg|thumb|500px|center|Figure 19. Agilent 7900x mass spectrometer with attached New Wave NWR213 laser ablation system]]
  
Comparing inside the cross-language group (Figure 20): the data set of “English versus English” (the more transparent box located at the bottom-right of the first figure) is considerably lower than other cross-language data sets (colored boxes in the first figure).
+
====Processing the Data====
 +
The data from the ICP-MS was then put into a software called Iolite, this is how we managed the data and removed any anomalies [29]. The overall waveform of the isotopes in the hair sample for several spots can be seen in Figure 20. An enhanced and zoomed in version of one of the spots is shown in Figure 21.
 +
[[File:ZFigure23.jpg|thumb|500px|center|Figure 20. Waveform of the isotopes in several spots]]
 +
[[File:ZFigure24.jpg|thumb|500px|center|Figure 21. Enhanced and zoomed into a spot]]
 +
A brief explanation of the waveform composition will be done. Each waveform is an individual isotope, where the top waveform is the average of all the other waveforms. The first 20 seconds of the waveform, shows a relative flat line, this is the laser ablation system calibrating itself. This information is not useful. The waveform peaks up (represented with a square box), this is when the laser just hits the surface of the hair. Then the waveform is a flat line, this is when the laser is ablating the inside of the hair. This is the important information. In some cases the waveform peaks up at the end, this is the laser ablating through the hair and onto the sticky tape and quartz slide.
 +
To compare the results more precisely, two sets of data were captured, the surface of the hair and inside the hair. Only getting these results, meant there was so contamination from the sticky tape or the calibration set up. Lastly the results were outputted into one comprehensive Excel file, with all the necessary data, a portion of the data can be seen in Figure 22.
 +
[[File:ZFigure25.jpg|thumb|500px|center|Figure 22. Portion of the data]]
  
Comparing across the two figures (Figure 20 and 21): it is easy to find out that the cross-language group has generally higher results than the same-language group (Color schemes in two figures are not the same , please refer to the column names when doing comparison). Median values of boxes in same-language group are all significantly lower than those in the cross-language group. All third quartiles (Q3) in the same-language group are lower than first quartiles (Q3) of the corresponding language group in the cross-language group.
+
===Data Analysis===
 +
Analysing the data, there were several isotopes that were found in the hair, these isotopes are shown in Table 5.
 +
[[File:ZFigure26.jpg|thumb|500px|center|Table 5. Isotopes found in the hair]]
  
In addition, data sets in figures above are less biased compared to those in Levenshtein Test. Distributions of data in each boxplots are quite compact. The Simhash Algorithm is not sensitive to different languages. These facts give extra credibility to the Simhash algorithm.  
+
====Strontium====
Based on the aforementioned observations, it is reasonable to draw out the conclusion that the Simhash Algorithm has an excellent ability of distinguishing different kinds of languages.
+
The comparison of the strontium in the different samples can be seen in Figure 23.
 +
[[File:ZFigure27.jpg|thumb|500px|center|Figure 23. Strontium in different samples]]
 +
The x-axis represents the distance of the hair ablated and the y-axis is the amount of strontium in the hair at that distance. Samples A and E, are the samples that left Adelaide and came back.  
 +
For sample E about the first 4000um of hair is when they were in Bali, except for the very first point where they were in Adelaide. Observing the figure, sample E has two high peaks of strontium levels, one at about 1500um and the other at about 3200um, this indicates that there is high strontium levels in Bali. The rest of the value of strontium are low, when sample E came back to Adelaide.
 +
For sample A about the first 2000um is when they were in Japan, except for the very first point where they were in Adelaide. Observing the figure, there is a peak at about 1500um, indicating higher strontium values, than that of when sample A was in Adelaide.
 +
Overall, these results are concluding that the strontium levels in Adelaide are smaller than that of Bali and Japan.
  
==== Main Test ====
+
===Conclusion===
Here, the two versions of the mysterious code are tested against the UDHR in different languages. For each version of code, it was compared to different language versions of the UDHR. Results are presented in the first bar chart.  Pink represents the first version of the code and blue represents the other one. (Each bar actually consists of two sub-bars, length of sub-bar represents each result, and there is no overlapping between sub-bars).
+
The strontium level is lower in Adelaide than that of Bali and Japan. Relating this to the Somerton Man, as pervious groups did the ICP-MS on his hair and got high strontium reading. This indicates that Somerton Man must have visited another country with high strontium level, before his visited Adelaide.
As there is little difference between pink and blue bars, the average value in each language group is taken. For comparison purpose, these average values are combined with the same-language test result, and presented in the second figure below (bars in the left figure are turned to horizontal lines in the right figure, which represent the mean value of the Hamming Distance set between code and a specific language):
+
[[File:Group4.jpg|thumb|600px|center|Simhash Test Result]]
+
Considering that the comparison between two versions of code and the UDHR texts may not be sufficient to draw out any convincing conclusion, the two versions of code are compared again with 12000 letters long War and Peace text in each language using the same Simhash method. As the two versions of code have extremely similar test results against a specific language, the two versions’ results are added together in each language group. Result of the test is shown below:
+
[[File:Last.jpg|thumb|600px|center|Simhash Test Result]]
+
Figure 22 clearly showed that the Hamming Distance between English and the code is the minimum; figure 24 also presents the same characteristic.
+
In Figure 23, there are two horizontal lines reside inside of their corresponding boxes:  English and Portuguese. But compared with Portuguese, the red line that represents the result of code vs English is relatively closer to the median value of the English vs English test result set.
+
  
==== Conclusion of the Simhash Test ====
+
==Task 3 DNA Analysis==
Based on these facts, the conclusion is that the mysterious code is more likely to be English, rather than other eight languages being tested.  
+
===Background Theory===
 +
Deoxyribonucleic acid (DNA), is the hereditary material in humans and all other organisms, it is mostly located in the cell nucleus but a small portion can be found in the mitochondria [30]. DNA is made up of four chemical bases, which are adenine (A), guanine (G), cytosine (C), and thymine (T), the information of DNA is stored as a code [30]. More than 99% of these bases are the same between individuals, the variant of DNA determines the different personal characteristics an individual has [30]. Furthermore, this DNA variation is called single-nucleotide polymorphism (SNP), this is an important aspect in this task.
  
=== Summary of Tests ===
+
===Aim===
In Section 4, there are generally two groups of tests presented: the Levenshtein Test and the 2-grams Simhash Test.
+
The aim of this task is to discover how much DNA can be degraded before it becomes unidentifiable. This information will also be used to indicate how much DNA is needed from Somerton Man. The approach to degrade DNA, will be done by removing SNPs from a DNA sample.
The Levenshtein Test has given the opinion that the mysterious code is more likely to be English initialisms; but its credibility was challenged by the randomness appeared in the 50-length group test. On the other hand, the opinion inferred from the Levenshtein Test has gained vigorous support by the result in the 2-grams Simhash Test.  
+
For the secondary suspected languages which are Turkish and Portuguese respectfully in two tests.  By comparing results of two tests together the suspicion can be removed. As the Turkish test result in the 2-grams Simhash test was not as conspicuously high as the one in the Levenshtein Test, nor did the Portuguese test in the Levenshtein test perform abnormally. 
+
Now, there are enough evidences to support the final conclusion that the code consists of Initialism of English.
+
  
==Task 2 Mass Spectrometer isotope Data Analysis==
+
===Preparation===
===Relative research===
+
====Obtaining DNA data====
1.Hair elements
+
A DNA sample kit was ordered from an ancestry place called 23andMe. This is a privately held personal genomics and biotechnology company, which they use saliva to test your DNA [31]. The DNA sample was tested and the results were received. This can be seen in Figure 24.
Hair analysis can demonstrate the content level of the heavy mental and essential elements in human body. [19] The different value of some essential elements in hair may be correlated with nutritional status and diseases, such as calcium (Ca), chromium (Cr), zinc (Zn), copper (Cu), and selenium (Se). The high values of some elements such as lead (Pb), arsenic (As), mercury (Hg), and cadmium (Cd) in hair can indicate the intoxicating phenomenon. [19]
+
[[File:ZFigure28.jpg|thumb|500px|center|Figure 24. DNA results from 23andme]]
Using this analysis method can discover the useful information about the Somerton man healthy condition and has possibility to find the cause of death.
+
Figure 24 show details of the DNA report. It can be seen that this person is 65.1% Eastern Europe, indicating at least one of their parents are European. Another useful information is the Maternal Haplogroup and the Paternal Haplogroup. A haplogroup is a group of similar haplotypes that share a common ancestor but different SNPs [32]. The Maternal Haplogroup traces back to a single common ancestor, this can be used for both males and females, which in this case is H11. The Paternal Haplogroup use Y-chromosomes to define a specific set of shared mutation, this can only be used for males, as females do not have Y-chromosomes and can be seen is R-CTS11962.1.
2.The relation between living environment and hair elements
+
The living environment can affect the content level of elements in human body. This effect not only caused by polluted water and food, but through the injury and radiation. Trace elements values in hair can reflect the source of behaviors including the diet hobbies, smoking and medication. Human absorbs pollution source by different ways such as oral, breath, injection and radiation. These contaminants stored in different body organs and tissues can affect human health. Hair is a good indicator of environment change. Normally, it shows the variation of trace element in human body as the hair growth.
+
  
=== Glass test data analysis ===
+
====Raw Data====
==== Data screening ====
+
23andMe also provided the raw data of the DNA sample, this document was the basis of this task. Figure 25 shows a portion of the raw data.
The hair data was recorded by mass spectrometer and the data was got from glass test. There are 44 elements gained by laser ablation (shown in Table 1). Three of these elements are needed to be mentioned. The first one is sulphur (S7) which must exist in human’s hair. The value of Sulfer can be refered to the drift value of the environment. The second one is lead (Pb206), the high level of Lead is harmful for human health. The third one is strontium (Sr88). This element has high percentage in Adelaide’s soil compared with other areas. So, the analysis of the change tendency of Strontium’s value will provide useful information about Somerton man’s living environment before his death.
+
[[File:ZFigure29.jpg|thumb|500px|center|Figure 25. A segment of the raw data]]
Table 1 Elements table for glass test
+
Each line in the raw data represents one single SNP. Removal of these SNPs will be done at random. The raw data has over 600,000 lines.
  
[[File:code1.jpg|thumb|500px|center|Table 1 Elements table for glass test]]
+
====DNA database====
==== Element classification ====
+
After analysing and understanding the raw data, it was then imported into a DNA database website called ‘GEDmatch Genesis’ to compare with other DNA samples. A fraction of the original sample and how the database displays the results is show in Figure 26.
Some recorded hair data contains toxic and heavy mental elements which are harmful for human health in high values. However, most are essential elements for human body. Table 2 shows the toxic elements and essential elements in hair. The comparison of toxic elements content in hair is an important research subject.[21]
+
[[File:ZFigure30.jpg|thumb|500px|center|Figure 26. Fraction of Gedmatch Genesis database results]]
Table 2 the Different Type Elements of Hair
+
For privacy reasons the names and email address could not be display in the figure above.
[[File:somertontable2.png|center|Table 2 the Different Type Elements of Hair]]
+
Each column in Figure 26 will be explained. ‘Kit’ represents different DNA samples from various individuals. A segment refers to a section or block of contiguous SNPs. A matching segment is a section that is the same between two people. ‘Largest Seg’ is the largest matching segment. ‘cM’ (Centimorgan) is a measure of genetic linkage, it is a measure of DNA information within a chromosome. ‘Gen’ provides a rough estimate of the number of generations between you match both share, where 3 indicates two candidates share the same great-grand parents, 4 indicates they share the same great great-grand parents etc. 'Overlap' is the number of positions that exist in common between both kits, without regard to whether they match or not. The amount of overlap, along with the largest cM amount, is usually a good indication of the relative quality of the match. Matches with low overlap are highlighted with a pink or red background, depending on the overlap value. ‘Date compared’ the date the DNA sample was processed into the database. ‘Testing company’ the location where the DNA sample was tested.
In previous study, the hair data analysis is based on glass test result. However, the impurity material has influence on element values, thus, the data analysis mainly relies on quartz test rather than glass test in this project. Compared with glass plate, the quartz plate is a purity material composition. Table 3 shows the main composition of glass and quartz.
+
Table 3 The different Elements of Glass and Quartz
+
[[File:somertontable3.png|center|Table 3 The different Elements of Glass and Quartz]]
+
  
==== Element content comparison ====
+
===DNA Degradation===
In order to speculate the cause of death and living environment, the Somerton man hair data should be compared with control samples. In this experiment, six normal people’s hairs are selected as control samples. Then, the mass hair data is disposed by software for a visual comparison.
+
Degradation in this case refers to the randomly removal of DNA sequences form the raw data. With these results, SNP’s will be removed from the DNA at different percentages, which then can be utilised to discover how much DNA is required until it becomes unidentifiable. Matlab was used to remove different percentages of SNPs in the sample. Figure 27 demonstrates the code that was used to remove the SNPs from the raw data.
Excel 
+
[[File:soFigure27.jpg|thumb|500px|center|Figure 27. Matlab code to remove SNPs]]
At first, the hair data of Somerton man was plotted by Excel (shown in Figure 6). The X axis demonstrates the scan time of mass spectrometer. The Y axis demonstrates the relative value of elements content.
+
A brief explanation of how the code works will be discussed. The code gets the raw data (which is a text file). A different portion of the raw data is shown in Figure 28, which will be used in a demonstration.
Figure 1 Somerton man’s hair data
+
[[File:soFigure28.jpg|thumb|500px|center|Figure 28. A portion of the raw data]]
[[File:somertontable4.png|thumb|500px|center|Figure 6 Somerton man’s hair data]]
+
The raw data has 638468 lines, the code then removes a certain percentage of those line (The code is currently going to remove 90% of the lines). The next part of the code deletes the blank space of the removed lines. The blank spaces needed to be removed to import the data to GED match Genesis. A demonstration of how the code works is shown in Figure 29.
However, only 28 elements were shown on the chart. Some data was missing due to the shortage of Excel. Besides, most elements are not clear except sodium (Na23). Some traces are out of range and not obvious, such as the elements located on the bottom of this chart. So, these troubles result in the difficulties of contrast.
+
[[File:soFigure29.jpg|thumb|500px|center|Figure 29. A portion of the raw data]]
The control samples hair data plotted by Excel (shown on appendix A) as well. Unfortunately, they all have the shortages as the Somerton man’s data chart.
+
10 lines are present in the raw data (Left image), then 2 lines are removed (Middle image), lastly the blank spaces of the removed lines are also removed, leaving only 8 lines in total (Right Image).
Matlab
+
Lastly the code converts the results back into a text file, which can be imported to GED match Genesis. The removal of SNPs was done from 10%-90% and the whole set of the experiment will be carried out five times to obtain an average result to reduce any outliers.  
In order to solve these problems, Matlab can be used to plot each element’s content distribution. 44 elements comparison figures have been plotted and shown on appendix A. Some comparison figures of element value display the results clearly. However, some figures have difficulties to find the tendency. The elements figures’ classification is shown on table 4.
+
Two different task were completed using the DNA, more specifically the degraded DNA. One tasked involved using heritage analysis and the other involved comparison of DNA using the database.  
Table 4 Classify elements by comparison results
+
[[File:code5.png|thumb|500px|center|Table 4 Classify elements by comparison results]]
+
The quality of the results is considered below, four elements (Na23, Mg24, Si29 and Ca43) have the similar tendency. Meanwhile these four elements are all contained in glass material in previous study. This phenomenon can assume that the recorded elements values are affected by glass slice. However, the elements which do not exist in the glass will not be affected.
+
In table 4, the same value means that the content of elements in the Somerton man and control samples can be regarded as the same value. Different value represents a significant difference in element content, and the change trend is not similar. No tendency means that the comparison figures can’t show the law of change and the results can’t be compared. Figure 7 shows the lead (Pb206) relative value comparison between Somerton man and control samples.
+
Figure 7 Lead (Pb206) Value Comparison
+
[[File:code6.png|thumb|500px|center|Figure 7 Lead (Pb206) Value Comparison]]
+
The X axis demonstrates the scan time of mass spectrometer. The Y axis demonstrates the relative value of Lead content. The mass spectrometer scans from the root of hair. Therefore, the smaller number of X axis represents the closer to death time. In Figure 2, the content of lead in control samples is very low and there is slight upward trend only in a short period of time. However, there have found a very high level of Lead in Somerton man’s hair compared with control samples especially the end of his hair. This situation of high lead content is worthy of attention. Through further studying the distribution graph of lead content, it clearly shows a tendency that the lead value has a visual reduction before his death. So, it can be affirmed that high level of lead was not the main cause of his death. In fact, the high level of lead content can be related to the living environment and the rapid decline in lead content may represents the recent replacement of the living environment. Thus, some of the geographic information in 1948 needed to research, such as solid analysis, petroleum quality, water quality and nuclear power plant location.
+
Figure 8 shows the mercury (Hg202) relative value comparing between Somerton man and control samples.
+
Figure 8 Mercury (Hg202) Value Comparison
+
[[File:code7.png|thumb|500px|center|Figure 8 Mercury (Hg202) Value Comparison]]
+
The X axis demonstrates the scan time of mass spectrometer. The Y axis demonstrates the relative value of mercury content. There have found a higher level of mercury in Somerton man’s hair compared with control samples. This value from root to the end of hair is basically the same. Through research, mercury is a highly toxic heavy metal pollutants of biologic toxicity. It is difficult to be discharged back into the organism. So it is a serious threat to human health. However, mercury is ubiquitous in nature. There are trace amounts of mercury existing in plants animals and food. Normally, human can eliminate toxins through excretion and metabolism to keep the amount that should not influence health. This is the reason for finding mercury in human’s hair. The main pollution of mercury is from chlor-alkali, plastics, batteries, electronics and other industrial emissions. The Somerton man’s higher level of mercury may be due to his living area. Thus, this should not be regarded as the cause of his death.
+
The strontium value between the Somerton man and control samples’ hair are obviously different. Figure 4 shows the strontium (Sr88) relative value comparing between Somerton man and control samples.
+
Figure 9 Strontium (Sr88) Value Comparison
+
[[File:code8.png|thumb|500px|center|Figure 9 Strontium (Sr88) Value Comparison]]
+
The X axis demonstrates the scan time of mass spectrometer. The Y axis demonstrates the relative value of strontium content. Strontium exists in mineral water. It is an essential element for human. This element can prevent hardening of the arteries. At the root of Somerton man hair, the content of strontium is significant high. This means the living area for Somerton man contains high level of strontium content before his death. In addition, the strontium value shows an increasing tendency in his last two weeks. However, strontium is not a toxic element and the increasing tendency is not the cause for the death of Somerton man. Reasonable inference is that the Somerton man changes living environment two weeks before the death and this is the same with the speculation for lead content change.
+
Moving-average Filter in Matlab
+
Some comparison figures cannot be shown clearly by Matlab plotting. Thus, the Moving-average Filter should be applied. Firstly, selecting a window size for Filter. Then, calculating the arithmetic average of outliers within the window and make the average of the demand as outlier’s window centre point. After that, moving the window as window size and the average method is repeated until this process is complete. This method has a good inhibitory effect on periodic disturbance and makes the figure smoother. It is suitable for high frequency oscillation system.
+
All of the elements content figures have been plotted by using moving-average filter and shown on appendix B. After using moving-average filter, the comparison figures have some new founds, such that more similar tendency elements are found and some clear comparison figures have been shown. The details are shown on table 5.
+
Table 5 Classify elements by comparison results 
+
[[File:code9.png|thumb|500px|center|Table 5 Classify elements by comparison results ]]
+
Through filtering, As75, Ag107, Cd111 are the new found toxic elements which have high level in the Somerton man’s hair. The content distribution is shown as follow:
+
Figure 10 Cadmium (Cd111) Value Comparison
+
[[File:code10.png|thumb|500px|center|Figure 10 Cadmium (Cd111) Value Comparison ]]
+
Cadmium is an element present in the nickel-cadmium battery and it is a toxic element which can create poison gas when boiled. The average content value of cadmium in Somerton man is much higher than control samples especially when the scan time is 1400 seconds. However, the change trend of this element are similar with lead and content value of cadmium shows a decreasing tend for his last time. Unlike lead, cadmium is more toxic and it can be regarded as the clue for the death of Somerton man. However, this kind of battery did not appear before 60 years. That can be indicate this battery is not the source of cadmium. It can also be observed from the figure that the hair for control samples contain cadmium as well. Thus, it can assume that this kind of element may come from drinking water or food, or other frequently used items.
+
Figure 11 Silver (Ag107) Value Comparison
+
[[File:code11.png|thumb|500px|center|Figure 11 Silver (Ag107) Value Comparison ]]
+
Through the observation of Figure 6,the silver content of Somerton man is higher than control samples. Besides, the value change for silver in hair of Somerton man is not obvious and the whole tends to be stable. Thus, this element should not be used for the speculation of Somerton man living condition. Besides, the element of silver is not harmful for human healthy. It can only make some colour change for skin. Thus, this element will not be researched in this project.
+
Figure12 Arsenic (As75) Value Comparison
+
[[File:code12.png|thumb|500px|center|Figure12 Arsenic (As75) Value Comparison ]]
+
Arsenic content level in Somerton man is increasing during the last two week. At the tip of Somerton hair, the arsenic content is similar with control samples. That means this content level is not harmful for human. Besides, the living area of Somerton can be assumed as same as control samples. However, the arsenic level is significant high at his last few days and the arsenic can be used for making toxicant. So, changing living area or being poisoned should be considered. This comparison figure can be regarded as one valuable clue.
+
  
=== Quartz test data analysis ===
+
====Heritage====
==== Data screening ====
+
The first task completed was investigating the heritage of the DNA sample sent. There were various genetic ancestry projects that could have been chosen, therefore research was completed on all the different genetic ancestry projects to make sure that the results would be the most accurate. Eurogenes seemed the most appropriate considering the sample is decent of European background. The next part was to select which model was most appropriate to calculate the heritage in the DNA, again significant research was completed and the chosen model was Eurogenes K13, as this model is best for samples with mixed heritage.
Compared with glass plate, quartz plate is more pure. So, some elements recorded in the glass experiment were not present in the quartz test and some elements have different relative atomic weight. The recorded 35 elements in quartz test are shown in Table 6.
+
The heritage results for the original case (no SNPs removed) can be seen in Figure 30, it can be seen that the sample has strong heritage in North Atlantic, Baltic and West Mediterranean regions.
[[File:code13.png|thumb|500px|center|Table 6 Recorded Elements on Quartz Test ]]
+
[[File:soFigure30.jpg|thumb|500px|center|Figure 30. Heritage results]]
==== Data comparison figures ====
+
The results that were completed only took consideration of North Atlantic and Baltic regions as these had the highest percentages, 29.13% and 42.11% respectively.
All of the quartz data is recorded. In order to comparison, the data of element has been classified as four sets of data:
+
The other cases were than completed, the 10%-90% of SNP removal. A graph was produced to visualise what was happening when removing the SNPs, which can be seen in Figure 31.
• Control sample hairs: Files name are Hair02-Hair08. For control hairs, it analyzed 2 sections of each hair, each 500um in length, so there are 0.7 centimeters control sample in total. The number of samples is seven, and each person takes two hairs as the two samples.
+
[[File:soFigure31.jpg|thumb|500px|center|Figure 31. Heritage test, with SNPs removed]]
• Somerton Man: It analyzed 500um sections, each subsequent transect beginning where the previous finished for a total of 96 sections (4.8cm total).
+
The x-axis represents how much SNP is removed in the sample and the y-axis is how much percentage of that heritage the sample is. Observing the graph it can see the results are relative steady up to about 60%, the results then fluctuate at 70% and onwards. This indicates that removing more than 70% of the SNPS will result in inaccurate results. This result is not enough to conclude this, so five more tests were completed to see if the trend was similar for each case. The figure below demonstrates the average of the heritage tests completed.
• Standards: The files named NIST612-1 to NIST612-20 (20 transects) are standards and were run to ensure that the instrument wasn’t grifting too much during the analysis, they can be used to compensate for the drift if need be as they were run at intervals during the sample acquisition.  
+
[[File:soFigure32.jpg|thumb|500px|center|Figure 32. Average of the Heritage Test]]
• Background: Transits of 100um lengths across the quartz, files are Glue -1 to Glue–4.
+
It can be seen that average results are all relatively linear, expect for some points at North Atlantic 40%, 80% and 90% and Baltic 80% and 90%. The error bars however, get larger as the percentage of SNPs are being removed. This indicates that the DNA samples start to lose its structure as SNPs are being removed.
For the quartz test data, each element needs to plot three comparison figures. The first comparison is the comparison of four sets of data which are control samples, Somerthon man, quartz and standards. The second one is enlarged graph for the first comparison figure. The third figure shown fourteen control samples’ hair data comparison.
+
From 10%-40% of SNP removed, the standard deviation starts increasing to approximately 1, which shows the DNA is still robust. From 50%-90% of SNP removed, the results starts to vary significantly, which shows as the error bars are large and the standard deviation is much larger.
Using the different colour dots to represent the four sets of data. Blue dots represent control samples, red dots represent Somerton man, yellow dots represent standards and green dots represent quartz background. The Figures (13-19) show the comparison of four sets of data. The X axis for all of the figures describes the scan times of the laser. The Y axis represents the relative value of elements in four sets of data.
+
It can be concluded with these results that DNA sample is robust until about 50% and after that DNA becomes unidentifiable.
[[File:code14.png|thumb|500px|center|Figure 13a Silver (Ag107) ]]
+
[[File:code15.png|thumb|500px|center|Figure 13b Silver (Ag107) ]]
+
[[File:code16.png|thumb|500px|center|Figure 13c Silver (Ag107) ]]
+
The comparison result of silver (Ag107) value between four sets of data is shown in Figure 8. Yellow dots represent the drift values which are basically stable and indicate that the drift change is not significant. Thus, the recorded data is reliable. Green dots represent the silver content in quartz. As can be seen from the figure, the content of silver in the quartz is very small. Therefore, this set of data has little effect on the test data. In general, the silver element content in Somerton man’s hair is higher than the control hairs. In quartz experiment, the relative content value of silver in the hair is a table value and it is similar with the glass experiment result.  
+
Figure 14 Carbon (C13)
+
[[File:code17.png|thumb|500px|center|Figure 14a Carbon (C13) ]]
+
[[File:code18.png|thumb|500px|center|Figure 14b Carbon (C13) ]]
+
[[File:code19.png|thumb|500px|center|Figure 14c Carbon (C13) ]]
+
Figure 15 Sulphur (S34)
+
[[File:code20.png|thumb|500px|center|Figure 15a Sulphur (S34) ]]
+
[[File:code21.png|thumb|500px|center|Figure 15b Sulphur (S34) ]]
+
[[File:code22.png|thumb|500px|center|Figure 15c Sulphur (S34) ]]
+
Same as silver, the standard values for sulphur and carbon are constant which means the instrument wasn’t drifting too much. By comparison, the carbon and sulphur relative values are similar among Somerton man and 12 control hairs. Only two samples have significant differences and these samples come from one special person. Carbon and sulphur are essential elements in human hair. The change in the content of these two elements can be a direct indicator of a person's living environment. Based on the comparison results of the Figure 9 and 10, the living environment conditions for Somerton man and six control samples are similar but different from the special one. Thus, the living environment of Somerton man can be confirmed.
+
Figure 16 Calcium (Ca43)
+
[[File:code23.png|thumb|500px|center|Figure 16 Calcium (Ca43) ]]
+
Figure 17 Potassium (K39)
+
[[File:code24.png|thumb|500px|center|Figure 17 Potassium (K39) ]]
+
Figure 18 Strontium (Sr88)
+
[[File:code25.png|thumb|500px|center|Figure 18 Strontium (Sr88) ]]
+
Compared with the elements relative values between Somerton man and control samples, Figure 11, 12, and 13 have some similarities. The relative values of calcium, potassium and strontium in Somerton man’ hair are higher than most control hairs. However, the content of these three elements in two samples is close to the content of Somerton man’ hair. Moreover, these two hair samples both are from the one person. Based this result, it can be speculated that the content values of these three elements in the living environment for Somerton man and this special sample are similar.
+
Figure 19 Lead (Pb206)
+
[[File:code26.png|thumb|500px|center|Figure 19a Lead (Pb206) ]]
+
[[File:code27.png|thumb|500px|center|Figure 19b Lead (Pb206) ]]
+
[[File:code28.png|thumb|500px|center|Figure 19c Lead (Pb206) ]]
+
It has mentioned in the previous chapter that lead is a toxic element, and the data obtained from the glass plate experiment show that the lead content in Somerton man’s hair is significantly higher than that in control hairs. This phenomenon still exists in the quartz plate test. Since the incomplete hair for quartz plate test, the trend of lead content value in the hair before Somerton man dying cannot be expressed. However, the content of lead in Somerton man’s hair is indeed higher than the control hairs.
+
The difference in lead levels does not indicate that the death of Somerton man is caused by the high level of lead in the body. The difference may be due to the fact that the lead content in human living environment is different from 60 years ago.
+
Figure 20 Comparison of the relative values of some elements in control hairs
+
[[File:code29.png|thumb|500px|center|Figure 20a Comparison of the relative values of some elements in control hairs ]]
+
[[File:code30.png|thumb|500px|center|Figure 20b Comparison of the relative values of some elements in control hairs ]]
+
For control hairs, two sections of each hair have been analysed. There are fourteen control hairs which are from seven persons. By comparing these 14 samples, it was found that there was a difference in the elemental content between the samples, especially the first control person’s hair. Figure 15 shows the relative value of the first sample is significantly higher than the other samples. These elements are Mg, Ca, K, Mn, Na, Li, Sr. In addition, by the analysis and comparison results of several previous graphs, the values of the elemental contents of this sample are similar to those of Somerton man. From this result it can be inferred that this control sample and Somerton man have similarities in the living environment or lifestyles.
+
  
=== Two data comparison ===
+
====Database====
==== Two tests data comparison table ====
+
The second task that was completed was comparing the DNA sample to a database of other people DNA samples. Using the database, we were able to efficiently identify how closely related, the DNA samples were, to other people. Refer back to Figure 26 to see how the database displays the results.
In the glass plate test and the quartz plate test, 31 same elements were recorded and shown in table 7.
+
The original sample will be compared with the 10%-90% cases of SNP are removal. The objective is to find false positives and false negatives. A false positive in this case would mean that the kit appeared in the removal section and a false negative means the kits disappeared in the original sample. An example below will be shown for a clear understanding.  
Table 7 the Same Elements in Two Data Sets
+
Original: A B C D E
[[File:code31.png|thumb|500px|center|Table 7 the Same Elements in Two Data Sets ]]
+
10% of SNP removed: C D G H I
==== Comparison of Somerton man’s two hairs ====
+
false positive: G H I and false negative: A B E
In the last two chapters, we compare and analyze the differences between Somerton man and the control samples in the two experimental data. Next, the two experimental results of Somerton man's two hair data will be compared. We hope to compare the two groups of experimental data to find a same trend but different values data, and then multiply a constant number with one of the test data to obtain the other test data.
+
As there is thousands and thousands of kits to compare, a sample size of 30 was taken to compare. Figure 33 shows the false positives and false negatives for different percentages of SNPs removed.
Figure 21 (a) Silicon (Si) value comparison in quartz test
+
[[File:soFigure33.jpg|thumb|500px|center|Figure 33. False positives and false negatives]]
[[File:code32.png|thumb|500px|center|Figure 21 (a) Silicon (Si) value comparison in quartz test ]]
+
The x-axis represents how much SNP is removed in the sample and the y-axis is how many false positives or false negatives are found. Unfortunately, in this case the false positives and false negatives equalled each other.
Figure 21 (b) Comparison of the relative values of Silicon (Si) in Somerton man hair
+
The results are all fairly high, at 10% there were only 6 matches with the original case. This indicates that even altering the DNA just by a small amount can have significant change on the DNA. It can be seen as it starts to approach 50% of SNP removal there are more false positives and false negatives, with fewer matches to the original case. At 50% and greater of SNP removed there is 30 false positives and false negatives, this means that no other DNA sample matched the original case.
[[File:code33.png|thumb|500px|center|Figure 21 (b) Comparison of the relative values of Silicon (Si) in Somerton man hair]]
+
Figure 16(a) is a comparison of the silicon data obtained in a quartz plate experiment. The silicon content in the standard is reduced. So that the silicon content in the hair data is reduced. It is shown that the environment of test has influence on test result. From Figure 21 (b), these two hairs of Somerton man have significant different in the relative value of silicon. Besides, the trend of these two sets of data is significantly different. Thus, it cannot find the constant number from Figure 21.
+
The following figures that from Figure 22 to 25 are the different elements comparison results of two Somerton man’s hairs. It is hard to find the same trend from this comparison. Therefore, it is necessary to find another suitable way to compare the two sets of data.
+
Figure 22 Comparison of the relative values of Arsenic (As75) in Somerton man hair
+
[[File:code34.png|thumb|500px|center|Figure 22 Comparison of the relative values of Arsenic (As75) in Somerton man hair]]
+
Figure 23 Comparison of the relative values of Strontium (Sr88) in Somerton man hair
+
[[File:code35.png|thumb|500px|center|Figure 23 Comparison of the relative values of Strontium (Sr88) in Somerton man hair]]
+
Figure 24 Comparison of the relative values of Lead (Pb206) in Somerton man hair
+
[[File:code36.png|thumb|500px|center|Figure 24 Comparison of the relative values of Lead (Pb206) in Somerton man hair]]
+
Figure 25 Comparison of the relative values of Sulphur (S34) in Somerton man hair
+
[[File:code37.png|thumb|500px|center|Figure 25 Comparison of the relative values of Sulphur (S34) in Somerton man hair]]
+
  
==Project Management - Planning and Feasibility==
+
===Conclusion===
=== Timeline ===
+
The major finding in this task is that when 50% of the DNA is removed it becomes unrecognisable. This was proven in both the heritage task and the database task. It was also found that alerting the DNA by only a small amount can result in very significant change to the DNA. Relating this back to the Somerton Man, only 50% of DNA is needed from the Somerton Man to undergo further research.
The timeline Gantt chart has been attached in appendix B. It has listed the key milestones of the project (shown in Table 7).
+
Table 7 Key Milestones of the Project
+
[[File:code38.png|thumb|500px|center|Table 7a Key Milestones of the Project(a)]]
+
[[File:code39.png|thumb|500px|center|Table 7b Key Milestones of the Project(b)]]
+
  
=== Work breakdown ===
+
==Project Management==
The key task for the project is plotting and comparing data figures. Then, finding clues from the comparison. Due to two group members have different study directions, the project work is individual.
+
=== Task allocation ===
+
The project has two directions. In general, two group members worked on different tasks. This report devotes to analyzing the mass spectrometer data of the Somerton man hairs. The other group member works on cracking code and writing software
+
=== Management Strategy ===
+
To ensure finishing project on time, the following strategies have been applied:
+
• Meeting with supervisors regualrly.
+
• Receiving effective feed backs from supervisors
+
• Finishing tasks without delay
+
• Communicating with group member frequently.
+
 
=== Budget ===
 
=== Budget ===
No hardware was required in this project. All the softwares used in this project were either provided the University of Adelaide or open-sourced. As a result the $500 budget was not used at all.  
+
Each project member had $250 for the project, hence we had $500 for the entire project as two members were present in the group.
 +
The pure quartz slide was purchased for the mass spectrometer
 +
The DNA genetic testing kit was purchased so we could test our DNA.
 +
A table of what items were purchased and how much each item cost is seen in Table 6.
 +
[[File:soTable6.jpg|thumb|500px|center|Table 6. Budget Table]]
 +
 
 
=== Risk Management ===
 
=== Risk Management ===
The risks of project are listed in Table 2. The first risk should be considered is misunderstanding project tasks. So, the well communication with supervisors is essential. For this report, the crucial task is plotting figures by Matlab. Matlab unavailable will have influence on project processing. However, the Matlab is available on the computers of university. So, this risk can be negligible.  
+
A risk assessment table for the project can be seen in Table 7. One of the major risks that was encountered was bugs in the code. This caused lots of issues throughout the project, one of the main issues was that it delayed the project from proceeding further, this lead to task completion delay. Another risk was the chance of having the Human Ethics not approved. This would mean that we could not do the mass spectrometer task and the DNA task, meaning we would have to do more in-depth analysis on the first task and find additional tasks to do. Slight misunderstanding of project tasks happened throughout the project, but these were explained through meetings with the supervisor.
Table 8 Project Risks
+
[[File:soTable7.jpg|thumb|500px|center|Table 7. Risk Management]]
[[File:code40.png|thumb|500px|center|Table 8 Project Risks]]
+
  
==Future work==
+
==Conclusion==
===Code cracking===
+
The main task was to find out if horse names represent the mysterious code. Final conclusions for the code analysis task, showed that the mysterious code is unlikely to be horse names. It was also discovered that is unlikely to be South Australian street names, Australian city names or The Rubaiyat of Omar Khayyam book. Even with further studies on the book, it was still unlikely to be the mysterious code. But the results showed that it is possible that the mysterious code is Australian beach names.
In the part of code cracking, the next several tasks are expected to be finished:
+
The second main task was ICP-MS on different hair samples. The main focus was strontium and was compared with two samples that left Adelaide for a period of time and the other three samples that stayed in Adelaide. It was found the level of strontium was higher in Bali and Japan than Adelaide. This concludes that Somerton Man must have visited another country with high strontium before coming to Adelaide.
1. Extend the two tests to other languages which have not been tested.
+
The last task involved degrading DNA to discover when the DNA would become unidentifiable. This was done by removing SNPs. Main conclusion found was that DNA become unidentifiable, when 50% of the DNA is removed. This could be used to help find the Somerton Mans relatives or his real identification.
2. Perform more n-grams test using the Simhash Algorithm.
+
Overall, the possibility of discovering the identification of the Somerton Man has gotten a step closer. Hopefully with this project completed other groups can utilise this information and identify who the Somerton Man is.
3. Increase the number of tests in the 50-letters group Levenshtein test to see if the randomness can be avoided.  
+
  
===Hair data analysis===
+
==Future Project Discussion==
In the part of mass spectral analysis, Somerton man’s hair elements data which got from glass test and quartz test have been plotted. The future works as following:
+
===Code Analysis===
1. Re-analysing the uncommon elements value of Somerton man’s hair
+
Deciphering the mysterious code still needs to be completed. It is possible that the mysterious code is Australian beach names, therefore further investigation is required to finalise if this is true or not. This can be done by increasing the sampling size or by locating what beaches Somerton Man might have visited before the time of his death. Also, more collective objects should be considered and researched. Using different statistically methods would also enhance the results.
2. Find the Somerton man’s living environment clues
+
===Hair Analysis===
3. Find the same trend part from two Somerton man’s hair comparison figures.
+
Analysis of hair from different countries that the Somerton Man could have gone to and compare those results to the strontium levels of this project.
4. Gain the constant number to multiply one test data
+
To confirm that the hair sample didn’t have any outliers in it, each individual should give two hairs to sample and the whole hair should be ablated rather than doing spots in the hair. This will be more accurate, as you will have more data to process.
 +
===DNA Analysis===
 +
With the DNA task, as there was only a small sample of false positives and false negatives used, an algorithm or code should be used to process all the false positives and false negatives from the database. This would give a more accurate result.
 +
Another task would be having a second DNA kit from another individual and compare the two together. This will see if the two DNA results, have the same undefinable DNA percentage.
 +
Through extensive research on the case, Somerton Man has a granddaughter. Future groups can research on how the DNA sequence is different between a grandfather to their granddaughter. This could help understand the different DNA sequences between the two.
 +
Now that we know that only 50% of DNA is needed, future groups should get Somerton Mans DNA and undergo further research on it.
  
==Conclusions==
+
==References==
===Code cracking===
+
[1] J. Bineth. “Somerton Man cold case,” ABC news, 14th December 2017 [online]. Available: http://www.abc.net.au/news/2017-12-14/somerton-man-cold-case-could-be-one-step-closer-to-solved/9245512
In this project, two new algorithms have been researched and introduced to check the similarities between the mysterious code and texts of different languages. Algorithms have been implemented in Java. Text materials for each test have been gathered and arranged into ideal formats. Individual test for each algorithm has been designed and implemented. After analyzing test results the conclusion that the code consists of Initialism of English has been drawn out.
+
===Hair data analysis===
+
In this project, three important comparisons of experimental data were made. These three comparisons are related to glass test, quartz test and the comparison of glass and quartz test results.
+
In the comparison of the experimental data of glass plate test, the relative values of element content for lead, mercury, arsenic and cadmium in Somerton man’s hair are significantly higher than control samples in certain period of time. However, these elements’ content have been reduced to a safe range before his death. Thus the high content of these elements cannot be regarded as the main cause of Somerton man’s death, only as a reference to the change of living environment.
+
In the comparison for the data of quartz test, the total amount of recorded elements were reduced from 44 in glass test to 35 which is due to the high purity of quartz plate. Thus, some elements in glass test are excluded. Through comparison, the content of carbon and sulphur in Somerton man’s hair has a large difference with the fourth sample. In addition, the content of potassium, strontium and calcium are similar with the first sample. Therefore, it can assume that Somerton man’s living area is different with the fourth sample, but similar with the first sample.
+
The data comparison between glass and quartz test does not achieve the desired results. So, the experimental method may need to change in the future work.
+
  
==Appendices==
+
[2] The News, “Dead Man Found Lying on Somerton Beach,” 1st December 1948, p. 1 [online]. Available: http://trove.nla.gov.au/ndp/del/article/129897161
* Appendix A:Glass test figures
+
[[File:Glass Test.pdf]]
+
  
* Appendix B: After Moving-average Filters figures
+
[3] The Advertiser, “Cryptic note on body,” 9th June 1949, p. 1 [online]. Available: https://trove.nla.gov.au/newspaper/article/36371152
[[File:After Moving-average.pdf]]
+
  
* Appendix C: Quartz data figures
+
[4] The Advertiser, “New Clue in Somerton Body Mystery,” 25th July 1949, p. 3 [online]. Available: https://trove.nla.gov.au/newspaper/article/36677719
[[File:Quartz Test.pdf]]
+
  
* Appendix D: Two test comparison figures
+
[5] “Tamam Shud/ Somerton Man,” [online]. Available: http://ciphermysteries.com/tamam-shud-somerton-man
[[File:Two test comparison.pdf]]
+
  
* Appendix E: The Timeline Gantt Chart
+
[6] The Advertiser, “Tamam Shud,” 10th June 1949, p. 2 [online]. Available: https://trove.nla.gov.au/newspaper/article/36371416
  
 +
[7] Maguire, S. “Death riddle of a man with no name,” The Advertiser, 9th March 2005, p. 28 [online]. Available: http://www.eleceng.adelaide.edu.au/personal/dabbott/tamanshud/advertiser-mar2005.pdf
  
[[File:code42.png|thumb|500px|center|The Timeline Gantt Chart]]
+
[8] Inside Story, presented by Stuart Littlemore, ABC TV, screened at 8 pm, Thursday, 24th August, 1978
 
+
==References==
+
<references/>
+
[1]. L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking . Available: http://www.adelaidenow.com.au/news/south-australia/somerton-man-mystery-new-details-revealed-of-jo-thomson-nurse-in-the-case/news-story/4c6bccbd2318584ad0cc6daaf3d8abd4
+
  
[2]. Renato Castello, “New twist in Somerton Man mystery as fresh claims emerge,” Sunday Mail SA, November 23th, 2013. Access via Internet:  
+
[9] A. Turnbull and D. Bihari. (2009). Final Report 2009: Who killed the Somerton man? [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_report_2009:_Who_killed_the_Somerton_man
http://www.adelaidenow.com.au/news/south-australia/new-twist-in-somerton-man-mystery-as-fresh-claims-emerge/story-fni6uo1m-1226766905157
+
  
[3]. Lynton Grace, “Somerton Man mystery: New details revealed of Jo Thomson, nurse in the case”, The Advertiser, 29th May 2015. Access via Internet:
+
[10] K. Ramirez and L-V. Michael. (2010). Final Report 2010 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2010
http://www.adelaidenow.com.au/news/south-australia/somerton-man-mystery-new-details-revealed-of-jo-thomson-nurse-in-the-case/news-story/4c6bccbd2318584ad0cc6daaf3d8abd4
+
  
[4]. From Wikipedia, the Taman Shud Case. Access via Internet:
+
[11] S. Maxwell and P. Johnson. (2011). Final Report 2011 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2011
https://en.wikipedia.org/wiki/Tamam_Shud_case
+
  
[5]. From Internet:  
+
[12] A. Duffy and T. Stratfold. (2012). Final Report 2012 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2012
http://ciphermysteries.com/wp-content/uploads/sites/6/2014/01/SomertonManCode-wikipedia.jpg
+
  
[6]. Inside Story, presented by Stuart Littlemore, ABC TV, screened at 8 pm, Thursday, August 24th, 1978.
+
[13] L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking
  
[7]. A. Turnbull and D. Bihari. (2009). Final Report 2009: Who killed the Somerton man?  Available:
+
[14] N. Gencarelli and J. Yang. (2015). Semester B Final Report 2015 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report/Thesis_2015
https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_report_2009:_Who_killed_the_Somerton_man%3F
+
  
[8]. K. Ramirez and L-V. Michael. (2010). Final Report 2010 . Available:  
+
[15] Y. Li and Y. Ma. (2016). Semester B Final Report 2016 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report/Thesis_2016 
https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2010
+
  
[9]. S. Maxwell and P. Johnson. (2011). Final Report 2011 . Available:
+
[16] B. David. “P Value and the Theory of Hypothesis Testing: An Explanation for New Researchers,” Clinical Orthopaedics and Related Research®, Vol.468 (3), pp.885-892 2010.
https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2011
+
  
[10]. A. Duffy and T. Stratfold. (2012). Final Report 2012 . Available:
+
[17] G G. L et al. “What is the Value of a p Value?,” The Annals of Thoracic Surgery, Vol.87(5), pp.1337-1343 2009.
https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2012
+
  
[11]. L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking . Available:  
+
[18] T. Hung. “The p-value,” 21st March 2016 [online] Available: https://www.students4bestevidence.net/p-value-in-plain-english-2/
https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking
+
  
[12]. N. Gencarelli and J-K. Yang. (2015). Semester B Final Report 2015 – Cipher Cracking . Available:  
+
[19] “Mass Spectrometer,” Premier Biosoft [online]. Available: http://www.premierbiosoft.com/tech_notes/mass-spectrometry.html
https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report/Thesis_2015
+
  
[13]. MS. Charikar. (2002). Similarity estimation techniques from rounding algorithms. Available:
+
[20] R. Bakhtiar. “Biological mass spectrometry: A primer”. Mutagenesis pp.530-531 2000.
https://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf
+
  
[14]. G.S. Manku, A. Jain and A. Das Sarma. (2007). Detecting Near-Duplicates for Web Crawling. Conference on World Wide Web.
+
[21] S. Jackson et al. “The application of laser ablation-inductively coupled plasma-mass spectrometry to in situ U–Pb zircon geochronology,” Chemical Geology, Vol. 211, pp.47-69 2004.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.78.7794&rep=rep1&type=pdf
+
  
[15]. From Wikipedia, the Vector Space Model. Access via Internet: https://en.wikipedia.org/wiki/Vector_space_model#/media/File:Vector_space_model.jpg
+
[22] “What are single nucleotide polymorphisms?,” U.S National Library of Medicine [online] Available:  
 +
https://ghr.nlm.nih.gov/primer/genomicresearch/snp
  
[16]. From Internet:
+
[23] G. Shaw. “Polymorphism and Single nucleotide polymorphisms (SNPs)” Science Made Simple, Vol. 112, pp.664-665 2013.
http://static.oschina.net/uploads/img/201308/30125158_L1CI.jpg
+
  
[17] Adelaide Microscopy, ‘instrumentation’ viewed on 23 March 2016
+
[24] Google Maps, 2018. [online]. Available: https://www.google.com.au/maps/dir//Morphettville+Racecourse
<https://www.adelaide.edu.au/microscopy/instrumentation/icpms.html>
+
  
[18] ‘Australia can solve one of the world’s most intriguing mysteries by exhuming the body of The Somerton Man’ 2015, in NEWS.com.au, viewed on 15 March 2016.
+
[25] E. FitzGearld. “Rubaiyat of Omar Khayyam,” [online]. Available: http://www.eleceng.adelaide.edu.au/personal/dabbott/tamanshud/W&T_rubaiyat_wells_copy.pdf
<http://www.news.com.au/national/crime/ >
+
  
[19]Batool,AI, Rehman, FU, Naveed, NH, Shaheen,A and Irfan,S 2010 ‘Hairs as biomonitors of hazardous metals present in a work environment’ in Full Length Research Paper, vol. 10, no. 18, pp.3602-3607.
+
[26] Dr. W. Lawrence. “HAIR MINERAL ANALYSIS – AN INTRODUCTION,” [online]. Available: https://www.drlwilson.com/articles/HA%20INTRO.htm#CONT
  
[20] ‘Dead Man Found Lying on Somerton Beach’ 1948, in The News, vol. 51, no. 7902, pp. 1.
+
[27] American Academy of Dermatology. “How Fast Does Hair Grow? Tips for Growth,” Healthline, 2018 [online]. Available: https://www.healthline.com/health/beauty-skin-care/grow-hair-faster
  
[21] Miekeley, N, Dias Carneiro, MTW, and Porto da Silveira, CL 1998, ‘How reliable are human hair reference intervals for trance element?’ in the Science of the Total Environment, vol. 218, no. 1998, pp. 9-17.
+
[28] L.Tirado-Lee. “The science of curls,” Helix [online]. Available: https://helix.northwestern.edu/blog/2014/05/science-curls
  
[22] School of Electrical and Electronic Engineering ‘Final Year Project Proposal’, 2014
+
[29] Paton, C., Hellstrom, J., Paul, B.,Woodhead, J. and Hergt, J., 2011. Iolite: Freeware for the visualisation and processing of mass spectrometric data. Journal of Analytical Atomic Spectrometry. doi:10.1039/c1ja10172b.
  
[23] ‘Somerton man’ viewed on 23 March 2016
+
[30] “What is DNA?,” Genetics Home Reference, 2018 [online]. Available: https://ghr.nlm.nih.gov/primer/basics/dna.
<https://en.wikipedia.org/wiki/Taman_Shud_Case>
+
  
[24] ‘Why The Somerton Man Endures As One Of Australia's Most Fascinating Cold Cases’ 2016, in GIZMODO, viewed on 15 March 2016.
+
[31] “About us,” 23andme [online]. Available: https://mediacenter.23andme.com/about-us/
<http://www.gizmodo.com.au>
+
  
==Glossary and Symbols==
+
[32] “Getting started with the haplogroup reports,” 23andme [online]. Available: https://customercare.23andme.com/hc/en-us/articles/235201447-Getting-Started-with-the-Haplogroup-Reports
* '''ICP-MS:''' Inductively Coupled Plasma Mass Spectrometer
+

Latest revision as of 18:05, 24 October 2018

Abstract

The corpse of a mystery man’s body was found at Adelaide’s Somerton Beach, South Australia, Australia, on the 1st December 1948 and was hence referred to as the Somerton Man. Till this day the identity of the Somerton Man and the cause of his death is unknown. This project will be broken up into three different tasks, with all contributing towards the unsolved case of the mysterious man. The first task is based around the piece of paper that was found inside his trousers pocket. This piece of paper had five lines of capital letters, and is thought to be some kind of code or cipher. It was found that this mysterious code was a part of a poetry book, known as the ‘Rubaiyat of Omar Khayyam’. The letters on the mysterious code are thought to be first letters of words, based on previous year’s project groups. The location of his death is near Morphettville Racecourse, which leads to belief that the Somerton Man’s mysterious code are different horse names. However, using various statistical approaches, this was proven to be not true. The second part of this project involved, using a mass spectrometer to analyse different isotopic signatures of the samples. More specifically, the samples used were the shaft of the hair, obtained from various people, which were ablated by the laser, and effectively recorded the various elements. This will be compared with the Somerton Man’s hair, to identify specific elements present, as well as noting how long he was in Adelaide before his death. Different DNA samples, were analysed in the final task. Using software tools the samples were degraded, until the DNA became unidentifiable. It was found that degrading 50% or greater of the DNA sequences, makes the DNA unidentifiable.

Acknowledgements

Project supervisor Professor Derek Abbott, for the helpful and motivational advice, as well as the exceptional guidance, which was presented on each of the completed tasks. The University of Adelaide Microscopy Centre, for permitting the project team to use the mass spectrometer. Dr Sarah Gilbert for the assistance and suggestions whilst using the mass spectrometer. Also, thank you to all the volunteer who gave their hairs to use during the experiment. In addition, the project team would like to acknowledge all the software support from University of Adelaide, including Microsoft, Matlab as well as Iolite from the Microscopy centre.

Introduction

Motivation

During this project, a study was undertaken on an unsolved murder case. On December 1st, 1948, an unknown man was found deceased, on Somerton Beach in Adelaide [1]. From then on, he was labelled, ‘the Somerton Man’. There was no form of identification present on him, as well as little information surrounding his death. Therefore, the cause of his death is still unknown to this day [2]. Figure 1 shows the deceased man.

Figure 1 The Somerton Man

A piece of paper with the words “Tamám Shud” printed on it, was found rolled up inside his trousers pocket, which can be seen in Figure 2. This statement can be translated from Persian to either “it is ended” or “it is finished”. This piece of paper was found to be part of a poetry book, the ‘Rubaiyat of Omar Khayyam’ [3]. The theme of the book is that, one should live life to the full and have no regrets when it ends [4].

Figure 2 The Scrap of Paper

The book is speculated to be related to the dead man, due to the parchment. Hence, the case being known as the Tamam Shud case. This has been considered, since the early stages of the police investigation, to be "one of Australia's most profound mysteries [6].” Capital letters were found to be scribbled in the back cover of the Rubaiyat of Omar Khayyam, as seen in Figure 3. Thus, indicating that these letters are somewhat vital to the case at hand, as it is speculated that they may be a form of code or cipher.

Figure 3 The Mysterious Code

The code consisted of five various lines of capital letters, with a total of fifty letters all together. The second and fourth lines are very similar in the way in which they appear. It is therefore, believed that the Somerton Man may have made an error with the encryption, when writing the second line, hence why it is struck out. It is unclear whether some of the letters are in fact an “M” or “W” [7]. For tasks referring to the code, it is assumed that the unclear letter is an “M”. His body was found to be located near the Morphettville Racecourse, this leads to the belief, that the Somerton Man’s mysterious code are related to different horse names. It was also noted that the people who found the deceased body of the Somerton Man, were racehorse jockeys [2]. An Inductively Coupled Plasma Mass Spectrometer (ICP-MS) is used in this project to identify the various isotopes in various people’s hair. With the main concern of Strontium as, Adelaide has high levels in strontium in the soil compared to the rest of Australia.

Previous Studies/Related Work

Australian Department of Defence

In 1978, a request was sent by journalist Stuart Littlemore to the Department of Defence cryptographers to analyse the code. Unfortunately, the cryptographers were unable to crack the code, as they could not provide a satisfactory answer. It was stated that the code had “insufficient symbols” or a “disturbed mind” generated the meaningless code [8].

The University of Adelaide project groups

There have been several Honours project groups at the University of Adelaide that have undertaken this project. The previous work the project groups have done include:

  • Letter frequency analysis in different languages.
  • Initial letter and sentence letter probabilities.
  • The likelihood of the code being an initialism of a poem.
  • Different cypher techniques.
  • The design and implementation of a web crawler.
  • 3D generated reconstruction bust of the Somerton Man.

Main conclusions that these project groups have come to are:

  • It is unlikely that the mysterious code is created randomly.
  • There is strong evidence to believe the mysterious code is most likely to be in English.
  • It is not likely that the mysterious code are initialisms extracted from poems.
  • The Rubaiyat of Omar Khayyam was not used as a straight substitution one-time pad for encryption.
  • The Rubaiyat of Omar Khayyam was not created as a one-time pad for the mysterious code.

With these conclusions, this project will look into further detail of what the Mysterious Code is [9] [10] [11] [12] [13] [14] [15].

Mass Spectrometer Pervious Work

Previous years have also done study with Mass Spectrometer. The 2013 project group had some of Somerton Mans hair and plotted the different elements in the hair comparing with controlled samples. Analysis was conducted on different elements between the two samples. This was done using a glass slide, which have impurities in it [13]. In the 2016 project group, they recreated the analysis using a quartz slide, which does not have impurities. They concluded that Somerton Mans had some abnormally high readings, of some elements, one of which is strontium [15]. In this project, the strontium level will be looked in higher detail and this will be used to indicate how long the Somerton Man was in Adelaide before his death.

Aims and Objectives

The first task that will be done is to understand if the mysterious code are a collective object (horse names, Adelaide street names, Australian beaches, etc.). This will be done using hypothesis testing. An extension of this task will also be completed, which involves the Rubaiyat of Omar Khayyam. The second task that will be performed is using a mass spectrometer. Controlled sample hairs will be compared with Somerton man’s hair, to see how long he was in Adelaide before his death, by finding different elements in the hair. The third task that will be accomplished is using DNA data. The data will be degraded using software tools till it becomes unidentifiable. This then can be used to see how much DNA we need from Somerton man, where further research can undergo.

Technical Background

P-value Theorem

A p-value is a recognised statistical probability, which acknowledges whether an equal or larger effect is present, in comparison to its observed counterpart. In statistics, the p-value helps you determine the significance of the statistical hypothesis by observing the results that were sampled. This determines the probability that the results are due to chance, rather than the experimental conditions. Thus, determining the strength and validity of the results against that of the null hypothesis [16]. In this project, the main focus with the p-values is to determine whether the mysterious code, represents local horse names. Where the null hypothesis is ‘The group of letters are horse names’ with the alternative hypothesis being ‘The group of letters are not horse names.’ For null hypothesis to be accepted the p-value must be larger than 0.05, this indicates that the observed data point is located in the ‘most likely observation’ range, as seen in Figure 4. If the p-value is lower than 0.05, this indicates that the collected results are statistically significant and that the observed data point is located in the ‘very unlikely observations’ range. If the results are in the ‘very unlikely observations’ range, then the null hypothesis can be rejected, which means that the mysterious code is indeed, not horse names [17].

Figure4. Computation of a p-value

Figure 4 shows, the y-axis is the probability and the x-axis is the set of possible results.

Mass Spectrometer

A mass spectrometer is an analytical technique which, when given a foreign sample, it can detect unknown compounds within it. The mass spectrometer produces a multitude of ions in the sample, which then uses a mass to charge ratio of the different ions and records the quantity of each ion type [19]. The components of a typical mass spectrometer is seen in Figure 5. The three major components are the ion source, analyser and the ion detector system.

Figure 5. The Main Components of a Mass Spectrometer [19]

The ion source produces gaseous ions from the sample that was used. The analyser, then sorts the different ions by using the mass-to-charge ratio, according to their mass components. The ion detector system detects the different ions in the sample and records the quantity of each ion type and converts it into an electric signal [20]. The Inductively Coupled Plasma Mass Spectrometer (ICP-MS) is the type of mass spectrometer that was used in this project. The ICP-MS is faster, more precise and sensitive at finding different ions compared to other types of mass spectrometer [21]. In regards to the project, the sample being used is the shaft of the hair.

Single Nucleotide Polymorphism (SNP)

Single Nucleotide Polymorphisms (SNPs), are one of the most common and well talked about genetic variations, which can be present between humans [22]. These variations occur within a nucleotide (a single block of DNA) and happen roughly within 1 out of 300 base pairings [23]. In regards to the project, SNPs will be removed from a DNA sample.

Knowledge Gaps and Technical Challenges

The technical challenges that will be encountered in this project are related to all the knowledge gaps mentioned. To complete each task within the project, further development for programming skills, such as Matlab were required. P-value calculation and hypothesis testing needed some revision, to ensure that a satisfactory level of understanding of the concepts was present. The skill to use Microsoft Excel to perform statistical analysis on the p-value, is required. It was also required to learn how to correctly use a mass spectrometer and interpret the results, this can be done by finding multiple ways to enhance knowledge, before trials.

Task 1: Code Analysis

Aim

The aim of this task is to comprehend whether the mysterious code represents some collective object. The collective objects that will be used are horse names, Australian beaches and cities, South Australian street names, and The Rubaiyat of Omar Khayyam book. The assumption will be made that the letters in the mysterious code, are the initialism of a word. The Somerton Man had a lot of associations with horses, so further research will be going into the assumption that the mysterious code are ‘horse names’. The null hypothesis is ‘The group of letters are horse names’ and the alternative hypothesis is ‘The group of letters are not horse name.’

Preperation

Before the initialisation of this task, the details of the whole case were reviewed. This review included the following; who found the corpse, where it was located and when it was discovered. This gave a better understanding of what type of collective object should be researched, which are stated above. As mentioned before, the reason why horse names are a major factor is because the location of his body was found to be located near Morphettville Race course, as seen in Figure 6.

Figure 6. Location of Somerton Beach and Morphettville Racecourse

It should also be noted that the people who found the deceased body were racehorse jockeys. Hence the reason we assumed the mysterious code are the initialism of horse names from the year 1948.

Method

The approach to determine if the mysterious code represents collective objects, will be done by calculating the p-value and implementing hypothesis testing. In the case of the horse names, there were no direct websites, which provided horse names in the year of 1948. This led to discovering evidence of these names within relevant newspapers and articles. This was done by using ‘Trove’, a search engine to help find resources in Australia. More specifically, in this case it was used to obtain articles and newspapers from 1948. The other collective objects, as mentioned above, were found using South Australian government websites. This led to an abundance of cross checking, to make sure that the list which was going to be utilised, was indeed correct. Matlab was used as the software tool. The initial letter of the each collective object was all that was needed, where using code was implemented to perform this task. In the case, where a collective object had multiple words, both words were included. Also if there was any extra punctuation, it was removed. A demonstration is shown in Figure 7, which illustrates the input and output results for the horse names case and also displays the frequency of each letter.

Figure 7. Matlab Results

Excel was used, to produce the statistical results from Matlab. The letter frequency will be obtained by dividing the amount of each letter over the total amount calculated, where then a p-value test was performed and a comparison graph was completed.

Results

Each collective object was compared to the mysterious code by the frequency of each letter. Where the x-axis represents the alphabet and the y-axis represents the frequency of the letters between the two testing objects. The p-value test was also completed to verify the results, where a p-value of less than 0.05 shows that it is very unlikely that the collective object is the mysterious code.

Horse Names

The comparison of horse names to the mysterious code is seen in Figure 8.

Figure 8. Comparison of Mysterious Code with Horse Names

There was a sample of 69 horse names and it can be seen on the graph that the horse names do not correlate with the mysterious code with many of the English letters. This was also proven by the p-value, as it was lower than 0.05, which means the null hypothesis is not accepted.

Australian Beaches

The comparison of Australian beaches to the mysterious code is seen in Figure 9.

Figure 9. Comparison of Mysterious Code with Australian Beach Names

There was a sample of 114 beach names. Analysing the graph it be seen that the frequency of the letters do correlate with mysterious code. As the results seemed genuine a hypothesis test was done between this values. The results showed a p-value of greater than 0.05, which indicates that the mysterious code could be Australian beach names.

South Australia Street Names

The comparison of South Australian street names to the mysterious code is seen in Figure 10.

Figure 10. Comparison of Mysterious Code with South Australian Street Names

There was a sample of 447 South Australian street names. Observing the graph it can be seen that the frequency of the letter are not similar with the mysterious code. This was also proven by the p-value, as it was lower than 0.05.

Australian City's

The comparison of Australian city names to the mysterious code is seen in Figure 11.

Figure 11. Comparison of Mysterious Code with Australian City Names

There was a sample of 90 Australian city names. Observing the graph it can be seen that the frequency of some letter are similar with the mysterious code. A hypothesis test was then done to check the results. The p-value that was obtained was less than 0.05.

The Rubaiyat of Omar Khayyam book

The comparison of the Rubaiyat of Omar Khayyam book to the mysterious code is seen in Figure 12.

Figure 12. Comparison of Mysterious Code with The Rubaiyat of Omar Khayyam book

There was a sample of 852 words form the book. Observing the graph it can be seen that the frequency of the letter are not similar with the mysterious code. This was also proven by the p-value, as it was lower than 0.05. An extension of this task was also done. This includes analysing The Rubaiyat of Omar Khayyam book more carefully. Previous years stated that the mysterious code does not correlate with the book. Each paragraph in the book has four lines of words (see Figure 13), which compared with the mysterious code also has four lines. Still assuming that each letter in the mysterious code is an initial word, we can compare the two.

Figure 13. A page from The Rubaiyat of Omar Khayyam

The task was to count how many words are in each line of the book and compare it with the mysterious code. Using the first paragraph in Figure 13 (outlined with a red square), the first line has 9 words, then followed by 7 words in the second line, then 8 words in the last two lines. Comparing just the first paragraph with the mysterious code from line 1 to 4, there are 9, 11, 11 and 13 letters respectively. It already can be seen from the first paragraph that there may not be a correlation between the mysterious code and the book. Counting every line would be very time consuming, therefore a text file of The Rubaiyat of Omar Khayyam was used, in correlation with Matlab to count each word in each line. Then using excel, a graph was plotted with error bars to the number of letters in the mysterious code. This can be seen in Figure 14.

Figure 14. Error Bars against the book

The x-axis represents which line in the paragraph it is and the y-axis represents the amount of words present in that line. It can be seen that on line 1, the mysterious code is in the error bars. The rest of the lines are out of the error bars. This indicates that the mysterious code is not from The Rubaiyat of Omar Khayyam book and further proves the previous year’s studies of the book not being part of the mysterious code.

P-value

A summary of the p-values are shown in Table 1.

Table 1. P-value Results

It can be observed that the only collective object that is above 0.05 is Australian beaches, this indicates that the mysterious code could be Australian beaches.

Conclusion

Overall, the results have shown, that it is unlikely that the mysterious code represents the hypothesis stated, of that it is horse names. It has also shown that it is unlikely to be South Australian street names, Australian city names or The Rubaiyat of Omar Khayyam book, even with extensive analysis on the book. But to some surprise it is possible that the mysterious code is Australian beach names, as the p-value was above 0.05. Future research will be required to obtain more meaningful results.

Task 2 Hair Analysis

Background Theory

Analysing elements in the hair could reveal a great deal of useful information on a person’s recent life, this includes their lifestyle imbalances, living environment and dietary problems. Also the mineral levels in hair is about ten times more robust, than compared to blood [26]. These results could reveal where the last place Somerton Man had been to or even the last activity that Somerton Man had done, which could provide some useful evidences in solving this case. Knowing how much hair grows is very important for this project. For every month that passes, hair grows by 1cm [27]. This means the newest hair is in the root.

Aim

The aim of this task is to identify the different isotopes present in several different people’s hair. More specifically the element of concern is strontium. Adelaide has high levels in strontium in the soil compared to the rest of Australia. With this knowledge, the task is to test various hair samples, which have left Adelaide, within the past month and compare it to that of hair samples that have not left Adelaide for at least a year, to see how the strontium values change. This will then be compared with the strontium levels in the Somerton Man’s hair, which can determine how long he was in Adelaide before his death. The ICP-MS, is the approach to determine the different isotopes within the hair, which will then return a spectral analysis of the hair. The spectral analysis will be completed by laser ablation of the hair, where the hair will be ablated with a laser and the spectral elements are recorded.

Preparation

Hair Elements

Before the ICP-MS could be used for the analysis on the hair samples, different isotopes had to be chosen for the ICP-MS to find. Research was completed to find the most common chemical elements inside human hair, which included carbon, hydrogen, oxygen, sulphur, phosphorus and zinc. Then further research to indicate what element we want the ICP-MS to find, which included any toxic elements and common elements in food. There were 24 isotopes that were recorded by the ICP-MS and are shown in Table 2.

Table 2. Isotopes chosen for the ICP-MS

Hair Samples

In this project, a different hair samples was obtained, from five different individuals, for more reliable and accurate results. Two of these samples obtained, were from people that left Adelaide for a period of time and the other three samples were from people that stayed in Adelaide. All the samples will be anonymous for privacy reasons and will be identified as A, B, C, D and E. A summary of each hair sample is shown in Table 3, this includes which sample is identified as, sex, the date the hair was obtained and a brief description of the sample.

Table 3. Summary of the samples

The length of time out of Adelaide is very important, as we can determine the length of hair that was in Adelaide and the length of the hair that was outside of Adelaide. Sample A left Adelaide for 7 days, assuming there is 30 days in one month and hair grows 1cm per month, therefore 2.33mm of hair is from Japan. Similarly with sample E, they left Adelaide for 13 days, therefore 4.33mm of hair is from Bali. Also have to take consideration of the 2 days from when they came back to Adelaide to when the hair was obtained, this is 0.67mm. An example of the length of the hairs is shown in Figure 15.

Figure 15. Example of the hair length

The red displays the 2 days before obtaining the hair, when both samples came back to Adelaide. The orange shows the length of sample A and the green shows the length of sample E, when both samples were away from Adelaide.

Quartz Slide

The hair samples will be placed on a slide, in this case the slide will be made of pure quartz. The reason a pure quartz slide is used, rather than an ordinary glass slide is that, glass slides have a lot of impurities that would contaminate the result, where as a pure quartz slide does not. The elements in a glass slide and quartz slide is shown in Table 4.

Table 4. Comparison of elements in glass and quartz slide

The hair samples were stuck down with double sided sticky tape on the quartz slide, this will have some contamination and will have to be dealt with appropriately.

Mass Spectrometer Experiment

Capturing the data

The quartz slide was placed into two stabilisers, which was then installed into the machine. This machine did the laser ablation on the hair samples. Figure 16 displays the quartz slide with the hair samples, installed into the laser ablation system. While Figure 17 displays the enlarged version of the quartz slide on the monitor.

Figure 16. Hair samples installed into the laser ablation system
Figure 17. Laser ablation system with monitor

Next to the laser ablation system, there are two monitors which will be used for the laser ablation task. The left monitor records the data of the laser ablation and the right monitor controls the laser ablation system, this can be seen in Figure 18. The next step involved choosing how many spots of the hair we want ablated and the distance between these spots. A spot is where the laser will ablate the hair and the isotopes will be recorded for that spot. It was chosen to ablate the hair about half way (1cm), which gave 11 spots at about 800 micro meters apart, two of these spots were located in the root. The mass spectrometer will then document the chosen isotopes. The complete set up of the experiment is shown in Figure 19, with the laser ablation system on the left and the mass spectrometer on the right.

Figure 18. Recording the data (left) and controlling the laser ablation machine (right)
Figure 19. Agilent 7900x mass spectrometer with attached New Wave NWR213 laser ablation system

Processing the Data

The data from the ICP-MS was then put into a software called Iolite, this is how we managed the data and removed any anomalies [29]. The overall waveform of the isotopes in the hair sample for several spots can be seen in Figure 20. An enhanced and zoomed in version of one of the spots is shown in Figure 21.

Figure 20. Waveform of the isotopes in several spots
Figure 21. Enhanced and zoomed into a spot

A brief explanation of the waveform composition will be done. Each waveform is an individual isotope, where the top waveform is the average of all the other waveforms. The first 20 seconds of the waveform, shows a relative flat line, this is the laser ablation system calibrating itself. This information is not useful. The waveform peaks up (represented with a square box), this is when the laser just hits the surface of the hair. Then the waveform is a flat line, this is when the laser is ablating the inside of the hair. This is the important information. In some cases the waveform peaks up at the end, this is the laser ablating through the hair and onto the sticky tape and quartz slide. To compare the results more precisely, two sets of data were captured, the surface of the hair and inside the hair. Only getting these results, meant there was so contamination from the sticky tape or the calibration set up. Lastly the results were outputted into one comprehensive Excel file, with all the necessary data, a portion of the data can be seen in Figure 22.

Figure 22. Portion of the data

Data Analysis

Analysing the data, there were several isotopes that were found in the hair, these isotopes are shown in Table 5.

Table 5. Isotopes found in the hair

Strontium

The comparison of the strontium in the different samples can be seen in Figure 23.

Figure 23. Strontium in different samples

The x-axis represents the distance of the hair ablated and the y-axis is the amount of strontium in the hair at that distance. Samples A and E, are the samples that left Adelaide and came back. For sample E about the first 4000um of hair is when they were in Bali, except for the very first point where they were in Adelaide. Observing the figure, sample E has two high peaks of strontium levels, one at about 1500um and the other at about 3200um, this indicates that there is high strontium levels in Bali. The rest of the value of strontium are low, when sample E came back to Adelaide. For sample A about the first 2000um is when they were in Japan, except for the very first point where they were in Adelaide. Observing the figure, there is a peak at about 1500um, indicating higher strontium values, than that of when sample A was in Adelaide. Overall, these results are concluding that the strontium levels in Adelaide are smaller than that of Bali and Japan.

Conclusion

The strontium level is lower in Adelaide than that of Bali and Japan. Relating this to the Somerton Man, as pervious groups did the ICP-MS on his hair and got high strontium reading. This indicates that Somerton Man must have visited another country with high strontium level, before his visited Adelaide.

Task 3 DNA Analysis

Background Theory

Deoxyribonucleic acid (DNA), is the hereditary material in humans and all other organisms, it is mostly located in the cell nucleus but a small portion can be found in the mitochondria [30]. DNA is made up of four chemical bases, which are adenine (A), guanine (G), cytosine (C), and thymine (T), the information of DNA is stored as a code [30]. More than 99% of these bases are the same between individuals, the variant of DNA determines the different personal characteristics an individual has [30]. Furthermore, this DNA variation is called single-nucleotide polymorphism (SNP), this is an important aspect in this task.

Aim

The aim of this task is to discover how much DNA can be degraded before it becomes unidentifiable. This information will also be used to indicate how much DNA is needed from Somerton Man. The approach to degrade DNA, will be done by removing SNPs from a DNA sample.

Preparation

Obtaining DNA data

A DNA sample kit was ordered from an ancestry place called 23andMe. This is a privately held personal genomics and biotechnology company, which they use saliva to test your DNA [31]. The DNA sample was tested and the results were received. This can be seen in Figure 24.

Figure 24. DNA results from 23andme

Figure 24 show details of the DNA report. It can be seen that this person is 65.1% Eastern Europe, indicating at least one of their parents are European. Another useful information is the Maternal Haplogroup and the Paternal Haplogroup. A haplogroup is a group of similar haplotypes that share a common ancestor but different SNPs [32]. The Maternal Haplogroup traces back to a single common ancestor, this can be used for both males and females, which in this case is H11. The Paternal Haplogroup use Y-chromosomes to define a specific set of shared mutation, this can only be used for males, as females do not have Y-chromosomes and can be seen is R-CTS11962.1.

Raw Data

23andMe also provided the raw data of the DNA sample, this document was the basis of this task. Figure 25 shows a portion of the raw data.

Figure 25. A segment of the raw data

Each line in the raw data represents one single SNP. Removal of these SNPs will be done at random. The raw data has over 600,000 lines.

DNA database

After analysing and understanding the raw data, it was then imported into a DNA database website called ‘GEDmatch Genesis’ to compare with other DNA samples. A fraction of the original sample and how the database displays the results is show in Figure 26.

Figure 26. Fraction of Gedmatch Genesis database results

For privacy reasons the names and email address could not be display in the figure above. Each column in Figure 26 will be explained. ‘Kit’ represents different DNA samples from various individuals. A segment refers to a section or block of contiguous SNPs. A matching segment is a section that is the same between two people. ‘Largest Seg’ is the largest matching segment. ‘cM’ (Centimorgan) is a measure of genetic linkage, it is a measure of DNA information within a chromosome. ‘Gen’ provides a rough estimate of the number of generations between you match both share, where 3 indicates two candidates share the same great-grand parents, 4 indicates they share the same great great-grand parents etc. 'Overlap' is the number of positions that exist in common between both kits, without regard to whether they match or not. The amount of overlap, along with the largest cM amount, is usually a good indication of the relative quality of the match. Matches with low overlap are highlighted with a pink or red background, depending on the overlap value. ‘Date compared’ the date the DNA sample was processed into the database. ‘Testing company’ the location where the DNA sample was tested.

DNA Degradation

Degradation in this case refers to the randomly removal of DNA sequences form the raw data. With these results, SNP’s will be removed from the DNA at different percentages, which then can be utilised to discover how much DNA is required until it becomes unidentifiable. Matlab was used to remove different percentages of SNPs in the sample. Figure 27 demonstrates the code that was used to remove the SNPs from the raw data.

Figure 27. Matlab code to remove SNPs

A brief explanation of how the code works will be discussed. The code gets the raw data (which is a text file). A different portion of the raw data is shown in Figure 28, which will be used in a demonstration.

Figure 28. A portion of the raw data

The raw data has 638468 lines, the code then removes a certain percentage of those line (The code is currently going to remove 90% of the lines). The next part of the code deletes the blank space of the removed lines. The blank spaces needed to be removed to import the data to GED match Genesis. A demonstration of how the code works is shown in Figure 29.

Figure 29. A portion of the raw data

10 lines are present in the raw data (Left image), then 2 lines are removed (Middle image), lastly the blank spaces of the removed lines are also removed, leaving only 8 lines in total (Right Image). Lastly the code converts the results back into a text file, which can be imported to GED match Genesis. The removal of SNPs was done from 10%-90% and the whole set of the experiment will be carried out five times to obtain an average result to reduce any outliers. Two different task were completed using the DNA, more specifically the degraded DNA. One tasked involved using heritage analysis and the other involved comparison of DNA using the database.

Heritage

The first task completed was investigating the heritage of the DNA sample sent. There were various genetic ancestry projects that could have been chosen, therefore research was completed on all the different genetic ancestry projects to make sure that the results would be the most accurate. Eurogenes seemed the most appropriate considering the sample is decent of European background. The next part was to select which model was most appropriate to calculate the heritage in the DNA, again significant research was completed and the chosen model was Eurogenes K13, as this model is best for samples with mixed heritage. The heritage results for the original case (no SNPs removed) can be seen in Figure 30, it can be seen that the sample has strong heritage in North Atlantic, Baltic and West Mediterranean regions.

Figure 30. Heritage results

The results that were completed only took consideration of North Atlantic and Baltic regions as these had the highest percentages, 29.13% and 42.11% respectively. The other cases were than completed, the 10%-90% of SNP removal. A graph was produced to visualise what was happening when removing the SNPs, which can be seen in Figure 31.

Figure 31. Heritage test, with SNPs removed

The x-axis represents how much SNP is removed in the sample and the y-axis is how much percentage of that heritage the sample is. Observing the graph it can see the results are relative steady up to about 60%, the results then fluctuate at 70% and onwards. This indicates that removing more than 70% of the SNPS will result in inaccurate results. This result is not enough to conclude this, so five more tests were completed to see if the trend was similar for each case. The figure below demonstrates the average of the heritage tests completed.

Figure 32. Average of the Heritage Test

It can be seen that average results are all relatively linear, expect for some points at North Atlantic 40%, 80% and 90% and Baltic 80% and 90%. The error bars however, get larger as the percentage of SNPs are being removed. This indicates that the DNA samples start to lose its structure as SNPs are being removed. From 10%-40% of SNP removed, the standard deviation starts increasing to approximately 1, which shows the DNA is still robust. From 50%-90% of SNP removed, the results starts to vary significantly, which shows as the error bars are large and the standard deviation is much larger. It can be concluded with these results that DNA sample is robust until about 50% and after that DNA becomes unidentifiable.

Database

The second task that was completed was comparing the DNA sample to a database of other people DNA samples. Using the database, we were able to efficiently identify how closely related, the DNA samples were, to other people. Refer back to Figure 26 to see how the database displays the results. The original sample will be compared with the 10%-90% cases of SNP are removal. The objective is to find false positives and false negatives. A false positive in this case would mean that the kit appeared in the removal section and a false negative means the kits disappeared in the original sample. An example below will be shown for a clear understanding. Original: A B C D E 10% of SNP removed: C D G H I false positive: G H I and false negative: A B E As there is thousands and thousands of kits to compare, a sample size of 30 was taken to compare. Figure 33 shows the false positives and false negatives for different percentages of SNPs removed.

Figure 33. False positives and false negatives

The x-axis represents how much SNP is removed in the sample and the y-axis is how many false positives or false negatives are found. Unfortunately, in this case the false positives and false negatives equalled each other. The results are all fairly high, at 10% there were only 6 matches with the original case. This indicates that even altering the DNA just by a small amount can have significant change on the DNA. It can be seen as it starts to approach 50% of SNP removal there are more false positives and false negatives, with fewer matches to the original case. At 50% and greater of SNP removed there is 30 false positives and false negatives, this means that no other DNA sample matched the original case.

Conclusion

The major finding in this task is that when 50% of the DNA is removed it becomes unrecognisable. This was proven in both the heritage task and the database task. It was also found that alerting the DNA by only a small amount can result in very significant change to the DNA. Relating this back to the Somerton Man, only 50% of DNA is needed from the Somerton Man to undergo further research.

Project Management

Budget

Each project member had $250 for the project, hence we had $500 for the entire project as two members were present in the group. The pure quartz slide was purchased for the mass spectrometer The DNA genetic testing kit was purchased so we could test our DNA. A table of what items were purchased and how much each item cost is seen in Table 6.

Table 6. Budget Table

Risk Management

A risk assessment table for the project can be seen in Table 7. One of the major risks that was encountered was bugs in the code. This caused lots of issues throughout the project, one of the main issues was that it delayed the project from proceeding further, this lead to task completion delay. Another risk was the chance of having the Human Ethics not approved. This would mean that we could not do the mass spectrometer task and the DNA task, meaning we would have to do more in-depth analysis on the first task and find additional tasks to do. Slight misunderstanding of project tasks happened throughout the project, but these were explained through meetings with the supervisor.

Table 7. Risk Management

Conclusion

The main task was to find out if horse names represent the mysterious code. Final conclusions for the code analysis task, showed that the mysterious code is unlikely to be horse names. It was also discovered that is unlikely to be South Australian street names, Australian city names or The Rubaiyat of Omar Khayyam book. Even with further studies on the book, it was still unlikely to be the mysterious code. But the results showed that it is possible that the mysterious code is Australian beach names. The second main task was ICP-MS on different hair samples. The main focus was strontium and was compared with two samples that left Adelaide for a period of time and the other three samples that stayed in Adelaide. It was found the level of strontium was higher in Bali and Japan than Adelaide. This concludes that Somerton Man must have visited another country with high strontium before coming to Adelaide. The last task involved degrading DNA to discover when the DNA would become unidentifiable. This was done by removing SNPs. Main conclusion found was that DNA become unidentifiable, when 50% of the DNA is removed. This could be used to help find the Somerton Mans relatives or his real identification. Overall, the possibility of discovering the identification of the Somerton Man has gotten a step closer. Hopefully with this project completed other groups can utilise this information and identify who the Somerton Man is.

Future Project Discussion

Code Analysis

Deciphering the mysterious code still needs to be completed. It is possible that the mysterious code is Australian beach names, therefore further investigation is required to finalise if this is true or not. This can be done by increasing the sampling size or by locating what beaches Somerton Man might have visited before the time of his death. Also, more collective objects should be considered and researched. Using different statistically methods would also enhance the results.

Hair Analysis

Analysis of hair from different countries that the Somerton Man could have gone to and compare those results to the strontium levels of this project. To confirm that the hair sample didn’t have any outliers in it, each individual should give two hairs to sample and the whole hair should be ablated rather than doing spots in the hair. This will be more accurate, as you will have more data to process.

DNA Analysis

With the DNA task, as there was only a small sample of false positives and false negatives used, an algorithm or code should be used to process all the false positives and false negatives from the database. This would give a more accurate result. Another task would be having a second DNA kit from another individual and compare the two together. This will see if the two DNA results, have the same undefinable DNA percentage. Through extensive research on the case, Somerton Man has a granddaughter. Future groups can research on how the DNA sequence is different between a grandfather to their granddaughter. This could help understand the different DNA sequences between the two. Now that we know that only 50% of DNA is needed, future groups should get Somerton Mans DNA and undergo further research on it.

References

[1] J. Bineth. “Somerton Man cold case,” ABC news, 14th December 2017 [online]. Available: http://www.abc.net.au/news/2017-12-14/somerton-man-cold-case-could-be-one-step-closer-to-solved/9245512

[2] The News, “Dead Man Found Lying on Somerton Beach,” 1st December 1948, p. 1 [online]. Available: http://trove.nla.gov.au/ndp/del/article/129897161

[3] The Advertiser, “Cryptic note on body,” 9th June 1949, p. 1 [online]. Available: https://trove.nla.gov.au/newspaper/article/36371152

[4] The Advertiser, “New Clue in Somerton Body Mystery,” 25th July 1949, p. 3 [online]. Available: https://trove.nla.gov.au/newspaper/article/36677719

[5] “Tamam Shud/ Somerton Man,” [online]. Available: http://ciphermysteries.com/tamam-shud-somerton-man

[6] The Advertiser, “Tamam Shud,” 10th June 1949, p. 2 [online]. Available: https://trove.nla.gov.au/newspaper/article/36371416

[7] Maguire, S. “Death riddle of a man with no name,” The Advertiser, 9th March 2005, p. 28 [online]. Available: http://www.eleceng.adelaide.edu.au/personal/dabbott/tamanshud/advertiser-mar2005.pdf

[8] Inside Story, presented by Stuart Littlemore, ABC TV, screened at 8 pm, Thursday, 24th August, 1978

[9] A. Turnbull and D. Bihari. (2009). Final Report 2009: Who killed the Somerton man? [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_report_2009:_Who_killed_the_Somerton_man

[10] K. Ramirez and L-V. Michael. (2010). Final Report 2010 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2010

[11] S. Maxwell and P. Johnson. (2011). Final Report 2011 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2011

[12] A. Duffy and T. Stratfold. (2012). Final Report 2012 [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2012

[13] L. Griffith and P. Varsos. (2013). Semester B Final Report 2013 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2013_-_Cipher_cracking

[14] N. Gencarelli and J. Yang. (2015). Semester B Final Report 2015 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report/Thesis_2015

[15] Y. Li and Y. Ma. (2016). Semester B Final Report 2016 – Cipher Cracking [online]. Available: https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report/Thesis_2016

[16] B. David. “P Value and the Theory of Hypothesis Testing: An Explanation for New Researchers,” Clinical Orthopaedics and Related Research®, Vol.468 (3), pp.885-892 2010.

[17] G G. L et al. “What is the Value of a p Value?,” The Annals of Thoracic Surgery, Vol.87(5), pp.1337-1343 2009.

[18] T. Hung. “The p-value,” 21st March 2016 [online] Available: https://www.students4bestevidence.net/p-value-in-plain-english-2/

[19] “Mass Spectrometer,” Premier Biosoft [online]. Available: http://www.premierbiosoft.com/tech_notes/mass-spectrometry.html

[20] R. Bakhtiar. “Biological mass spectrometry: A primer”. Mutagenesis pp.530-531 2000.

[21] S. Jackson et al. “The application of laser ablation-inductively coupled plasma-mass spectrometry to in situ U–Pb zircon geochronology,” Chemical Geology, Vol. 211, pp.47-69 2004.

[22] “What are single nucleotide polymorphisms?,” U.S National Library of Medicine [online] Available: https://ghr.nlm.nih.gov/primer/genomicresearch/snp

[23] G. Shaw. “Polymorphism and Single nucleotide polymorphisms (SNPs)” Science Made Simple, Vol. 112, pp.664-665 2013.

[24] Google Maps, 2018. [online]. Available: https://www.google.com.au/maps/dir//Morphettville+Racecourse

[25] E. FitzGearld. “Rubaiyat of Omar Khayyam,” [online]. Available: http://www.eleceng.adelaide.edu.au/personal/dabbott/tamanshud/W&T_rubaiyat_wells_copy.pdf

[26] Dr. W. Lawrence. “HAIR MINERAL ANALYSIS – AN INTRODUCTION,” [online]. Available: https://www.drlwilson.com/articles/HA%20INTRO.htm#CONT

[27] American Academy of Dermatology. “How Fast Does Hair Grow? Tips for Growth,” Healthline, 2018 [online]. Available: https://www.healthline.com/health/beauty-skin-care/grow-hair-faster

[28] L.Tirado-Lee. “The science of curls,” Helix [online]. Available: https://helix.northwestern.edu/blog/2014/05/science-curls

[29] Paton, C., Hellstrom, J., Paul, B.,Woodhead, J. and Hergt, J., 2011. Iolite: Freeware for the visualisation and processing of mass spectrometric data. Journal of Analytical Atomic Spectrometry. doi:10.1039/c1ja10172b.

[30] “What is DNA?,” Genetics Home Reference, 2018 [online]. Available: https://ghr.nlm.nih.gov/primer/basics/dna.

[31] “About us,” 23andme [online]. Available: https://mediacenter.23andme.com/about-us/

[32] “Getting started with the haplogroup reports,” 23andme [online]. Available: https://customercare.23andme.com/hc/en-us/articles/235201447-Getting-Started-with-the-Haplogroup-Reports