Editing Final Report/Thesis 2018 (section)

==Task 3 DNA Analysis==
===Background Theory===
Deoxyribonucleic acid (DNA), is the hereditary material in humans and all other organisms, it is mostly located in the cell nucleus but a small portion can be found in the mitochondria [30]. DNA is made up of four chemical bases, which are adenine (A), guanine (G), cytosine (C), and thymine (T), the information of DNA is stored as a code [30]. More than 99% of these bases are the same between individuals, the variant of DNA determines the different personal characteristics an individual has [30]. Furthermore, this DNA variation is called single-nucleotide polymorphism (SNP), this is an important aspect in this task.

===Aim===
The aim of this task is to discover how much DNA can be degraded before it becomes unidentifiable. This information will also be used to indicate how much DNA is needed from Somerton Man. The approach to degrade DNA, will be done by removing SNPs from a DNA sample.

===Preparation===
====Obtaining DNA data====
A DNA sample kit was ordered from an ancestry place called 23andMe. This is a privately held personal genomics and biotechnology company, which they use saliva to test your DNA [31]. The DNA sample was tested and the results were received. This can be seen in Figure 24.
[[File:ZFigure28.jpg|thumb|500px|center|Figure 24. DNA results from 23andme]]
Figure 24 show details of the DNA report. It can be seen that this person is 65.1% Eastern Europe, indicating at least one of their parents are European. Another useful information is the Maternal Haplogroup and the Paternal Haplogroup. A haplogroup is a group of similar haplotypes that share a common ancestor but different SNPs [32]. The Maternal Haplogroup traces back to a single common ancestor, this can be used for both males and females, which in this case is H11. The Paternal Haplogroup use Y-chromosomes to define a specific set of shared mutation, this can only be used for males, as females do not have Y-chromosomes and can be seen is R-CTS11962.1.

====Raw Data====
23andMe also provided the raw data of the DNA sample, this document was the basis of this task. Figure 25 shows a portion of the raw data.
[[File:ZFigure29.jpg|thumb|500px|center|Figure 25. A segment of the raw data]]
Each line in the raw data represents one single SNP. Removal of these SNPs will be done at random. The raw data has over 600,000 lines.

====DNA database====
After analysing and understanding the raw data, it was then imported into a DNA database website called ‘GEDmatch Genesis’ to compare with other DNA samples. A fraction of the original sample and how the database displays the results is show in Figure 26.
[[File:ZFigure30.jpg|thumb|500px|center|Figure 26. Fraction of Gedmatch Genesis database results]]
For privacy reasons the names and email address could not be display in the figure above.
Each column in Figure 26 will be explained. ‘Kit’ represents different DNA samples from various individuals. A segment refers to a section or block of contiguous SNPs. A matching segment is a section that is the same between two people. ‘Largest Seg’ is the largest matching segment. ‘cM’ (Centimorgan) is a measure of genetic linkage, it is a measure of DNA information within a chromosome. ‘Gen’ provides a rough estimate of the number of generations between you match both share, where 3 indicates two candidates share the same great-grand parents, 4 indicates they share the same great great-grand parents etc. 'Overlap' is the number of positions that exist in common between both kits, without regard to whether they match or not. The amount of overlap, along with the largest cM amount, is usually a good indication of the relative quality of the match. Matches with low overlap are highlighted with a pink or red background, depending on the overlap value. ‘Date compared’ the date the DNA sample was processed into the database. ‘Testing company’ the location where the DNA sample was tested.

===DNA Degradation===
Degradation in this case refers to the randomly removal of DNA sequences form the raw data. With these results, SNP’s will be removed from the DNA at different percentages, which then can be utilised to discover how much DNA is required until it becomes unidentifiable. Matlab was used to remove different percentages of SNPs in the sample. Figure 27 demonstrates the code that was used to remove the SNPs from the raw data.
[[File:soFigure27.jpg|thumb|500px|center|Figure 27. Matlab code to remove SNPs]]
A brief explanation of how the code works will be discussed. The code gets the raw data (which is a text file).  A different portion of the raw data is shown in Figure 28, which will be used in a demonstration.
[[File:soFigure28.jpg|thumb|500px|center|Figure 28. A portion of the raw data]]
The raw data has 638468 lines, the code then removes a certain percentage of those line (The code is currently going to remove 90% of the lines). The next part of the code deletes the blank space of the removed lines. The blank spaces needed to be removed to import the data to GED match Genesis. A demonstration of how the code works is shown in Figure 29.
[[File:soFigure29.jpg|thumb|500px|center|Figure 29. A portion of the raw data]]
10 lines are present in the raw data (Left image), then 2 lines are removed (Middle image), lastly the blank spaces of the removed lines are also removed, leaving only 8 lines in total (Right Image).
Lastly the code converts the results back into a text file, which can be imported to GED match Genesis. The removal of SNPs was done from 10%-90% and the whole set of the experiment will be carried out five times to obtain an average result to reduce any outliers. 
Two different task were completed using the DNA, more specifically the degraded DNA. One tasked involved using heritage analysis and the other involved comparison of DNA using the database. 

====Heritage====
The first task completed was investigating the heritage of the DNA sample sent. There were various genetic ancestry projects that could have been chosen, therefore research was completed on all the different genetic ancestry projects to make sure that the results would be the most accurate. Eurogenes seemed the most appropriate considering the sample is decent of European background. The next part was to select which model was most appropriate to calculate the heritage in the DNA, again significant research was completed and the chosen model was Eurogenes K13, as this model is best for samples with mixed heritage.
The heritage results for the original case (no SNPs removed) can be seen in Figure 30, it can be seen that the sample has strong heritage in North Atlantic, Baltic and West Mediterranean regions.
[[File:soFigure30.jpg|thumb|500px|center|Figure 30. Heritage results]]
The results that were completed only took consideration of North Atlantic and Baltic regions as these had the highest percentages, 29.13% and 42.11% respectively.
The other cases were than completed, the 10%-90% of SNP removal. A graph was produced to visualise what was happening when removing the SNPs, which can be seen in Figure 31.
[[File:soFigure31.jpg|thumb|500px|center|Figure 31. Heritage test, with SNPs removed]]
The x-axis represents how much SNP is removed in the sample and the y-axis is how much percentage of that heritage the sample is. Observing the graph it can see the results are relative steady up to about 60%, the results then fluctuate at 70% and onwards. This indicates that removing more than 70% of the SNPS will result in inaccurate results. This result is not enough to conclude this, so five more tests were completed to see if the trend was similar for each case. The figure below demonstrates the average of the heritage tests completed.
[[File:soFigure32.jpg|thumb|500px|center|Figure 32. Average of the Heritage Test]]
It can be seen that average results are all relatively linear, expect for some points at North Atlantic 40%, 80% and 90% and Baltic 80% and 90%. The error bars however, get larger as the percentage of SNPs are being removed. This indicates that the DNA samples start to lose its structure as SNPs are being removed. 
From 10%-40% of SNP removed, the standard deviation starts increasing to approximately 1, which shows the DNA is still robust. From 50%-90% of SNP removed, the results starts to vary significantly, which shows as the error bars are large and the standard deviation is much larger.
It can be concluded with these results that DNA sample is robust until about 50% and after that DNA becomes unidentifiable.

====Database====
The second task that was completed was comparing the DNA sample to a database of other people DNA samples. Using the database, we were able to efficiently identify how closely related, the DNA samples were, to other people. Refer back to Figure 26 to see how the database displays the results.
The original sample will be compared with the 10%-90% cases of SNP are removal. The objective is to find false positives and false negatives. A false positive in this case would mean that the kit appeared in the removal section and a false negative means the kits disappeared in the original sample. An example below will be shown for a clear understanding. 
Original: A B C D E
10% of SNP removed: C D G H I
false positive: G H I and false negative: A B E
As there is thousands and thousands of kits to compare, a sample size of 30 was taken to compare. Figure 33 shows the false positives and false negatives for different percentages of SNPs removed.
[[File:soFigure33.jpg|thumb|500px|center|Figure 33. False positives and false negatives]]
The x-axis represents how much SNP is removed in the sample and the y-axis is how many false positives or false negatives are found. Unfortunately, in this case the false positives and false negatives equalled each other.
The results are all fairly high, at 10% there were only 6 matches with the original case. This indicates that even altering the DNA just by a small amount can have significant change on the DNA. It can be seen as it starts to approach 50% of SNP removal there are more false positives and false negatives, with fewer matches to the original case. At 50% and greater of SNP removed there is 30 false positives and false negatives, this means that no other DNA sample matched the original case.

===Conclusion===
The major finding in this task is that when 50% of the DNA is removed it becomes unrecognisable. This was proven in both the heritage task and the database task. It was also found that alerting the DNA by only a small amount can result in very significant change to the DNA. Relating this back to the Somerton Man, only 50% of DNA is needed from the Somerton Man to undergo further research.