Editing Final Report/Thesis 2019 (section)

==Task 3: Investigation on ethnicity==

===Aims===
The first aim of this task is to investigate the ethnicity of the Somerton Man. As described in previous section, the quality of Somerton Man's DNA is low, therefore the second aim is to study the reliability of low quality DNA's ethnicity examination results. 

===Methods===
An ethnicity tool called Eurogenes Ad-Mix Utilities was used. This tool was provided by GEDmatch and can generate a report of ethnicity proportions to the given DNA kit. Eurogenes K13 model is selected as the 'calculator' model. This model calculates and gives results of the ethnicity proportion in 13 different global regions as shown in Figure 10, and this mode is primarily for European background people since it provides more sub-continental regions for Europe. The Somerton Man's DNA was selected as input kit of the utility and the ethnicity report was generated.

[[File:ethnicity_sample.png|thumb|300px|center|Figure 10:  A sample report of Eurogenes Ad-Mix Utilities]]

In addition, to investigate the reliability of a low quality DNA data file's ethnicity report, several complete DNA samples was analysed. The project ordered 2 sets of complete DNA reference data from 23andMe which provide same format as Somerton Man's file. A program was developed that allows the user to degrade the selected DNA file into different levels of DNA data. This program was also developed using C++. The project team degraded each complete DNA sample files into 9 files by removing 10% SNPs, 20% SNPs and then step by step to 90% SNPs. An extra file which contains only the SNPs with same rsids in Somerton Man's DNA file was created and was named as degraded_DNA for each set of complete DNA sample data. These files were then uploaded to GEDmatch and the same ethnicity research was conducted as what has been done on Somerton man's DNA raw data. All ethnicity reports were recorded, and the change of how the ethnicity proportion changes was also observed. 

In order to provide stronger evidence to prove whether the low quality DNA file's ethnicity report is reliable or not, different degradation algorithms were introduced. The first strategy was that for every 10 SNPs, the first n% SNPs were removed where n% is the percentage of SNPs we would like to remove. The next algorithm performed was the opposite of the first algorithm. This algorithm removed the last n% SNPs for every 10 SNPs, where n% is the percentage of SNPs we would like to remove.  The third and fourth methods were to remove the first and last n% of SNPs for each chromosome, where n% is the percentage of SNPs we would like to remove.

===Results and discussion===
The ethnicity report of Somerton Man's DNA are shown in Figure 11. As the shown in the pie chart, the first 2 major regions are North Atlantic region which contributes up to 36.21% of the chart, and Baltic region which is 20.44%.
[[File:ethnicity_SM.png|thumb|400px|center|Figure 11:  Ethnicity report of Somerton Man's DNA]]

According to the population averages table[15] for Eurogenes K13 model provided by the developer Davidski (Polako), both Baltic and North Atlantic regions are in Europe. Figure 12 is a map that indicates the areas of Baltic region and Figure 13 shows North Atlantic region.
<div style="text-align:center;"><ul> 
<li style="display: inline-block;"> [[File:Baltic.png|right|thumb|300px|Figure 12: Map of Baltic region]] </li>
<li style="display: inline-block;"> [[File:North_atlantic.png|right|thumb|323px|Figure 13: Map of North Atlantic region]] </li>
</ul></div>
To prove the ethnicity report created was reliable, 2 complete DNA files were gained and were degraded to the same level of Somerton Man's DNA which is 2% SNPs remaining in the file. Sample DNA reference file 1 contained 613967 SNPs and 96.41% of them were not empty, and DNA reference file 2 has 614009 SNPs and 97.68% of them were available for use. The ethnicity reports of 2 complete sample DNA files are presented in figure 16 and 17. Also, ethnicity reports of degraded_DNA files for each complete DNA are shown in figure 14 and 15. According to the ethnicity reports shown in those figures, the proportion of the largest and second largest ethnicity regions of sample DNA file 1 have changed to 83.13% to 78.66% and 14.82% to 18.14% after degradation process. The first major region proportion has reduced 4.64% and the second region proportion has increased for 3.32%. The degradation process affected the proportion of each ethnicity region for DNA sample 1, but the change is not much and the first and second regions are still the largest 2 regions in the pie chart. Similar phenomenon can be discovered when comparing ethnicity reports of DNA sample 2. The largest ethnicity regions has grown for 2.33% from 81.44% to 83.77%, and the second largest region increased 0.28% from 7.12% to 7.40%. These changes shows that the proportion of major ethnicity regions would not change greatly when a complete human DNA file is degraded to a level of 2% SNPs remaining.
<div style="text-align:center;"><ul> 
<li style="display: inline-block;"> [[File:s1_eth.png|right|thumb|410px|Figure 14: Ethnicity reports of sample DNA file 1]] </li>
<li style="display: inline-block;"> [[File:degrade_s1_eth.png|right|thumb|400px|Figure 15: Ethnicity reports of sample DNA file 1 after degradation]] </li>
</ul></div>
<div style="text-align:center;"><ul> 
<li style="display: inline-block;"> [[File:s2_eth.png|right|thumb|430px|Figure 16: Ethnicity reports of sample DNA file 2]] </li>
<li style="display: inline-block;"> [[File:degrade_s2_eth.png|right|thumb|425px|Figure 17: Ethnicity reports of sample DNA file 2 after degradation]] </li>
</ul></div>
To provide more evidence to prove this theory, several degradation algorithms introduced in section 5.2 have been applied and changes of ethnicity proportions during different degradation processes have been observed and recorded. 2 sample DNA reference files were degraded into 9 files at different levels from 90% to 10% SNPs remaining. The proportion of first 2 largest ethnicity regions of each degraded files have been plotted on line graphs. Figure 18 is the line graph that shows how the means of ethnicity proportions change via the degradation process with standard error provided. As the graph shown, each region proportion fluctuate at a certain level. For instance the percentage of first region of sample 1 fluctuate at around 83% which is a close value to the original proportion 81.44%. However, error bars or standard errors of each region become larger, as more SNPs are removed, which indicate that as more SNPs being removed, the proportions presented in ethnicity reports become less accurate. But in another case, the highest standard error for first and second region proportions of sample 1 and 2 are 1.32%, 1.41%, 1.33% and 1.03%. None of these standard errors exceed 1.5% which can be seen as an acceptable errors. Therefore the project concludes that when a large amount of SNPs are removed from a set of DNA data, the ethnicity report generated from the DNA data would be influenced, but the results are still acceptable to identify the owner's ethnicity.
[[File:eth_line_graph.png|thumb|600px|center|Figure 18: Line graph of means of ethnicity proportions vs degradation level]]

===Conclusion===
According to the observation of ethnicity change during the degradation process, as more amount of SNPs are removed from a complete human DNA reference file, the result of ethnicity report would be less accurate but the largest and second largest ethnicity regions in the report are still reliable. Therefore the top two major ethnicity of the Somerton Man are North Atlantic and Baltic, where these two regions are mostly around Europe.