Final Report/Thesis 2019
Abstract
Somerton Man case is most mysterious case in last century. A unknown man was murdered on Somerton Beach, and identifications of the killer and the victim are still mysteries nowaday. The project aims to investigate the identification of the Somerton man with his DNA data provided. Unfortunately, the DNA data is corrupted and has a high drop rate, therefore the team of the project would be required to use different strategies and techniques to recover and analyse the DNA. Then find out any possible characteristics of the Somerton Man. To approach the goals of the project, the team would have firstly evaluate the DNA data and try to conduct DNA analysis via different genetic services. In addition, the degradation process of complete DNA data. By degrading a complete DNA file, how the degradation would affect the DNA analysis results would be observed and discussed.
Introduction
Motivation
The main topic of the project is human identification via using software programming and genetic analysis techniques. The project conducts a study on investigating the identification of the victim in the Somerton Man case which is one of the most mysterious cases in last century. On December 1st 1948, a well-dressed male was found dead on Somerton Beach in Adelaide [1]. He was clean-shaven, well dressed in a suit and no belongings could prove his identity [7]. Later the man was called as Somerton Man. The figure below shows the look of the Somerton Man.

After more than half century, the identification of the Somerton Man is still unsolved. With the Somerton Man’s DNA data extracted from his hair, the project team may be able to conduct several DNA examinations and identify characteristics of the man. Unfortunately, the DNA data is corrupted due to degradation of the hair, but the team of the project will try the best to investigate the data with modern techniques. In addition, human identification is the main topic of the project. In modern society, human identification techniques is useful in multiple aspects, such as criminal investigation or seeking relatives. Most current identification techniques would require high quality DNA samples, but the project focus on investigating identification techniques based on low quality genetic data. Also the project concentrate on using engineering methods and problem solving skills to improve the identification techniques.
Objectives
The aim of the project is to investigate the identification of the Somerton Man. To be more specific, the group is aiming to identify any possible relatives, physical characteristics, genetic diseases or ethnicity of the Somerton Man. To achieve these goals, the team would use software and genetic analysis techniques to work on the Somerton Man’s DNA data. In addition, the reliability of DNA analysis results from a low quality DNA data would be investigated. This would be approached by degrading several complete DNA data samples into different levels and conducts sets of genetic test on them. The change of test results will be observed and discussed.
Background
DNA
DNA is the hereditary material which stores the genetic information in humans [2]. There are two types of DNA in human beings, one is known as nuclear DNA which is located in cell nucleus and another type is mitochondrial DNA which is located in the mitochondria. This project only focuses on the analysis of nuclear DNA. DNA stores genetic information as a sequence built up with four types of nitrogen bases which are adenine (A), guanine (G), cytosine (C), and thymine (T) [2]. Also, a sugar molecule and a phosphate molecule are attached to each nitrogen base to form a molecule called nucleotide. The bases would pair up (A with T and C with G) and multiple nucleotides are placed in two strands to form a double helix which looks like a spiral [2]. In general, a DNA is a genetic sequence formed by multiple base pairs. The genetic instructions of building and maintaining an organism are obtained from the order of these base pairs [2]. There are about 3 billion bases in human DNA, in which more than 99% of the bases are common in all human beings, and the physiological differences among people depends on these 1% DNA.

Chromosome


Chromosome is an integrated package of DNA molecules. It has thread-like structure, and DNA molecules are coiled up around hi stones proteins to form the structure [3]. There are 23 pairs of chromosomes in human body’s cell, which is 46 chromosomes in total. 22 pairs are called autosomes which are common for both males and females and the last 23rd pair is sex chromosomes which differ males and females. In this project, the DNA data analysis would only focus on autosomes [4].
SNP
Single nucleotide polymorphisms(SNPs) are most common type of genetic variation among human beings [5]. Each SNP represents a difference in a nucleotide which is a single DNA molecule [6]. For instance, a SNP may replace a nucleotide of base guanine (G) with cytosine (C). These SNPs can be found nearly once in every 1,000 nuceotides on average in a person’s DNA. Most SNPs do not effect health of owner. However, some of these variations may associated with diseases.
DNA reference file
A DNA reference file stores a group of SNPs data of owner’s DNA. The format of DNA reference files using in this project is the same format of 23andMe company’s file, where 23andMe is a company that attended to provide personal genetic information for the customer by using advanced genetic analysis techniques and web-based interactive tools. A screen shot of a sample file is shown below.

As shown in the figure, there are 4 columns rsid, chromosone, position and genotype in the DNA reference file. The rsid is a unique id used to identify a specific SNP [9]. The format of rsid starts with “rs” and followed by a number (eg. rs123456). These rsids are commonly used by researchers and databases. There is another special format of rsid that starts with “i” and followed by a number (eg. i123456). This “i” format is used internally by 23andMe to identify the unknown SNP and can not be used in public database. The second column chromosone identify which chromosome the SNP belongs to. Then the third column position indicates positions of SNPs in owner’s DNA sequence. Last column genotype represent the base pairs of variants(A, T, G, or C). Note that there are some cases, the genotype result for some SNPs are not able be provided and “--” would be displays in genotype column [9]. It is important to note that only the SNPs with identified base pairs can be used for DNA analysis.
Task 1
Aims
The aim of this task is to have a basic understanding of the DNA reference file and DNA analysis techniques. The project provide a DNA reference file of the Somerton Man which is a corrupted DNA data. A screen shot of the file is shown in figure 6.

The first goal of this task is to evaluate the quality of the file including counting the total amount of SNPs and the amount of available SNPs. Then the team should try to conduct some DNA analysis on the DNA reference file.
Methods
To approach the first goal of this task, the team will develop a program which provide functions for counting total amount of SNPs, amount of available SNPs (SNPs that do not have genotype of “--”) and determine the percentages of available SNPs for 1 to 22 chromosones of Somerton Man’s DNA raw data. Program was developed by C++ language. Then a website called GEDmatch will be used for conducting DNA analysis. GEDmatch is a website that has an open data personal genomics database and provide tools for DNA and genealogy research. The site become well known after law enforcement in California use it to the Golden State Killer case and are commonly used by all law enforcement in United State [10]. Somerton Man’s DNA reference file will be uploaded to the website and tried to conduct several DNA analysis provided on the website.
Results and discussion
The counting outputs of Somerton man's DNA data is presented in figure 7. As the figure shown, there are more than 0.6 million SNPs in the files, but only about 2% of them have determined base pairs. In DNA analysis, only the SNPs with available base pairs can be used and most genetic analysis technique would require a certain amount of available SNPs.

Then the Somerton Man's DNA reference file was uploaded to GEDmatch for using one-to-many tool. The one-to-many tool is the main service provided by GEDmatch. When a DNA raw data reference file was updated, it would be stored as a kit in GEDmatch database, and be compared with other kits in the public database. After the match process finished, the one-to-many tool could show how many kits in database are matched with the kit that the user has uploaded. Unfortunately the website reject to process the Somerton Man's data for using one-to-many tool since the file did not meet the minimum requirements of 2000 SNPs for each chromosome.

Conclusion
The quality of Somerton Man's DNA reference file is lower than expected. Only about 2% of 613905 SNPs in the files are available for use. Such low quality DNA file is not accepted by GEDmatch to conduct DNA match examination. In order to satisfy the minimum requirements of GEDmatch, a data recovery work would be required which would be introduced in next task.
Task 2: Artificially recover DNA file
Aims
In this task, the project group aims to artificially recover Somerton Man’s DNA file to satisfy the basic SNPs amount requirements (2000 SNPs for each chromosome) of GEDmatch’s one-to-many tool and find out how many people is relative with Somerton Man’s DNA kit.
Methods
The recovery works would be done by developing multiple programs with C++. In general, the recovery work is to replace fixed amount of empty SNPs which is 2000 SNPs for each chromosome with available SNPs. Several simple recovery algorithms will be introduced and implemented. First algorithm called random algorithm is to replace empty SNPs with random base pairs in genotype. Replacing empty genotype with homozygous pairs (AA, GG, TT, CC) can be considered which provide 4 new algorithms to use. In addition, if there is no DNA kit matched with Somerton Man’ DNA in the database, trying to recover Somerton Man’s DNA more empty SNPs can be a back up plan. With the recovery algorithm introduced before, the project team can recover more SNPs in Somerton Man’s DNA reference file. And try the recovered DNA kits with one-to-many tool.
Results and discussion
With the developed program, multiple artificial DNA kits which have 2000 SNPs in each chromosome were created. Unfortunately, all of these DNA kits have 0 matches with other DNA in the public database which means these artificial DNA kits have no relative can be found in the GEDmatch database.

Then kits with more amount of empty SNPs were replaced with homozygous pairs or random pairs were created, but none of these files could find relative DNA kits in the database. Even the DNA kits with all empty SNPs recovered could find a matched DNA result. It is important to note that all 5 recovery strategies were all implemented. As GEDmatch is the most commonly used DNA database in public, it contained a huge amount of DNA kits in its database. As the website shown, the total number of kits managed by GEDmatch database is 1363427157376. Therefore, the chance that no DNA kit in the database is related to Somerton Man is nearly impossible. Which means that the quality of Somerton Man’s DNA file is too low for using one-to-many tool and implementing simple recovery algorithms introduced are useless.
Conclusion
It is obvious that there is too many empty SNPs in Somerton Man’s DNA reference file. And the recovery algorithms introduced in section 4.1 are too simple and can not help to increase the chance of finding Somerton Man’s relatives in the GEDmatch database. With DNA data that only contains approximate 2% available SNPs, it is nearly impossible to find any possible related DNA kits with Somerton Man.
Task 3: Investigation on ethnicity
Aims
The first aim of this task is investigating the ethnicity of Somerton Man’s DNA. As described in previous section, the quality of Somerton Man’s DNA is low, therefore the second aim is to study the reliability of low quality DNA’s ethnicity examination results.
Methods
Firstly, an ethnicity tool called Eurogenes Ad-Mix Utilities will be used. This tool is provided on GEDmatch and can generate a report of ethnicity proportions with given DNA kit. Eurogenes K13 model is selected as the 'calculator' model. This model allow the utility calculate the ethnicity proportion into 13 different global regions as figure 10 shown, and this mode is primary for European background persons since it provide more sub-continental regions for Europe [13]. The Somerton Man’s DNA would be selected as input kit of the utility and the ethnicity report will be generated and evaluated.

In addition, to investigate the reliability of a low quality DNA data file's ethnicity report, several complete DNA samples would be required. The project will order 2 sets of complete DNA reference data from 23andMe which provide same format as Somerton Man’s file. A program will be developed that allow the user to degrade the selected DNA file to be degraded into different levels of DNA data. This program will be also developed in C++. The project team will degrade each complete DNA sample files into 9 files by removing 10% SNPs, 20% SNPs and then step by step to 90% SNPs. An extra file which would only contained the SNPs with same rsids in Somerton Man’s DNA file and be named as degraded_DNA will be generated for each set of complete DNA sample data as well. Then these files will be uploaded to GEDmatch and conduct the same ethnicity research as what has been done on Somerton man’s DNA raw data. All ethnicity reports are going to be recorded, and the change of how the ethnicity proportion changes will be observed. In order to provide stronger evidence to prove whether the low quality DNA file's ethnicity report is reliable or not, different degradation algorithms are introduced to be performed. The first strategy is that for every 10 SNPs, remove first n% SNPs where n% is the percentage of SNPs we like to remove. The next algorithm perform opposite way of the first algorithm. This algorithm require the program to remove last n% SNPs for every 10 SNPs, where n% is the percentage of SNPs we like to remove. The third and fourth methods are remove the first and last n% of SNPs for each chromosome, where n% is the percentage of SNPs we like to remove.
Results and discussion
The ethnicity report of Somerton Man's DNA are shown in figure 11. As the pie chart shown, the first 2 major regions are North Atlantic region which has 36.21% proportion and Baltic region which has 20.44% proportion.

According to the population averages table[15] for Eurogenes K13 model provided by the developer Davidski (Polako), both Baltic and North Atlantic regions are in Europe. the figure 12 is a map that indicate the areas of Baltic region and figure 13 shows the North Atlantic region.
To prove the ethnicity report created is reliable, 2 complete DNA files were gained and be degraded to the same level of Somerton Man’s DNA which is 2% SNPs remaining in the file. Sample DNA reference file 1 contained 613967 SNPs and 96.41% of them are not empty, and DNA reference file 2 has 614009 SNPs and 97.68% of them are available for use. The ethnicity reports of 2 complete sample DNA files are presented in figure 16 and 17. Also, ethnicity reports of degraded_DNA files for each complete DNA are shown in figure 14 and 15. According to the ethnicity reports shown in figures, the proportion of largest and second largest ethnicity regions of sample DNA file 1 have changed to 83.13% to 78.66% and 14.82% to 18.14% after degradation process. The first major region proportion has reduced 4.64% and the second region proportion has increased for 3.32%. The degradation process effect the proportion of each ethnicity region for DNA sample 1, but the change is not much and the first and second regions are still the largest 2 regions in the pie chart. Similar phenomenon can be discovered when comparing ethnicity reports of DNA sample 2. The largest ethnicity regions has grown for 2.33% from 81.44% to 83.77%, and the second largest region increased 0.28% from 7.12% to 7.40%. These changes shows that the proportion of major ethnicity regions would not have great change which could be within 5% when a complete human DNA file is degraded to a level of 2% SNPs remaining.
To provide more evidences to prove this theory, several degradation algorithms introduced in section 5.2 have been applied and changes of ethnicity proportions during different degradation processes have been observed and recorded. 2 sample DNA reference files are degraded into 9 files at different levels from 90% to 10% SNPs remaining. The proportion of first 2 largest ethnicity regions of each degraded files have been plotted on line graphs. Figure 18 is the line graph that show how the means of ethnicity proportions change via the degradation process with standard error provided. As the graph shown, each region proportion fluctuate at a certain level. For instance the percentage of first region of sample 1 fluctuate at around 83% which is a close value to the original proportion 81.44%. However, error bars or standard errors of each region become larger, as more SNPs are removed, which indicate that as more SNPs being removed, the proportions presented in ethnicity reports become less accuracy. But in another case, the highest standard error for first and second region proportions of sample 1 and 2 are 1.32%, 1.41%, 1.33% and 1.03%. None of these standard errors exceed 1.5% which can be seen as an acceptable errors. Therefore the project conclude that when a large amount of SNPs are removed from a set of DNA data, the ethnicity report generated from the DNA data would be influenced, but the results are still acceptable for identifying owner’s ethnicity.

Conclusion
According to the observation of ethnicity change during the degradation process, as more amount of SNPs are removed from a complete human DNA reference file, the result of ethnicity report would be less accuracy. But for the largest and second largest ethnicity regions in the report are still reliable. Therefore the major ethnicity of Somerton Man is North Atlantic and Baltic.
Genetic diseases search
Aims
During this task, the team would focus on searching clinical effects of each available SNP and identify any possible genetic disease or physical characteristics that Somerton Man could have.
Methods
To search the clinical effects of SNPs, the team would develop a data mining program that collecting information in SNP database. Python language would be used for development since it is convenient for web development. The SNP database the project selected to use is dbSNP which is the largest database for nucleotide variations in the world, and is managed by the National Center for Biotechnology Information (NCBI) [11]. Figure 5.1 shows the information provided by dbSNP. The team would collect the clinical significances refers to each rsid in Somerton Man’s file.
Conclusion
Task 1: The proportion of Somerton Man’s DNA is quite low to conduct most DNA analysis services. But there still are some techniques can be tested with it.
Task 2: The Somerton Man might be North Atlantic according to the ethnicity check on GEDmatch
Task 3: No strong evidences to confirm his physical charateristics and genetic diseases. But several interesting results were discovered.





