Difference between revisions of "Final Report/Thesis 2019"
(→Results and discussion) |
(→Task 3) |
||
Line 99: | Line 99: | ||
It is obvious that there is too many empty SNPs in Somerton Man’s DNA reference file. And the recovery algorithms introduced in section 4.1 are too simple and can not help to increase the chance of finding Somerton Man’s relatives in the GEDmatch database. With DNA data that only contains approximate 2% available SNPs, it is nearly impossible to find any possible related DNA kits with Somerton Man. | It is obvious that there is too many empty SNPs in Somerton Man’s DNA reference file. And the recovery algorithms introduced in section 4.1 are too simple and can not help to increase the chance of finding Somerton Man’s relatives in the GEDmatch database. With DNA data that only contains approximate 2% available SNPs, it is nearly impossible to find any possible related DNA kits with Somerton Man. | ||
− | ==Task 3== | + | ==Task 3: Investigation on ethnicity== |
+ | ===Aims=== | ||
+ | The first aim of this task is investigating the ethnicity of Somerton Man’s DNA. As described in previous section, the quality of Somerton Man’s DNA is low, therefore the second aim is to study the reliability of low quality DNA’s ethnicity examination results. | ||
+ | ===Methods=== | ||
+ | Firstly, an ethnicity tool called Eurogenes Ad-Mix Utilities will be used. This tool is provided on GEDmatch and can generate a report of ethnicity proportions with given DNA kit. Eurogenes K13 model is selected as the ‘calculator’ model. This model allow the utility calculate the ethnicity proportion into 13 different global regions as figure 5.1 shown, and this mode is primary for European background persons since it provide more sub-continental regions for Europe [13]. The Somerton Man’s DNA would be selected as input kit of the utility and the ethnicity report will be generated and evaluated. | ||
− | + | In addition, to investigate the reliability of a low quality DNA data file’s ethnicity report, several complete DNA samples would be required. The project will order 2 sets of complete DNA reference data from 23andMe which provide same format as Somerton Man’s file. A program will be developed that allow the user to degrade the selected DNA file to be degraded into different levels of DNA data. This program will be also developed in C++. The project team will degrade each complete DNA sample files into 9 files by removing 10% SNPs, 20% SNPs and then step by step to 90% SNPs. An extra file which would only contained the SNPs with same rsids in Somerton Man’s DNA file and be named as degraded_DNA will be generated for each set of complete DNA sample data as well. Then these files will be uploaded to GEDmatch and conduct the same ethnicity research as what has been done on Somerton man’s DNA raw data. All ethnicity reports are going to be recorded, and the change of how the ethnicity proportion changes will be observed. | |
− | Somerton Man’s DNA | + | In order to provide stronger evidence to prove whether the low quality DNA file’s ethnicity report is reliable or not, different degradation algorithms are introduced to be performed. The first strategy is that for every 10 SNPs, remove first n% SNPs where n% is the percentage of SNPs we like to remove. The next algorithm perform opposite way of the first algorithm. This algorithm require the program to remove last n% SNPs for every 10 SNPs, where n% is the percentage of SNPs we like to remove. The third and fourth methods are remove the first and last n% of SNPs for each chromosome, where n% is the percentage of SNPs we like to remove. |
− | + | ||
− | Somerton Man’s DNA. | + | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
==Conclusion== | ==Conclusion== |
Revision as of 12:52, 6 November 2019
Contents
Abstract
Somerton Man case is most mysterious case in last century. A unknown man was murdered on Somerton Beach, and identifications of the killer and the victim are still mysteries nowaday. The project aims to investigate the identification of the Somerton man with his DNA data provided. Unfortunately, the DNA data is corrupted and has a high drop rate, therefore the team of the project would be required to use different strategies and techniques to recover and analyse the DNA. Then find out any possible characteristics of the Somerton Man. To approach the goals of the project, the team would have firstly evaluate the DNA data and try to conduct DNA analysis via different genetic services. In addition, the degradation process of complete DNA data. By degrading a complete DNA file, how the degradation would affect the DNA analysis results would be observed and discussed.
Introduction
Motivation
The main topic of the project is human identification via using software programming and genetic analysis techniques. The project conducts a study on investigating the identification of the victim in the Somerton Man case which is one of the most mysterious cases in last century. On December 1st 1948, a well-dressed male was found dead on Somerton Beach in Adelaide [1]. He was clean-shaven, well dressed in a suit and no belongings could prove his identity [7]. Later the man was called as Somerton Man. The figure below shows the look of the Somerton Man.
After more than half century, the identification of the Somerton Man is still unsolved. With the Somerton Man’s DNA data extracted from his hair, the project team may be able to conduct several DNA examinations and identify characteristics of the man. Unfortunately, the DNA data is corrupted due to degradation of the hair, but the team of the project will try the best to investigate the data with modern techniques. In addition, human identification is the main topic of the project. In modern society, human identification techniques is useful in multiple aspects, such as criminal investigation or seeking relatives. Most current identification techniques would require high quality DNA samples, but the project focus on investigating identification techniques based on low quality genetic data. Also the project concentrate on using engineering methods and problem solving skills to improve the identification techniques.
Objectives
The aim of the project is to investigate the identification of the Somerton Man. To be more specific, the group is aiming to identify any possible relatives, physical characteristics, genetic diseases or ethnicity of the Somerton Man. To achieve these goals, the team would use software and genetic analysis techniques to work on the Somerton Man’s DNA data. In addition, the reliability of DNA analysis results from a low quality DNA data would be investigated. This would be approached by degrading several complete DNA data samples into different levels and conducts sets of genetic test on them. The change of test results will be observed and discussed.
Background
DNA
DNA is the hereditary material which stores the genetic information in humans [2]. There are two types of DNA in human beings, one is known as nuclear DNA which is located in cell nucleus and another type is mitochondrial DNA which is located in the mitochondria. This project only focuses on the analysis of nuclear DNA. DNA stores genetic information as a sequence built up with four types of nitrogen bases which are adenine (A), guanine (G), cytosine (C), and thymine (T) [2]. Also, a sugar molecule and a phosphate molecule are attached to each nitrogen base to form a molecule called nucleotide. The bases would pair up (A with T and C with G) and multiple nucleotides are placed in two strands to form a double helix which looks like a spiral [2]. In general, a DNA is a genetic sequence formed by multiple base pairs. The genetic instructions of building and maintaining an organism are obtained from the order of these base pairs [2]. There are about 3 billion bases in human DNA, in which more than 99% of the bases are common in all human beings, and the physiological differences among people depends on these 1% DNA.
Chromosome
Chromosome is an integrated package of DNA molecules. It has thread-like structure, and DNA molecules are coiled up around hi stones proteins to form the structure [3]. There are 23 pairs of chromosomes in human body’s cell, which is 46 chromosomes in total. 22 pairs are called autosomes which are common for both males and females and the last 23rd pair is sex chromosomes which differ males and females. In this project, the DNA data analysis would only focus on autosomes [4].
SNP
Single nucleotide polymorphisms(SNPs) are most common type of genetic variation among human beings [5]. Each SNP represents a difference in a nucleotide which is a single DNA molecule [6]. For instance, a SNP may replace a nucleotide of base guanine (G) with cytosine (C). These SNPs can be found nearly once in every 1,000 nuceotides on average in a person’s DNA. Most SNPs do not effect health of owner. However, some of these variations may associated with diseases.
DNA reference file
A DNA reference file stores a group of SNPs data of owner’s DNA. The format of DNA reference files using in this project is the same format of 23andMe company’s file, where 23andMe is a company that attended to provide personal genetic information for the customer by using advanced genetic analysis techniques and web-based interactive tools. A screen shot of a sample file is shown below.
As shown in the figure, there are 4 columns rsid, chromosone, position and genotype in the DNA reference file. The rsid is a unique id used to identify a specific SNP [9]. The format of rsid starts with “rs” and followed by a number (eg. rs123456). These rsids are commonly used by researchers and databases. There is another special format of rsid that starts with “i” and followed by a number (eg. i123456). This “i” format is used internally by 23andMe to identify the unknown SNP and can not be used in public database. The second column chromosone identify which chromosome the SNP belongs to. Then the third column position indicates positions of SNPs in owner’s DNA sequence. Last column genotype represent the base pairs of variants(A, T, G, or C). Note that there are some cases, the genotype result for some SNPs are not able be provided and “--” would be displays in genotype column [9]. It is important to note that only the SNPs with identified base pairs can be used for DNA analysis.
Task 1
Aims
The aim of this task is to have a basic understanding of the DNA reference file and DNA analysis techniques. The project provide a DNA reference file of the Somerton Man which is a corrupted DNA data. A screen shot of the file is shown in figure 6.
The first goal of this task is to evaluate the quality of the file including counting the total amount of SNPs and the amount of available SNPs. Then the team should try to conduct some DNA analysis on the DNA reference file.
Methods
To approach the first goal of this task, the team will develop a program which provide functions for counting total amount of SNPs, amount of available SNPs (SNPs that do not have genotype of “--”) and determine the percentages of available SNPs for 1 to 22 chromosones of Somerton Man’s DNA raw data. Program was developed by C++ language. Then a website called GEDmatch will be used for conducting DNA analysis. GEDmatch is a website that has an open data personal genomics database and provide tools for DNA and genealogy research. The site become well known after law enforcement in California use it to the Golden State Killer case and are commonly used by all law enforcement in United State [10]. Somerton Man’s DNA reference file will be uploaded to the website and tried to conduct several DNA analysis provided on the website.
Results and discussion
The counting outputs of Somerton man's DNA data is presented in figure 7. As the figure shown, there are more than 0.6 million SNPs in the files, but only about 2% of them have determined base pairs. In DNA analysis, only the SNPs with available base pairs can be used and most genetic analysis technique would require a certain amount of available SNPs.
Then the Somerton Man's DNA reference file was uploaded to GEDmatch for using one-to-many tool. The one-to-many tool is the main service provided by GEDmatch. When a DNA raw data reference file was updated, it would be stored as a kit in GEDmatch database, and be compared with other kits in the public database. After the match process finished, the one-to-many tool could show how many kits in database are matched with the kit that the user has uploaded. Unfortunately the website reject to process the Somerton Man's data for using one-to-many tool since the file did not meet the minimum requirements of 2000 SNPs for each chromosome.
Conclusion
The quality of Somerton Man's DNA reference file is lower than expected. Only about 2% of 613905 SNPs in the files are available for use. Such low quality DNA file is not accepted by GEDmatch to conduct DNA match examination. In order to satisfy the minimum requirements of GEDmatch, a data recovery work would be required which would be introduced in next task.
Task 2: Artificially recover DNA file
Aims
In this task, the project group aims to artificially recover Somerton Man’s DNA file to satisfy the basic SNPs amount requirements (2000 SNPs for each chromosome) of GEDmatch’s one-to-many tool and find out how many people is relative with Somerton Man’s DNA kit.
Methods
The recovery works would be done by developing multiple programs with C++. In general, the recovery work is to replace fixed amount of empty SNPs which is 2000 SNPs for each chromosome with available SNPs. Several simple recovery algorithms will be introduced and implemented. First algorithm called random algorithm is to replace empty SNPs with random base pairs in genotype. Replacing empty genotype with homozygous pairs (AA, GG, TT, CC) can be considered which provide 4 new algorithms to use. In addition, if there is no DNA kit matched with Somerton Man’ DNA in the database, trying to recover Somerton Man’s DNA more empty SNPs can be a back up plan. With the recovery algorithm introduced before, the project team can recover more SNPs in Somerton Man’s DNA reference file. And try the recovered DNA kits with one-to-many tool.
Results and discussion
With the developed program, multiple artificial DNA kits which have 2000 SNPs in each chromosome were created. Unfortunately, all of these DNA kits have 0 matches with other DNA in the public database which means these artificial DNA kits have no relative can be found in the GEDmatch database.
Then kits with more amount of empty SNPs were replaced with homozygous pairs or random pairs were created, but none of these files could find relative DNA kits in the database. Even the DNA kits with all empty SNPs recovered could find a matched DNA result. It is important to note that all 5 recovery strategies were all implemented. As GEDmatch is the most commonly used DNA database in public, it contained a huge amount of DNA kits in its database. As the website shown, the total number of kits managed by GEDmatch database is 1363427157376. Therefore, the chance that no DNA kit in the database is related to Somerton Man is nearly impossible. Which means that the quality of Somerton Man’s DNA file is too low for using one-to-many tool and implementing simple recovery algorithms introduced are useless.
Conclusion
It is obvious that there is too many empty SNPs in Somerton Man’s DNA reference file. And the recovery algorithms introduced in section 4.1 are too simple and can not help to increase the chance of finding Somerton Man’s relatives in the GEDmatch database. With DNA data that only contains approximate 2% available SNPs, it is nearly impossible to find any possible related DNA kits with Somerton Man.
Task 3: Investigation on ethnicity
Aims
The first aim of this task is investigating the ethnicity of Somerton Man’s DNA. As described in previous section, the quality of Somerton Man’s DNA is low, therefore the second aim is to study the reliability of low quality DNA’s ethnicity examination results.
Methods
Firstly, an ethnicity tool called Eurogenes Ad-Mix Utilities will be used. This tool is provided on GEDmatch and can generate a report of ethnicity proportions with given DNA kit. Eurogenes K13 model is selected as the ‘calculator’ model. This model allow the utility calculate the ethnicity proportion into 13 different global regions as figure 5.1 shown, and this mode is primary for European background persons since it provide more sub-continental regions for Europe [13]. The Somerton Man’s DNA would be selected as input kit of the utility and the ethnicity report will be generated and evaluated.
In addition, to investigate the reliability of a low quality DNA data file’s ethnicity report, several complete DNA samples would be required. The project will order 2 sets of complete DNA reference data from 23andMe which provide same format as Somerton Man’s file. A program will be developed that allow the user to degrade the selected DNA file to be degraded into different levels of DNA data. This program will be also developed in C++. The project team will degrade each complete DNA sample files into 9 files by removing 10% SNPs, 20% SNPs and then step by step to 90% SNPs. An extra file which would only contained the SNPs with same rsids in Somerton Man’s DNA file and be named as degraded_DNA will be generated for each set of complete DNA sample data as well. Then these files will be uploaded to GEDmatch and conduct the same ethnicity research as what has been done on Somerton man’s DNA raw data. All ethnicity reports are going to be recorded, and the change of how the ethnicity proportion changes will be observed. In order to provide stronger evidence to prove whether the low quality DNA file’s ethnicity report is reliable or not, different degradation algorithms are introduced to be performed. The first strategy is that for every 10 SNPs, remove first n% SNPs where n% is the percentage of SNPs we like to remove. The next algorithm perform opposite way of the first algorithm. This algorithm require the program to remove last n% SNPs for every 10 SNPs, where n% is the percentage of SNPs we like to remove. The third and fourth methods are remove the first and last n% of SNPs for each chromosome, where n% is the percentage of SNPs we like to remove.
Conclusion
Task 1: The proportion of Somerton Man’s DNA is quite low to conduct most DNA analysis services. But there still are some techniques can be tested with it.
Task 2: The Somerton Man might be North Atlantic according to the ethnicity check on GEDmatch
Task 3: No strong evidences to confirm his physical charateristics and genetic diseases. But several interesting results were discovered.