Difference between revisions of "Final Report/Thesis 2019"
(→Aims) |
(→Task 1) |
||
Line 75: | Line 75: | ||
To approach the first goal of this task, the team will develop a program which provide functions for counting total amount of SNPs, amount of available SNPs (SNPs that do not have genotype of “--”) and determine the percentages of available SNPs for 1 to 22 chromosones of Somerton Man’s DNA raw data. Program was developed by C++ language. | To approach the first goal of this task, the team will develop a program which provide functions for counting total amount of SNPs, amount of available SNPs (SNPs that do not have genotype of “--”) and determine the percentages of available SNPs for 1 to 22 chromosones of Somerton Man’s DNA raw data. Program was developed by C++ language. | ||
Then a website called GEDmatch will be used for conducting DNA analysis. GEDmatch is a website that has an open data personal genomics database and provide tools for DNA and genealogy research. The site become well known after law enforcement in California use it to the Golden State Killer case and are commonly used by all law enforcement in United State [10]. Somerton Man’s DNA reference file will be uploaded to the website and tried to conduct several DNA analysis provided on the website. | Then a website called GEDmatch will be used for conducting DNA analysis. GEDmatch is a website that has an open data personal genomics database and provide tools for DNA and genealogy research. The site become well known after law enforcement in California use it to the Golden State Killer case and are commonly used by all law enforcement in United State [10]. Somerton Man’s DNA reference file will be uploaded to the website and tried to conduct several DNA analysis provided on the website. | ||
+ | ===Results and discussion=== | ||
+ | The counting outputs of Somerton man’s DNA data is presented in figure 3.2. As the figure shown, there are more than 0.6 million SNPs in the files, but only about 2% of them have determined base pairs. In DNA analysis, only the SNPs with available base pairs can be used and most genetic analysis technique would require a certain amount of available SNPs. | ||
+ | Then the Somerton Man’s DNA reference file was uploaded to GEDmatch for using one-to-many tool. The one-to-many tool is the main service provided by GEDmatch. When a DNA raw data reference file was updated, it would be stored as a kit in GEDmatch database, and be compared with other kits in the public database. After the match process finished, the one-to-many tool could show how many kits in database are matched with the kit that the user has uploaded. Unfortunately the website reject to process the Somerton Man’s data for using one-to-many tool since the file did not meet the minimum requirements of 2000 SNPs for each chromosome. | ||
+ | [[File:Dna_ref.png|center|300px|Figure 7:DNA kit not accepted due to low amount of SNPs]] | ||
==Task 2== | ==Task 2== |
Revision as of 12:11, 6 November 2019
Contents
Abstract
Somerton Man case is most mysterious case in last century. A unknown man was murdered on Somerton Beach, and identifications of the killer and the victim are still mysteries nowaday. The project aims to investigate the identification of the Somerton man with his DNA data provided. Unfortunately, the DNA data is corrupted and has a high drop rate, therefore the team of the project would be required to use different strategies and techniques to recover and analyse the DNA. Then find out any possible characteristics of the Somerton Man. To approach the goals of the project, the team would have firstly evaluate the DNA data and try to conduct DNA analysis via different genetic services. In addition, the degradation process of complete DNA data. By degrading a complete DNA file, how the degradation would affect the DNA analysis results would be observed and discussed.
Introduction
Motivation
The main topic of the project is human identification via using software programming and genetic analysis techniques. The project conducts a study on investigating the identification of the victim in the Somerton Man case which is one of the most mysterious cases in last century. On December 1st 1948, a well-dressed male was found dead on Somerton Beach in Adelaide [1]. He was clean-shaven, well dressed in a suit and no belongings could prove his identity [7]. Later the man was called as Somerton Man. The figure below shows the look of the Somerton Man.
After more than half century, the identification of the Somerton Man is still unsolved. With the Somerton Man’s DNA data extracted from his hair, the project team may be able to conduct several DNA examinations and identify characteristics of the man. Unfortunately, the DNA data is corrupted due to degradation of the hair, but the team of the project will try the best to investigate the data with modern techniques. In addition, human identification is the main topic of the project. In modern society, human identification techniques is useful in multiple aspects, such as criminal investigation or seeking relatives. Most current identification techniques would require high quality DNA samples, but the project focus on investigating identification techniques based on low quality genetic data. Also the project concentrate on using engineering methods and problem solving skills to improve the identification techniques.
Objectives
The aim of the project is to investigate the identification of the Somerton Man. To be more specific, the group is aiming to identify any possible relatives, physical characteristics, genetic diseases or ethnicity of the Somerton Man. To achieve these goals, the team would use software and genetic analysis techniques to work on the Somerton Man’s DNA data. In addition, the reliability of DNA analysis results from a low quality DNA data would be investigated. This would be approached by degrading several complete DNA data samples into different levels and conducts sets of genetic test on them. The change of test results will be observed and discussed.
Background
DNA
DNA is the hereditary material which stores the genetic information in humans [2]. There are two types of DNA in human beings, one is known as nuclear DNA which is located in cell nucleus and another type is mitochondrial DNA which is located in the mitochondria. This project only focuses on the analysis of nuclear DNA. DNA stores genetic information as a sequence built up with four types of nitrogen bases which are adenine (A), guanine (G), cytosine (C), and thymine (T) [2]. Also, a sugar molecule and a phosphate molecule are attached to each nitrogen base to form a molecule called nucleotide. The bases would pair up (A with T and C with G) and multiple nucleotides are placed in two strands to form a double helix which looks like a spiral [2]. In general, a DNA is a genetic sequence formed by multiple base pairs. The genetic instructions of building and maintaining an organism are obtained from the order of these base pairs [2]. There are about 3 billion bases in human DNA, in which more than 99% of the bases are common in all human beings, and the physiological differences among people depends on these 1% DNA.
Chromosome
Chromosome is an integrated package of DNA molecules. It has thread-like structure, and DNA molecules are coiled up around hi stones proteins to form the structure [3]. There are 23 pairs of chromosomes in human body’s cell, which is 46 chromosomes in total. 22 pairs are called autosomes which are common for both males and females and the last 23rd pair is sex chromosomes which differ males and females. In this project, the DNA data analysis would only focus on autosomes [4].
SNP
Single nucleotide polymorphisms(SNPs) are most common type of genetic variation among human beings [5]. Each SNP represents a difference in a nucleotide which is a single DNA molecule [6]. For instance, a SNP may replace a nucleotide of base guanine (G) with cytosine (C). These SNPs can be found nearly once in every 1,000 nuceotides on average in a person’s DNA. Most SNPs do not effect health of owner. However, some of these variations may associated with diseases.
DNA reference file
A DNA reference file stores a group of SNPs data of owner’s DNA. The format of DNA reference files using in this project is the same format of 23andMe company’s file, where 23andMe is a company that attended to provide personal genetic information for the customer by using advanced genetic analysis techniques and web-based interactive tools. A screen shot of a sample file is shown below.
As shown in the figure, there are 4 columns rsid, chromosone, position and genotype in the DNA reference file. The rsid is a unique id used to identify a specific SNP [9]. The format of rsid starts with “rs” and followed by a number (eg. rs123456). These rsids are commonly used by researchers and databases. There is another special format of rsid that starts with “i” and followed by a number (eg. i123456). This “i” format is used internally by 23andMe to identify the unknown SNP and can not be used in public database. The second column chromosone identify which chromosome the SNP belongs to. Then the third column position indicates positions of SNPs in owner’s DNA sequence. Last column genotype represent the base pairs of variants(A, T, G, or C). Note that there are some cases, the genotype result for some SNPs are not able be provided and “--” would be displays in genotype column [9]. It is important to note that only the SNPs with identified base pairs can be used for DNA analysis.
Task 1
Aims
The aim of this task is to have a basic understanding of the DNA reference file and DNA analysis techniques. The project provide a DNA reference file of the Somerton Man which is a corrupted DNA data. A screen shot of the file is shown in figure 6.
The first goal of this task is to evaluate the quality of the file including counting the total amount of SNPs and the amount of available SNPs. Then the team should try to conduct some DNA analysis on the DNA reference file.
Methods
To approach the first goal of this task, the team will develop a program which provide functions for counting total amount of SNPs, amount of available SNPs (SNPs that do not have genotype of “--”) and determine the percentages of available SNPs for 1 to 22 chromosones of Somerton Man’s DNA raw data. Program was developed by C++ language. Then a website called GEDmatch will be used for conducting DNA analysis. GEDmatch is a website that has an open data personal genomics database and provide tools for DNA and genealogy research. The site become well known after law enforcement in California use it to the Golden State Killer case and are commonly used by all law enforcement in United State [10]. Somerton Man’s DNA reference file will be uploaded to the website and tried to conduct several DNA analysis provided on the website.
Results and discussion
The counting outputs of Somerton man’s DNA data is presented in figure 3.2. As the figure shown, there are more than 0.6 million SNPs in the files, but only about 2% of them have determined base pairs. In DNA analysis, only the SNPs with available base pairs can be used and most genetic analysis technique would require a certain amount of available SNPs. Then the Somerton Man’s DNA reference file was uploaded to GEDmatch for using one-to-many tool. The one-to-many tool is the main service provided by GEDmatch. When a DNA raw data reference file was updated, it would be stored as a kit in GEDmatch database, and be compared with other kits in the public database. After the match process finished, the one-to-many tool could show how many kits in database are matched with the kit that the user has uploaded. Unfortunately the website reject to process the Somerton Man’s data for using one-to-many tool since the file did not meet the minimum requirements of 2000 SNPs for each chromosome.
Task 2
Ethnicity check via GEDmatch shows that he was North Atlantic for a proportion of more than a quarter of the chart. The second largest section shows that he was Baltic, which does not stray too much from North Atlantic region.
There is only slight change on the ethnicity regions during the degradation process. It is shown in Figure 5 that the ethnicity does not intersect with one another for two sample DNA files, thus concludes that the degradation of DNA does not affect the proportion of ethnicity. This then concludes that the Somerton Man’s origin is around North Atlantic countries and Baltic region based on Figure 4. The countries that are associated with these regions are shown in Figure 6.
Task 3
Somerton Man’s DNA was analysed with dbSNP 575 potential genetic diseases were found associated to Somerton Man’s DNA. There is no result strongly support Somerton Man's known physical appearence such as hair colour, teeth structure or eye colour. But several interesting characteristics were discovered. One of the diseases found in his DNA is Skin fragility woolly hair syndrome which indicates that Somerton Man might have woolly hair abnormality.
Conclusion
Task 1: The proportion of Somerton Man’s DNA is quite low to conduct most DNA analysis services. But there still are some techniques can be tested with it.
Task 2: The Somerton Man might be North Atlantic according to the ethnicity check on GEDmatch
Task 3: No strong evidences to confirm his physical charateristics and genetic diseases. But several interesting results were discovered.