Difference between revisions of "Cipher Cracking 2019"

From Derek
Jump to: navigation, search
(General project description)
(Semester 2)
 
(31 intermediate revisions by 3 users not shown)
Line 3: Line 3:
 
*[[Andrew G. Allison|Dr Andrew Allison]]
 
*[[Andrew G. Allison|Dr Andrew Allison]]
  
== Honours tudents ==
+
== Honours students ==
 
*[[Azizul Hakim]]
 
*[[Azizul Hakim]]
 
*[[Zihe Wang]]
 
*[[Zihe Wang]]
Line 13: Line 13:
 
In this project is about human identification. In times of war and natural disaster there is a high demand for the ability to identify human remains. Also in the area of policing crime, the need to forensically identify remains is critical.  Also human identification can extend to humans that are alive: for example reuniting an adopted child with a birth family. Children that do not know their origins for a number of reasons (eg. human trafficking of babies) can also benefit.   
 
In this project is about human identification. In times of war and natural disaster there is a high demand for the ability to identify human remains. Also in the area of policing crime, the need to forensically identify remains is critical.  Also human identification can extend to humans that are alive: for example reuniting an adopted child with a birth family. Children that do not know their origins for a number of reasons (eg. human trafficking of babies) can also benefit.   
  
The idea of this project is to advance the techniques of human identification by focussing on a difficult case that that is too hard for current techniques.  You can read the details about the dead body and the circumstances [http://en.wikipedia.org/wiki/Taman_Shud_Case]
+
The idea of this project is to advance the techniques of human identification by focusing on a difficult case that that is too hard for current techniques.  You can read the details about the dead body and the circumstances [http://en.wikipedia.org/wiki/Taman_Shud_Case here.]
  
We also want you to bring the skills of an electrical engineer to bear on the area of e-forensics and bioinfirmatics see if you can apply these to other areas of the case (eg. analyzing degraded DNA sequences).
+
We also want you to bring the skills of an electrical engineer to bear on the area of e-forensics and bioinformatics see if you can apply these to an important area of the case, ie. analyzing degraded DNA sequences.
  
 
==Specific tasks==
 
==Specific tasks==
  
Here are the remaining tasks resulting from previous work. You may want to focus on a subset of these:
+
We have supplied you with a file of about 700,000 rs and i numbers and the corresponding SNPs for the Somerton Man. Unfortunately there's a very high drop out rate.  The idea is to groundtruth this data, characterize it, and squeeze out of it any information you can.  The idea is to find out what the data can tell us and also what it definitely fails at telling us. We need to know both. Here are some ideas to get you started:
  
* Critically review the statistical analysis of the letters. See if you can extend it (eg. testing another language previous students missed and checking if they included all possibilities of ambiguous letters)Is the conclusion of previous team correct?  Because there are lots of interesting new tasks (see below) don't spend too long on this. Just spend enough time on this as a quick warm up exercise.
+
* Write a script to count the rs and i numbers in each chromosomeCount the base pairs in each chromosome. Tabulate these results. Do this quickly.
  
* Download the searchable pdf file of the copy of the [http://www.eleceng.adelaide.edu.au/personal/dabbott/tamanshud/W&T_rubaiyat_wells_copy.pdf Omar Khayyam] that closely matches the dead man's copy. Create an ascii file with the raw text. Use this as a one-time pad that directly substitutes the letters of the alphabet a-z. The book contains 75 quatrains (four sentence poems) each containing about 140 letters.  So the whole book contains about <math>75 \times 140 = 10,500\,</math> letters. As we don't know where in the book the one-time pad starts, start at the beginning and step through the whole book one letter at a time. You'll then end up with >10,000 decrypts. Write a software script to look for the most common top-20 words in all the decrypts to narrow down to a few possible results that can be examined by eye.
+
* Create a synthetic "random human" file to get some idea what random SNPs do. Upload this in research mode on Gedmatch Genesis and test its characteristics. Does it actually link to any humans?
  
* Extend the [[Media:Cipher GUI.rar|CipherGUI 2011]] software that was created by a previous teamSee if you can add more ciphers to the collection. Use it to eliminate more ciphers and enter your conclusions here: [[Cipher Cross-off List]]. Be critical and be prepared to question and recheck some of the items already on the list.
+
* Create a synthetic human that say has all A's or a short sequence of SNPs that is repeated over and over periodicallyHow do these files behave?
  
* A previous team created a webcrawler and search engine to search keywords with wild cards, as Google does not allow this. This is  to check for common repeated expressions on the WWW that may contain initial letters that are also in the code. If the code is an initialism, this will give us a clue as to some likely content. There are two things that need to be fixed: (i) we need you to write a convenient web-interface for this search engine, and (ii) we need to scrap the webcrawler as it takes too long. It is impossible to crawl the whole web with one little PC and therefore you need to interface the search engine to operate on an index that has already been created by a commercial webtrawler. Unfortunately, companies like Google won't let you access their index list.  Therefore you need to use another provider such as [http://yacy.de/en/index.html YaCy]. You may also want to check out a resource called [http://commoncrawl.org/get-started/ CommonCrawl]. To use CommonCrawl you'll need to sign up with both [https://commoncrawl.atlassian.net/wiki/display/CRWL/Quick+Start+-+Build+from+Github github] (to download the Java source code) and also [http://docs.aws.amazon.com/ElasticMapReduce/latest/GettingStartedGuide/Welcome.html?r=4524 Amazon] (to run your uploaded compiled code).  
+
* Upload the Somerton Man in research mode. Find out which chromosomes are under the minimum required based papers. Try many different ways of padding these base pairs with dummy data in order to meet the minimum.   You are looking to find the padding method that influences the results the least.
  
* Use computer graphics to reconstruct and undistort the face of the dead man. What would he look like if he were alive? To do this you need the data from the previous group that scanned the man's face from a plaster bust, at the Police Museum, with a [http://www.david-laserscanner.com/ 3D scanner]. An example of the type of graphics software you can use to manipulate the scanned image is [http://www.123dapp.com/catch 123D]. You may want to investigate other 3D rendering graphics software.
+
* Using a known good DNA reference file, deliberately drop out those SNPs that are missing in the Somerton Man. As you know the groundtruth in this case, you can investigate which padding methods failed and which produced partial good results.
  
* Use the departmental 3D printer to recreate a scaled down version of the bust, before and after your 3D rendering. The motivation for creating a 3D representation is so that we can create 2D pictures of the man at any angle. It will not be long before companies like Google release next generation search engines that search for faces on the web. So having multiple images at a number of angles will be of future importance for a large-scale image search.
+
* Another padding trick is you might introduce homozygous pairs (AA, GG, TT, CC) in various places to beef up the number of SNPs.  These segments might distort the relationship estimates, but if you keep them to smaller segments, they might not make a difference. You can test this out on the good DNA file that has been deliberately stripped.
  
* Investigate the prices and availability of more expensive 3D scanners than the [http://www.david-laserscanner.com/ David 3D Scanner] that we have. Can you find a 3D scanner that would have the resolution to pick up all the pores and texture on the bust of the Somerton Man?
+
* Further to padding by just the minimum amount, you can then investigate if you can pad by more by finding SNPs that are common to males or have a high liklihood in males. You can test this on good DNA files.
  
* Plot, present, and interpret the mass spectrometer data we have.
+
* You can go to a SNP database that is indexed by rs and i numbers and find the affect each SNP has (eg. if it affects eyecolour, a disease, or whatever). You can then create you own file containing all the descriptions for the Somerton Man. Then you can write your own software to datamine this file for interesting physiological features, characteristics, and diseases.
  
 
==Deliverables==
 
==Deliverables==
Line 42: Line 42:
 
* Start Project Work (Week 1)
 
* Start Project Work (Week 1)
 
* Proposal seminar (Week 5)
 
* Proposal seminar (Week 5)
#* [[File:141 Proposal Seminar.pdf]]
+
** [[File:141 Proposal Seminar2019.pdf]]
 
* Progress report (Week 12) - only one report needed in wiki format
 
* Progress report (Week 12) - only one report needed in wiki format
#* [[File:141 Progress Report.pdf]]
+
** [[File:141 Progress Report2019.pdf]]
  
 
===Semester 2===
 
===Semester 2===
  
 
* Final seminar (Week 10)
 
* Final seminar (Week 10)
#* [[File:141 Final Seminar.pdf]]
+
** [[File:141 Final Seminar2019.pdf]]
 
* Final report (Week 11) - only one report needed in wiki format
 
* Final report (Week 11) - only one report needed in wiki format
#* [[Final Report/Thesis 2018]]
+
** [[Final Report/Thesis 2019]]
 
* Poster (Week 12) - one poster only needed
 
* Poster (Week 12) - one poster only needed
#* [[File:141 Poster.pdf]]
+
** [[File:141 Poster2019.pdf]]
 
* Project exhibition 'expo'  (Week 12)  
 
* Project exhibition 'expo'  (Week 12)  
 
* CD or stick containing your whole project directories (Week 13)
 
* CD or stick containing your whole project directories (Week 13)
 
* YouTube video (Week 13) - add the URL to this wiki
 
* YouTube video (Week 13) - add the URL to this wiki
#* https://www.youtube.com/watch?v=w9E6JIGcqzs&feature=youtu.be
+
** https://youtu.be/FcJbR_diNMg
  
 
== Weekly progress and questions ==
 
== Weekly progress and questions ==
 
This is where you record your progress and ask questions. Make sure you update this every week.
 
This is where you record your progress and ask questions. Make sure you update this every week.
*[[Cipher cracking 2018 weekly progress]]
+
*[[Cipher cracking 2019 weekly progress]]
  
 
==Approach and methodology==
 
==Approach and methodology==
Line 67: Line 67:
  
 
==Possible extension==  
 
==Possible extension==  
If you knock off this project too easily and are looking for a harder code cracking problem to try your software out on, you can progress to analyzing another famous unsolved mystery: the [http://en.wikipedia.org/wiki/Voynich_manuscript Voynich Manuscript]
+
* If there is time we can get hold of the original FASTQ file and perform statistical tests on it.  And you can attempt statistical imputation to infer missing SNPs. Idea behind imputation is like what we do in electronic engineering when we do error correction in digital communication systems.  For example if a communication system transmits the sentence "the cat sat on the mat"  and we receive "th* c*t s*t *n th* m*t" with drop outs in information, we can still reconstruct the sentence.  You can think of a DNA file as being like a long message that also suffers drop outs, and so you can use statistical techniques to infer values. We electrical engineering we call this "error correction" but in bioinformatics this is called "imputation."
  
 
== Expectations ==  
 
== Expectations ==  
We don't really expect you to find the killer, though that would be cool if you do and you'll become very famous overnight. To get good marks we expect you to show a logical approach to trying to find the patterns from the code on the web, and any other attempts to crack the code. Running the webcrawler for many hours is unrealistic, so we'd like you to find a pre-indexed version of the web for you search engine.
+
We don't really expect you to find the killer or identify the Somerton Man, though that would be cool if you do and you'll become very famous overnight. To get good marks we expect you to show a logical engineering approach to squeezing information out of the data, and using known good files to always groundtruth each idea.
 +
 
 +
It is perfectly acceptable to have a long list of ideas that didn't work, provide they are carefully tested in a structured way. Finding things that don't work is part of the scientific process. When we don't know what is supposed to work or not, being able to eliminate ideas that don't work is still very exciting and worthwhile.
 +
 
 +
* It is important to regularly see your main supervisors. Don't let more than 2 week go by without them seeing your face briefly.
 +
 
 +
* You should be making at least one formal progress meeting with supervisors per month. It does not strictly have to be exactly a month, but roughly each month you should be in a position to show some progress and have some problems and difficulties to discuss. On the other hand the meetings can be very frequent in periods when you have a lot of activity and progress to show.
 +
 
 +
* The onus is on you to drive the meetings, make the appointments, and set them up.
  
 
== Relationship to possible career path==
 
== Relationship to possible career path==
Whilst the project is fascinating as you'll learn about a specific murder case—and we do want you to have a lot of fun with it—the project does have a hard-core serious engineering side. It will familiarize you with techniques in information theory, probability, statistics, encryption, decryption, and datamining. It will also improve your software skills. The project will also involve writing software code that trawls for patterns on the world wide web (exploiting it as a huge database). This will force you to learn about search engines and databases; and the new tools you develop may lead to new IP in the area of datamining and also make you rich/famous. The types of jobs out there where these skills are useful are in computer security, comms, or in digital forensics. The types of industries that will need you are: the software industry, e-finance industry, e-security, IT industry, Google, telecoms industry, [http://www.asio.gov.au/ ASIO], [http://www.asis.gov.au/ ASIS], defence industry (e.g. [http://www.dsd.gov.au/ DSD]), etc. So go ahead and have fun with this, but keep your eye on the bigger engineering picture and try to build up an appreciation of why these techniques are useful to our industry. Now go find that killer...this message will self-destruct in five seconds :-)
+
Whilst the project is fascinating as you'll learn about a specific murder case—and we do want you to have a lot of fun with it—the project does have a hard-core serious engineering side. It will familiarize you with techniques in information theory, probability, statistics, programming, bioinformatics, and datamining. It will also improve your software skills. So go ahead and have fun with this, but keep your eye on the bigger engineering picture and try to build up an appreciation of why these techniques are useful to our industry. Now go find that killer...this message will self-destruct in five seconds :-)
  
 
==See also==
 
==See also==
* [[Final Report/Thesis 2018]]
+
* [[Final Report/Thesis 2019]]
 +
*[https://github.com/ratherto/Who-killed-the-Somerton-Man-2019 Codes used in this project]
  
 
== References and useful resources==
 
== References and useful resources==
If you find any useful external links, list them here:
+
If you find any useful external links and resources, list them here:
* [http://en.wikipedia.org/wiki/Taman_Shud_Case The taman shud case]
+
* [http://en.wikipedia.org/wiki/Taman_Shud_Case The Tamam Shud case]
* [http://www.library.cornell.edu/colldev/mideast/okhym.htm Edward Fitzgerald's translation of رباعیات عمر خیام by  عمر خیام]
+
* The PALEOMIX protocol used by this project [[File:NPO_schubert2014.pdf]]
* [http://ebooks.adelaide.edu.au/ Adelaide Uni Library e-book collection]
+
* Guanchen Li's Master's thesis [[File:Thesis_a1652167_Guanchen_Li.pdf]]'
* [http://www.ohchr.org/EN/UDHR/Pages/Introduction.aspx UN Declaration of Human Rights - different languages]
+
* [https://genesis.gedmatch.com/login1.php Gedmatch Genesis]
* [http://www.fbi.gov/hq/lab/fsc/backissu/jan2000/olson.htm Analysis of criminal codes and ciphers]
+
* [https://hirisplex.erasmusmc.nl/ Hair and eye colour calculator]
* [http://www.fbi.gov/hq/lab/fsc/backissu/april2006/research/2006_04_research01.htm Code breaking in law enforcement: A 400-year history]
+
  
 
==Back==
 
==Back==

Latest revision as of 04:31, 11 November 2019

Supervisors

Honours students

Project guidelines

General project description

In this project is about human identification. In times of war and natural disaster there is a high demand for the ability to identify human remains. Also in the area of policing crime, the need to forensically identify remains is critical. Also human identification can extend to humans that are alive: for example reuniting an adopted child with a birth family. Children that do not know their origins for a number of reasons (eg. human trafficking of babies) can also benefit.

The idea of this project is to advance the techniques of human identification by focusing on a difficult case that that is too hard for current techniques. You can read the details about the dead body and the circumstances here.

We also want you to bring the skills of an electrical engineer to bear on the area of e-forensics and bioinformatics see if you can apply these to an important area of the case, ie. analyzing degraded DNA sequences.

Specific tasks

We have supplied you with a file of about 700,000 rs and i numbers and the corresponding SNPs for the Somerton Man. Unfortunately there's a very high drop out rate. The idea is to groundtruth this data, characterize it, and squeeze out of it any information you can. The idea is to find out what the data can tell us and also what it definitely fails at telling us. We need to know both. Here are some ideas to get you started:

  • Write a script to count the rs and i numbers in each chromosome. Count the base pairs in each chromosome. Tabulate these results. Do this quickly.
  • Create a synthetic "random human" file to get some idea what random SNPs do. Upload this in research mode on Gedmatch Genesis and test its characteristics. Does it actually link to any humans?
  • Create a synthetic human that say has all A's or a short sequence of SNPs that is repeated over and over periodically. How do these files behave?
  • Upload the Somerton Man in research mode. Find out which chromosomes are under the minimum required based papers. Try many different ways of padding these base pairs with dummy data in order to meet the minimum. You are looking to find the padding method that influences the results the least.
  • Using a known good DNA reference file, deliberately drop out those SNPs that are missing in the Somerton Man. As you know the groundtruth in this case, you can investigate which padding methods failed and which produced partial good results.
  • Another padding trick is you might introduce homozygous pairs (AA, GG, TT, CC) in various places to beef up the number of SNPs. These segments might distort the relationship estimates, but if you keep them to smaller segments, they might not make a difference. You can test this out on the good DNA file that has been deliberately stripped.
  • Further to padding by just the minimum amount, you can then investigate if you can pad by more by finding SNPs that are common to males or have a high liklihood in males. You can test this on good DNA files.
  • You can go to a SNP database that is indexed by rs and i numbers and find the affect each SNP has (eg. if it affects eyecolour, a disease, or whatever). You can then create you own file containing all the descriptions for the Somerton Man. Then you can write your own software to datamine this file for interesting physiological features, characteristics, and diseases.

Deliverables

Semester 1

Semester 2

Weekly progress and questions

This is where you record your progress and ask questions. Make sure you update this every week.

Approach and methodology

We expect you to take a structured approach to both the validation of last year's results, and the writing of the software. You should carefully design the big-picture high-level view of the software modules, and the relationships and interfaces between them. Think also about the data transformations needed.

Possible extension

  • If there is time we can get hold of the original FASTQ file and perform statistical tests on it. And you can attempt statistical imputation to infer missing SNPs. Idea behind imputation is like what we do in electronic engineering when we do error correction in digital communication systems. For example if a communication system transmits the sentence "the cat sat on the mat" and we receive "th* c*t s*t *n th* m*t" with drop outs in information, we can still reconstruct the sentence. You can think of a DNA file as being like a long message that also suffers drop outs, and so you can use statistical techniques to infer values. We electrical engineering we call this "error correction" but in bioinformatics this is called "imputation."

Expectations

We don't really expect you to find the killer or identify the Somerton Man, though that would be cool if you do and you'll become very famous overnight. To get good marks we expect you to show a logical engineering approach to squeezing information out of the data, and using known good files to always groundtruth each idea.

It is perfectly acceptable to have a long list of ideas that didn't work, provide they are carefully tested in a structured way. Finding things that don't work is part of the scientific process. When we don't know what is supposed to work or not, being able to eliminate ideas that don't work is still very exciting and worthwhile.

  • It is important to regularly see your main supervisors. Don't let more than 2 week go by without them seeing your face briefly.
  • You should be making at least one formal progress meeting with supervisors per month. It does not strictly have to be exactly a month, but roughly each month you should be in a position to show some progress and have some problems and difficulties to discuss. On the other hand the meetings can be very frequent in periods when you have a lot of activity and progress to show.
  • The onus is on you to drive the meetings, make the appointments, and set them up.

Relationship to possible career path

Whilst the project is fascinating as you'll learn about a specific murder case—and we do want you to have a lot of fun with it—the project does have a hard-core serious engineering side. It will familiarize you with techniques in information theory, probability, statistics, programming, bioinformatics, and datamining. It will also improve your software skills. So go ahead and have fun with this, but keep your eye on the bigger engineering picture and try to build up an appreciation of why these techniques are useful to our industry. Now go find that killer...this message will self-destruct in five seconds :-)

See also

References and useful resources

If you find any useful external links and resources, list them here:

Back