Semester A Progress Report 2013 - Cipher cracking
Contents
Executive Summary
The 2013 University of Adelaide Cipher Cracking Team, Lucy Griffith and Peter Varsos, are continuing the investigation into the mystery of the Somerton Man. During semester A the team has worked on extending the analyses done by the 2012 team on the frequency of letters in different languages, using the Rubaiyat of Omar Khayyam as a one time pad to decipher the code and implementing cipher systems used in the 1940's in the Cipher Gui. So far the conclusions drawn by the 2013 team are as follows:
- The English Language statistically fits the best for the code assuming it is the initial letters of words in a sentence or list.
- The Rubaiyat of Omar Khayyam was most likely not used as a one time pad to encrypt the code.
- There are a number of cipher systems to be implemented in the Cipher Gui.
Still to come later this year:
- Re-analysis of letter frequency using the declaration of human rights (up to 406 languages).
- Analysis of mass spectrometry data from the Somerton Man's hair.
- Reconstruction of the Somerton Man's 3D bust to better represent appearance before death.
- Production of an instructional video for the One Time Pad Gui created by Peter Varsos.
Introduction
Background
The Case
Known as the Taman Shud Case the discovery of an unknown deceased man on Somerton Beach Adelaide (subsequently known as the Somerton Man) in 1948 remains one of Australia’s most baffling mysteries. Initially the case seemed relatively run of the mill suicide. The South Australian Police tried to identify the man first by using the clues available to them; the train and bus tickets found on the man led them to a suitcase stored in the Adelaide Railway Station but the man had no formal identification on him. Then the Police released his photo to the papers and his fingerprints to the army to check their records but there was still no positive identification. The body had to be embalmed and later a plaster cast was made of the victim’s bust so that the body could be interred. Six months after the discovery of the body, in a review of the evidence, a tiny piece of rolled paper was found in a pocket of the man’s clothing. This piece of paper contained the words Tamam Shud. Police, with assistance from the Library of South Australia, concluded that this scrap of paper had come from a book of Persian Poetry translated into English. Originally written by the Persian Philosopher Omar Khayyam the test was titled The Rubaiyat of Omar Khayyam and was translated by Edward Fitzgerald. A photograph released to the media of the scrap of paper led a Glenelg man to submit a copy of this very book for evidence. The book was a rare first edition and had been found on the backseat of a car parked along Jetty Road two weeks before the body was discovered. Police verified that the scrap had come from this very copy of the book and here the mystery deepens. Scrawled inside the back cover of the book were five lines of nonsensical letters and a telephone number. The phone number led Police to a local woman who denied any knowledge of the man or this copy of the book. The five lines of letters are the interesting part. These lines are believed by many to be an encrypted message written by the dead man. To this day the man has never been identified nor the message decrypted. Unfortunately the original copy of the book was lost in the 1950s. In 2013 our team was lucky enough to have access to a scanned copy of a first edition of The Rubaiyat of Omar Khayyam, the same edition connected to the Taman Shud Case.
The Code
WRGOABABD
MLIAOI
WTBIMPANETP
MLIABOAIAQC
ITTMTSAMSTGAB
Five lines of letters from the Roman Alphabet were found scribbled in the back cover of a copy of The Rubiyat of Omar Khayyam connected with the Taman Shud case. Many people believe that it is an encoded message but no one has been able to decode it. The second line seems to have been struck out. This makes sense since the second and third lines are similar so it is believed that the second line is an incorrect version of part of the third. For some of the letters it is unclear what the letters are; M or W, I or L. Much of the final year project work has focussed on this code.
Previous Studies
In addition to the initial South Australian Police investigation there have been many other studies done on the Taman Shud Case. The most notable attempt at cracking the code was made by the Australian Defence Force in 1978. Unfortunately the ADF concluded that the encoded message was insufficient in length to find a pattern and crack it. This has led our predecessors at the University of Adelaide to pursue a test and discount method for the code. Since 2009 there has been final year project teams studying the Taman Shud Case applying an engineering approach to cracking the code and attempting to discover the identity of the Somerton Man. In addition to trying many cipher systems on the code the project teams at Adelaide have produced software to encode messages and show analysis of the code, to trawl the internet for phrases and scanned the plaster bust of the victim in 3D. Despite the investigations of this case the identity of the Somerton Man remains a mystery.
Project Objectives
In 2013 the team has the privilege of working with evidence not available since the original investigation. We have a scanned copy of a first edition of The Rubaiyat of Omar Khayyam as well as the data from a spectral analysis of a strand of the Somerton Man’s hair. The objectives for 2013 are to:
- Review and extend the2012 analysis of letter frequency in different languages.
- Attempt to decipher the code using the Rubaiyat as a One Time Pad.
- Extend the Cipher Gui to include ciphers used particularly in the 1940’s.
- Analyse the mass spectrometry data to draw some conclusions about the Somerton Man’s life in the time before his death.
- Alter the 3D scan from 2012 to better represent the Somerton Man before death.
Review of 2012 Statistical Analysis
In 2012 the honours project group conducted a frequency analysis of the letters of the Roman Alphabet. They analysed the relative frequency of the letters themselves as well as words beginning with each of the letter of the Roman alphabet in over 80 different languages. The same passage of text translated into each of the languages was used for this analysis. The 2012 team also did an analysis of the English language again separately using the Oxford English Dictionary. They compared the results from these analyses with the code from the Somerton Man Case. I have conducted a review of this analysis to identify possible extensions to the analysis for 2013. The 2012 team conducted a wide analysis of many different languages. Unfortunately the text used for this analysis was a short translated passage from the bible, around 250 words long in each of the languages. This was not a large sample size, only around ten times the length of the alphabet itself. In addition the source of the text meant that there were many shared words between the languages (biblical names etc.). This could have skewed the data as these words would mostly not have been originally from the language being analysed. In conclusion I find little merit in this portion of the analysis undertaken by the 2012 group. The analysis of the first letters of words in the Oxford English Dictionary is also not very useful in my opinion. This text would contain many obscure and uncommon words so it does not represent the English language commonly used. This means that it gives little to the theory that the Somerton Code is an acronym of a sentence or the first letters from words in a list. Regardless of these analyses they did come to the same conclusion as the groups before them, which is that the language most likely used as an initial letter encoding for the Somerton Code was English. In extension to the 2012 analysis it was proposed that the 2013 team, Peter Varsos and Lucy Griffith, conduct several new analyses. We were to analyse initial letter of a word frequency in transcribed Russian and Persian (Farsi), and perhaps some other eastern European languages. Transcribed text is text written phonetically with a different language’s alphabet, in this case the Roman Alphabet. In order to conduct these analyses we needed to find some texts to sample. Unfortunately finding these texts proved too time consuming and this analysis was abandoned. The transcription of both Russian and Farsi also required an extended roman alphabet. This would likely have meant a lower correlation to the Somerton Code since that code contains none of those special characters. One option to counter act this skew would have been to count the special characters as their 'parent' characters for example consider 'ǎ' as 'a'. This could perhaps be the subject of analysis in 2014. It was also suggested after the 2012 review that we re-analyse the English Language with respect to some other contemporary texts of the time. We did an initial letter frequency analysis using the 2012 computer code with a page from an edition the Advertiser Newspaper from 1948. This resulted in an average standard deviation lower than most of the languages analysed in 2012. However, considering the review that we conducted of those analyses we believe that English is still the most likely candidate if the Somerton Code were the first letters of words in a list or sentence. We also checked the Rubaiyat text against the Somerton Code, using computer code, looking for a match of initial letters but found none. We considered doing the same with a transcribed Farsi version of the Rubaiyat but we were unable to locate one within a reasonable time frame. This could be yet another extension for the 2014 team.
The Rubiyat Matcher - Matcher
Concept
With the correct edition of the Rubaiyat available to us for the first time and its supposed connection to the case, it is important that we begin to do the basic analysis of how it could be involved as it could trivially lead us somewhere. A software suite allowing us to get the first letters of every word in any plaintext document such as a plaintext version of the Rubaiyat and check for the letters from the secret code within it will greatly increase efficiency of testing down the line. With the previous groups findings that the code being the first letters of words is likely, this software will allow us to check this against the Rubaiyat and against any other source we find later on. Secondly, later work on the code will require working word matching software and so building it for this will allow us to extend it later. In particular, the one time pad software will generate many decrypts that will need to be searched through, which an expanded version of this will allow. This needs to be taken into account in the design, all of the code for the first letter analysis will be kept separate and the rest kept very modular to allow other methods to plug into it later.
Design
So from this concept came a fairly simple design process as the program is very modular. A central object of a 'matcher' was created, which holds strings containing the key and code files, read in from buffers. From here, we can simply add a process to do whatever we want to these two strings and attach it to different buttons on the GUI. Therefore we could add these modular processes one at a time - each step involved adding one button to the GUI, a method (and helper methods) and then returning the result as a string to the matcher object (and telling it to print out its current state to the GUI).
A simple GUI was created, as most of the output would be to text documents (especially when we extend this to the OTP decoder) there was no need for panels and such - rather the GUI is just a control panel for the various methods.
In addition, a letter frequency analysis tool was added to the GUI, as the previous group had been using a standalone one and we felt that integrating it all into one place would speed up future work.
Implementation
The letter matching part of this software has several functions. It contains several small methods by which a block of plaintext can be converted into a string of the first character of every word. The main part of the software will allow another small string to be checked against this one in the hope of finding ‘matches’, the contents of the string within the other.
The first implementation simply checked to find if the contents of the string were within the larger one, however this is obviously only going to give us a very shallow way of checking. Therefore even before looking at possible extensions to the first version it was necessary to make a few different types of matching methods and implement them in parallel in the hope that one will give us what we were looking for. A very simple way of doing this was to check for partial matches rather than only full ones - which could be implemented trivially as it can be done exhaustively. There is the potential to expand in this area later, as even with random data some partial matches will come up for single, double and even triple letters - setting it up to work these out and looking for a certain threshold where it becomes statistically significant could yield a very robust way of searching for this.
Also implemented was a wildcard match. One of the big problems when looking at the code is the possibility that one or more of the letters could have even been copied down wrong (the handwriting wasn't the best!) which makes any kind of statistical analysis kind of pointless as it could all be missing one important thing. Therefore, a very simple method which does the above matching while ignoring each letter in the code iteratively was designed. This could easily be extended later too, to account for more letters and to also do a combined match with the above partial matching method. A proper combination of the two could be very worthwhile if we deem it worth the time to go down this path later - but would be unlikely to yield results for this particular example and so it will be left until then.
Finally it is worth mentioning the possibility of extending this specific example to other text. We were able to come by a different translation of the Rubaiyat to see if it would give us a better result, and should any other possibilities come up later we can try this.
Testing
Within this fairly limited space, the testing of this module was fairly trivial. Upper and lower cases were designed, and the outputs could be checked against expected results by hand.
The Rubiyat Matcher - One Time Pad
Concept
The one time pad software is designed to read a key and a code from two text files, and write to another file a list of all possible decrypts of that code using every possible one time pad combination of letters throughout the key document. It must also give the option to 'shortlist' or search this massive list of decrypts, and sort them in such a way that the decrypts that are more likely to be of interest are at the top.
Requirements
- Read in from a buffer containing two separate text documents, the key and the code.
- Be able to generate a translation key in an array from any given starting point in the key document.
- Iterate through the document creating such a key for every possible output.
- Use said keys to create a decrypt, output these in another text document.
- Allow on the spot searching of these decrypts.
- A method by which a 'rating' can be given to these decrypts, which accurately rates its likelihood to be English rather than gibberish.
- Creation of 'debug mode', a tool to allow ease of debugging and verifying the code later.
- Sort these decrpyts by their rating to allow ease of searching for a likely candidate.
- Creation of a GUI that is both easy to use and does not hide functionality.
- Creation of neat and modular code to allow possibility of later additions from other groups.
Design
Because of the complicated nature of this program, we had to be very careful with our design. In essence, since the program needs to be able to work with very large plaintext documents it necessitates it being transient - as in it simply reads and buffers from the two text documents, decodes, and prints the resultant decode to a third text document without storing anything in memory.
By using the modular GUI and matcher driver from the previous section - including the already designed buffers and readers - a lot of time was saved. The main problem then was the actual encryption methods itself.
A recursive solution was used. A method was designed which would read in from the key plaintext one letter at a time and create a transform array taking each new letter in order. This was called recursively with a seed allowing it to start from 1 letter forward each time, until it hit the end of the document. After this step the transform array was simply passed to a decrypt method which applied it (backwards, we are decrypting not encrpyting!) to the message and printed the solution to a new document. This generates a list of (tens of) thousands of decrypts for an average sized document such as the Rubaiyat, which need to be searched through.
A search and shortlist function were also designed. Again, this is simply modularly available from another button on the GUI and just reads from the filename that the decrypt method outputs to. The search function steps through this list of decrypts document buffering each line into a string, and search said strings for whatever word was typed into the box in the GUI.
The shortlist function simply takes each one and 'ranks' it depending on its likeliness to be an interesting string. In terms of design, we decided that being incredibly strict with these ranking criteria was really not needed, as it was fairly obvious very fast that the vast majority (usually all) of the decrypts were just going to be complete nonsense strings of random letters not even resembling English. These can be picked out very fast and a human is easily capable of quickly spotting a real string of words among a few hundred (but not tens of thousands). Therefore a points system was devised with three catagories - Very Strong, Strong, and Weak matches. It would bump a string containing any common words such as "and" or "the" right up into the very strong category immediately for a human to look at, as well as giving points to common English bigrams (pairs of common letters such as th, en) and common letters. It was tuned to an extent that out of the tens of thousands of decrpyts output by the Rubaiyat and message, a few hundred were very strong and around the top third were strong - a human could very easily look through this shortlist for anything of interest.
Implementation Process
- Design a solution for iterating through a large document to get a translation key.
- Design a method to decrypt a code using a given key.
- Implementation of a program using said designs.
- Research what attributes of a string make it more likely to be English
- Implementation of a search function using the decided upon approach.
- Test and Verification of Software
Testing
Because of the complicated nature of this software, the need for very robust tools to check each stage was functioning was identified very early in the projects lifespan. Because of this, a toggle for a 'debug mode' option was supported all the way through, available in the file menu of the GUI. This mode causes each stage of the processes to output additional information to the terminal, allowing breakpoint testing within the actual GUI. As the expected outputs of each stage of this process are very deterministic, this made sanity checks for each stage almost trivial and allow very easy verification of the outputs by matching examples to those done by hand, or just straight up looking at the printed outputs of what enters each stage, what it outputs and whether or not this is expected all the way through.
Boundary cases of small or empty keys/code were all handled by exceptions,prompting the user to replace their input rather than exiting - making the program very robust and easy to use.
Additionally, with the use of calls for OS supported endlines and similar, the program will run on both Windows and UNIX systems.
The shortlist function also required a lot of testing as if it was not tuned correctly it could very easily throw potentially useful decrpyts into the "weak match" category where it might not be looked at - which would be a disaster. Initially a full suite of testing was designed, however it very quickly became obvious that it was a lot easier to pick out English sentences from random strings than we first thought. From a document of a few hundred of these random strings of a message the size of our secret code, a ranking system that just gave points based on common words (and, the, it, at), common bigrams (th, en, at) and common letters (vowels) would often double or triple the strings of random letters in score. As such, we simply tuned it in such a way that people would be able to go through the very strong and strong sections by hand, and are fairly confident that while our shortlist method isn't a perfect way of detecting English it is very unlikely to miss any English sentence completely.
Results
The implemented code solution for the One Time Pad was successful in allowing us to generate a list of decrypts for any given key and code. Several tests including an example shown below show the software works as intended. Users must simply put the text files in the correct places as prompted, and press a big 'get decrypts' button and the program will open up a page with all possible decrypts in a list.
The shortlist function works insofar as it will rank the decrpyts successfully, and in many test cases was able to pick a string of English words from a block of gibberish. However, these were specifically designed test cases using knowledge of the heuristics used to search for these words. With a fairly simple method to detect common words, bigrams and letters we still can't garuntee that every single possible sentence would rank higher than any given string of gibberish that may come out. For the current example of a small block of text we believe the current implementation to be sufficient to pick out a secret code, however there is much room for improvement in this method if the need ever arises for later groups.
With the software suite working we were able to begin testing to see if we could decode the message. Several things were tried:
- Using the full plaintext Rubaiyat as a key resulted in some ten thousand decrypts, of which none in the shortlist yielded results.
- Using the first letters of the Rubaiyat generated by our matcher software as a key resulted in a thousand or so possible decrypts, which again gave no positive results.
- Using the phrase 'Tamun Shud' (with an alphabet afterwards to fill the translation)yielded no results. Similarly using 'TS' with alphabet after it gave nothing.
We were also able to test some straight shift cyphers as a OTP cyper with a shifted alphabet is equivalent to a shift cypher (e.g. a OTP key 'bcdefg...xyza' is equivalent to a shift cypher where every letter is moved one to the right). Testing every possible straight shift is possible by just using two alphabets one after the other - this also gave no positive results.
One Time Pad Example
Here is a quick example of how to work The Matcher GUI can be used. The first step is to generate an encoded message. Here we used "the brown fox" and a shifted-alphabet one time pad key of: 'zabcdefghijklmnopqrstuvwxy'. This gave us an encrypted message of "UIFCSPXOGPY".
Our input files for The Matcher GUI are the encrypted message and a section of nonsensical text which includes our original encryption key.
The Matcher GUI can then be run to "Get Decrypts". This produces a file of decrypts with every possible encryption key from the text. These keys are sets of 26 consecutive letters each shifted one letter along from the previous.
The Matcher GUI can then "Search Decrypts" for a specific keyword entered in the "Keyword to search decrypts" box or it can "Create a Shortlist" or decrypts containing common English words and letter combinations. The shortlist generated during our example contains the decrypt of the original message!
Cypher GUI Extension
Previous Work
The 2011 group completed a Ciper GUI with the aim of allowing fast encrypting and decrypting of several different ciphers within an easy to use GUI. Its design allows a fairly easy addition of new ciphers, making repeated tests of new ideas such as new keys or slightly different codes very easy. As well as the actual implementation of this rather large piece of software they implemented quite a few actual ciphers as part of the package - many of which were from the cipher cross-off list. However the ciphers chosen were for the most part simple ones (as I assume the actual software itself took a while to get working leaving no time for in depth look) This leaves a lot to be desired in terms of actually solving our secret code, in terms of an in depth look into which codes were frequently used at the time and which it could likely be.
Concept
With this in mind, our aim going into this section was to actually do some detailed research into which ciphers were likely to have actually been used during the time period. Here, there were no readily available computers, and we can guess that it is unlikely that the person in question had access to a mechanical method of encryption as this was rare outside military applications. This may be able to help us narrow it down to a cipher that can be used to quickly encrypt a message by hand. Indeed the messiness of the handwriting supports this hypothesis, it looks like it was done speedily. This therefore gives us a good starting point for finding, testing and then implementing a few new ciphers into the Cipher GUI that match this criteria. A small amount of information about relevant ciphers and why we chose them is added here:
Column Cipher: A transposition cipher that is well known to have been using at the end of WW2 and in the cold war due to its relative strength and its good speed of hand encryption. As a transposition cipher it is unlikely that this is a direct transform from our secret code since it would be the same letters in different places and it is difficult to make any sentences with the letters shown (or to put it technically frequency analysis shows shows the code is unlikely to be a straight transposition and more likely to be a substitution as previous groups discovered). However the fact this was used so commonly and easily leads to a chance that it was used after a substitution cipher for another layer of complexity and therefore we decided it was worth implementing.
Double Transposition Cipher: Also known as a Double Column Cipher. The same as above but done twice with two keywords. As before this was commonly used, but the other reason we chose to try this one is the relevance of double keyword ciphers on this case - the words Tamam Shud have a high chance of relevance.
Design
The previous group implemented the code, so half of the challenge of this part was working out how they designed everything.
The design was fairly modular as a result of the way that the software was designed to handle many different types of ciphers. As such, we simply needed to design a single cipher modules with defined inputs and outputs as set by the connector.
This makes the design fairly simple, as did the use of eclipse - a free IDE that we also were using for our project - we were able to just import their project material in!. A diagram from the previous implementation is given here as figure 1, showing that our new modules need simply connect to the interface.
Implementation
As mentioned above we implemented several stand alone cipher modules which work within the cipher GUI. Currently we have added:
- Column Cipher
- Double Transposition Cipher
Testing
After a quick analysis we came to the same conclusion as the previous group - testing of this software suite could be done fairly modularly. As such we could simply test each of the single cipher modules we implement them just by trivially testing examples and edge cases to see what outputs we expect.
Future Work
This project will continue into semester 2. We plan on working towards the following deliverables as time permits:
- Finishing implementation of several new Ciphers into the Cipher GUI.
- A full study on the frequency analysis using the deceleration of human rights and its translations into around 400 languages. We aim to see which language fits the code the best, and no previous group has done such an expansive study.
- Continuing work on the Matcher GUI. We plan to both increase functionality for the purpose of the frequency analysis above, and to increase usability by upgrading the GUI. This includes a video on operation to be included in our final report.
- Begin work on the 3d modelling section. We aim to clean up the model in order to increase probability of it being useful for image recognition software in the future.
- Analyse the mass spectroscopy data that we have available to us now. We aim to see if concentations of chemicals in our mystery man's hair can give us a clue to where he may have lived prior to his death.
There are also several more prospective ways we could go, including updating the Web Crawler from 2011, and we will continue to manage our timetable as the year goes on.
Project Management
Project Schedule
Gantt Chart
Budget
As of the end of semester 1, we have had no need of our allocated budget. In semester 2 we plan on doing 3D modelling work with the bust of the mystery man, and so we are reserving the budget as a contingency plan in case free 3D modelling software solutions do not meet our needs in terms of the changes we want to make.
Risk Management
A full risk assessment is available here File:RiskAssessment2013.pdf and File:PlantRiskAssessment2013.pdf
Conclusion
The 2013 semester 1 of the Cipher Cracking project has lead to quite a bit of progress. Each objective for this semester has been met as well as the software design being extended from what our original goals were. We have met our schedule for each, and met all requirements while coming in on budget.
One of the really big aims this year - to check the newly discovered copy of the Rubaiyat for possible one time pad keys - has been followed through successfully and come to a solid conclusion (which unfortunately was that there was nothing there). The program created however will provide an easy way to check future one time pads against the code should future groups have any need to do so - and therefore may play a significant part in future groups progress. Several other small hypotheses in terms of frequency analysis have been thoroughly tested, including some further results that came contradictory to what a previous group had found - due to their use of a bible passage which therefore had skewed letter frequency due to names being repeated etc. Additionally, a good start has been made on the Cipher GUI work with some potentially useful Ciphers to the project being added rather than just random ones for the sake of saying the software is operational. Though they have not yet cracked the code, having these within the tool is important as they were commonly used at the time and therefore there is always a possibility that the code was double encrpyted with a currently unknown code and one of these. All of the software and plaintext of the one time pad files are available in the below 'see also' section.
See also
- Cipher_Cracking_2013/Rubaiyat_Plaintext
- File:Matcher Program v1.0.rar
- File:PlantRiskAssessment2013.pdf
- File:RiskAssessment2013.pdf
- Cipher Cracking (main page)
- Cipher Cracking 2013
- Cipher Cross-off List
- Timeline of the Taman Shud Case
- List of people connected to the Taman Shud Case
- List of facts on the Taman Shud Case that are often misreported
- List of facts we do know about the Somerton Man
- The Taman Shud Case Coronial Inquest
- Primary source material on the Taman Shud Case
- Secondary source material on the Taman Shud Case
- Semester B Final Report 2013 - Cipher cracking
References and useful resources
- The taman shud case
- Edward Fitzgerald's translation of رباعیات عمر خیام by عمر خیام
- Adelaide Uni Library e-book collection
- Project Gutenburg e-books
- Foreign language e-books
- UN Declaration of Human Rights - different languages
- Statistical debunking of the 'Bible code'
- One time pads
- Analysis of criminal codes and ciphers
- Code breaking in law enforcement: A 400-year history
- Evolutionary algorithm for decryption of monoalphabetic homophonic substitution ciphers encoded as constraint satisfaction problems
- Copy of The Advertiser from 1948 for Letter Frequency Analysis: English