Structuring the Taman Shud code cracking process
Here are some preliminary guidelines on how to structure the investigation into the code. The idea of the project is not to crack the code as such. The idea is to eliminate what the code is not in a structured and well-documented way. Of course, if you crack the whole thing open that will be the cherry on the cake, but we are marking you on the engineering methodology you put into the cake and not the cherry.
In any engineering problem, what we always to is make simplified assumptions to make the problem tractable and then add extra complexity later (if needed). The same trick hold for analyzing this code. Break it up into a number of tasks that are based on reasonable simplified assumptions. Investigate these simplified cases thoroughly, and then the results will suggest where to go next. So long as you do each step methodically that is what matters. We don't care if you don't crack the code. If you run out of time to add on further complexities, you can suggest what needs to be done in the future in your conclusions. So the conclusion of your final report should present a structured plan on what a future team needs to do.
Contents
- 1 Hypothesis 1: the code is gibberish
- 2 Hypothesis 2: the code is in English, but the letters have been substituted
- 3 Hypothesis 3: the code is in English and the letters are as they are supposed to be
- 4 Hypothesis 4: the code is in a foreign language and the letters are as they are supposed to be
- 5 Hypothesis 5: the code is in a foreign language and the letters have been substituted
- 6 Resources
- 7 See also
- 8 Back
Hypothesis 1: the code is gibberish
- Assume the code is in fact a meaningless string of letters. This assumes the Somerton man was normally an English speaker, but was drunk or so badly poisoned with hallucinogen that he was writing a delusional string of letters.
- Think of ways to test this hypothesis.
- Get 10 native English speakers to write a string of 50 random letters before and after a fixed number of beers. They must try to be random only using their mind and not use computers or external devices. Better to chose friends who study courses where they don't teach you what randomness is. Otherwise you may get the odd friend who tries to be too smart and not go with the game. Arts students will be perfect victims.
- Then think of ways you can statistically compare the Somerton code to these gibberish sequences. Plot letter frequencies of gibberish with error bars. Make counts of letter pair frequencies. Are there letters of the alphabet people consistently missed out and how does this compare to the code? How to the most frequent letters compare?
- Calculate the average information in bits per symbol, H(x). You do this by summing up all H(x_i) over the code, where x_i is the i-th symbol. So let's say x_1 is the symbol 'R', you count up how many times R appears in the code and divide it by the total to give the probability P(x_1) of there being an 'R'. Then by definition, H(x_i) = P(x_i)log_2(P(x_i)). Do this for all the symbols and up all the H's. This is what is called the average information. Do this for both the gibberish and the code.
Hypothesis 2: the code is in English, but the letters have been substituted
- Make a big long list of coding techniques. Try to eliminate some based on the date they were invented. Or ones that would require more computing power than was available in 1948.
- It is not reasonable to assume he did the code in his head. A computer of the time could have been used. He could have been hurriedly copying down the code that someone else prepared. Maybe he ripped the code off someone else when they weren't looking.
- The take your reduced list of coding techniques, and code up an e-book written in English. You should to sub-sample chunks of 5o letters 10 times, and generate error bars. Then look at the letter frequencies before and after coding. Compare this with the Somerton code and you should be able to eliminate some.
- The you can repeat the above using different statistical features such a letter pairs. Also try calculating the probability of a symbol x_i appearing after a symbol x_j. This is called the transition probability. Comparing transition probabilities can give more clues.
Hypothesis 3: the code is in English and the letters are as they are supposed to be
- If the letters are as they are supposed to be, you can easily eliminate the idea that it is one big English anagram as there is only one 'e' in the whole code. So that is highly unlikely. You can demonstrate this by slicing up an English e-book into 50 letter blocks. Then count up how many 50-letter block into the whole book have only one 'e'. If it is zero or a small number then qed. If it is a number bigger then expected, then see Derek later on how to explore the anagram idea more deeply.
- Do letter frequency counts of a whole e-book and compare initial letter versus all the letters. Then test to see if it is likely the Somerton code is statistically similar to initial letters.
- Then we can repeat all the above tests again, but instead of an e-book try: (1) a big e-list of girl's names, (2) a big list of place names, (3) a big list of Australian train station names, (4) a list of seaports, (5) a list of chemical names, (6) a list of scientific terms to do with making nuclear bombs, and (7) any other fun thing you can think of.
Hypothesis 4: the code is in a foreign language and the letters are as they are supposed to be
- Select some foreign language e-books. Do letter frequency counts, pair counts, transition probabilities etc. Compare. Do initial letters and letters from full words. Compare.
- Do the compression test. Append the code to a standard text in 10 foreign languages. (Ask Matt where to get that). Compress each one. Rank according to compression ratio (ie. plot compression ratio on the y-axis and rank on the x-axis).
- Another trick is to put the code phrase by phrase into an online anagram server. Then count up how many anagrams you get for each language. The one with the most wins. This may be an indicator.
Hypothesis 5: the code is in a foreign language and the letters have been substituted
- This is the hardest one and should be left to the end to see if something easier turns up. I don't expect you to worry about this case for the project.
- At the end of the project the least you can do is suggest some ideas on how future students might approach this. But save your judgment until you've gained some experience with the other techniques first.
Resources
See also
- Cipher Cracking 2009
- Final report 2009: Who killed the Somerton man?
- Timeline of the Taman Shud Case
- List of people connected to the Taman Shud Case
- List of facts on the Taman Shud Case that are often misreported