Structuring the Taman Shud code cracking process

From Derek
Revision as of 09:13, 7 April 2009 by Mattjb (Talk | contribs)

Jump to: navigation, search

Here are some preliminary guidelines on how to structure the investigation into the code. The idea of the project is not to crack the code as such. The idea is to eliminate what the code is not in a structured and well-documented way. Of course, if you crack the whole thing open that will be the cherry on the cake, but we are marking you on the engineering methodology you put into the cake and not the cherry.

In any engineering problem, what we always do is make simplified assumptions to make the problem tractable and then add extra complexity later (if needed). The same trick hold for analyzing this code. Break it up into a number of tasks that are based on reasonable simplified assumptions. Investigate these simplified cases thoroughly, and then the results will suggest where to go next. So long as you do each step methodically that is what matters. We don't care if you don't crack the code. If you run out of time to add on further complexities, you can suggest what needs to be done in the future in your conclusions. So the conclusion of your final report should present a structured plan on what a future team needs to do.

Management

Carefully study how the tasks have been logically split up below. Fill in the deliberately missing blanks. Figure out the best sequence to do them in. Figure out how to allocate the tasks between you two so that their can be two streams of complementary work going on in parallel.

Hypothesis 1: the code is gibberish

  • Assume the code is in fact a meaningless string of letters. This assumes the Somerton man was normally an English speaker, but was drunk or so badly poisoned with hallucinogens that he was writing a delusional string of letters.
  • Think of ways to test this hypothesis.
  • Get 10 native English speakers to write a string of 50 random letters before and after a fixed number of beers. They must try to be random only using their mind and not use computers or external devices. Better to chose friends who study courses where they don't teach you what randomness is. Otherwise you may get the odd friend who tries to be too smart and not go with the game. Arts students will be perfect victims.
  • Then think of ways you can statistically compare the Somerton code to these gibberish sequences. Plot letter frequencies of gibberish with error bars. Make counts of letter pair frequencies. Are there letters of the alphabet people consistently missed out and how does this compare to the code? How do the most frequent letters compare?
  • Calculate the average information in bits per symbol, H(x). You do this by summing up all H(x_i) over the code, where x_i is the i-th symbol. So let's say x_1 is the symbol 'R', you count up how many times R appears in the code and divide it by the total letters to give the probability P(x_1) of there being an 'R'. Then by definition, H(x_i) = P(x_i)log_2(P(x_i)). Do this for all the symbols and up all the H's. This is what is called the average information. Do this for both the gibberish and the code.

Hypothesis 2: the code is in English, but the letters have been substituted

  • Make a big long list of coding techniques. Try to eliminate some based on the date they were invented. Or ones that would require more computing power than was available in 1948.
  • It is not reasonable to assume he did the code in his head. A computer of the time could have been used. He could have been hurriedly copying down the code that someone else prepared. Maybe he ripped the code off someone else when they weren't looking.
  • Then take your reduced list of coding techniques, and code up an e-book written in English. You should to sub-sample chunks of 50 letters 10 times, and generate error bars. Then look at the letter frequencies before and after coding. Compare this with the Somerton code and you should be able to eliminate some.
  • The you can repeat the above using different statistical features such a letter pairs. Also try calculating the probability of a symbol x_i appearing after a symbol x_j. This is called the transition probability. Comparing transition probabilities can give more clues.

Hypothesis 3: the code is in English and the letters are as they are supposed to be

  • If the letters are as they are supposed to be, you can easily eliminate the idea that it is one big English anagram as there is only one 'e' in the whole code. So that is highly unlikely. You can demonstrate this by slicing up an English e-book into 50 letter blocks. Then count up how many 50-letter blocks in the whole book have only one 'e'. If it is zero or a small number then qed. If it is a number bigger then expected, then see Derek later on how to explore the anagram idea more deeply.
  • Do letter frequency counts of a whole e-book and compare initial letter versus all the letters. Then test to see if it is likely the Somerton code is statistically similar to initial letters.
  • Then we can repeat all the above tests again, but instead of an e-book try: (1) a big e-list of girl's names, (2) a big list of place names, (3) a big list of Australian train station names, (4) a list of seaports, (5) a list of chemical names, (6) a list of scientific terms to do with making nuclear bombs, and (7) any other fun thing you can think of.

Hypothesis 4: the code is in a foreign language and the letters are as they are supposed to be

  • Select some foreign language e-books. Do letter frequency counts, pair counts, transition probabilities etc. Use the frequency counts to get a simple measure of Shannon entropy. Compare. Do initial letters and letters from full words. Compare.
  • Another trick is to put the code phrase by phrase into an online anagram server. Then count up how many anagrams you get for each language. The one with the most wins. This may be an indicator.

Hypothesis 5: the code is in a foreign language and the letters have been substituted

  • This is the hardest one and should be left to the end to see if something easier turns up. I don't expect you to worry about this case for the project.
  • At the end of the project the least you can do is suggest some ideas on how future students might approach this. But save your judgment until you've gained some experience with the other techniques first.

Hypothesis 6: the structure of the code can tell us something

  • This is Derek's favourite one that he wants you to get stuck in as early as possible.
  • It is very powerful because it assumes nothing about the letters, only their sequence structure. It also is language independent. It is also tolerant to the ambiguous letters that can be treated as wildcards.
  • Write a piece of sofware that looks for regular expressions. For example ABAB is a repeat pair. Look for repeat pairs. MLIABO is a sexet with all unique characters. TMT is a palindrome triplet. Look carefully and find other characteristic structural features. Make a list of them.
  • Get your code you search for these features within your own list of girls' names, names of chemicals, names of train stations, nuclear terms etc. Look for these regular expressions in e-books that specifically talk about the history of the Cold War and details of Operation Vernona.
  • If possible, write the code so it can search the whole www for these regular expressions. Try and find a search engine that supports regular expressions first. Otherwise you may have to write software that turns a regular expression into a set of search engine queries.

Hypothesis 7: Vigenère cipher

  • Try the Kasiski test and the Friedman test to deduce possible key lengths.
  • Once the key length is known, you can then write out the code in rows of that length, and then do a frequency analysis on each of the columns---since each column corresponds to a single letter, then a column is the result of a separate substitution cipher. You can then reuse your software for the substitution cipher to try and decrypt.
  • More details here

Hypothesis 8: One time pad

  • In a one-time
  • Possible keys:
    • Quatrains from the Rubaiyat. You should generate all possible variants of the Fitzgerald translation using the list of differences, and try all possible quatrains. You can rule out those that are shorter than the ciphertext, unless you consider concatenation of successive pairs of quatrains.

Hypothesis 9: external and internal information is useful

  • This is the one where you just use lateral thinking and think hard about the internal information in the code and the external information of the case. How do these things help you?
  • Internal information: What does the presence of a 'Q' tell you? It tells you there wasn't a straight substitution with a rotary phone. The absence of a 'U' to go with the 'Q' means it is highly unlikely the code is English letters without some kind of subsitution. It could be the 'Q' is an initial.
  • External information: Could the name 'Jestin' be in the code itself? Or could the word 'enormoz' be in there? This was the word used by Cold War spies to refer to a nuke. Think of half a dozen significant words, from what you know about the case. Write software to apply a sliding window across the code and assume a substitution with 'Jestin' at each point. 'Jestin' has six unique characters so there will only be a few decrypted outputs. Only six letters will be decrypted, and for the rest just print a "*". Then visually we can easily eliminate the outputs that look like rubbish. Then do this for 'enormoz' and for half a dozen fun words that you come up with.
  • Could 'Jestin' or 'Tamam shud' be the cryptographic keyword? Assume some basic coding schemes that use keywords and write software to check if your guessed keywords work or not.

Resources

See also

Back