Final Report 2010
Contents
Due Date
This report is due on the 21st of October, 2010 (Thursday, Week 11).
Executive Summary
michael
Aims and Objectives
Project Background
- ROUGH STILL NEEDS ALOT OF WORK - MICHAEL****
At around 6.30 am on December 1st 1948 a dead body was discovered at Somerton Beach, here in South Australia, resting against a rock opposite a home for crippled children. The post mortem revealed that the man’s organs were too heavily congested for the cause of death to have been natural. 46 years later, in 1994, forensic science was used to determine that the man had died from digitalis which, at the time of the man’s death, was only accessible with a prescription.
The man was found with several possessions including:
- Cigarettes
- Matches
- A metal comb
- Chewing gum
- Railway ticket to Henley Beach
- A bus ticket and,
- A tram ticket
By far the most intriguing of his possessions however is the small piece of paper showing the phrase “Tamam Shud”.
This piece of paper was identified to be from a book of poems called “The Rubiayat” by Omar Khayyam, a famous Persian poet.
After an announcement was made by the police the copy of The Rubiayat to which the piece of paper belonged was produced by a local person who said he had found the book in the back seat of his unlocked car on the 30th of November.
The book contained had two things pencilled into it:
- A phone number that lead police to a female named Jestyn. Jestyn was a nurse that denied all knowledge of the dead man.
- A short code of what appeared to be random or encrypted letters.
This code is yet to be solved and provides the basis for our research project.
Verification of Past Results
michael
Random Letters
michael
Verification of Past Algorithms
Methodology
The Text Matching Algorithm
Function
The text parser is a piece of code written in java that parses through a text or html file and attempts to find specific word and pattern segments within that file. It is divided into several different parts and provides a means in which to parse through a large file directory in a relatively fast manner.
Implementation
First, there’s a main method that takes the user’s input of which mode to use, the directory to look through, and the pattern, word or initialism to search for. Once these inputs have been taken, the parser then calls a method to find all of the files in the inputted directory as well as recursively calling itself when it finds a folder in the given directory.
Once the file list has been determined the code then calls a different parsing method depending on which mode was chosen in the main method for each file in the list. These are FindExact, which searches for an exact word in a file; FindInitialism which searches for the initial letters of a word; and FindPattern which searches for all initialisms that match a given pattern.
It is important to understand the difference between FindInitialism and FindPattern. FindInitialism takes the initialism as an input (ie “abab”) and then parses through a text and searches for it. On the other hand, FindPattern takes a pattern as an input (ie “#@#@” or “##@#”) and parsers through a text to get every initialism that matches that pattern (so for the case of “#@#@”, FindPattern will find “abab”, “acac”, “xzxz”, etc).
The parsing code works by taking in segments of the text file as a string one line at a time. The string is then observed character by character to determine if a match or a pattern occurs. However for the initialism and pattern search, we are only interested in the initial letters of a word, so specific conditional statements are needed to determine if a character meets a certain criteria before observing it for comparison with the initialism or pattern we are searching for. Generally, any whitespace or punctuation such as periods or commas (but not apostrophes) is considered the beginnings and ends of words, so the subsequent characters are the ones we are interested in.
If the parser does find a match, it produces an output into the command prompt as well as a results text file stating the line and character number of the line where it was initially found. As well, depending on the mode, it prints the pattern or the actual initialism.
After it has gone through all of the files in the directory, a summary of the results is printed. For FindExact, it prints the number of times it was found and in how many files. For FindInitialism, as well as the amount of times it was found, it also prints the expected proportion to find the initialism in the entire set of files as well as the proportion that actually occurred. From this we can determine whether a set of files contained an initialism more or less than expected. Finally, FindPattern prints out an array of which initialisms occurred as well as how often.
At the moment, there are some limitations to the code. Firstly, the code can only parse through text, html and other similar files. It cannot handle complex files such as Microsoft Office documents or PDF files and although this limits the amount of results we can obtain, there are many places that offer eBooks and such in text file format so it is not really a problem. Also, due to how the code was written, FindPattern is currently limited to finding initialisms that span only a single line of text. FindInitialism does not suffer from this limitation; however there are some issues in printing the initialism if it spans 3 or more lines. That is, the initialism is properly detected but not printed properly, so it is only really a problem for sentence accuracy and doesn’t really affect the frequency calculations for the results.
The Web Crawler
What it does
The basic function of the web crawling portion of the project is to access text on the internet and pass it directly to the pattern matching algorithm. This allows for a reasonably fast access method to large quantities of raw text that can be processed thoroughly and used for statistical analysis.
How it was implemented
Several different approaches were used to implement the web crawler in order to find method that was both effective and simple to use. After experimenting with open source crawlers available such as Arachnid and Jspider we turned our attention to searching for a simpler solution that could be operated directly from the command prompt. Such a program would allow us to hopefully input a website or list of websites of interest, collect relevant data and then have some control over the pattern matching methods that would be used to produce useful results. After much searching and experimenting I came across an open source crawler called HTTrack. HTTrack was used for the following reasons:
- It is free
- It is simple to use. A GUI version and command line version come with the standard package which allowed for an easy visual experience to become familiar with the program that was easily translated to coded commands.
- It allows full website mirroring. This means that the text from the websites is stored on the computer and can be used both offline and for multiple searches without needing to access and search the internet every time.
- It has a huge amount of customisation options. This allowed for control over such things as search depth (how deep into a website), accessing external websites or just one (avoids jumping to websites that contain irrelevant data), search criteria (only text is downloaded, no images movies or unwanted files that are of no use and waste downloads)
- It abides the Robots Exclusion Protocol (individual access rights that are customised by the owner of each website)
- It has a command prompt option. This allows for a user friendly approach and integration with the pattern matching algorithm.
To keep the whole project user friendly, a batch file was created that follows the following process:
- Takes in a URL or list of URLs that are pre-saved in a text file at a known location on the computer.
- Prompts the user to enter a destination on the computer to store the data retrieved from the website.
- Accesses HTTrack and perform a predetermined search on the provided URL(s).
- Once the website mirroring is complete the program moves to the predetermined location containing the pattern matching code
- Compiles and runs the pattern matching code
Results
Exact Initialism
The following is a table of relative frequencies of the first letters of a word in the English language. This was taken from a wiki page here. We chose random 3 and 4 character long segments of the Somerton Man’s code and using this table, we obtained the expected probability of the segments occurring in regular English text.
Several different types of text were divided into categories and then tested in groups. The categories include The Bible, The Rubiayat, Novels, Science Texts, Long Poems, and works by William Shakespeare. The complete list of texts used can be found in the appendix. The expected probability was then compared to the actual proportion that occurred to determine what text type the mystery code could be from using the following formula:
P(actual) = Total Number of Occurrences / (Total Words in Text – n + 1)
Where n is the size of the segment we are looking for.
Using the results, several tables and graphs were generated to simplify viewing and can be found here.
Our results indicate that most of the code segments occur around the expected frequency for each of the text types, so we can’t really determine if the code belongs to any of the tested text types. However, from inspection of the results, the code is more likely to be divided into 3 word long fragments than 4 word long initialisms. Additionally, we also considered that the code could be backwards. Typically, the results favoured forward segments over the backwards segments, although there were occasions when the backwards segment did appear more often than the forwards segment.
Of particular note was that neither of the 4 or 3 character long code segments appeared much at all in the Rubiayat. This seems quite highly suspect, and there is a possibility that someone intentionally created the mystery code in such a way that this would happen because is it very unlikely that almost none of the code segments would be found in the Rubiayat. More testing is probably needed to prove this by testing more poems, and possibly truncating other poems to the same size as the Rubiayat.
So in conclusion, these exact initialism results indicate that the Somerton man’s code could be a substitution cipher of initialisms found in the Rubiayat, based on the fact that it is incredibly unlikely that none of the tested code segments were found in the Rubiayat. To further test this theory, we began testing for patterns of initialisms to see if there was any correlation between texts and attempt to narrow down what the code has been substituted from.
Pattern Initialism
The Pattern matching algorithm was utilised in a very similar manner to the Initialism algorithm discussed above. Patterns that contain two symbols and appear in the mystery code were highlighted and then used as a basis for a wildcard pattern search on the different types of English texts.
Pattern Selection
As shown in the following table the letters of the mystery code were spread out in order to single out the patterns that actually occur. The different pattern combinations can be seen highlighted in yellow and are broken down further in the next table.
As shown, patterns that occur both horizontally and vertically were used for the testing. It has since been decided that the vertical patterns are unlikely to be of any interest as a simple inspection of the actual mystery code shows very little structure or consistency in a vertical manner compared to the horizontal.
M | R | G | O | A | B | A | B | D | ||||
M | T | B | I | M | P | A | N | E | T | P | ||
M | L | I | A | B | O | A | I | A | Q | C | ||
I | T | T | M | T | S | A | M | S | T | G | A | B |
The highlighted letters are all, in some form, part of a two symbol pattern of length 3 or 4. As shown below, patterns occurring both forwards (left to right) and backwards (right to left) were used for the testing. This assumes that there is a possibility that the code was written backwards.
From these patterns the individual wildcard code is derived, giving the patterns of interest that were used on the different English texts. For example where ABAB occurs in the mystery code, a pattern of @#@# was searched to reveal any combination of letters that match the ‘pattern’ of ABAB.
A full breakdown of the patterns of interest and their associated combination can be seen in the following table.
Interesting 4 Symbol Sequences | Interesting 3 Symbol Sequences |
---|---|
Horizontal | |
ABAB - @#@# | ABA - @#@ |
TTMT - @@#@ | AIA - @#@ |
TMTT - @#@@ | ITT - @## |
TTI - @@# | |
TTM - @@# | |
MTT - @## | |
TMT - @#@ | |
BAB - @#@ | |
Vertical | |
MMMI - @@@# | TLT - @#@ |
IMMM - @### | MMM - @@@ |
AAAA - @@@@ | AAA = @@@ |
TQT - @#@ | |
MMI - @@# | |
IMM - @## | |
Patterns of Interest | |
@#@# | @#@ |
@@#@ | @## |
@#@@ | @@# |
@@@# | @@@ |
@### | |
@@@@ |
Each of the pattern combinations found in the mystery code was then used to search through different types of text so that any differences between the initial letters used in the different text types would be clear. The result of these tests would then hopefully suggest the most likely origin of the mystery code if it is in fact initialism.
The types of texts used were:
- Poems (a variety of long and short poems from a large variety of poets/archives)
- Novels
- Science texts (a selection of textbooks - chemistry, maths and physics)
- Shakespeare (entire collection)
- The Bible (separately tested the King James and Revised Standard Versions)
- Rubiayat (book of poems associated with the murder itself)
Discussion
The results, for each pattern possibility, have been tabulated and compiled in the ****LINK TO RESULTS TABLES HERE**** The first thing that is apparent from the results is the similarities between the different types of text. This suggests that the general occurrence of the order of initial letters in English is independent of the source. That is, it doesn’t really matter where the raw text is from as it is all English and will produce very similar results.
The second interesting point is the highlighted results that relate to the letters taken directly from the mystery code. Having these results occur in the top list of results implies that it is still technically possible that the code is in fact initialism without substitution. However, apart from the patterns of one letter (@@@@ and @@@), there is a very large spread of results with the top result usually only about 5% of the total amount of matches found. The top 26 matches are shown for each search however in the full list some patterns returned results with up to 500 different match variations. This suggests that it is highly likely that if the mystery code is in fact initialism, there is a good chance that it is also using some type of substitution method. That is, each letter is actually representing a different letter.
The very low amounts of matches found that are a direct link from the book of Rubiayat and the mystery code is also important. This book, linked so closely to the murder, would be the most obvious source of the code. It is clear from the results that if the code has been made as an initialism from the Rubiayat then it is definitely also incorporating substitution.
Further Research
Project Management
Michael