Final Report 2010

From Derek
Revision as of 23:31, 18 October 2010 by A1161917 (Talk | contribs)

Jump to: navigation, search

Due Date

This report is due on the 21st of October, 2010 (Thursday, Week 11).

Executive Summary

michael

Aims and Objectives

Project Background

      • ROUGH STILL NEEDS ALOT OF WORK - MICHAEL****


At around 6.30 am on December 1st 1948 a dead body was discovered at Somerton Beach, here in South Australia, resting against a rock opposite a home for crippled children. The post mortem revealed that the man’s organs were too heavily congested for the cause of death to have been natural. 46 years later, in 1994, forensic science was used to determine that the man had died from digitalis which, at the time of the man’s death, was only accessible with a prescription.


The man was found with several possessions including:

  • Cigarettes
  • Matches
  • A metal comb
  • Chewing gum
  • Railway ticket to Henley Beach
  • A bus ticket and,
  • A tram ticket


By far the most intriguing of his possessions however is the small piece of paper showing the phrase “Tamam Shud”. This piece of paper was identified to be from a book of poems called “The Rubiayat” by Omar Khayyam, a famous Persian poet.


After an announcement was made by the police the copy of The Rubiayat to which the piece of paper belonged was produced by a local person who said he had found the book in the back seat of his unlocked car on the 30th of November.


The book contained had two things pencilled into it:

  • A phone number that lead police to a female named Jestyn. Jestyn was a nurse that denied all knowledge of the dead man.
  • A short code of what appeared to be random or encrypted letters.


This code is yet to be solved and provides the basis for our research project.

Verification of Past Results

michael

Random Letters

michael

Verification of Past Algorithms

Methodology

The Text Matching Algorithm

The Web Crawler

What it does

The basic function of the web crawling portion of the project is to access text on the internet and pass it directly to the pattern matching algorithm. This allows for a reasonably fast access method to large quantities of raw text that can be processed thoroughly and used for statistical analysis.

How it was implemented

Several different approaches were used to implement the web crawler in order to find method that was both effective and simple to use. After experimenting with open source crawlers available such as Arachnid and Jspider we turned our attention to searching for a simpler solution that could be operated directly from the command prompt. Such a program would allow us to hopefully input a website or list of websites of interest, collect relevant data and then have some control over the pattern matching methods that would be used to produce useful results. After much searching and experimenting I came across an open source crawler called HTTrack. HTTrack was used for the following reasons:

  • It is free
  • It is simple to use. A GUI version and command line version come with the standard package which allowed for an easy visual experience to become familiar with the program that was easily translated to coded commands.
  • It allows full website mirroring. This means that the text from the websites is stored on the computer and can be used both offline and for multiple searches without needing to access and search the internet every time.
  • It has a huge amount of customisation options. This allowed for control over such things as search depth (how deep into a website), accessing external websites or just one (avoids jumping to websites that contain irrelevant data), search criteria (only text is downloaded, no images movies or unwanted files that are of no use and waste downloads)
  • It abides the Robots Exclusion Protocol (individual access rights that are customised by the owner of each website)
  • It has a command prompt option. This allows for a user friendly approach and integration with the pattern matching algorithm.



To keep the whole project user friendly, a batch file was created that follows the following process:

  1. Takes in a URL or list of URLs that are pre-saved in a text file at a known location on the computer.
  2. Prompts the user to enter a destination on the computer to store the data retrieved from the website.
  3. Accesses HTTrack and perform a predetermined search on the provided URL(s).
  4. Once the website mirroring is complete the program moves to the predetermined location containing the pattern matching code
  5. Compiles and runs the pattern matching code

Results

Direct Initialism

Pattern Initialism

The Pattern matching algorithm was utilised in a very similar manner to the Initialism algorithm discussed above. Patterns that contain two symbols and appear in the mystery code were highlighted and then used as a basis for a wildcard pattern search on the different types of English texts.

Pattern Selection

As shown in the following table the letters of the mystery code were spread out in order to single out the patterns that actually occur. The different pattern combinations can be seen highlighted in yellow and are broken down further in the next table.

As shown, patterns that occur both horizontally and vertically were used for the testing. It has since been decided that the vertical patterns are unlikely to be of any interest as a simple inspection of the actual mystery code shows very little structure or consistency in a vertical manner compared to the horizontal.


Apparent patterns in the Somerton Man Code
M R G O A B A B D
M T B I M P A N E T P
M L I A B O A I A Q C
I T T M T S A M S T G A B

The highlighted letters are all, in some form, part of a two symbol pattern of length 3 or 4. As shown below, patterns occurring both forwards (left to right) and backwards (right to left) were used for the testing. This assumes that there is a possibility that the code was written backwards.

From these patterns the individual wildcard code is derived, giving the patterns of interest that were used on the different English texts. For example where ABAB occurs in the mystery code, a pattern of @#@# was searched to reveal any combination of letters that match the ‘pattern’ of ABAB.

A full breakdown of the patterns of interest and their associated combination can be seen in the following table.


Interesting Pattern Symbol Breakdown
Interesting 4 Symbol Sequences Interesting 3 Symbol Sequences
Horizontal
ABAB - @#@# ABA - @#@
TTMT - @@#@ AIA - @#@
TMTT - @#@@ ITT - @##
TTI - @@#
TTM - @@#
MTT - @##
TMT - @#@
BAB - @#@
Vertical
MMMI - @@@# TLT - @#@
IMMM - @### MMM - @@@
AAAA - @@@@ AAA = @@@
TQT - @#@
MMI - @@#
IMM - @##
Patterns of Interest
@#@# @#@
@@#@ @##
@#@@ @@#
@@@# @@@
@###
@@@@


Each of the pattern combinations found in the mystery code was then used to search through different types of text so that any differences between the initial letters used in the different text types would be clear. The result of these tests would then hopefully suggest the most likely origin of the mystery code if it is in fact initialism.


The types of texts used were:

  • Poems (a variety of long and short poems from a large variety of poets/archives)
  • Novels
  • Science texts (a selection of textbooks - chemistry, maths and physics)
  • Shakespeare (entire collection)
  • The Bible (separately tested the King James and Revised Standard Versions)
  • Rubiayat (book of poems associated with the murder itself)

Discussion

The results, for each pattern possibility, have been tabulated and compiled in the ****LINK TO RESULTS TABLES HERE**** The first thing that is apparent from the results is the similarities between the different types of text. This suggests that the general occurrence of the order of initial letters in English is independent of the source. That is, it doesn’t really matter where the raw text is from as it is all English and will produce very similar results.


The second interesting point is the highlighted results that relate to the letters taken directly from the mystery code. Having these results occur in the top list of results implies that it is still technically possible that the code is in fact initialism without substitution. However, apart from the patterns of one letter (@@@@ and @@@), there is a very large spread of results with the top result usually only about 5% of the total amount of matches found. The top 26 matches are shown for each search however in the full list some patterns returned results with up to 500 different match variations. This suggests that it is highly likely that if the mystery code is in fact initialism, there is a good chance that it is also using some type of substitution method. That is, each letter is actually representing a different letter.


The very low amounts of matches found that are a direct link from the book of Rubiayat and the mystery code is also important. This book, linked so closely to the murder, would be the most obvious source of the code. It is clear from the results that if the code has been made as an initialism from the Rubiayat then it is definitely also incorporating substitution.

Further Research

Project Management

Michael

Work Allocation and Structure

Schedule

Risk Management

Budget

Summary and Conclusions

Appendix

References

See Also