Final Report 2010: Difference between revisions

← Older edit Newer edit →

Revision as of 19:03, 18 October 2010

Due Date

Executive Summary

michael

Aims and Objectives

Project Background

Verification of Past Results

michael

Random Letters

michael

Verification of Past Algorithms

Methodology

The Text Matching Algorithm

The Web Crawler

What it does

The basic function of the web crawling portion of the project is to access text on the internet and pass it directly to the pattern matching algorithm. This allows for a reasonably fast access method to large quantities of raw text that can be processed thoroughly and used for statistical analysis.

How it was implemented

Several different approaches were used to implement the web crawler in order to find method that was both effective and simple to use. After experimenting with open source crawlers available such as Arachnid and Jspider we turned our attention to searching for a simpler solution that could be operated directly from the command prompt. Such a program would allow us to hopefully input a website or list of websites of interest, collect relevant data and then have some control over the pattern matching methods that would be used to produce useful results. After much searching and experimenting I came across an open source crawler called HTTrack. HTTrack was used for the following reasons:

It is free
It is simple to use. A GUI version and command line version come with the standard package which allowed for an easy visual experience to become familiar with the program that was easily translated to coded commands.
It allows full website mirroring. This means that the text from the websites is stored on the computer and can be used both offline and for multiple searches without needing to access and search the internet every time.
It has a huge amount of customisation options. This allowed for control over such things as search depth (how deep into a website), accessing external websites or just one (avoids jumping to websites that contain irrelevant data), search criteria (only text is downloaded, no images movies or unwanted files that are of no use and waste downloads)
It abides the Robots Exclusion Protocol (individual access rights that are customised by the owner of each website)
It has a command prompt option. This allows for a user friendly approach and integration with the pattern matching algorithm.

To keep the whole project user friendly, a batch file was created that follows the following process:

Takes in a URL or list of URLs that are pre-saved in a text file at a known location on the computer.
Prompts the user to enter a destination on the computer to store the data retrieved from the website.
Accesses HTTrack and perform a predetermined search on the provided URL(s).
Once the website mirroring is complete the program moves to the predetermined location containing the pattern matching code
Compiles and runs the pattern matching code

Results

Direct Initialism

Pattern Initialism

Pattern Selection

The wild card pattern matching algorithm was designed for use on all 2 symbol combinations for initialism combinations with a length of 3 and 4. In order to first distinguish the possible patterns that are of interest the mystery code was organised letter by letter as shown in the following table.

Apparent patterns in the Somerton Man Code
M	R	G	O	A	B	A	B	D
M	T	B	I	M	P	A	N	E	T	P
M	L	I	A	B	O	A	I	A	Q	C
I	T	T	M	T	S	A	M	S	T	G	A	B

The highlighted letters are all, in some form, a part of either a 3 or 4 symbol pattern. Horizontal and vertical patterns where used and analysed both forwards and backwards. Just by inspection it would appear that the code is written in a horizontal fashion rather than vertically however it was unclear what the results would reveal so both methods where assumed possible.

A full breakdown of the patterns of interest and their associated combination can be seen in the following table.

**An example table**
First header	Second header
upper left		right side
lower left	lower middle	right side

Each of the pattern combinations found in the mystery code was then used to search through different types of text so that any differences between the initial letters used in the different text types would be clear. The result of these tests would then hopefully suggest the most likely origin of the mystery code if it is in fact initialism.

The types of texts used were:

Poems (a variety of long and short poems from a large variety of poets/archives)
Novels
Science texts (a selection of textbooks - chemistry, maths and physics)
Shakespeare (entire collection)
The Bible (separately tested the King James and Revised Standard Versions)
Rubiayat (book of poems associated with the murder itself)

@@ Line 133: / Line 133: @@
 | style="border-bottom: 3px solid grey;" | lower middle
 |-
-| colspan="3" align="center" |
 |}

Final Report 2010: Difference between revisions

Revision as of 19:03, 18 October 2010

Contents

Due Date

Executive Summary

Aims and Objectives

Project Background

Verification of Past Results

Random Letters

Verification of Past Algorithms

Methodology

The Text Matching Algorithm

The Web Crawler

What it does

How it was implemented

Results

Direct Initialism

Pattern Initialism

Pattern Selection

Discussion

Further Research

Project Management

Summary and Conclusions

Appendix

References

See Also

Navigation menu

Final Report 2010: Difference between revisions

Revision as of 19:03, 18 October 2010

Due Date

Executive Summary

Aims and Objectives

Project Background

Verification of Past Results

Random Letters

Verification of Past Algorithms

Methodology

The Text Matching Algorithm

The Web Crawler

What it does

How it was implemented

Results

Direct Initialism

Pattern Initialism

Pattern Selection

Discussion

Further Research

Project Management

Summary and Conclusions

Appendix

References

See Also

Navigation menu

Search