Final Report/Thesis 2015: Difference between revisions
| Line 40: | Line 40: | ||
| ===Task 2: N-Gram Search=== | ===Task 2: N-Gram Search=== | ||
| ====Aim==== | ====Aim==== | ||
| The aim of Task 2 was to create a search engine to look for regular expressions that could be linked to the Somerton Man Code.  Numerous studies from previous groups showed that it is statistically likely for the Somerton Man code to be an English initialism (see Previous Studies/Related Work section).  Based on  | The aim of Task 2 was to create a search engine to look for regular expressions that could be linked to the Somerton Man Code.  Numerous studies from previous groups showed that it is statistically likely for the Somerton Man code to be an English initialism (see Previous Studies/Related Work section).  Based on this, for this task an assumption was made that the code is an initialism.  A search engine was to be developed that used the letters of the code as initial letters of words in commonly used English phrases.  This concept was to be explored further using a technique that accesses a larger database in much shorter time than the web crawler developed by groups in previous years (see Previous Studies/Related Work section).  Using an n-gram database, rather than crawling the whole web for grams, had the advantage that the crawling had already been done and all grams had been recorded.  This was to drastically increase the speed at which the gram combinations on the web could be found.  The search engine was required to output a list of possible gams from the input letter combinations.  An assumption that was to be made in order to complete this task was that the letters in the code, and thus the words in the grams, were order relevant.  Another assumption was that all variants of ambiguous letters in the code were to be included. | ||
| ====Method==== | ====Method==== | ||
Revision as of 20:28, 15 October 2015
Executive Summary
Introduction
Motivation
Previous Studies/Related Work
Aims and Objectives
Significance
Technical Background
P-Value Theorem Explanation
Chi-Squared Test Explanation
Universal Declaration of Human Rights Explanation
Project Gutenberg Explanation
N-Gram Model Explanation
One-Time Pad Explanation
Knowledge Gaps and Technical Challenges
Method - Specific Tasks
Task 1: Statistical Frequency Analysis of Letters
Aim
Method
Results
Evaluation and Justification
Task 2: N-Gram Search
Aim
The aim of Task 2 was to create a search engine to look for regular expressions that could be linked to the Somerton Man Code. Numerous studies from previous groups showed that it is statistically likely for the Somerton Man code to be an English initialism (see Previous Studies/Related Work section). Based on this, for this task an assumption was made that the code is an initialism. A search engine was to be developed that used the letters of the code as initial letters of words in commonly used English phrases. This concept was to be explored further using a technique that accesses a larger database in much shorter time than the web crawler developed by groups in previous years (see Previous Studies/Related Work section). Using an n-gram database, rather than crawling the whole web for grams, had the advantage that the crawling had already been done and all grams had been recorded. This was to drastically increase the speed at which the gram combinations on the web could be found. The search engine was required to output a list of possible gams from the input letter combinations. An assumption that was to be made in order to complete this task was that the letters in the code, and thus the words in the grams, were order relevant. Another assumption was that all variants of ambiguous letters in the code were to be included.
Method
All variants of code considered, included crossed out line in search.
Research upon the available databases was conducted and it was found that the two largest databases that most suited our needs were 'Microsoft Web N-Gram Services' [1] , and Google N-Gram [2]. The maximum number of grams provided by each database was five. Due to this, a maximum of five letter gram groups from the code could be processed at a time.
This was achieved by writing code in python to generate all possible 5-gram initialisms from all code variants and output them into a corresponding text file.  The same was done for 4, 3, 2 and 1-grams and stored in their respective text files.  These were to be used as input files from which the search engine was able to perform searches to query the database.
The use of Microsoft Web N-Gram Services is likely to be chosen due to its larger database [3] [4], better provided documentation and tutorials [5], and lower cost (see Budget Section). The improved web-crawler is to be implemented via a web application using developer tools such as Microsoft Visual Studio or Eclipse [14]. The application can be programmed in native languages including Visual Basic, Visual C#, Visual C++, Visual F# and Jscript [6].
Results
Evaluation and Justification
Task 3: Rubaiyat of Omar Khayyam as a One-Time Pad
Aim
Method
Results
Evaluation and Justification
Task 4: Statistical Frequency of Letters Reanalysis
Aim
Method
Results
Evaluation and Justification
Project Management - Planning and Feasibility
Work Breakdown/Deliverables
The workload for this project was broken down into its main tasks. These can be seen in list form in the Final Project Gantt Chart (see Timeline section). The key deliverables are represented as milestones on the Gantt Chart. The dependencies of the tasks and deliverables can be seen in the Gantt Chart as black arrows, these are as follows: The Research Proposal and Progress Report have dependence on the Draft Research Proposal, which has dependence on the Proposal Seminar. Of the specific project tasks, Task 1 was completed first, and Tasks 2, 3 and 4 were completed in parallel. The Final Seminar Presentation, Project Exhibition Poster, Final Performance, Youtube video and Dump of final work are all dependent on the completion of the specific project tasks. The Final Report/Honours Thesis was completed in parallel with the rest of the work from the Research Proposal and Progress Report hand-up, onwards.
Timeline
The timeline for this project was created in the form of a Gantt Chart. The proposed Gantt Chart can be seen in Figure X.

The final Gantt Chart after all revisions and updates can be seen in Figure X.

Changes made from the originally proposed Gantt Chart to the final revised Gantt Chart include the renaming of Tasks 2 and 4 to N-Gram Search and Statistical Frequency of Letters Reanalysis. Task 2 was completed earlier than expected, but cleaning up results for presentation and finding meaningful combinations of the results proved to take longer than expected, and so the second part of Task 2 was extended. Task 3 was also extended so that Jikai was able to complete this task. Task 4 was commenced earlier than proposed since the bulk of Task 2 was completed early. The dump of final work and project youtube video were moved to be completed after the due date of the Final Report/Thesis upon discussion with our supervisors. Overall, our initially proposed Gantt Chart estimated our project timeline quite accurately and only minor changes needed to be made.
Task Allocation
The workload for the tasks within this project were allocated based on the strengths and skillset of each member, as well as the estimated time taken and complexity of each task. A table of the project task allocation can be seen in Figure X. The key allocations were that Nicholas Gencarelli undertook the tasks of Project Management, N-Gram Search and the Project Exhibition Poster. Jikai Yang undertook the tasks of the use of the Rubaiyat of Omar Khayyam as a One-time Pad, and the project Youtube video. The allocations did not require changing throughout the project life cycle apart from the decision for both members to perform a statistical reanalysis for Task 4 rather than both analysing the mass spectrometer data from the Somerton Man's hair.

Management Strategy
A number of management strategies were adopted for use throughout the project. One of which was frequent face-to-face contact through regular meetings every 2-3 weeks. Another was regular communication between group members via text message and email. Collaboration is another strategy that was useful, if one member required assistance on a particular task, the other was able to step in and help. This was achieved through the use of flexible task allocation. The group was able to make use of collaborative software including Google Drive for working together on project documents, and Git Hub repository for working together on code for software development. The project Wiki page was updated in real time including the weekly progress section to monitor and review work completed by each member every week, as well as plan tasks for the upcoming week. Finally, the use of a Gantt chart was used as a management strategy to incorporate clearly defined task and goals and established a critical path through use of task dependencies.
Budget
Risk Analysis
A risk assessment was undertaken for this project to include risk identification, analysis, evaluation and treatment strategies using the Adelaide University risk matrix procedure [34][7]. This can be seen in Figure X. One of the risks that occurred during the project was the inaccurate estimation of time and resources. This occurred since the group and supervisors were unhappy with the results obtained from the initial analysis of letter frequency performed in Task 1. This was rectified by implementing the flexibility of our schedule and by replacing the initially proposed Task 4: Mass Spectrometer Data Analysis, with a new Task 4: Statistical Frequency of Letters Reanalysis. Another risk that occurred throughout the project was Illness. This was able to be dealt with relatively easily through working from home for a short period of time. The minor misunderstanding of project tasks occurred on a few occasions, but these were clarified through scheduling meetings with group members and supervisors. Bugs in code were reduced to the best of our ability through thorough testing and debugging of code. Finally, the inability to decipher the Somerton Man Code was a risk estimated with an almost certain likelihood. Despite being unable to avoid this risk throughout the project, its effects were considered negligible, and the group was still able to complete all work to the best of its ability, and further the research into the decryption of the code for not only future honours groups, but also the wider community through publishing our results on our Wiki.

Conclusions
Future Work
References
- ↑ COMPLETE THIS REFERENCE: http://research.microsoft.com/en-us/collaboration/focus/cs/web-ngram.aspx
- ↑ COMPLETE THIS REFERENCE: http://googleresearch.blogspot.nl/2006/08/all-our-n-gram-are-belong-to-you.html
- ↑ C. X. Zhai et al. (2010, July 19-23). Web N-gram Workshop [online]. Available: http://research.microsoft.com/en-us/events/webngram/sigir2010web_ngram_workshop_proceedings.pdf
- ↑ T.Brants and A.Franz. (2006). Web 1T 5-gram Version 1 [online]. Available: https://catalog.ldc.upenn.edu/LDC2006T13
- ↑ T.Brants and A.Franz. (2006). Web 1T 5-gram Version 1 [online]. Available: https://catalog.ldc.upenn.edu/LDC2006T13
- ↑ No Author. Visual Studio Languages [online]. Available: https://msdn.microsoft.com/en-us/library/vstudio/ee822860%28v=vs.100%29.aspx
- ↑ No Author. RISK MANAGEMENT HANDBOOK [online]. Available: http://www.adelaide.edu.au/legalandrisk/docs/resources/Risk_Management_Handbook.pdf
Glossary and Symbols
- ASIO: Australian Security Intelligence Organisation
- ASIS: Australian Secret Intelligence Service
- ASD: Australian Signals Directorate
- P-value theorem: The p-value is the calculated probability that gives researchers a measure of the strength of evidence against the null hypothesis [1].
- Chi-Squared Test:
- Universal Declaration of Human Rights:
- Project Gutenberg:
- N-gram model: The N-gram model is a sequence of n items from a given sequence of phonemes, syllables, letters, words or base pairs [2].
- One-time pad: The one-time pad is a decoder technology which cannot be cracked if the correct key is used [3].
- Initialism: A group of letters formed using the initial letters of a group of words or a phrase [4].
- Ciphertext: The encoded format of a message [5].
- Plaintext: The information of an original message, which is desired to be deciphered from the ciphertext </ref>.
http://www.maths.uq.edu.au/~pa/SCIE1000/gma.pdf</ref>.
- Key: What is needed to convert the ciphertext into the plaintext using the one-time pad </ref>.
- ↑ B. David et al., “P Value and the Theory of Hypothesis Testing: An Explanation for New Researchers,” Clinical Orthopaedics and Related Research®, Vol.468 (3), pp.885-892 2010. [25] G G. L et al., “What is the Value of a p Value?,” The Annals of Thoracic Surgery, Vol.87(5), pp.1337-1343 2009. [26] No Author.p-value [online]. Available: http://en.wikipedia.org/wiki/P-value#cite_note-nature506-1
- ↑ A. Z Broder et al., “Syntactic clustering of the web”. Computer Networks and ISDN Systems 29 (8), pp.1157–1166. [28] No Author. Video Lectures [online]. Available: https://class.coursera.org/nlp/lecture/17
- ↑ S.M. Bellovin. (2011, July 12). Frank Miller: Inventor of the One-Time Pad [online]. Available: http://www.tandfonline.com.proxy.library.adelaide.edu.au/doi/full/10.1080/01611194.2011.583711#abstract
- ↑ No Author. Initialism [online]. Available: http://dictionary.reference.com/browse/initialism
- ↑ No Author (2011). Topic 1: Cryptography [online]. Available: http://www.maths.uq.edu.au/~pa/SCIE1000/gma.pdf