Final Report/Thesis 2015

From Derek
Revision as of 20:23, 15 October 2015 by A1628026 (Talk | contribs)

Jump to: navigation, search

Executive Summary

Introduction

Motivation

Previous Studies/Related Work

Aims and Objectives

Significance

Technical Background

P-Value Theorem Explanation

Chi-Squared Test Explanation

Universal Declaration of Human Rights Explanation

Project Gutenberg Explanation

N-Gram Model Explanation

One-Time Pad Explanation

Knowledge Gaps and Technical Challenges

Method - Specific Tasks

Task 1: Statistical Frequency Analysis of Letters

Aim

Method

Results

Evaluation and Justification

Task 2: N-Gram Search

Aim

The aim of Task 2 was to create a search engine to look for regular expressions that could be linked to the Somerton Man Code. Numerous studies from previous groups showed that it is statistically likely for the Somerton Man code to be an English initialism (see Previous Studies/Related Work section). Based on the assumption that the code is an initialism, a search engine was to be developed that used the letters of the code as initial letters of words in commonly used English phrases. This concept was to be explored further using a technique that accesses a larger database in much shorter time than the web crawler developed by groups in previous years (see Previous Studies/Related Work section). Using an n-gram database, rather than crawling the whole web for grams, had the advantage that the crawling had already been done and all grams had been recorded. This was to drastically increase the speed at which the gram combinations on the web could be found.

Method

All variants of code considered, included crossed out line in search.

Research upon the available databases was conducted and it was found that the two largest databases that most suited our needs were 'Microsoft Web N-Gram Services' [1] , and Google N-Gram [2]. The maximum number of grams provided by each database was five. Due to this, a maximum of five letter gram groups from the code could be processed at a time.


This was achieved by writing code in python to generate all possible 5-gram initialisms from all code variants and output them into a corresponding text file. The same was done for 4, 3, 2 and 1-grams and stored in their respective text files. These were to be used as input files from which the search engine was able to perform searches to query the database.

The use of Microsoft Web N-Gram Services is likely to be chosen due to its larger database [3] [4], better provided documentation and tutorials [5], and lower cost (see Budget Section). The improved web-crawler is to be implemented via a web application using developer tools such as Microsoft Visual Studio or Eclipse [14]. The application can be programmed in native languages including Visual Basic, Visual C#, Visual C++, Visual F# and Jscript [6].

Results

Evaluation and Justification

Task 3: Rubaiyat of Omar Khayyam as a One-Time Pad

Aim

Method

Results

Evaluation and Justification

Task 4: Statistical Frequency of Letters Reanalysis

Aim

Method

Results

Evaluation and Justification

Project Management - Planning and Feasibility

Work Breakdown/Deliverables

The workload for this project was broken down into its main tasks. These can be seen in list form in the Final Project Gantt Chart (see Timeline section). The key deliverables are represented as milestones on the Gantt Chart. The dependencies of the tasks and deliverables can be seen in the Gantt Chart as black arrows, these are as follows: The Research Proposal and Progress Report have dependence on the Draft Research Proposal, which has dependence on the Proposal Seminar. Of the specific project tasks, Task 1 was completed first, and Tasks 2, 3 and 4 were completed in parallel. The Final Seminar Presentation, Project Exhibition Poster, Final Performance, Youtube video and Dump of final work are all dependent on the completion of the specific project tasks. The Final Report/Honours Thesis was completed in parallel with the rest of the work from the Research Proposal and Progress Report hand-up, onwards.

Timeline

The timeline for this project was created in the form of a Gantt Chart. The proposed Gantt Chart can be seen in Figure X.

Fig. X: Proposed Project Gantt Chart

The final Gantt Chart after all revisions and updates can be seen in Figure X.

Fig. X: Final Project Gantt Chart

Changes made from the originally proposed Gantt Chart to the final revised Gantt Chart include the renaming of Tasks 2 and 4 to N-Gram Search and Statistical Frequency of Letters Reanalysis. Task 2 was completed earlier than expected, but cleaning up results for presentation and finding meaningful combinations of the results proved to take longer than expected, and so the second part of Task 2 was extended. Task 3 was also extended so that Jikai was able to complete this task. Task 4 was commenced earlier than proposed since the bulk of Task 2 was completed early. The dump of final work and project youtube video were moved to be completed after the due date of the Final Report/Thesis upon discussion with our supervisors. Overall, our initially proposed Gantt Chart estimated our project timeline quite accurately and only minor changes needed to be made.

Task Allocation

The workload for the tasks within this project were allocated based on the strengths and skillset of each member, as well as the estimated time taken and complexity of each task. A table of the project task allocation can be seen in Figure X. The key allocations were that Nicholas Gencarelli undertook the tasks of Project Management, N-Gram Search and the Project Exhibition Poster. Jikai Yang undertook the tasks of the use of the Rubaiyat of Omar Khayyam as a One-time Pad, and the project Youtube video. The allocations did not require changing throughout the project life cycle apart from the decision for both members to perform a statistical reanalysis for Task 4 rather than both analysing the mass spectrometer data from the Somerton Man's hair.

Fig. X: Table of Task Allocation

Management Strategy

A number of management strategies were adopted for use throughout the project. One of which was frequent face-to-face contact through regular meetings every 2-3 weeks. Another was regular communication between group members via text message and email. Collaboration is another strategy that was useful, if one member required assistance on a particular task, the other was able to step in and help. This was achieved through the use of flexible task allocation. The group was able to make use of collaborative software including Google Drive for working together on project documents, and Git Hub repository for working together on code for software development. The project Wiki page was updated in real time including the weekly progress section to monitor and review work completed by each member every week, as well as plan tasks for the upcoming week. Finally, the use of a Gantt chart was used as a management strategy to incorporate clearly defined task and goals and established a critical path through use of task dependencies.

Budget

Risk Analysis

A risk assessment was undertaken for this project to include risk identification, analysis, evaluation and treatment strategies using the Adelaide University risk matrix procedure [34][7]. This can be seen in Figure X. One of the risks that occurred during the project was the inaccurate estimation of time and resources. This occurred since the group and supervisors were unhappy with the results obtained from the initial analysis of letter frequency performed in Task 1. This was rectified by implementing the flexibility of our schedule and by replacing the initially proposed Task 4: Mass Spectrometer Data Analysis, with a new Task 4: Statistical Frequency of Letters Reanalysis. Another risk that occurred throughout the project was Illness. This was able to be dealt with relatively easily through working from home for a short period of time. The minor misunderstanding of project tasks occurred on a few occasions, but these were clarified through scheduling meetings with group members and supervisors. Bugs in code were reduced to the best of our ability through thorough testing and debugging of code. Finally, the inability to decipher the Somerton Man Code was a risk estimated with an almost certain likelihood. Despite being unable to avoid this risk throughout the project, its effects were considered negligible, and the group was still able to complete all work to the best of its ability, and further the research into the decryption of the code for not only future honours groups, but also the wider community through publishing our results on our Wiki.

Fig. 6: Table of Risk Assessment

Conclusions

Future Work

References

  1. COMPLETE THIS REFERENCE: http://research.microsoft.com/en-us/collaboration/focus/cs/web-ngram.aspx
  2. COMPLETE THIS REFERENCE: http://googleresearch.blogspot.nl/2006/08/all-our-n-gram-are-belong-to-you.html
  3. C. X. Zhai et al. (2010, July 19-23). Web N-gram Workshop [online]. Available: http://research.microsoft.com/en-us/events/webngram/sigir2010web_ngram_workshop_proceedings.pdf
  4. T.Brants and A.Franz. (2006). Web 1T 5-gram Version 1 [online]. Available: https://catalog.ldc.upenn.edu/LDC2006T13
  5. T.Brants and A.Franz. (2006). Web 1T 5-gram Version 1 [online]. Available: https://catalog.ldc.upenn.edu/LDC2006T13
  6. No Author. Visual Studio Languages [online]. Available: https://msdn.microsoft.com/en-us/library/vstudio/ee822860%28v=vs.100%29.aspx
  7. No Author. RISK MANAGEMENT HANDBOOK [online]. Available: http://www.adelaide.edu.au/legalandrisk/docs/resources/Risk_Management_Handbook.pdf

Glossary and Symbols

  • ASIO: Australian Security Intelligence Organisation
  • ASIS: Australian Secret Intelligence Service
  • ASD: Australian Signals Directorate
  • P-value theorem: The p-value is the calculated probability that gives researchers a measure of the strength of evidence against the null hypothesis [1].
  • Chi-Squared Test:
  • Universal Declaration of Human Rights:
  • Project Gutenberg:
  • N-gram model: The N-gram model is a sequence of n items from a given sequence of phonemes, syllables, letters, words or base pairs [2].
  • One-time pad: The one-time pad is a decoder technology which cannot be cracked if the correct key is used [3].
  • Initialism: A group of letters formed using the initial letters of a group of words or a phrase [4].
  • Ciphertext: The encoded format of a message [5].
  • Plaintext: The information of an original message, which is desired to be deciphered from the ciphertext </ref>.

http://www.maths.uq.edu.au/~pa/SCIE1000/gma.pdf</ref>.

  • Key: What is needed to convert the ciphertext into the plaintext using the one-time pad </ref>.


Cite error: <ref> tags exist, but no <references/> tag was found