Difference between revisions of "Final Report"

From Derek
Jump to: navigation, search
m (Preliminary Conclusions)
(My bad, didn't realise how to 'move' a page)
 
Line 1: Line 1:
NOTE: Will need to re-organise everything as I'm using their suggested formatting for a Thesis involving multiple different experiments.
 
  
'''Acknowledgements'''
 
 
The project team would like to extend their deepest gratitude to our supervisor, Prof. Derek Abbott, and co-supervisors, Dr. Brian Ng and Maryam Ebrahimpour, for their continual support and guidance throughout the research project. The advice given throughout helped drive the project forward and allowed for basic investigations on a very interesting topic.
 
 
'''Abstract'''
 
 
The Voynich Manuscript is a 15th century document written in an unknown language or cipher or may be a hoax. This report presents the ideas and results into determining possible linguistic properties within the Voynich with the intent of determining possible relationships with other known languages. This is performed through basic data-mining and statistical methods. The document reviews previous research carried out by other researchers. The final method is given and the results obtained by the project team are detailed and evaluated. The project management is also briefly outlined.
 
 
==Introduction==
 
 
===Background===
 
 
The Voynich Manuscript is a document written in an unknown script that has been carbon dated back to the early 15th century [1] and believed to be created within Europe [2]. Named after Wilfrid Voynich, whom purchased the folio in 1912, the manuscript has become a well-known mystery within linguistics and cryptology. It is divided into several different section based on the nature of the drawings [3]. These sections are:
 
 
*Herbal
 
 
*Astronomical
 
 
*Biological
 
 
*Cosmological
 
 
*Pharmaceutical
 
 
*Recipes
 
 
The folio numbers and examples of each section are outlined in appendix section A.2.
 
In general, the Voynich Manuscript has fallen into three particular hypotheses [4]. These are as follows:
 
 
*Cipher Text: The text is encrypted.
 
 
*Plain Text: The text is in a plain, natural language that is currently unidentified.
 
 
*Hoax: The text has no meaningful information.
 
 
Note that the manuscript may fall into more than one of these hypotheses [4]. It may be that the manuscript is written through steganography, the concealing of the true meaning within the possibly meaningless text.
 
 
===Aim===
 
 
The aim of the research project is to determine possible features and relationships of the Voynich Manuscript through the analyses of basic linguistic features and to gain knowledge of these linguistic features. These features can be used to aid in the future investigation of unknown languages and linguistics.
 
 
The project does not aim to fully decode or understand the Voynich Manuscript itself. This outcome would be beyond excellent but is unreasonable to expect in a single year project from a small team of student engineers with very little initial knowledge on linguistics.
 
 
===Motivation===
 
 
The project attempts to find relationships and patterns within unknown text through the usage of basic linguistic properties and analyses. The Voynich Manuscript is a prime candidate for analyses as there is no known accepted translations of any part within the document. The relationships found can be used help narrow future research and to conclude on specific features of the unknown language within the Voynich Manuscript.
 
 
Knowledge produced from the relationships and patterns of languages and linguistics can be used to further the current linguistic computation and encryption/decryption technologies of today [5].
 
 
While some may question as to why an unknown text is of any importance to Engineering, a more general view of the research project shows that it deals with data acquisition and analyses. This is integral to a wide array of businesses, including engineering, which can involve a basic service, such as survey analysis, to more complex automated system.
 
 
===Significance===
 
 
There are many computational linguistic and encryption/decryption technologies that are in use today. As mentioned in section 1.3, knowledge produced from this research can help advance these technologies in a range of different applications [5]. These include, but are not limited to, information retrieval systems, search engines, machine translators, automatic summarizers, and social networks [5].
 
 
Particular technologies, that are widely used today, that can benefit from the research, include:
 
*Turn-It-In (Authorship/Plagiarism Detection)
 
*Google (Search Engines)
 
*Google Translate (Machine Runnable Language Translations)
 
 
===Technical Background===
 
 
The vast majority of the project relies on a technique known as data mining. Data mining is the process of taking and analysing a large data set in order to uncover particular patterns and correlations within said data thus creating useful knowledge [6]. In terms of the project, data shall be acquired from the Interlinear Archive, a digital archive of transcriptions from the Voynich Manuscript, and other sources of digital texts in known languages. Data mined from the Interlinear Archive will be tested and analysed for specific linguistic properties using varying basic statistical methods.
 
 
The Interlinear Archive, as mentioned, will be the main source of data in regards to the Voynich Manuscript. It has been compiled to be a machine readable version of the Voynich Manuscript based on transcriptions from various transcribers. Each transcription has been translated into the European Voynich Alphabet (EVA). An example of the archive in EVA and the corresponding text within the Voynich Manuscript can be seen within the appendix section A.3. The EVA itself can be seen within appendix section A.4.
 
 
===Technical Challenges===
 
 
Due to the difficulty of transcribing a hand-written 15th century document, no transcriptions within the Archive are completed, nor do they all agree with each other. Many tokens within the Voynich Manuscript have been considered as a different token, or even multiple tokens. Spacing between word tokens has also been a key ambiguity as one transcription may consider one word token to be multiple word tokens or vice-versa. It is also believed that the manuscript is missing 14 pages [7]. These uncertainties will make it difficult to effectively conclude on any linguistic analyses.
 
 
The statistical methods relating to linguistics are numerous, leading to many different possible approaches that can be used upon the Voynich Manuscript. However many of the intricate techniques require some form of knowledge of the language itself. This limits the possible linguistic analysis techniques that can be used. Despite previous research on the Voynich Manuscript, no current conclusion has been widely accepted [3]. Due to this the research will be focused on the basics of linguistics.
 
 
==Requirements==
 
 
It is not expected that the project fully decodes, or even partially decodes, the Voynich Manuscript. Nonetheless the project must show the following:
 
*A logical approach to investigating the Voynich Manuscript
 
*Critical evaluation of any and all results
 
*Testing on all code
 
*Hypotheses based on results
 
 
==Literature Review==
 
 
Over the years, the Voynich Manuscript has been investigated by numerous scholars and professionals. This has given rise to many possible hypotheses [4] through many different forms of analysis based on its linguistic properties [2]. These properties range from the character tokens to word tokens, to the syntax and pages themselves. The currently reviewed literature, which is of interest to the project, is summarized below.
 
 
A broad, albeit brief, summary of linguistic analyses that have been completed over the previous years is given by Reddy and Knight [2] and include some of their own tests. They perform multiple analyses on the letter, the word, syntax, pages, and the manuscript itself while giving reference to other works on the same property. Their work on the letter and word are of a particular interest of this project. They suggest that vowels may not be represented within the Voynich Manuscript and that Abjad languages have the closest similarities [2]. This is concluded through two-state hidden Markov models and word length distributions respectively. Reddy and Knight also suggest that there are some particular structural similarities within the words when using a minimum description length based algorithm [2].
 
 
Gabriel Landini’s [3] looks into the statistical characteristics of the manuscript and natural languages. Characterising the text through Zipf’s Law and performing analysis on entropy and character token correlation, Landini suggests that there is some form of linguistic structure behind the Voynich Manuscript [3]. In particular, the work reveals long range correlation, a modal token length, and periodic structures within the text.
 
 
Andreas Schinner [4] takes a different approach in the paper “The Voynich Manuscript: Evidence of the Hoax Hypothesis”. Schinner performs a random walk model and tests token repetition distances through the Levenshtein distance metric. It is concluded that while the results seem to support the hoax hypothesis more so than the others, it cannot rule out any of them [4].
 
 
Diego R. Amancio, Eduardo G. Altmann, Diego Rybski, Osvaldo N. Oliveira Jr., and Luciano da F. Costa [5] investigate the statistical properties of unknown texts. They apply various techniques to the Voynich Manuscript looking at vocabulary size, distinct word frequency, selectivity of words, network characterization, and intermittency of words. Their techniques were aimed at determining useful statistic properties with no prior knowledge of the meaning of the text. Although not aimed specifically at deciphering the Voynich Manuscript, they do conclude that the Voynich Manuscript is compatible with natural languages [5].
 
 
Jorge Stolfi’s website [8] gave multiple views and analyses of the Voynich Manuscript. Stolfi’s work on word length distributions and morphological structure [8] are of particular interest to the project. He displays a remarkable similarity in word length distributions between the Voynich Manuscript and Eastern Asian languages [8]. He also shows evidence of morphological structure, displaying prefix-midfix-suffix structure [9], and later displaying a crust-mantle-core paradigm [10].
 
 
In regards to research on the Voynich Manuscript carried out at the University of Adelaide. This is the second year that this project has been undertaken by students. Bryce Shi and Peter Roush provide a report on their results [11]. They carry out a multitude of tests on the Voynich Manuscript including:
 
*Zipf’s Law
 
*Word Length Distribution
 
*Word and Picture Association
 
*Word Recurrence Intervals
 
*Entropy
 
*N-Grams
 
*Punctuation
 
*Authorship
 
Shi and Roush give short conclusions to each of these tests but realise that further research is required for any to be considered conclusive [11].
 
 
===Morphology Literature Review===
 
 
Linguistic morphology investigates the meaningful segments within words [17]. By investigating these smaller, meaningful segments it may be possible to break the Voynich Manuscripts words into their most basic parts, simplifying any word analysis.
 
 
Aronoff and Fudeman [17] provide the basics behind morphology and morphological analysis. In particular, they give two basic, complimentary approaches through analysis and synthesis. The analytic approach is of interest to this project as it deals with breaking words down into their smallest 'meaningful' sections. They also provide four basic analytic principles used in morphology to aid anyone that attempts to undertake any morphological analysis. Note that Aronoff and Fudeman also highlight potential issues with morphological structure when comparing between different languages, showing direct comparisons between English and other various languages.
 
 
Durrett and DeNero [18] introduce a supervised approach to predicting the base forms of words, particularly those base forms within morphologically rich languages. Using a data-driven approach, Durrett and DeNero develop software that learns transformation rules from inflection table data. The rules are then applied to a data set to extract the morphological data. While this method can be extended to any language unfortunately it requires a substantial number of example inflection tables [18] making this unsuitable for use on the Voynich Manuscript. However the method may prove useful if performing tests on the English language.
 
 
Trost [19] delves into computational morphology, providing the fundamentals behind linguistics, real-world applications, and various forms of computational morphology analysis. The two-level rules for finite-state morphology outlined by Trost are of concern to this project as they show that an affix may change the structure of the word. Trost gives small examples of English plurals. Due to the unknown nature of the Voynich Manuscript, any morphological analysis will need to take the possibility of these small structural changes into account.
 
 
Goldsmith [12] reports on the use of a minimum description length model to analyse the morphological structure of various European languages. He attempts separate a word into successive morphemes, where possible, corresponding to more traditional linguistic analysis [12]. Goldsmith outlines the minimum description length approach, in great detail, and various other approaches that have been attempted by previous researchers. The results obtained are good but not perfect, noting that a number of issues are still present within his approach and the various other approaches. He concludes that his algorithm is highly likely different than that of a human language learner and that the determination of morphology is a complex task.
 
 
==Method Overview==
 
 
The methods used during the project are outlined here. They are split into different phases where each phase will be considering a specific linguistic feature or property and attempting to relate it to Voynich Manuscript while building onto what was learned in the previous phase(s). Many techniques may replicate previous research outlined in section 3.0. The results within these documents will be used to compare and complement results where possible.
 
 
All phases will be coded and will therefore include testing as all code must be verified for results to be considered accurate. Code will also be well commented and documented within the project wiki.
 
 
Completion of each phase is considered a milestone, as referenced in section 6.1.
 
 
===Phase 1 - Characterization of the Text===
 
 
Characterization of the text involves determining the first-order statistics of the Voynich Manuscript. This first involves pre-processing the Interlinear Archive into a simpler machine-readable format.
 
 
The pre-processed files are then characterized through MATLAB code by finding and determining:
 
*Unique word tokens (Vocabulary Size)
 
*Unique character tokens (Alphabet Size)
 
*Total word tokens
 
*Total character tokens
 
*Frequency of word tokens
 
*Frequency of character tokens
 
*Word token length frequency
 
*Character tokens that only appear at the start, end, or middle of word tokens
 
 
A 'unique' token is considered a token that is different than any of the other tokens. In terms of character tokens, difference is attributed to the token itself being visually (machine-readable) different than another. In terms of word tokens, difference is attributed to the structure of the word.
 
 
Resulting statistics are then be compared with other known languages through using the same code on the various translations of the Universal Declaration of Human Rights. Unfortunately the Universal Declaration of Human Rights is, by comparison, a small document which will limit results.
 
 
===Phase 2 - English Investigation===
 
 
The English investigation looks into the elementary structure of English text. It specifically examines the representation of the English alphabet and how the alphabetical tokens can be extracted from an English text using statistics. This is done to grasp a better understanding on how character tokens are used within text and how data and statistics relating to these character tokens can be used to characterize each token.
 
 
Initially, a corpus of English texts (see Appendix section A.9) shall be passed through the characterization code of phase 1 to determine the first-order statistics of each text. These will be compared to grasp a basic understanding of how each of the tokens can be statistically represented and how these statistics differ between texts. These tokens include alphabetical, numerical, and punctuation tokens.
 
 
The characterization code will then be expanded upon to include character token bigrams to further define the differences between character tokens. Bigrams give the conditional probability, P, of a token, Tn, given the proceeding token, Tn-1. This is given in the following formula:
 
 
P(T_n│T_(n-1) )=  (P(T_(n-1),T_n))/(P(T_(n-1)))
 
 
It is expected that the probability of the different tokens along with the first-order statistics, obtained through the phase 1 code, will show definitive differences between alphabetical, numerical, and punctuation tokens.
 
 
Code will be written that takes these statistical findings into account to attempt to extract the English alphabet from any given English text with no prior knowledge of English itself. This will be used to examine the Voynich Manuscript to search for any character token relationships.
 
 
===Phase 3 - Morphology Investigation===
 
 
Morphology deals with the structure of the words, particularly the meaningful segments that make up a word [17]. Specifically, phase 3 will be looking into the possibility of affixes within the Voynich Manuscript.
 
 
As described in section 2, previous research has found the possibility of morphological structure within the Voynich Manuscript [2]. A Minimum Description Length model [12] may be used to attempt to segment word tokens into possible affix models.
 
 
The basis of the code will be examining word tokens within the Interlinear Archive and attempting to find all similar tokens. This will initially determine if a word token appears within another, different word token. Following the Minimum Description Length model, the code will then attempt to find the most compact representation of the word token and any pre or post word tokens.
 
 
By analysing the word tokens that appear within other word tokens, and their placement within said word tokens, it is expected that a hypothesis for possible prefix, affix, and stem word tokens will be concluded.
 
 
Coding this model into MATLAB will allow for use on the Interlinear Archive. The code will also be used on English texts to provide a qualitative comparison on the effectiveness and limitations of the algorithm.
 
 
===Phase 4 - Illustration Investigation===
 
 
The illustration investigation looks into the illustrations contained in the Voynich Manuscript. It will examine the possible relation between texts, words and illustrations.
 
The different sections in the Voynich Manuscript are based on the drawings and illustrations in pages. Almost all the sections are texts with illustrations except recipes section.
 
 
In Phase 4, the basis of the code will be achieving the following functions:
 
*Find out unique word tokens in each pages and sections
 
*Determine the location of a given word token
 
*Determine the frequency of a given word token
 
*Compare two given texts, find out common words and frequency of the common words of the texts.
 
The resulting statistics from the code can then be used into investigation. However, it should be noted that the manuscript may have been written by multiple authors and in multiple languages [13]. Sections of the manuscript will need to be investigated separately, particularly those written in different languages, along with the manuscript as a whole.
 
 
===Phase 5 - Other Ideas===
 
 
This phase will essentially be determining follow-up investigations based on current findings from the other phases. The empirical data found may lead to possible investigations that can be followed up during this phase. It is also quite possible that a phase, particularly phases 3 and 4, may not provide a definitive conclusion or may lead to an impasse. Due to this, phase 5 has been left much more open than the other phases.
 
 
Some other particular investigations that may be completed during this phase include:
 
*Keywords and co-occurrence within the manuscript [14]
 
*Vowel and consonant representation [2]
 
*Word order [2]
 
*Hidden Markov Modelling [11]
 
*15th Century Cipher Analysis [11]
 
It is expected that this phase will eventually be split up into multiple separate phases. At this time it is unknown as to which follow-up investigations will be completed and, as such, has been left for discussion at a later date as previous phases become completed.
 
 
==Implementation and Testing==
 
 
As described earlier, the vast majority of the project uses Data Mining. This is done through various self-developed software code, mainly in C++ and MATLAB languages, for each phase. To ensure the integrity of any data and/or results the code is verified through testing. The implementation and testing procedures of the software code for each phase is briefly detailed below.
 
 
===Phase 1 - Characterization of the Text===
 
 
====Pre-Processing of the Text====
 
The Interlinear Archive is written in such a way that each page stored within the archive contains the transcription of the page by the various transcribers. The transcriptions can be identified through an identifier at the start of each line (as can be seen in Appendix section A.5). To simplify the required processing for the characterization of the text, and other phases to be investigated, the pre-processing software code re-organises the Interlinear Archives transcriptions into their separate transcribers.
 
 
The software code also removes, or replaces, any unnecessary formatting. This includes the transcriber identifiers, spacing labels, and any in-line comments. Depending on what character tokens are to be investigated, the software can either keep or remove any extended EVA characters found.
 
 
=====Implementation=====
 
Implementation of the pre-processing software code required an understanding of the layout and formatting used within the Interlinear Archive. This is detailed in the initial pages of the archive but a brief summary can be found in Appendix section A.5.
 
 
Using these details, software was written to read each page of the Interlinear Archive, searching for each specific transcriber through the identifier tags. Upon finding a transcribed line by the specified transcriber the software replicates the page file and the lines within a separate folder designated for the transcriber. This process is repeated until the entirety of the Interlinear Archive has been read.
 
 
A secondary software module is then run to remove, or replace, the unnecessary formatting by reading the pages contained within the now separated and re-organised version of the archive.
 
 
=====Testing=====
 
Initial testing of the software code was carried out on small English sentences. Each sentence testing a specific feature of the layout and formatting used within the Interlinear Archive before a final sentence combining all of these features. Once these tests have been passed, the pre-processing software was run on specified pages of the Interlinear Archive. These specified pages are also re-formatted by hand and cross-checked with the output.
 
 
This was to ensure that the software itself can identify a transcriber and perform any necessary reformatting within the text.
 
 
====Characterization of the Text====
 
The software code for the characterization of the text is the main objective of Phase 1. It tabulates all relevant first-order statistics of the Voynich Manuscript as detailed in the Proposed Method. As pre-processing of the Interlinear Archive is completed first, the characterization software was developed to be simple and assumes that pre-processing has occurred.
 
 
=====Implementation=====
 
Initially, a specific transcriber will be chosen based on the completion of the transcription. The pages transcribed by this specific transcriber will then be read by the software code where it will store all the relevant first-order statistics. These are summarised into an output text file.
 
 
=====Testing=====
 
As with the pre-processing software, the testing of the characterization software is initially completed using English. A small paragraph was written and hand-counted for the relevant data. The output of the characterization software was then compared with that which was hand-counted. This is again repeated with a section of the Interlinear Archive.
 
 
This was to ensure that the data received by the characterization software was correct.
 
 
===Phase 2 - English Investigation===
 
 
====Character Token Bigrams====
 
It was found that basic first-order statistics would not allow for the full extraction of the English Alphabet from a random text. The character token bigram software aimed at giving further data which would help with the extraction of the English Alphabet.
 
 
The software does not perform any formatting and does not distinguish any tokens aside from their 'uniqueness'.
 
 
=====Implementation=====
 
The software initially finds every unique character token within the given text and generates every possible bigram from that set of unique character tokens. It then reads the text counting every occurrence of a bigram and outputs the results to a text file.
 
 
=====Testing=====
 
Testing of the character token bigram software was completed using the same English test paragraph used for the characterization software. The process described within the implementation for generating the bigrams was completed by hand and cross-checked with the output of the software.
 
 
This was to ensure that the software correctly generated the bigrams and made no distinctions between tokens. These distinctions included lower-case and upper-case alphabetical tokens, numerical tokens, and punctuation tokens.
 
 
====English Alphabet Extraction====
 
The English alphabet extraction software was the main objective of Phase 2. The software itself is developed to use the currently found statistics to extract the English alphabet from a random English text.
 
 
=====Implementation=====
 
Implementation of the English alphabet extraction software involved analysing the current data and building up from the basics in an attempt to characterize a list of unknown character tokens. This meant that the software would initially look at the first-order statistics and attempt to use those as a means of extracting possible alphabet tokens or non-alphabet tokens. Further data would them be included to help extract tokens that could not be characterized or refine tokens that had been improperly characterized.
 
 
A second implementation is currently in development as it was found that the first implementation had a large error deviation when used to analyse different types of English text and when using small sample sizes. This second implementation will be generalising the bigram data further to decrease both the error rate and deviation.
 
 
=====Testing=====
 
Testing of the English alphabet extraction software is completed through inputting texts that had been previously used to extract first-order data and bigram data and texts that have not yet been analysed. This allows the team to check for the error rate of the software and determine if the accuracy and precision is sufficient enough to be used on the Voynich Manuscript.
 
 
===Phase 3 -  Morphology Investigation===
 
 
====Basic Morpheme Identification====
 
The basic morpheme identification software reads the text from a chosen text file within a folder and extracts all the combinations, and corresponding frequencies, of sequences of characters for a specified length that occur at the start of a word (prefix) or end of a word (suffix). This only gives a very basic interpretation of possible affixes and does not attempt to determine what type of affix, if any, the extract character sequence is.
 
 
=====Implementation=====
 
The current implementation reads a text file from a chosen folder, extract all the combinations, and corresponding frequencies, of sequences of characters for a specified length, and outputs the data to a specified text file.
 
 
=====Testing=====
 
Testing of the basic morpheme identification software is done by first writing basic English text that includes possible examples that should be identified. The results are manually found and compared with the output of the software for verification.
 
 
Small sections of the Voynich Manuscript are also used. The same manual procedure is used to verify the software output for Voynichese text.
 
 
===Phase 4 -  Illustration Investigation===
 
 
====Search and Analysis Word Token====
 
The software in the Phase 4 is a search engine, the functions is to search the unique word token of the Voynich Manuscript in each section and each folio as well as compare the texts between folios. And also the software will allow the user to search a given word token, find out the location (section, folio, paragraph) and frequency of the word token. Again, the database for phase 4 is Takahashi's transcription, and the given word token which is searched is in Takahashi's format.
 
=====Implementation=====
 
The first step of Phase 4 is reading through the original Voynich Manuscript, choose a certain folio which content text with illustration. Recording all the word tokens occurring in the folio, then analyse the word tokens using the Matlab code from Phase 1 and Phase 4, recording the following features:
 
 
*Number of word tokens in this folio
 
*Number of unique word token in this folio
 
*Location of each word token  in this folio
 
*Frequency of this folio’s word token in the whole Voynich Manuscript.
 
*Find out the location of this folio’s word token that also appear in others folio.
 
 
When the statistical computations have finished, perform a comparison between different folios on which word tokens occur, determine any similar and/or different illustrations in each of their folios and then analyse.
 
 
The second step is counting the basic feature of the Voynich by folios and sections.
 
 
*Word token, number of word token, most frequency word token on each folio and section.
 
*Unique word token, number of unique word token, most frequency unique word token on each folio and section.
 
 
When the statistical computations have finished, perform a comparison between the folios on which unique word tokens occur by section, determine any similar and/or different illustrations in each of their folios and then analyse.
 
 
=====Testing=====
 
Some Matlab code are same as phase 1 which have tested already. To testing phase 4’s Matlab code, first choose a certain word token from the statistic in phase 1, the frequency of the word token should be around 10 to 20 in order to easy for manually check. Then input the word token to the Matlab code, the result will be each location of the word token in the Voynich Manuscript and the frequency of the word token. Comparing the result with the statistic from phase 1, to see if they are same. And also read though the location results in the original Voynich Manuscript, to see if the word token is in the right place.
 
 
===Phase 5 - Collocation Investigation===
 
 
====Collocation Extraction and Word Association Metric Calculator====
 
The MATLAB code used in the collocation investigation provides all the found collocations (strict word-pairs) within a given text and calculates the corresponding frequencies and pointwise mutual information metrics.
 
 
=====Implementation=====
 
The extraction section of the code initially tokenizes all of the words within a chosen text file. These words are then paired together in the order they are found to provide all the found collocations. During this time the code is determining all the relevant metrics (frequency, relative frequency, PMI) based on the total collocations found and the probability of each single word occurring.
 
 
The results are then output into a text file.
 
 
=====Testing=====
 
Testing of the collocation and association metric software is done by first writing basic English text which is then processed through the software. The results are manually found and compared with the output of the software for verification.
 
 
==Results==
 
 
===Phase 1===
 
 
The project team begun research by pre-processing the Interlinear Archive into separate simple files containing the transcriptions of each unique transcriber. All unnecessary data, such as comments, were removed from each of these transcriptions. In-line formatting was also converted to follow a simpler, machine-readable standard (see Appendix section A.5 for an example).
 
 
To get the most accurate results the team must look into which transcriptions are the most complete. Shi and Roush [11] suggest that the Takahashi transcription was the most complete transcription by checking the total number of lines transcribed. A test on the amount of transcribed lines per transcriber is performed again giving the results within Figure 2 (see appendix section A.6 for a complete list of transcriber codes and their meaning).
 
 
[[File:Num of Lines Transcribed.png|600px|thumb|center|Figure 2: Total Number of Lines Transcribed by Transcriber]]
 
 
This follows the same conclusion of Shi and Roush [11].
 
 
A comparison of the top five most completed transcriptions word-length distribution was then carried out. Takahashi’s transcription showed an unusual peculiarity with a single word token of length 35 with the next highest being of length 15. However, this word token  was composed of mainly unknown ‘*’ characters and was therefore removed from our data set. This resulted in the following word-length distribution plot in figure 3.
 
 
[[File:Relative Word Length Frequencies of Most Completed Transcriptions.png|600px|thumb|center|Figure 3: Relative Word Length Frequencies of Most Completed Transcriptions]]
 
 
This result, again, conforms to the results found by Shi and Roush [11], showing a peak word length distribution of 5 and giving an unusual binomial distribution. This can also be seen in Reddy and Knight [2]. However Reddy and Knight specifically investigated the word lengths of language B  within the Voynich Manuscript.
 
 
The Universal Declaration of Human Rights was also mined for relative word length distributions. This is, unfortunately, limited to a much smaller amount of tokens than that of the Voynich Manuscript but shall give a good indication as to which languages to investigate further.
 
 
As it is believed that the Voynich originated from Europe [2], European languages were initially compared with the results found above. Using the Takahashi transcription, as it is the most complete, resulted in the following word-length distribution plot in figure 4.
 
 
[[File:Relative Word Length Frequencies of European Languages.png|600px|thumb|center|Figure 4: Relative Word Length Frequencies of European Languages]]
 
 
Many European languages were removed from the plot to make it more readable. Regardless, the resulting conclusion was the same as no tested European language appeared to fit the peak word length and binomial distribution of the Voynich Manuscript. Shi and Roush [11] found similar results, but also showed that the language within the manuscript had a closer resemblance to Hebrew. Reddy and Knight [2] tested Buckwalter Arabic, Pinyin, and ‘de-voweled’ English, resulting in much closer relationships. All gave the appearance of a binomial distribution much like the manuscript, with Buckwalter Arabic being very similar to Voynich Language B. This leads to the hypothesis that the manuscript may be in the form of Abjad [2].
 
 
Looking specifically at the Takahashi transcription, the following first-order statistics of the full transcription were found (as shown in Table 1).
 
 
{| border="1" class="wikitable"
 
|+
 
! Section
 
! Result - Excluding Extended EVA Characters
 
! Result - Including Extended EVA Characters
 
|-
 
! <div style="text-align: left;">Total Word Tokens</div>
 
| 37919
 
| 37919
 
|-
 
! <div style="text-align: left;">Total Unique Word Tokens</div>
 
| 8151
 
| 8172
 
|-
 
! <div style="text-align: left;">Total Character Tokens</div>
 
| 191825
 
| 191921
 
|-
 
! <div style="text-align: left;">Total Unique Character Tokens</div>
 
| 23
 
| 48
 
|-
 
! <div style="text-align: left;">Longest Word Token</div>
 
| 15
 
| 15
 
|-
 
|}
 
 
''Table 1: First-Order Statistics (Takahashi)''
 
 
 
The Takahashi transcription was also characterized based on the different sections, as outlined in Appendix Section A.2 (((CHECK))), and is summarized in Table 2 and Table 3 below.
 
 
{| border="1" class="wikitable"
 
|+
 
! Section
 
! Total Word Tokens
 
! Total Unique Word Tokens
 
! Total Character Tokens
 
! Total Unique Character Tokens
 
! Total Longest Word Token
 
|-
 
! <div style="text-align: left;">Herbal</div>
 
| 11475
 
| 3423
 
| 54977
 
| 23
 
| 13
 
|-
 
! <div style="text-align: left;">Astronomical</div>
 
| 3057
 
| 1630
 
| 15777
 
| 20
 
| 14
 
|-
 
! <div style="text-align: left;">Biological</div>
 
| 6915
 
| 1550
 
| 34681
 
| 20
 
| 11
 
|-
 
! <div style="text-align: left;">Cosmological</div>
 
| 1818
 
| 834
 
| 9289
 
| 21
 
| 13
 
|-
 
! <div style="text-align: left;">Pharmaceutical</div>
 
| 3972
 
| 1668
 
| 20168
 
| 21
 
| 15
 
|-
 
! <div style="text-align: left;">Recipes</div>
 
| 10682
 
| 3102
 
| 56933
 
| 21
 
| 14
 
|-
 
|}
 
 
''Table 2: Takahashi First-Order Statistics By Section - Excluding Extended EVA Characters''
 
 
 
{| border="1" class="wikitable"
 
|+
 
! Section
 
! Total Word Tokens
 
! Total Unique Word Tokens
 
! Total Character Tokens
 
! Total Unique Character Tokens
 
! Total Longest Word Token
 
|-
 
! <div style="text-align: left;">Herbal</div>
 
| 11475
 
| 3441
 
| 55040
 
| 44
 
| 13
 
|-
 
! <div style="text-align: left;">Astronomical</div>
 
| 3057
 
| 1630
 
| 15781
 
| 23
 
| 14
 
|-
 
! <div style="text-align: left;">Biological</div>
 
| 6915
 
| 1550
 
| 34684
 
| 22
 
| 11
 
|-
 
! <div style="text-align: left;">Cosmological</div>
 
| 1818
 
| 834
 
| 9290
 
| 22
 
| 13
 
|-
 
! <div style="text-align: left;">Pharmaceutical</div>
 
| 3972
 
| 1668
 
| 20180
 
| 24
 
| 15
 
|-
 
! <div style="text-align: left;">Recipes</div>
 
| 10682
 
| 3102
 
| 56946
 
| 29
 
| 14
 
|-
 
|}
 
 
''Table 3: Takahashi First-Order Statistics By Section - Including Extended EVA Characters''
 
 
 
From Table 1, 2, and 3 above it can clearly be seen that the majority of character tokens, 99.95%, used are those of the basic EVA. It can also be observed that the majority of the extended EVA character tokens are found within the Herbal section of the Voynich Manuscript (from Tables 2 and 3). 
 
 
All word and character tokens of each transcription have also been recorded along with the frequency that each occur.
 
 
 
 
Looking further at the Takahashi transcription, the following first-order statistics were found.[[File:Longest word token.png|200px|thumb|right|Figure 5: Longest word token]][[File: VM 49v.jpg|thumb|100px|right|Folio 49v]]
 
[[File: VM 66r.jpg|thumb|100px|right|Folio 66r]]
 
[[File: diff language.png|thumb|120px|right|Number of single letter word in languages]]
 
 
The longest word token appears in folio f87r in the pharmaceutical section. The word within the Takahashi transcription is “ypchocpheosaiin”, the original in the Voynich Manuscript is shown in Figure 5 on the right.
 
 
The length of shortest word tokens in the Voynich Manuscript is 1, which means they are single character token words, Table 4 below shows the statistics of all the single character word tokens in the Voynich Manuscript.
 
 
[[File: shortest word.png|thumb|300px|center|Table 4: Single letter word]]
 
 
 
As it is believed that the Voynich originated from Europe [2], some Europe languages have more single letter words. The table left shows the number of single letter word in different languages.As can be seen from the table, the total number of single letter word Voynich is close to Greek and Hebrew and quiet different from French and English. In Greek and Hebrew, the single letters are used to represent numbers. There are two typical folio 49v and 66r which have more single letter words than most of the other folios. In these two folios, there both are a column in the left margin of paragraphs. The pictures of the two folios are shown above. Converting the letter by using Takahashi transcription, the letters are "f o r y e * k s p o * y e * * p o * y e * d y s k y" on folio 49v, and "y o s sh y d o f * x air d sh y f f y o d r f c r x t o * l r t o x p d" on folio 66r. In English written habit, the order of paragraph is represented by a series of number or letter in left margin of paragraphs. The single letters in these folios are follow this habit, so the letters may represent numbers. This also shows that the language within the manuscript had a closer resemblance to Hebrew.
 
 
 
 
The frequency of words is helpful when investigating unknown languages, the most frequent word tokens in the Voynich Manuscript were found as shown in Table 5 below. To analyse the data, English literature is also investigated, the data was also recorded in Table 6.
 
 
 
[[File:Most Frequent.png|550px|thumb|center|Table 6: Most frequent word in Voynich and English]]
 
 
 
The literature for the English investigation is chosen based on the Voynich Manuscript’s statistic. The total word tokens within the Voynich Manuscript is 37919, the total word token count for the English literature is 40786. Although the total word number is similar, the word token which only appear once is quite different, the number for Voynich is 5647, and for English is only 3420.
 
 
[[File:Tokens start end.png|180px|thumb|right|Table 6: Tokens appear at the start and end]]
 
[[File:Most Frequent vs.png|600px|thumb|center|Figure 6: Most frequent word in Voynich Vs. English]]
 
 
From figure 6 it can be clearly seen that the shape of the two data is similar, and their gap narrowed after the 10th most frequent word.
 
 
 
The table  6 on the right shows the statistics of characters tokens only appear at the start and end of word tokens within the Takahashi transcription. It can be found that the token z appears the least times at the start and end. Token o appears most at the start, the frequency is 8530, and token y appears most at the end, the frequency is 15409. In almost all written systems in the world, punctuations always occur at word edge, most of them can only be found at the end of words. In Voynich language, it can be seen from table 6, there is no word tokens only appear at the end of word. In traditional sense, there is likely no punctuation in Voynich.
 
 
All word and character tokens of each transcription have been recorded along with the frequency that each occur. Note that the character tokens are currently limited to the basic EVA characters (see appendix section A.4) but is currently being expanded to include the extended EVA characters. All the data and statistic from phase 1 are basic, they can be used in the rest phases.
 
 
===Phase 2===
 
 
The results found within phase 2 show that the current implementation of the English Extraction software has a high deviation within the error rate when using different English texts. While it is expected that errors will occur, the error rate found is too high to be acceptable. The results below show the prominent data that was found and how it was used within the English extraction software. However, as specified in section 5.2, a second implementation of the English extraction software is being developed.
 
 
Investigating the frequency statistics alone of each text within the English Corpus (see Appendix A.7) only a small amount of tokens can be characterized as either alphabet or non-alphabet tokens exclusively. It also shows that the token data of some texts differed and that the majority of character tokens could not be exclusively characterized. The statistics of each text were combined to give the frequencies found over the entire corpus. Tables 7 and 8 below summarizes the boundaries found using the combined case which was used to give threshold values for the extraction software.
 
 
{| border="1" class="wikitable"
 
|+
 
!
 
!
 
! Character Token
 
! Relative Frequency
 
|-
 
! <div style="text-align: left;">1</div>
 
| Non-Alphabet Token
 
| ,
 
| 0.018952262
 
|-
 
! <div style="text-align: left;">2</div>
 
| Numerical Token
 
| 2
 
| 0.005601521
 
|-
 
! <div style="text-align: left;">3</div>
 
| Upper-Case Alphabet Token
 
| A
 
| 0.004233199
 
|-
 
! <div style="text-align: left;">4</div>
 
| Lower-Case Alphabet Token
 
| e
 
| 0.118199205
 
|-
 
|}
 
 
''Table 7: Highest Relative Frequency''
 
 
{| border="1" class="wikitable"
 
|+
 
!
 
!
 
! Character Token
 
! Relative Frequency
 
|-
 
! <div style="text-align: left;">1</div>
 
| Non-Alphabet Token
 
| =
 
| 0.000000190
 
|-
 
! <div style="text-align: left;">2</div>
 
| Numerical Token
 
| 0
 
| 0.001239706
 
|-
 
! <div style="text-align: left;">3</div>
 
| Upper-Case Alphabet Token
 
| Q
 
| 0.000040890
 
|-
 
! <div style="text-align: left;">4</div>
 
| Lower-Case Alphabet Token
 
| z
 
| 0.000503139
 
|-
 
|}
 
 
''Table 8: Lowest Relative Frequency''
 
 
The frequency of a character token appearing at the beginning or end of a word token was then analysed to determine if any character tokens can be characterized through this data. A table of the this data can be viewed in Appendix section A.8. It was found that using this data alone did not allow for accurate characterization as specific characters that may only appear at the beginning or end of a word token could be either alphabet or non-alphabet characters.
 
 
To further extend the two basic types of data used above, bigrams of the English corpus were generated and analysed. These led to much more prominent results, showing specific characters rarely appearing next to one type of character and more frequently next to another. The most frequency bigrams of each English text within the corpus are shown below in Figures 7 to 11.
 
 
[[File:MacbethBigrams.jpg|400px|thumb|center|Figure 7: Most Frequent Bigrams in Macbeth]]
 
[[File:RobinHoodBigrams.jpg|400px|thumb|center|Figure 8: Most Frequent Bigrams in The Merry Adventures of Robin Hood]]
 
[[File:HeavensBigrams.jpg|400px|thumb|center|Figure 9: Most Frequent Bigrams in The Story of the Heavens]]
 
[[File:FoxgloveBigrams.jpg|400px|thumb|center|Figure 10: Most Frequent Bigrams in An Account of the Foxglove and some of its Medical Uses]]
 
[[File:NewTestamentBigrams.jpg|400px|thumb|center|Figure 11: Most Frequent Bigrams in The New Testament]]
 
 
As shown, 'th' and 'he' were the most frequent bigrams regardless of text.
 
 
Other results found numerical tokens generally only appear next to other numerical tokens with the rare occurrence of non-alphabet tokens. This is shown in Figure 12 below. Note that the high occurrence of a bigram beginning with a numerical token and ending with a colon is attributed to the writing style within the New Testament.
 
 
[[File:NumericalBigram.jpg|600px|thumb|center|Figure 12: Numerical Bigram Occurrences]]
 
 
Some non-alphabet tokens proved to be more difficult to characterize due to their lower frequencies and high probability of appearing beside an alphabet token. A small example is shown below in Figure 13.
 
 
[[File:LowBigram.jpg|600px|thumb|center|Figure 13: Low Bigram Occurrences]]
 
 
As can be seen, the bracket token appeared more often than that of the upper-case alphabet tokens. This is a specific example as the upper-case tokens were much rarer within the text than that of the more common alphabet tokens. However it does show that care must be taken during the analysis and extraction to ensure these tokens are characterized correctly.
 
 
 
Threshold values were, again, determined based off of the relative bigram data. This could then be used, along with the basic frequency data obtained earlier, to begin extraction of the English alphabet.
 
 
The extraction software was used on different English texts. The initial text, not included in the English corpus, was Robert Stevenson's book Treasure Island. Running the software over the text produced the following lists in Table 9.
 
 
{| border="1" class="wikitable"
 
|+
 
!
 
!Results
 
|-
 
! <div style="text-align: left;">Possible Alphabet Tokens</div>
 
| A B C D E F G H I J K L M N O P Q R S T U V W Y a b c d e f g h i j k l m n o p q r s t u v w y z
 
|-
 
! <div style="text-align: left;">Possible Non-Alphabet Tokens</div>
 
| ! " ' ( ) * , - . 0 1 2 3 4 5 6 7 8 9 : ; ? X _ x
 
|-
 
|}
 
 
''Table 9: Treasure Island Alphabet Extraction Results''
 
 
Clearly the results from Table 9 above are exceptional. Only two character tokens, 'X' and 'x', were characterized incorrectly giving an error rate of 2.703%.
 
 
The second text, included within the English corpus, to be used within the English extraction software was Howard Pyle's The Merry Adventures of Robin Hood. This produced the following lists in Table 10.
 
 
{| border="1" class="wikitable"
 
|+
 
!
 
!Results
 
|-
 
! <div style="text-align: left;">Possible Alphabet Tokens</div>
 
| ' ( - A B C D E F G H I K L M N O P Q R S T U W Y a b c d e f g h i j k l m n o p q r s t u v w y
 
|-
 
! <div style="text-align: left;">Possible Non-Alphabet Tokens</div>
 
| ! " ) , . 0 1 2 3 4 5 6 7 8 9 : ; ? J V X Z _ x z
 
|-
 
|}
 
 
''Table 10: The Merry Adventures of Robin Hood Alphabet Extraction Results''
 
 
The results here are much worse than initially received. Nine characters were incorrectly characterized giving an error rate of 12.16%. This large increase in error shows that the algorithm used within the software does not return accurate results.
 
 
It was concluded that the errors propagated from the usage of threshold values obtained from a much larger data set than that of which was tested. This halted the use of the software on the Voynich Manuscript as any results obtained would be inconclusive with such a high error rate. This also prompted for a second implementation of the English extraction software to be written which will exclude the use of threshold values, focusing more on the general data obtained from the bigrams. The second implementation is currently in development.
 
 
These results do, however, allude to differences within writing style. Despite both texts shown here being that of a fiction-novel, the frequencies of character tokens and character bigrams differed enough to produce different results. Unfortunately this is far from definitive but does show how differences in token frequency can be used within linguistic Stylometry. This could be used in a later phase to investigate authorship of the Voynich Manuscript.
 
 
===Phase 3: Morphology Investigation===
 
 
Due to the errors found within Phase 2, Phase 3 has currently been delayed. However, Phase 3 was briefly investigated but the results, from the current implementation of the Morphology software, only show the number of occurrences of a unique word token within another unique word token. The highest occurring word tokens are shown in Figure 14 below.
 
 
[[File:InitialMorphology.jpg|500px|thumb|center|Figure 14: Highest Occurring Unique Word Tokens Within Other Unique Word Tokens]]
 
 
====Introduction====
 
Within linguistics, morphology, broadly speaking, deals with the study of the internal structure of words [8]. It can be divided into several different categories, depending on the grammar, with the most basic being between inflection and word-formation [9]. (((MAY NEED TO EXPLAIN MEANING BETWEEN INFLECTION AND WORD-FORMATION))))
 
 
((FOR REFERENCE))
 
*Inflection: (Changes to tense, gender, mood, number, case, etc.) http://www.britannica.com/topic/inflection
 
*Word-Formation: Similar to inflection (Derivation and Compounding)
 
 
English and many other languages contain many words that have some form of internal structure [10]. These internal structures can have multiple different forms and categories, depending on the language itself, with the most common structural units as suffixes and prefixes [11]. This is also known as concatenative morphology.
 
 
Within this small experiment, the most common affixes in English are found and compared with those found within the Voynich Manuscript. Due to the unknown word structure of the Voynich, this experiment defines an affix as a sequence of characters that appear at the word edges. Any particular differences found can then be checked against other languages within the corpus.
 
 
====Literature Review====
 
 
Over the past years, many researchers have examined multiple different techniques of extracting different forms of linguistic morphology from various different languages [12]. Both unsupervised and supervised techniques have been used. Hammarström presents a particularly simple unsupervised algorithm for the extraction of salient affixes from an unlabelled corpus of a language [13]. This is particularly of interest as the Voynich Manuscript does not have any universally accepted morphological structure [4]. Hammarström’s algorithm assumes salient affixes have to be frequenct and that words are simply variable length sequences of characters. This is a naïve approach to handling the complex nature of morphology by restricting itself to concatenated morphology of which do not necessarily need to be frequent [13]. His results show that it includes many affixes that would be considered junk affixes where a junk affix is defined as a sequence of characters that, once affixed to a word, do not change the word in any meaningful way. He states that his results can only give guidelining experimental data and did find that the (GENRE) of writing, even in the same language, could give significant differences. More informed segmentation and peeling of affix layers was beyond the scope of the paper.
 
 
Eryiğit and Adalı offer two different approaches by using a large Turkish lexicon [14]. One approach was to initially determine the root words which allows for these to be stripped from other words, leaving the possible affixes. The other approach used the reverse order by initially determining the affixes which could then be stripped from the words leaving only root words. Both approaches used rule-based finite state machines as Turkish is a fully concatenative language that only contains suffixes [14]. This approach would not work with the Voynich Manuscript as there is no known lexicon that can be used with the Voynich. However the paper does give evidence on how rule-based approaches can be utilised to determine morphological structure.
 
 
Minnen, Carol and Pearce show a method for analysing the inflectional morphology within English [15]. This did not use any explicit lexicon or word-base but did require knowledge of the English langauge as it used a set of morphological generalisations and a list of exceptions to these. This method is available as software modules which could be used in future experiments to compare with other possible methods  to determine inflectional morphological structure.
 
 
Snover and Brent present an unsupervised system for the extraction of stems and suffixes with no prior knowledge of the language [11]. The system is designed to be entirely probabilistic that attempts to identify the final stems and suffixes for a given list of words. They state that the results and analysis are conservative, showing only a number of possible suffixes but, due to this, appears to be more precise than other morphology extraction algorithms. However this system requires a large corpus to determine a list of common words to use. In particular, when testing English Snover and Brent use the Hansard corpus which contains approximately 1.6 billion words. Other tests show that it has particular issues with languages that use more complex morphology.
 
 
Another paper shows extraction of morphology through the extension of the Morfessor Baseline, a tool for unsupervised morphological segmentation. Kohonen, Virpioja and Lagus state that the number of unique word formed from morphology can be very large in a given corpus [12]. They show that by adding the use of labelled data, which is data that is known as its corresponding morphological category, to unlabeled data the results of the extraction significantly improve. However this means that knowledge of the language is required to give such labelled data. They note that by using labelled data they can bias the system to a particular language or task and that it is difficult to avoid biasing across different languages. The morphemes themselves may be higher or lower depending on the language.
 
 
Morphology tests and experiments have also been carried out previously on the Voynich Manuscript. Several hypothesis of the basic structure have been given [4], these include:
 
 
*Roots and Suffixes model
 
*Prefix-Stem-Suffix model
 
*Core-Mantel-Crust model
 
 
Reddy and Knight [4] perform a test on the Voynich Manuscript by running Linguistica, an unsupervised morphological segmentation algorithm, to segment the words into possible prefixes, stems and suffixes. They conclude that the results suggest there is some form of morphological structure within the Voynich Manuscript.
 
 
 
 
====Methods====
 
 
The affix extraction method exploits the simple definition given to affixes in this paper. That is, an affix is a sequences of characters that appear at the word edges. Text is read into a MATLAB function which is set to find all character sequences that begin at the start or end of a word, of a set length, and compute their relative frequencies. Any word that contains the same amount or less than the set length value is ignored.
 
 
The extracted affixes are then ranked by frequency and kept in their corresponding character lengths. These are plotted and compared with those found in the Voynich Manuscript and other languages.
 
 
All punctuation and casing within the English text was also removed. The extraction of Voynich Manuscript used the simplified Takahashi transcription with the extended EVA characters removed. As the results from the previous investigations suggested that there was no punctuation nor casing within the Voynich Manuscript any texts used… (THIS WILL PROBABLY BE PLACED SOMEWHERE ELSE)
 
 
====Results====
 
 
(UNSURE ON WHETHER TO USE THE TABLES, GRAPHS, OR BOTH)
 
 
The initial results compared the affixes of character lengths from two to five from a section of the English text Robin Hood and those found with the Voynich. The results of the prefix extraction and ranking can be seen in Table (REFERENCE) and Figure(S) (REFERENCE OR POSSIBLY APPENDIX) below.
 
 
[[File:Prefix Table.jpg|800px|thumb|center|Figure X: Prefix Comparison Table]]
 
 
The results of the suffix extraction and ranking can be seen in Table (REFERENCE) and Figure(s) (REFERENCE OR POSSIBLY APPENDIX) below.
 
 
[[File:Suffix Table.jpg|800px|thumb|center|Figure X: Suffix Comparison Table]]
 
 
Further testing was completed over the suffixes of character length 3 and 4 as these showed the greatest (((variability - different word?))) between the Voynich Manuscript and English. In particular, there are significant differences between the top two ranked suffixes, hence the top two ranked frequencies of the various languages within the corpus were found and compared. The results are of character length 3 suffixes are shown in Figure(s) (INSERT REFERENCES) and the results of character length 4 suffixes are shown in Figures (INSERT REFERENCES) below.
 
 
[[File:Suffix Length 3.jpg|800px|thumb|center|Figure X: Top 2 Ranked Suffix Length 3 Plot of Various Languages]]
 
 
[[File:Suffix Length 3 Difference.jpg|800px|thumb|center|Figure X: Difference Ratio of Top 2 Ranked Suffix Length 3 of Various Languages]]
 
 
[[File:Suffix Length 4.jpg|800px|thumb|center|Figure X: Top 2 Ranked Suffix Length 4 Plot of Various Languages]]
 
 
[[File:Suffix Length 4 Difference.jpg|800px|thumb|center|Figure X: Difference Ratio of Top 2 Ranked Suffix Length 4 of Various Languages]]
 
 
====Discussion====
 
 
The initial findings of the prefix and suffix comparisons between English and the Voynich do not appear to give any definitive relationships. From the prefix data it can be clearly seen that the Voynich contains many more frequent prefixes over the entire range when only considering the top 10 frequency ranked prefixes. This relationship does not appear to change significantly as the prefix character length is increased.
 
 
The suffix data provides a more significant difference in that English begins with very similar values to that of the Voynich at length 2, a higher ranked 1 suffix at length 3, and much lower values at lengths 4 and 5. The range of values given at length 4 for English even show an almost linear relationship while the Voynich shows an almost exponential decay. With such a significant difference between the length 3 and 4 suffixes…^^^^^^ decided to focus on these two lengths over the various languages within the corpus.
 
 
Examining the results of the length 3 and 4 comparisons of the corpus show that, at a …(((FREQUENCY LEVEL))) only Greek appears to have a relationship with the Voynich Manuscript showing a similar rank 1 suffix relative frequency but having a larger rank 2 suffix relative frequency. French also appears to have a similar rank 1 suffix but has a significantly lower rank 2. When comparing the difference between the two ranked suffixes, Latin has a very similar difference ratio to that of the Voynich.
 
 
At length 4 on the (((FREQUENCY LEVEL))) Greek, again, appears to have a possible relationship to that of the Voynich, showing a similar rank 1 but a much higher rank 2. However, when comparing the difference ratio between the two ranked suffixes, French has the closest relationship but is, still, significantly different.
 
 
It should be noted that morphology is full of different ambiguities [11]. These findings only give very baseline experimental data that…
 
 
*** MENTION ‘JUNK’ AFFIXES THAT WERE OBTAINED USING THE BASIC METHOD HERE ***
 
 
====Conclusion====
 
 
The data here does not give any conclusive findings. The data does show that there may be possible morphological relationships in the Voynich to other languages, in particular Greek, Latin and French, when using a naïve definition of an affix as a sequence of characters at the edge of a word.
 
 
It may also suggest that there is some form of morphological structure within the Voynich Manuscript but these results are unable to definitively conclude on that also. Further research is required…
 
 
===Phase 4===
 
Phase 4 will examine the possible relation between texts, words and illustrations. First a certain folio f102r2 were selected.[[File:frog.png|130px|thumb|right|Figure 15: "frog"]] It is a breakthrough point when start to investigate. The reason of choosing folio f102r2 is there is a “frog” in the top right corner, as shown in Figure 15. The language in the Voynich Manuscript is an unknown language, therefore it is helpful to find and use possible known stuff when investigating. The image of folio is in appendix section A.9.
 
 
As shown in Table 11, the following statistic were found in folio f102r2:
 
 
 
[[File:f102r2data.png|500px|thumb|center|Table 11: "Statistic of folio f102r2"]]
 
 
 
There are 21 word tokens that only appear in folio f102r2 and only appear once, they are unique word tokens, and these do not appear in any other folios within the whole Voynich Manuscript. The word tokens are:
 
 
airam, chefoly, dethody, desey, eeey, kockhas, kolor, lsais, olockhy, opolkod, qkeeey, qkeeod, qoochey, qyoeey, skekyd, soeees, soefchocphy, sossy, ydeeal, ykeockhey.[[File:f34r.png|150px|thumb|right|Figure 16: "Illustration in f34r"]]
 
[[File:f87r.png|150px|thumb|right|Figure 17: "illustration in f87r"]]
 
 
There are also 21 word tokens appear once in folio f102r2 and appear less than 10 times in other folios. Table 12 below shows the statistic of these word tokens.
 
 
 
[[File:wordpage.png|500px|thumb|center|Table 12: "Word tokens within folio f102r2"]]
 
 
 
The unique word in the folio may be the description of unique illustration in the folio. In folio, there are 21 unique word token and 12 illustrations. It can tell nothing without further information.
 
 
There are 2 word token which appear in folio f102r2 once and also appear in other folios once. These are first selected to be investigated: amg, cheodor. They occur in folios f34r and f87r respectively. The images of the three folios are in appendix A.9.
 
 
Word token amg occurs in folio f102r2 and f34r, word token cheodor occurs in f102r2 and f87r. The three folios are both text with illustrations. The illustration in f34r is shown in Figure 18 on the right and illustration in f87r is in Figure 17. They appear to be a form of plant with leaves, roots and corollas. In f102r2, there are plants too, however the plants in f102r2 do not have corollas.
 
 
The similar between the f102r2 and f34r are:[[File:root.png|150px|thumb|right|Figure 18: "root illustration in f102r2 and f34r"]]
 
 
*They both occur the word token amg
 
*They both occur plant illustration
 
*The plants in the folio both have roots,
 
 
The figure 18 on the right shows the similar roots part of two folio’s illustration, the top one is folio 102r2, the bottom one is f34r. Through the comparison, perhaps the word amg in the Voynich Manuscript is used as a description for the plant roots as shown in figure 9. Again, this is just a crude conclusion, further information is needed for the investigation.
 
 
 
 
 
 
 
The figure 19 below shows number of total word and unique word on each folio.
 
 
[[File: Unique word per page.png|1000px|thumb|center|Figure 19: Total word VS. Unique word on each page]]
 
 
The folio which has most unique word is 58r, number of total word on 58r is 367, number of unique word  is 100. Folio 65r has no unique word and the number of total word on 67r is only 3. Number of total word in Voynich is 37919 and unique word is 5647. The average number of total words per folio is 168.52, and the average number of unique words per folio is 25.09. Most of the folios only contain one illustration, maybe the unique words in that folio are all related to the illustration, but the number of unique word per folio is to high. It is impossible to find out which unique words are only used to describe the illustration.
 
 
 
The figure 20 below shows number of total word and unique word of each section
 
 
[[File: Unique word per section.png|650px|thumb|center|Figure 19: Total word VS. Unique word of each section]]
 
 
The number of unique words of each section is also too high to analyse.
 
 
The figure 20 below shows the average word length on each folio, it can be seen from the figure, the average word length on each folio are all around 5 which is reasonable.
 
 
[[File: Word length.png|1000px|thumb|center|Figure 20: Word length of each page]]
 
 
 
The table 13 below shows the most 10 frequent words of each section
 
 
[[File: Most frequent words by section.png|1200px|thumb|center|Table 13: Most frequent words by section]]
 
 
 
The table 14 below shows the most 10 frequent unique words of each section
 
 
[[File: Most frequent unique words by section.png|1200px|thumb|center|Table 14: Most frequent unique words by section]]
 
 
====Methods of analysis====
 
 
There are two basic methods have been used through the investigation.
 
 
Method 1:
 
*Find out the unique words in this section[[File: Chopin.jpg|175px|thumb|right|Figure 21: "choiin"]]
 
*Select one unique word to analyse
 
*Analyse the folios which contain this word
 
*Find out the similar illustrations between these folios
 
*If the illustrations are similar, the word may used to describe the similarity.
 
 
The most frequent unique word of Herbal section "choiin" is selected to be analyse first. The figure on the right is the original "choiin" in the Voynich manuscript. The following folios all contain the word "choiin".
 
 
{| align=center border=1
 
|-
 
|<gallery>
 
File:VM 10v.jpg|10v
 
File:VM 15v.jpg|15v
 
File:VM 20r.jpg|20r
 
File:VM 20v.jpg|20v
 
File:VM 28v.jpg|28v
 
File:VM 35v.jpg|35v
 
File:VM 37v.jpg|37v
 
File:VM 38r.jpg|38r
 
File:VM 38v.jpg|38v
 
File:VM 45v.jpg|45v
 
File:VM 49r.jpg|49r
 
</gallery>
 
|}
 
 
 
It can be seen form the folios above, although the folios all contain the word "choiin", there is no clear similarity can be found.
 
 
For further information about the folios contain a given word can be found in [[Search engine]]. In the search engine, all the words appear the Voynich Manuscript which frequency are between 5 to 15 are listed in a table by initial.
 
 
Method 2:
 
*Find out similar illustrations first.
 
*Analyse the texts in the folios which have similar illustrations.
 
*Find out the common word appear in all of the folios.
 
*If the common words are unique to these folios, the words may used to describe the similar illustrations.
 
 
[[File: Common word.png|175px|thumb|right|Table 21: Common words in 17v,96v and 99r]]
 
{| align=center border=1
 
|-
 
|<gallery>
 
File:VM 17v.jpg|17v
 
File:99rr.jpg|99rr
 
File:VM 96v.jpg|96v
 
</gallery>
 
|}
 
 
 
The pictures above are from folio 17v, 99r and 96v. As shown in the pictures, the illustrations in the three folio are quite similar. Folio 17v is from Herbal section, folio 96v and 99r are from Pharmaceutical section. The table on the right shows the common words of the three folios. It can be seen from the table, frequency of common words in the three folios are very high and they are not unique to these folios. The common words appear most of folios in the manuscript which is difficult to tell any relation between the words and the similar illustration.
 
 
====Conclusion====
 
 
The results and data here does not give any conclusive findings.  Finding out unique word and similar illustrations can not give any meaningful conclusion of relation between illustration and word. And review the whole Voynich Manuscript, there is not many clear same or similar illustrations.
 
 
===Phase 5: Collocation Investigation===
 
 
**** Need to re-do citations ****
 
 
====Introduction====
 
 
Collocations have no universally accepted formal definition [16] but deals with the words within a language that co-occur more often than would be expected by chance [17]. Natural languages are full of collocations [17] and can vary significantly depending on the metric, such as length or pattern, used to define a collocation [18].
 
 
In this experiment, the definition used for a collocation is that of two words occurring directly next to each other. By extracting and comparing all possible collocations within the Voynich Manuscript and the corpus, a relationship based on word association may be found or the possibility of a hoax.
 
 
 
====Literature Review====
 
 
Similarly to the results found in the English Investigation in section (REFERENCE), collocation statistics are domain and language dependent [18]. Therefore texts within a corpus should be of the same domain to be able to compare results between languages. This also does not mean that the statistics will be the same as the recurrent property of words are typical to different types of languages [17]. This makes them difficult to translate across languages but, by using word association metrics, may show if a text in a similar domain has any relationship between different languages.
 
 
There are multiple different types of collocations which range from basic phrases to strict word-pair collocations [17] such as the collocation defined within this investigation. The word association metrics can also vary significantly and have a range of different statistical methods to assign a metric [18]. These include but are not limited to:
 
 
*T-Score
 
*Pearson’s Chi-Square Test
 
*Log-Likelihood Ratio
 
*Pointwise Mutual Information
 
 
Thanopoulos, Fakotakis and Kokkinakis compare these various word association metrics, defining their collocations as strict word-pairs [18]. Their results show that the values of the metrics can vary significantly and that, depending on the choice of association metric, will rank the same collocations in different orders. However, despite these differences, the resulting curve from the metrics are generally quite similar.
 
 
Wermter and Hahn also investigate different word association metrics while making comparisons to a simple frequency based metric [19]. While it is generally assumed that using a statistical association measure will produce more viable results [19], Wermter and Hahn argue that this type of association may not necessarily produce better results that a simple frequency association if not including additional linguistic knowledge. Like Thanopoulos, Fakotakis and Kokkinakis, Wermter and Hahn also show that using different metrics can return similar output assuming the metric ranks the most-likely collocations at the higher ranks while non-collocations are ranked last.
 
 
Pearce states that with no widely accepted definition on the exact nature of linguistic collocations there is a lack of any consistent evaluation methodology [20]. Many proposed computer based collocation definitions are based around the use of N-Gram statistics. An issue with this is that a dependency in a collocation may span many words, giving an example of French where a collocation may span up to 30 words. He shows different methods of giving a metric for word association and states that pointwise mutual information has so far been widely used as a basis. It is also stated that despite a universally accepted definition for a collocation, comparative evaluation is still useful.
 
 
Reddy and Knight [4] show summarized information on the word correlations of the Voynich. In particular they show that the word association of word-pairs within the Voynich at varying distances do not show any significant long-distance correlations and suggest that this may arise from scrambling of the text, generation from a unigram model, or the interleaving of words.
 
 
Shi and Roush [21] of the previous final year project group also carry out a collocation investigation using word-pairs and again found that the Voynich displayed a weak word association measure when compared to other languages. They suggest this could indicate that the manuscript is a hoax or some type of code, further stating that ciphers are designed to have weak word order.
 
 
 
 
====Methods====
 
 
The extraction of the collocations utilized a simple MATLAB code that read a text file, determining every collocation within the text file and corresponding statistics. Collocations were extracted by initially tokenizing all the words within the text file and pairing each adjacent word token in a separate cell array. The frequency of each collocation was tracked as the pairing of word tokens occurred.
 
 
To determine the strength of word association two different metrics were used and each collocation was ranked based on this metric. Initially each collocation was ranked based on their corresponding relative frequencies, where the most frequent collocations would be ranked higher than the less frequent collocations. This can be defined as the probability of the word-pair occurring within the text.
 
 
The second word association metric used was pointwise mutual information which is considered as a widely accepted method to quantify the strength of word association [18] [19] [21]. This method incorporates the probabilities of each word occurring within the text as well as the two word appear coincidently. This is defined mathematically as:
 
 
PMI(x,y)=log(P(x,y)/P(x)P(y))
 
 
Where P(x) and P(y) are the respective probabilities of each word occurring within the text and P(x,y) is the probability of the words occurring coincidently.
 
 
Both methods allowed for a plot to be generated of the corpus and the Voynich Manuscript such that comparisons could be made between the different languages.
 
 
====Results====
 
 
The following Figures (INSERT FIGURE REFERENCES) show the results obtained from the initial collocation ranking using their basic frequencies and a scrambling of corresponding texts.
 
 
[[File:Frequency Ranked Collocations.jpg|800px|thumb|center|Figure X: Frequency Ranked Collocations]]
 
 
[[File:Scrambled Collocations.jpg|800px|thumb|center|Figure X: Scrambled Frequency Ranked Collocations]]
 
 
This final Figure (INSERT FIGURE REFERENCE) show the results obtained with using the PMI metric to rank each of the collocations.
 
 
[[File:PMI Ranked Collocations.jpg|800px|thumb|center|Figure X: PMI Ranked Collocations]]
 
 
====Discussion====
 
 
From the frequency ranking graph in Figure (INSERT FIGURE REFERENCE) it can be clearly seen that the Voynich has a very low measure of word association when compared to most other languages. However, it can be seen that initially, the Voynich has a higher measure of word association than that of Hebrew before becoming very similar as the ranks increase. Note that while this measure does show some interesting results, and can be considered reasonably viable using simple frequency as the metric [19], it can only show a very limited number of rankings meaning it only shows a small representation of the entire data.
 
 
The second frequency graph shows particular languages and the Voynich Manuscript that displayed a possible relationship and their corresponding scrambling of the same text. As shown in research, weak word association may be the result of a hoax from random placement of words (gibberish) [21] or the scrambling of a text [4]. If the Voynich Manuscript is either of these the scrambling of the text may not have any significant effect on ranking curve. However the scrambling of the text does show a minor drop in the curve albeit with a generally smaller difference than shown by the other languages.
 
 
A particular issue with using frequency as a ranking method are function words within languages. Examining the collocations of English and the other languages show that many collocations involve function words. These words, such as ‘the’ or ‘and’, appear very frequently biasing the results. The choice to use pointwise mutual information as another metric for word association was to attempt to relieve this bias and to give more viable results [19]. The drawback to this is the bias PMI gives is to that of less frequent words [18] [21] hence any words or collocations that appeared only once were ignored.
 
 
The PMI ranked graph in Figure (INSERT FIGURE REFERENCE) again shows that the Voynich Manuscript has a very weak measure of word association, far below that of any of the other languages. By accounting for the function word bias in the frequency ranking, the graph also shows that Hebrew has a high measure of word association unlike in the frequency ranking graph. Note that, due to the much more flat curve and the representation of the words within the text file, there may be errors in the ranking of Hebrew and Arabic.
 
 
As the actual contents of the Voynich Manuscript are unknown, the results shown may be due to differing text domains [17] [18]. As the corpus is compiled from various different translations of the Old Testament, the results of the other languages are biased to that of biblical texts. This may allow for better comparisons of the known languages but may also have significantly different collocation statistics than those of other texts in different domains of the same languages.
 
 
====Conclusion====
 
 
Based on these results we can conclude that the Voynich Manuscript generally has a weaker measure of word association than that of the other tested languages. If only comparing using a simple frequency metric, the Voynich does however show a possible relationship to Hebrew or scrambled Arabic.
 
 
From a more general perspective the weak measure of word association may also be related to a hoax or a type of code that hides the word order [4] [21]. It may also be due to lax spelling due to less standardized written language. English itself went through many linguistic changes throughout the 11th century to the 15th century where Old English may have been almost incomprehensible [22]. With the Voynich being carbon dated back to the 15th century it is possible that a non-standardized language or part-thereof was used throughout the writing of the Voynich.
 
 
===Search engine Phase===
 
An extra phase called [[search engine]] have been developed. Link to [[search engine]].
 
 
In the search engine, there are two main tables and some basic statistic. One of the table is contain all the folio numbers, and another is contain all the words appear in the Voynich manuscript which frequency are between 5 to 15.
 
 
The picture and transcription of folios can be viewed by clicking the folios' number. The transcriptions are based on the takahashi transcription. There are 495 words that list in the word table by initial. The folios which contain the word can by viewed by clicking the words.
 
 
==Project Management==
 
 
===Project Status and Timeline===
 
 
As shown in section 4, the project has currently been split up into 5 distinct phases with the expectation of the 5th phase to be split off into much smaller phases. Each of these are to be worked on and completed within a given time to keep the project on schedule.
 
 
Due to issues found within Phase 2, as detailed in section 6.2, the project is behind the initial schedule. Developed software did not function to an acceptable standard. The schedule has since been revisited and modified, expanding the time allocated to Phase 2 and delaying the following phases slightly. Phase 3 was briefly started and Phase 4 is now well in development. As this was an expected risk, the delay has been accounted for and the proper rescheduling has been completed. The current project schedule has been graphically organized and displayed on a Gantt chart viewable in the Appendix section A.1.
 
 
===Deliverables===
 
 
The deliverables of the project are summarized below in Table 13, detailing the deliverable and the respective deadline. The deliverable work schedule can also be viewed within the Gantt chart of the appendix section A.1.
 
 
{| border="1" class="wikitable"
 
|+
 
! Deliverable
 
! Deadline
 
|-
 
! <div style="text-align: left;">Proposal Seminar</div>
 
| 31st of March, 2015
 
|-
 
! <div style="text-align: left;">Research Proposal Draft</div>
 
| 17th of April, 2015
 
|-
 
! <div style="text-align: left;">Research Proposal and Progress Report</div>
 
| 5th of June, 2015
 
|-
 
! <div style="text-align: left;">Final Seminar</div>
 
| Week 10, Semester 2
 
|-
 
! <div style="text-align: left;">Thesis</div>
 
| Week 11, Semester 2
 
|-
 
! <div style="text-align: left;">Expo Poster</div>
 
| Week 11, Semester 2
 
|-
 
! <div style="text-align: left;">Expo Presentation</div>
 
| Week 12, Semester 2
 
|-
 
! <div style="text-align: left;">YouTube Video</div>
 
| Week 12, Semester 2
 
|-
 
! <div style="text-align: left;">USB Flash Drive of all Code and Work</div>
 
| Week 12, Semester 2
 
|-
 
|}
 
 
''Table 13: Deliverables''
 
 
===Task Allocation===
 
 
Tasks have been allocated to the project team members through the phases in section 4. Collaboration between members will occur during phases 1, 2, and 5. However, it is expected that there will be a considerable amount of collaboration throughout all phases. The current allocations are summarized in table 14 below.
 
 
{| border="1" class="wikitable"
 
|+
 
! Task
 
! Phase
 
! Allocation
 
|-
 
! <div style="text-align: left;">Pre-Processing of Interlinear Archive</div>
 
| 1
 
| Andrew McInnes
 
|-
 
! <div style="text-align: left;">Writing and Testing Voynich First-Order Statistics Code</div>
 
| 1
 
| Andrew McInnes
 
|-
 
! <div style="text-align: left;">Writing and Testing Character Token Code</div>
 
| 1
 
| Lifei Wang
 
|-
 
! <div style="text-align: left;">Expanding First-Order Statistics Code</div>
 
| 2
 
| Andrew McInnes
 
|-
 
! <div style="text-align: left;">Expanding Character Token Code</div>
 
| 2
 
| Lifei Wang
 
|-
 
! <div style="text-align: left;">Writing and Testing English Alphabet Extraction Code</div>
 
| 2
 
| Andrew McInnes
 
|-
 
! <div style="text-align: left;">Writing and Testing Illustration Code</div>
 
| 4
 
| Lifei Wang
 
|-
 
! <div style="text-align: left;">Writing and Testing Stylometry Code</div>
 
| 4
 
| Lifei Wang
 
|-
 
! <div style="text-align: left;">Discussing and Determining Follow-Up Investigations</div>
 
| 5
 
| Andrew McInnes, Lifei Wang
 
|-
 
|}
 
 
''Table 14: Task Allocation''
 
 
===Management Strategy===
 
 
The project team will be managed through a minimum of one internal meeting between members outside of scheduled workshop and project time, and a minimum of one fortnightly meeting with supervisors. Each meeting will involve:
 
*Current phase progress
 
*Issue(s) encountered
 
*Display of any relevant result(s) or research finding(s)
 
Feedback can then be gathered through both team members and supervisors.
 
 
All working copies of code and documents shall also be kept on a group Google Drive. These will be updated as necessary and are available for all necessary members.
 
 
===Budget===
 
 
The project team has been assigned a budget of $500. However the project is heavily computer-based where all currently required programs are freely available on the University systems. Therefore it is expected that none of the budget will need to be used.
 
 
It is possible that works may be found that are unavailable within the University. Should it be found that these works would further the research then the budget may be used on these works. This will be discussed with supervisors.
 
 
===Risk Analysis===
 
 
Multiple risks have been identified by the project team and ranked according to the likelihood of occurring, and the consequence of an occurrence. The likelihood and consequence was given a number ranking as denoted by the brackets ‘[ ]’. The main risks are summarised within table 15 below.
 
 
{| border="1" class="wikitable"
 
|+
 
! No.
 
! Risk
 
! Likelihood
 
! Consequence
 
! Risk Level
 
|-
 
! <div style="text-align: left;">1</div>
 
| Underestimation and/or mismanagement of time and resources
 
| High [8]
 
| High [7]
 
| 56
 
|-
 
! <div style="text-align: left;">2</div>
 
| Health related issues from long periods at computers
 
| High [7]
 
| Moderate [6]
 
| 42
 
|-
 
! <div style="text-align: left;">3</div>
 
| Software not functioning to an acceptable standard
 
| Moderate [5]
 
| Moderate [6]
 
| 30
 
|-
 
! <div style="text-align: left;">4</div>
 
| Team member illness or injury
 
| High [7]
 
| Moderate [4]
 
| 28
 
|-
 
! <div style="text-align: left;">5</div>
 
| Issues with communication between team and/or supervisors
 
| Low [3]
 
| High [7]
 
| 21
 
|-
 
! <div style="text-align: left;">6</div>
 
| Loss of software code
 
| Low [2]
 
| Very High [10]
 
| 20
 
|-
 
|}
 
 
''Table 15: Risk Analysis (Summary)''
 
 
The risk level was calculated by multiplying the likelihood rank and the consequence rank. This risk level corresponds to the overall risk that is posed to the project.
 
 
Short descriptions along with mitigation and continuity plans for each risk are detailed below.
 
 
====Underestimation and/or mismanagement of time and resources====
 
 
As the team members undertaking the project have no experience with such a large, software focused project, the likelihood of underestimation or mismanagement of time and/or resources is high.
 
 
Mitigation of this risk shall be through continual meeting within the project team and with supervisors. A minimum of a weekly internal meeting within the team and a fortnightly progress meeting with supervisors shall occur. Phase 5 of the project has also been left deliberately long for this purpose.
 
 
Should the risk still occur, the project schedule shall be discussed and reworked to allow for successful completion within the project time frame.
 
 
====Health related issues from long periods at computers====
 
 
Due to project being mostly computer-based, team members will be in front of computers for large quantities of time.
 
 
To mitigate any possible issues due to long periods in front of computers, team members will take periodic breaks from the computer during their work.
 
 
====Software not functioning to an acceptable standard====
 
 
The project itself is heavily investigative and software that is written relies on analysing data obtained from this investigation. It is possible that any software written may not function to an acceptable standard, such as having a high error rate.
 
 
Mitigation shall be handled through extensive testing. Software found to be functioning below an acceptable standard may need to be re-written. Note that despite software essentially failing, this is also a valid result as it may show that the procedure used cannot reliably conclude on relevant linguistic properties or relationships.
 
 
====Team member illness or injury====
 
 
The project shall be occurring over two semesters. There is a high likelihood that one, if not both, team members may fall ill or become injured within that time.
 
 
Should any health related issues arise, the member shall inform the rest of the project team and supervisors. Depending on the illness or injury, the other member may take over work from the ill or injured member. As the majority of work is computer-based it is expected that team members will be able to work to some extent during most illnesses or injuries.
 
 
====Issues with communication between team and/or supervisors====
 
 
Team members and supervisors are very busy throughout the project lifetime. As the main form of communication shall be through emails it is possible, although not likely, that miscommunication of information may occur.
 
 
Communication issues shall be mitigated through thoughtful, concise messages. Emails should not contain any ambiguities where possible and any questions that may be raised should be asked immediately.
 
 
====Loss of software code====
 
 
As explained multiple times, the majority of the project is software based. It is possible that through some errors or malicious intent that software code(s) may be lost. While unlikely, the consequences of this occurring are severe.
 
 
All code will therefore be kept in multiple backups. This includes the use of external flash drives, the University system, and Google Drive.
 
 
==Conclusions==
 
 
The results found thus far cannot provide any significant, comprehensive conclusions. However they do lead to some small, basic hypotheses about the Voynich Manuscript.
 
 
As the amount of unique character tokens used within the Voynich Manuscript is small, when compared to that found within a modern language such as English, it can be hypothesized that the alphabet itself lacks a distinction between alphabet cases and possibly lacks punctuation. This was much the case in other languages such as classical Greek and classical Latin. It may also show that numerals were represented by the alphabetical tokens.
 
 
Again, with the results and amount of data currently collected the hypotheses above cannot be concluded. Further research and investigation is required.
 
 
==Citations and References==
 
 
[1] D. Stolte, “Experts determine age of book 'nobody can read',” 10 February 2011. [Online]. Available: http://phys.org/news/2011-02-experts-age.html. [Accessed 12 March 2015].
 
 
[2] S. Reddy and K. Knight, “What We Know About The Voynich Manuscript,” LaTeCH '11 Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 78-86, 2011.
 
 
[3] G. Landini, “Evidence Of Linguistic Structure In The Voynich Manuscript Using Spectral Analysis,” Cryptologia, pp. 275-295, 2001.
 
 
[4] A. Schinner, “The Voynich Manuscript: Evidence of the Hoax Hypothesis,” Cryptologia, pp. 95-107, 2007.
 
 
[5] D. R. Amancio, E. G. Altmann, D. Rybski, O. N. Oliveira Jr. and L. d. F. Costa, “Probing the Statistical Properties of Unknown Texts: Application to the Voynich Manuscript,” PLoS ONE 8(7), vol. 8, no. 7, pp. 1-10, 2013.
 
 
[6] S. Chakrabarti, M. Ester, U. Fayyad, J. Gehrke, J. Han, S. Morishita, G. Piatetsky-Shapiro and W. Wang, “Data Mining Curriculum: A Proposal (Version 1.0),” 12 April 2015. [Online]. Available: http://www.kdd.org/curriculum/index.html.
 
 
[7] R. Zandbergen, “Description of the Manuscript,” 25 March 2015. [Online]. Available: http://voynich.nu/descr.html.
 
 
[8] J. Stolfi, “Voynich Manuscript stuff,” 23 May 2005. [Online]. Available: http://www.ic.unicamp.br/~stolfi/EXPORT/projects/voynich/Welcome.html.
 
 
[9] J. Stolfi, “A prefix-midfix-suffix decomposition of Voynichese words,” 10 12 1997. [Online]. Available: http://www.ic.unicamp.br/~stolfi/voynich/97-11-12-pms/.
 
 
[10] J. Stolfi, “A Grammar for Voynichese Words,” 14 June 2000. [Online]. Available: http://www.ic.unicamp.br/~stolfi/EXPORT/projects/voynich/00-06-07-word-grammar/.
 
 
[11] B. Shi and P. Roush, “Semester B Final Report 2014 - Cracking the Voynich code,” University of Adelaide, Adelaide, 2014.
 
 
[12] J. Goldsmith, “Unsupervised Learning of the Morphology of a Natural Language,” Computational Linguistics, pp. 153-198, 2001.
 
 
[13] P. Currier, “New Research on the Voynich Manuscript: Proceedings of a Seminar,” 30 November 1976. [Online]. Available: http://www.voynich.nu/extra/curr_main.html.
 
 
[14] M. A. Montemurro and D. H. Zanette, “Keywords and Co-Occurrence Patterns in the Voynich Manuscript: An Information-Theoretic Analysis,” PLoS ONE, vol. 8, no. 6, pp. 1-9, 2013.
 
 
[15] “The Voynich Manuscript,” 22 March 2015. [Online]. Available: https://archive.org/details/TheVoynichManuscript.
 
 
[16] R. Zandbergen, “Analysis of the text,” 13 April 2015. [Online]. Available: http://www.voynich.nu/analysis.html.
 
 
[17]            M. Aronoff, and K. Fudeman, “What is morphology,” Vol. 8., John Wiley & Sons, pp. 1-25, 2011.
 
 
[18]            G. Durrett and J. DeNero, “Supervised Learning of Complete Morphological Paradigms,” HLT-NAACL, pp. 1185-1195. 2013.
 
 
[19]            H. Trost, “Computational Morphology,” [Online]. Available: http://ccl.pku.edu.cn/doubtfire/nlp/Lexical_Analysis/Word_Lemmatization/Introduction/Computational%20Morphology.htm. [Accessed 23 May 2015].
 
 
==A. Appendix==
 
 
===A.1 Project Gantt Chart===
 
 
[[File:Project 31 Gantt Chart.png]]
 
 
===A.2 The Voynich Manuscript===
 
 
The following images are of the Voynich Manuscript. These images have been reproduced from the Internet Archive [15]. Note that ‘v’ denotes verso, and ‘r’ denotes recto.
 
 
[[File:Herbal Section.jpg]]
 
 
The herbal section, folios 1r – 66v.
 
 
[[File:Astronomical Section.jpg]]
 
 
The astronomical section, folios 67r – 73v.
 
 
[[File:Biological Section.jpg]]
 
 
The biological section, folios 75r - 84v.
 
 
[[File:Cosmological Section.jpg]]
 
 
The cosmological section, folios 85r – 86v
 
 
[[File:Pharmaceutical Section.jpg]]
 
 
The pharmaceutical section, folios 87r – 102v.
 
 
[[File:Recipes Section.jpg]]
 
 
The recipes section, folios 103r – 116v.
 
 
===A.3 The Interlinear Archive Example===
 
 
The following images are an example of the text found within the Voynich Manuscript and its corresponding translation into the machine-readable EVA. These images have been reproduced from René Zandbergen’s website [16].
 
 
[[File:Voynich Text.jpg]]
 
 
===A.4 The European Voynich Alphabet===
 
 
The EVA as shown on René Zandbergen’s website [16].
 
 
[[File:Basic EVA.jpg]]
 
 
[[File:Extended EVA.jpg]]
 
 
===A.5 Interlinear Archive Format and Pre-Processing Example===
 
 
The Interlinear Archive is formatted in a specific way to allow for the data to be parsed into varying software. The full descriptions of the formatting can be found within the 'F0' pages of the interlinear archive. The main formatting points are summarized in Table A5 below:
 
 
{| border="1" class="wikitable"
 
|+
 
! Format Character
 
! Definition
 
|-
 
! <div style="text-align: left;">.</div>
 
| Definite word break
 
|-
 
! <div style="text-align: left;">,</div>
 
| Dubious word break
 
|-
 
! <div style="text-align: left;">-</div>
 
| Line break within a paragraph
 
|-
 
! <div style="text-align: left;">=</div>
 
| Paragraph break
 
|-
 
! <div style="text-align: left;">!</div>
 
| Filler character - denotes a break that was either skipped or lost (used for aligning different transcriptions)
 
|-
 
! <div style="text-align: left;">%</div>
 
| Filler character - long stretch (used for aligning different transcriptions)
 
|-
 
! <div style="text-align: left;">#</div>
 
| Line comment
 
|-
 
|}
 
 
''Table A5: Basic Interlinear Archive Formatting Characters''
 
 
Other formatting within the Interlinear Archive included:
 
 
*<f17v.P.1;H> - Identifier/Line Locator Tag <'Page Number'.'Unit within Page'.'Line Number';'Transcriber Code'>
 
*{comment} - Inline comment, usually to identify a specific type of illustration within the manuscript
 
*{&123} - Extended EVA character
 
 
The following gives an example of a page from the Interlinear Archive and the pre-processing that is completed during the initial stages of phase 1.
 
 
'''Unprocessed Interlinear Archive Example'''
 
 
 
<f17v.P.1;H>      pchodol.chor.fchy.opydaiin.odaldy-{plant}
 
 
<f17v.P.1;C>      pchodol.chor.pchy.opydaiin.odaldy-{plant}
 
 
<f17v.P.1;F>      pchodol.chor.fchy.opydaiin.odaldy-{plant}
 
 
 
<f17v.P.2;H>      ycheey.keeor.ctho!dal.okol.odaiin.okal-{plant}
 
 
<f17v.P.2;C>      ycheey.kshor.ctho!dal.okol.odaiin.okal-{plant}
 
 
<f17v.P.2;F>      ycheey.keeor.ctho.dal.okol.odaiin.okal-{plant}
 
 
 
<f17v.P.3;H>      oldaim.odaiin.okal.oldaiin.chockhol.olol-{plant}
 
 
<f17v.P.3;C>      oldaim.odaiin.okal.oldaiin.chockhol.olol-{plant}
 
 
<f17v.P.3;F>      oldaim.odaiin.okal.oldaiin.chockhol.olol-{plant}
 
 
 
<f17v.P.4;H>      kchor.fchol.cphol.olcheol.okeeey-{plant}
 
 
<f17v.P.4;C>      kchor.fchol.cphol.olcheol.okee!y-{plant}
 
 
<f17v.P.4;F>      kchor.fchol.cphol.olcheol.okeeey-{plant}
 
 
 
'''Processed File for H'''
 
 
pchodol chor fchy opydaiin odaldy 
 
 
ycheey keeor cthodal okol odaiin okal
 
 
 
oldaim odaiin okal oldaiin chockhol olol
 
 
kchor fchol cphol olcheol okeeey
 
 
===A.6 Transcriber Codes===
 
 
The following is a list of the transcriber codes and their respective transcriber
 
 
{| border="1" class="wikitable"
 
|+
 
! Transcriber Code
 
! Transcriber
 
|-
 
! <div style="text-align: left;">C</div>
 
| Currier
 
|-
 
! <div style="text-align: left;">F</div>
 
| Friedman (First Study Group)
 
|-
 
! <div style="text-align: left;">T</div>
 
| John Tiltman
 
|-
 
! <div style="text-align: left;">L</div>
 
| Don Latham
 
|-
 
! <div style="text-align: left;">R</div>
 
| Mike Roe
 
|-
 
! <div style="text-align: left;">K</div>
 
| Karl Kluge
 
|-
 
! <div style="text-align: left;">J</div>
 
| Jim Reed
 
|-
 
! <div style="text-align: left;">D</div>
 
| Currier Alternative
 
|-
 
! <div style="text-align: left;">G</div>
 
| Friedman Alternative
 
|-
 
! <div style="text-align: left;">I</div>
 
| Jim Reed Alternative
 
|-
 
! <div style="text-align: left;">Q</div>
 
| Karl Kluge Alternative
 
|-
 
! <div style="text-align: left;">M</div>
 
| Don Latham Alternative
 
|-
 
! <div style="text-align: left;">H</div>
 
| Takeshi Takahashi
 
|-
 
! <div style="text-align: left;">N</div>
 
| Gabriel Landini
 
|-
 
! <div style="text-align: left;">U</div>
 
| Jorge Stolfi
 
|-
 
! <div style="text-align: left;">V</div>
 
| John Grove
 
|-
 
! <div style="text-align: left;">P</div>
 
| Father Th. Petersen
 
|-
 
! <div style="text-align: left;">X</div>
 
| Dennis V. Mardle
 
|-
 
! <div style="text-align: left;">Z</div>
 
| Rene Zandbergen
 
|-
 
|}
 
 
''Table A6: Transcriber Codes''
 
 
=== A.7 English Character Token Frequency Data===
 
 
[[File:FrequencyData.jpg|1000px|thumb|center|Figure A8: English Corpus Token Frequency]]
 
 
=== A.8 English Character Token Appear At Start and End Data===
 
 
[[File:Start_end_data.jpg|1000px|thumb|center|Character token appear at start and end of a word]]
 
 
=== A.9 English Text Corpus ===
 
 
Table A9 below shows the texts used within the English corpus.
 
 
{| border="1" class="wikitable"
 
|+
 
! Text Name
 
! Author
 
! Obtained From
 
|-
 
! <div style="text-align: left;">The Merry Adventures of Robin Hood</div>
 
| Howard Pyle
 
| https://www.gutenberg.org/
 
|-
 
! <div style="text-align: left;">Macbeth</div>
 
| William Shakespeare
 
| https://www.gutenberg.org/
 
|-
 
! <div style="text-align: left;">The New Testament</div>
 
| Various
 
| https://www.gutenberg.org/
 
|-
 
! <div style="text-align: left;">An Account of the Foxglove and its Medical Uses</div>
 
| William Withering
 
| https://www.gutenberg.org/
 
|-
 
! <div style="text-align: left;">The Story of the Heavens</div>
 
| Robert S. Ball
 
| https://www.gutenberg.org/
 
|-
 
|}
 
 
''Table A9: English Corpus''
 
 
==See also==
 
* [[Cracking the Voynich code|Voynich Project, MAIN Page]]
 
* [[Cracking the Voynich Code 2015 weekly progress|Voynich 2015 (Andrew and Lifei) Weekly Progress]]
 
* [https://www.eleceng.adelaide.edu.au/students/wiki/projects/index.php/Projects:2014S1-44_Cracking_the_Voynich_Manuscript_Code Elec Eng 2014 Project Wiki]
 
* [https://drive.google.com/a/student.adelaide.edu.au/folderview?id=0B3xk_r8iaE_IYURhTEhLd1dyeVk&usp=sharing Voynich Manuscript Project 2015 Google Drive]
 
 
==Back==
 
*[https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Cracking_the_Voynich_code_2015 Back to Cracking the Voynich Code 2015]
 
*[https://myuni.adelaide.edu.au/webapps/portal/frameset.jsp Back to MyUni]
 
*[http://www.eleceng.adelaide.edu.au/personal/dabbott Back to Derek Abbott's homepage]
 
*[http://www.eleceng.adelaide.edu.au Back to EEE Department page]
 
*[http://www.adelaide.edu.au Back to the University of Adelaide homepage]
 

Latest revision as of 13:53, 21 October 2015