Research Project Proposal and Progress Report

Abstract

The Voynich Manuscript is a document written in an unknown language or cipher or may be a hoax. This research proposal presents an idea into determining possible linguistic properties within the Voynich with the intent of determining possible linguistic relationships with other known languages. This is to be performed through data mining and statistical methods. The document reviews previous research carried out by other researchers. The proposed method is given and current results obtained by the project team are detailed and evaluated. The project status and management is outlined.

Introduction

Background

The Voynich Manuscript is a document written in an unknown script that has been carbon dated back to the early 15th century [1] and believed to be created within Europe [2]. Named after Wilfrid Voynich, whom purchased the folio in 1912, the manuscript has become a well-known mystery within linguistics and cryptology. It is divided into several different section based on the nature of the drawings [3]. These sections are:

Herbal

Astronomical

Biological

Cosmological

Pharmaceutical

Recipes

The folio numbers and examples of each section are outlined in appendix section A.2. In general, the Voynich Manuscript has fallen into three particular hypotheses [4]. These are as follows:

Cipher Text: The text is encrypted.

Plain Text: The text is in a plain, natural language that is currently unidentified.

Hoax: The text has no meaningful information.

Note that the manuscript may fall into more than one of these hypotheses [4]. It may be that the manuscript is written through steganography, the concealing of the true meaning within the possibly meaningless text.

Aim

The aim of the project is to determine possible features and relationships of the Voynich Manuscript using statistical methods that can be used to aid in the investigation of unknown languages and linguistics. It is not to fully decode or understand the Voynich Manuscript itself. This outcome would be beyond excellent but is unreasonable to expect in a single year project.

Motivation

The research project, that is to be carried out, shall attempt to find relationships and patterns within unknown text through the usage of known statistical methods on languages and linguistics. The Voynich Manuscript is a prime candidate for this as there is no known accepted translations of any part within the document. The relationships found can be used to verify the statistical methods and also be used to conclude on specific features of the unknown language(s) within the Voynich Manuscript.

Knowledge produced from the relationships and patterns of languages and linguistics can be used to further the current linguistic computation and encryption/decryption technologies of today [5].

Significance

There are many computational linguistic and encryption/decryption technologies that are in use today. As mentioned in section 1.3, knowledge produced from this research can help advance these technologies in a range of different applications [5]. These include, but are not limited to, information retrieval systems, search engines, machine translators, automatic summarizers, and social networks [5].

Particular technologies, that are widely used today, that can benefit from the research, include:

Turn-It-In (Authorship/Plagiarism Detection)
Google (Search Engines)
Google Translate (Machine Runnable Language Translations)

Technical Background

The vast majority of the project relies on a technique known as data mining. Data mining is the process of taking and analysing a large data set in order to uncover particular patterns and correlations within said data thus creating useful knowledge [6]. In terms of the project, data shall be acquired from the Interlinear Archive, a digital archive of transcriptions from the Voynich Manuscript, and other sources of digital texts in known languages. Data mined from the Interlinear Archive will be tested and analysed for specific linguistic properties using varying statistical methods.

The Interlinear Archive, as mentioned, will be the main source of data in regards to the Voynich Manuscript. It has been compiled to be a machine readable version of the Voynich Manuscript based on transcriptions from various transcribers. Each transcription has been translated into the European Voynich Alphabet (EVA). An example of the archive in EVA and the corresponding text within the Voynich Manuscript can be seen within the appendix section A.3. The EVA itself can be seen within appendix section A.4.

Technical Challenges

Due to the difficulty of transcribing a hand-written 15th century document, no transcriptions within the Archive are completed, nor do they all agree with each other. Many tokens within the Voynich Manuscript have been considered as a different token, or even multiple tokens. Spacing between word tokens has also been a key ambiguity as one transcription may consider one word token to be multiple word tokens or vice-versa. It is also believed that the manuscript is missing 14 pages [7]. These uncertainties will make it difficult to effectively conclude on any linguistic analyses.

The statistical methods relating to linguistics are numerous, leading to many different possible approaches that can be used upon the Voynich Manuscript. However many of the intricate techniques require some form of knowledge of the language itself. This limits the possible linguistic analysis techniques that can be used. Despite previous research on the Voynich Manuscript, no current conclusion has been widely accepted [3]. Due to this the research will be focussed on the basics of linguistics.

Knowledge Gaps

The project requires a large amount of software code using various statistical techniques. No project team members are particularly knowledgeable in these areas. As such, all members within the project team shall be developing skills in software programming and knowledge within these statistical techniques as the project develops.

From a broader view, knowledge from statistical methods used on the Voynich Manuscript is plentiful but, so far, none have shown any conclusive, widely-accepted understanding of the text [3]. Throughout the project life, the team hopes to show possible relationships within the Voynich Manuscript through the investigation of different linguistic properties.

Requirements

It is not expected that the project fully decodes, or even partially decodes, the Voynich Manuscript. Nonetheless the project must show the following:

A logical approach to investigating the Voynich Manuscript
Critical evaluation of any and all results
Testing on all code
Hypotheses based on results

Literature Review

Over the years, the Voynich Manuscript has been investigated by numerous scholars and professionals. This has given rise to many possible hypotheses [4] through many different forms of analysis based on its linguistic properties [2]. These properties range from the character tokens to word tokens, to the syntax and pages themselves. The currently reviewed literature, which is of interest to the project, is summarized below.

A broad, albeit brief, summary of linguistic analyses that have been completed over the previous years is given by Reddy and Knight [2] and include some of their own tests. They perform multiple analyses on the letter, the word, syntax, pages, and the manuscript itself while giving reference to other works on the same property. Their work on the letter and word are of a particular interest of this project. They suggest that vowels may not be represented within the Voynich Manuscript and that Abjad languages have the closest similarities [2]. This is concluded through two-state hidden Markov models and word length distributions respectively. Reddy and Knight also suggest that there are some particular structural similarities within the words when using a minimum description length based algorithm [2].

Gabriel Landini’s [3] looks into the statistical characteristics of the manuscript and natural languages. Characterising the text through Zipf’s Law and performing analysis on entropy and character token correlation, Landini suggests that there is some form of linguistic structure behind the Voynich Manuscript [3]. In particular, the work reveals long range correlation, a modal token length, and periodic structures within the text.

Andreas Schinner [4] takes a different approach in the paper “The Voynich Manuscript: Evidence of the Hoax Hypothesis”. Schinner performs a random walk model and tests token repetition distances through the Levenshtein distance metric. It is concluded that while the results seem to support the hoax hypothesis more so than the others, it cannot rule out any of them [4].

Diego R. Amancio, Eduardo G. Altmann, Diego Rybski, Osvaldo N. Oliveira Jr., and Luciano da F. Costa [5] investigate the statistical properties of unknown texts. They apply various techniques to the Voynich Manuscript looking at vocabulary size, distinct word frequency, selectivity of words, network characterization, and intermittency of words. Their techniques were aimed at determining useful statistic properties with no prior knowledge of the meaning of the text. Although not aimed specifically at deciphering the Voynich Manuscript, they do conclude that the Voynich Manuscript is compatible with natural languages [5].

Jorge Stolfi’s website [8] gave multiple views and analyses of the Voynich Manuscript. Stolfi’s work on word length distributions and morphological structure [8] are of particular interest to the project. He displays a remarkable similarity in word length distributions between the Voynich Manuscript and Eastern Asian languages [8]. He also shows evidence of morphological structure, displaying prefix-midfix-suffix structure [9], and later displaying a crust-mantle-core paradigm [10].

In regards to research on the Voynich Manuscript carried out at the University of Adelaide. This is the second year that this project has been undertaken by students. Bryce Shi and Peter Roush provide a report on their results [11]. They carry out a multitude of tests on the Voynich Manuscript including:

Zipf’s Law
Word Length Distribution
Word and Picture Association
Word Recurrence Intervals
Entropy
N-Grams
Punctuation
Authorship

Shi and Roush give short conclusions to each of these tests but realise that further research is required for any to be considered conclusive [11].

Morphology Literature Review

Linguistic morphology investigates the meaningful segments within words [17]. By investigating these smaller, meaningful segments it may be possible to break the Voynich Manuscripts words into their most basic parts, simplifying any word analysis.

Aronoff and Fudeman [17] provide the basics behind morphology and morphological analysis. In particular, they give two basic, complimentary approaches through analysis and synthesis. The analytic approach is of interest to this project as it deals with breaking words down into their smallest 'meaningful' sections. They also provide four basic analytic principles used in morphology to aid anyone that attempts to undertake any morphological analysis. Note that Aronoff and Fudeman also highlight potential issues with morphological structure when comparing between different languages, showing direct comparisons between English and other various languages.

Durrett and DeNero [18] introduce a supervised approach to predicting the base forms of words, particularly those base forms within morphologically rich languages. Using a data-driven approach, Durrett and DeNero develop software that learns transformation rules from inflection table data. The rules are then applied to a data set to extract the morphological data. While this method can be extended to any language unfortunately it requires a substantial number of example inflection tables [18] making this unsuitable for use on the Voynich Manuscript. However the method may prove useful if performing tests on the English language.

Trost [19] delves into computational morphology, providing the fundamentals behind linguistics, real-world applications, and various forms of computational morphology analysis. The two-level rules for finite-state morphology outlined by Trost are of concern to this project as they show that an affix may change the structure of the word. Trost gives small examples of English plurals. Due to the unknown nature of the Voynich Manuscript, any morphological analysis will need to take the possibility of these small structural changes into account.

Goldsmith [12] reports on the use of a minimum description length model to analyse the morphological structure of various European languages. He attempts separate a word into successive morphemes, where possible, corresponding to more traditional linguistic analysis [12]. Goldsmith outlines the minimum description length approach, in great detail, and various other approaches that have been attempted by previous researchers. The results obtained are good but not perfect, noting that a number of issues are still present within his approach and the various other approaches. He concludes that his algorithm is highly likely different than that of a language learner and that the determination of morphology is a complex task.

Proposed Method

The proposed approach to the project has been broken down into multiple phases and is briefly shown in Figure 1.

Figure 1: Basic Work Breakdown Overview

Each phase will be considering a specific feature of the Voynich Manuscript and linguistics while building onto what was learned in the previous phase(s). Many techniques may replicate previous research outlined in section 3.0. The results within these documents will be used to compare and complement results obtained throughout the life of the project.

All phases will be coded and will therefore include testing as all code must be verified for results to be considered accurate. Code will also be well commented and documented within the project wiki.

Code shall be written in C++ and MATLAB languages as the project members have experience using these programming languages. MATLAB, in particular, is chosen as it provides a simple, easy to use mathematical toolbox that is readily available on the University systems. Other programming languages may be used if it is found to be more suitable.

Completion of each phase is considered a milestone, as referenced in section 6.1.

Phase 1 - Characterization of the Text

Characterization of the text involves determining the first-order statistics of the Voynich Manuscript. This first involves pre-processing the Interlinear Archive into a simpler machine-readable format.

The pre-processed files are then characterized through MATLAB code by finding and determining:

Unique word tokens
Unique character tokens
Frequency of word tokens
Frequency of character tokens
Word token length frequency
Character tokens that only appear at the start, end, or middle of word tokens

A 'unique' token is considered a token that is different than any of the other tokens.

Resulting statistics can then be compared with other known languages through using the same code on the various translations of the Universal Declaration of Human Rights. Unfortunately the Universal Declaration of Human Rights is, by comparison, a small document which will limit results. However it will give a basis for follow-up research into the languages that have a possible relationship to the Voynich Manuscript based on the first-order statistics.

Further research can be carried out using any languages that appear to have a relationship to the manuscript through the compilation of much a larger corpus.

Phase 2 - English Investigation

The English investigation looks into the elementary structure of English text. It specifically examines the representation of the English alphabet and how the alphabetical tokens can be extracted from an English text using statistics. This is done to grasp a better understanding on how character tokens are used within text and how data and statistics relating to these character tokens can be used to characterize each token.

Initially, a corpus of English texts (see Appendix A.(((REFERENCE)))) shall be passed through the characterization code of phase 1 to determine the first-order statistics of each text. These will be compared to grasp a basic understanding of how each of the tokens can be statistically represented and how these statistics differ between texts. These tokens include alphabetical, numerical, and punctuation tokens.

The characterization code will then be expanded upon to include character token bigrams to further define the differences between character tokens. Bigrams give the conditional probability, P, of a token, Tn, given the proceeding token, Tn-1. This is given in the following formula:

P(T_n│T_(n-1) )= (P(T_(n-1),T_n))/(P(T_(n-1)))

It is expected that the probability of the different tokens along with the first-order statistics, obtained through the phase 1 code, will show definitive differences between alphabetical, numerical, and punctuation tokens.

Code will be written that takes these statistical findings into account to attempt to extract the English alphabet from any given English text with no prior knowledge of English itself. This will be used to examine the Voynich Manuscript to search for any character token relationships.

Phase 3 - Morphology Investigation

Morphology deals with the structure of the words, particularly the meaningful segments that make up a word [17]. Specifically, phase 3 will be looking into the possibility of affixes within the Voynich Manuscript.

As described in section 2, previous research has found the possibility of morphological structure within the Voynich Manuscript [2]. A Minimum Description Length model [12] may be used to attempt to segment word tokens into possible affix models.

The basis of the code will be examining word tokens within the Interlinear Archive and attempting to find all similar tokens. This will initially determine if a word token appears within another, different word token. Following the Minimum Description Length model, the code will then attempt to find the most compact representation of the word token and any pre or post word tokens.

By analysing the word tokens that appear within other word tokens, and their placement within said word tokens, it is expected that a hypothesis for possible prefix, affix, and stem word tokens will be concluded.

Coding this model into MATLAB will allow for use on the Interlinear Archive. The code will also be used on English texts to provide a qualitative comparison on the effectiveness and limitations of the algorithm.

Phase 4 - Illustration Investigation

The illustration investigation looks into the illustrations contained in the Voynich Manuscript. It will examine the possible relation between texts, words and illustrations. The different sections in the Voynich Manuscript are based on the drawings and illustrations in pages. Almost all the sections are texts with illustrations except recipes section.

In Phase 4, the basis of the code will be achieving the following functions:

Finding unique word tokens in each pages and sections
Determine the location of a given word token
Determine the frequency of a given word token

The resulting statistics from the code can then be used into investigation. However, it should be noted that the manuscript may have been written by multiple authors and in multiple languages [13]. Sections of the manuscript will need to be investigated separately, particularly those written in different languages, along with the manuscript as a whole.

Phase 5 - Other Ideas

This phase will essentially be determining follow-up investigations based on current findings from the other phases. The empirical data found may lead to possible investigations that can be followed up during this phase. It is also quite possible that a phase, particularly phases 3 and 4, may not provide a definitive conclusion or may lead to an impasse. Due to this, phase 5 has been left much more open than the other phases.

Some other particular investigations that may be completed during this phase include:

Keywords and co-occurrence within the manuscript [14]
Vowel and consonant representation [2]
Word order [2]
Hidden Markov Modelling [11]
15th Century Cipher Analysis [11]

It is expected that this phase will eventually be split up into multiple separate phases. At this time it is unknown as to which follow-up investigations will be completed and, as such, has been left for discussion at a later date as previous phases become completed.

Implementation and Testing

As described earlier, the vast majority of the project uses Data Mining. This is done through various self-developed software code, mainly in C++ and MATLAB languages, for each phase. To ensure the integrity of any data and/or results the code is verified through testing. The implementation and testing procedures of the software code for each phase is briefly detailed below.

Phase 1 - Characterization of the Text

Pre-Processing of the Text

The Interlinear Archive is written in such a way that each page stored within the archive contains the transcription of the page by the various transcribers. The transcriptions can be identified through an identifier at the start of each line (as can be seen in Appendix section A.5). To simplify the required processing for the characterization of the text, and other phases to be investigated, the pre-processing software code re-organises the Interlinear Archives transcriptions into their separate transcribers.

The software code also removes, or replaces, any unnecessary formatting. This includes the transcriber identifiers, spacing labels, and any in-line comments. Depending on what character tokens are to be investigated, the software can either keep or remove any extended EVA characters found.

Implementation

Implementation of the pre-processing software code required an understanding of the layout and formatting used within the Interlinear Archive. This is detailed in the initial pages of the archive but a brief summary can be found in Appendix section A.5.

Using these details, software was written to read each page of the Interlinear Archive, searching for each specific transcriber through the identifier tags. Upon finding a transcribed line by the specified transcriber the software replicates the page file and the lines within a separate folder designated for the transcriber. This process is repeated until the entirety of the Interlinear Archive has been read.

A secondary software module is then run to remove, or replace, the unnecessary formatting by reading the pages contained within the now separated and re-organised version of the archive.

Testing

Initial testing of the software code was carried out on small English sentences. Each sentence testing a specific feature of the layout and formatting used within the Interlinear Archive before a final sentence combining all of these features. Once these tests have been passed, the pre-processing software was run on specified pages of the Interlinear Archive. These specified pages are also re-formatted by hand and cross-checked with the output.

This was to ensure that the software itself can identify a transcriber and perform any necessary reformatting within the text.

Characterization of the Text

The software code for the characterization of the text is the main objective of Phase 1. It tabulates all relevant first-order statistics of the Voynich Manuscript as detailed in the Proposed Method. As pre-processing of the Interlinear Archive is completed first, the characterization software was developed to be simple and assumes that pre-processing has occurred.

Implementation

Initially, a specific transcriber will be chosen based on the completion of the transcription. The pages transcribed by this specific transcriber will then be read by the software code where it will store all the relevant first-order statistics. These are summarised into an output text file.

Testing

As with the pre-processing software, the testing of the characterization software is initially completed using English. A small paragraph was written and hand-counted for the relevant data. The output of the characterization software was then compared with that which was hand-counted. This is again repeated with a section of the Interlinear Archive.

This was to ensure that the data received by the characterization software was correct.

Phase 2 - English Investigation

Character Token Bigrams

It was found that basic first-order statistics would not allow for the full extraction of the English Alphabet from a random text. The character token bigram software aimed at giving further data which would help with the extraction of the English Alphabet.

The software does not perform any formatting and does not distinguish any tokens aside from their 'uniqueness'.

Implementation

The software initially finds every unique character token within the given text and generates every possible bigram from that set of unique character tokens. It then reads the text counting every occurrence of a bigram and outputs the results to a text file.

Testing

Testing of the character token bigram software was completed using the same English test paragraph used for the characterization software. The process described within the implementation for generating the bigrams was completed by hand and cross-checked with the output of the software.

This was to ensure that the software correctly generated the bigrams and made no distinctions between tokens. These distinctions included lower-case and upper-case alphabetical tokens, numerical tokens, and punctuation tokens.

English Alphabet Extraction

The English alphabet extraction software was the main objective of Phase 2. The software itself is developed to use the currently found statistics to extract the English alphabet from a random English text.

Implementation

Implementation of the English alphabet extraction software involved analysing the current data and building up from the basics in an attempt to characterize a list of unknown character tokens. This meant that the software would initially look at the first-order statistics and attempt to use those as a means of extracting possible alphabet tokens or non-alphabet tokens. Further data would them be included to help extract tokens that could not be characterized or refine tokens that had been improperly characterized.

A second implementation is currently in development as it was found that the first implementation had a large error deviation when used to analyse different types of English text and when using small sample sizes. This second implementation will be generalising the bigram data further to decrease both the error rate and deviation.

Testing

Testing of the English alphabet extraction software is completed through inputting texts that had been previously used to extract first-order data and bigram data and texts that have not yet been analysed. This allows the team to check for the error rate of the software and determine if the accuracy and precision is sufficient enough to be used on the Voynich Manuscript.

Phase 3 - Morphology Investigation

The current software developed and implemented for Phase 3 is currently not completed but a brief overview is included below.

Basic Morpheme Identification

The basic morpheme identification software is currently under development. The software itself attempts to determine possible morphemes based on the unique word tokens found within a given text file.

Implementation

The current implementation takes all the unique word tokens found within a text file and checks if any of these word tokens appear within another of the unique word tokens. These are then listed and output to a text file.

Testing

Testing of the basic morpheme identification software is done by first writing basic English text that includes possible examples that should be identified. The results are manually found and compared with the output of the software for verification.

Small sections of the Voynich Manuscript are also used. The same manual procedure is used to verify the software output for Voynichese text.

Phase 4 - Illustration Investigation

Search and Analysis Word Token

The software in the Phase 4 is a search engine, the functions is to search the unique word token of the Voynich Manuscript in each section and each folio. And also the software will allow the user to search a given word token, find out the location (section, folio, paragraph) and frequency of the word token. Again, the database for phase 4 is Takahashi's transcription, and the given word token which is searched is in Takahashi's format.

Implementation

The first step of Phase 4 is read through the original Voynich Manuscript, choose a certain folio which content text with illustration. Recording all the word tokens occur in the folio, then analysis the word tokens use the Matlab code from phase 1 and pahse 4, recording the following features:

Number of word tokens in this folio
Number of unique word token in this folio
Location of each word token in this folio
Frequency of this folio’s word token in the whole Voynich Manuscript.
Find out the location of this folio’s word token that also appear in others folio.

When finish the statistical computations, do the comparison between different folios which occur same word token, find out the similar and different illustration in each of their folios and then analysis.

Testing

Some Matlab code are same as phase 1 which have tested already. To testing phase 4’s Matlab code, first choose a certain word token from the statistic in phase 1, the frequency of the word token should be around 10 to 20 in order to easy for manually check. Then input the word token to the Matlab code, the result will be each location of the word token in the Voynich Manuscript and the frequency of the word token. Comparing the result with the statistic from phase 1, to see if they are same. And also read though the location results in the original Voynich Manuscript, to see if the word token is in the right place.

Results

Phase 1

The project team begun research by pre-processing the Interlinear Archive into separate simple files containing the transcriptions of each unique transcriber. All unnecessary data, such as comments, were removed from each of these transcriptions. In-line formatting was also converted to follow a simpler, machine-readable standard (see Appendix section A.5 for an example).

To get the most accurate results the team must look into which transcriptions are the most complete. Shi and Roush (2014) suggest that the Takahashi transcription was the most complete transcription by checking the total number of lines transcribed [11]. A test on the amount of transcribed lines per transcriber is performed again giving the results within Figure 2 (see appendix section A.6 for a complete list of transcriber codes).

Figure 2: Total Number of Lines Transcribed by Transcriber

This follows the same conclusion of Shi and Roush (2014).

A comparison of the top five most completed transcriptions word-length distribution was then carried out. Takahashi’s transcription showed an unusual peculiarity with a single word token of length 35 with the next highest being of length 15. However, this word token was composed of mainly unknown ‘*’ characters and was therefore removed from our data set. This resulted in the following word-length distribution plot in figure 3.

Figure 3: Relative Word Length Frequencies of Most Completed Transcriptions

This result, again, conforms to the results found by Shi and Roush (2014), showing a peak word length distribution of 5 and giving an unusual binomial distribution. This can also be seen in Reddy and Knight [2]. However Reddy and Knight specifically investigated the word lengths of language B within the Voynich Manuscript.

The Universal Declaration of Human Rights was also mined for relative word length distributions. This is, unfortunately, limited to a much smaller amount of tokens than that of the Voynich Manuscript but shall give a good indication as to which languages to investigate further.

As it is believed that the Voynich originated from Europe [2], European languages were initially compared with the results found above. Using the Takahashi transcription, as it is the most complete, resulted in the following word-length distribution plot in figure 4.

Figure 4: Relative Word Length Frequencies of European Languages

Many European languages were removed from the plot to make it more readable. Regardless, the resulting conclusion was the same as no tested European language appeared to fit the peak word length and binomial distribution of the Voynich Manuscript. Shi and Roush [11] found similar results, but also showed that the language within the manuscript had a closer resemblance to Hebrew. Reddy and Knight [2] tested Buckwalter Arabic, Pinyin, and ‘de-voweled’ English, resulting in much closer relationships. All gave the appearance of a binomial distribution much like the manuscript, with Buckwalter Arabic being very similar to Voynich Language B. This leads to the hypothesis that the manuscript may be in the form of Abjad [2].

Looking specifically at the Takahashi transcription, the following first-order statistics of the full transcription were found (as shown in Table 1).


Section	Result - Excluding Extended EVA Characters	Result - Including Extended EVA Characters
Total Word Tokens	37919	37919
Total Unique Word Tokens	8151	8172
Total Character Tokens	191825	191921
Total Unique Character Tokens	23	48
Longest Word Token	15	15

Table 1: First-Order Statistics (Takahashi)

The Takahashi transcription was also characterized based on the different sections, as outlined in Appendix Section A.2 (((CHECK))), and is summarized in Table 2 and Table 3 below.


Section	Total Word Tokens	Total Unique Word Tokens	Total Character Tokens	Total Unique Character Tokens	Total Longest Word Token
Herbal	11475	3423	54977	23	13
Astronomical	3057	1630	15777	20	14
Biological	6915	1550	34681	20	11
Cosmological	1818	834	9289	21	13
Pharmaceutical	3972	1668	20168	21	15
Recipes	10682	3102	56933	21	14

Table 2: Takahashi First-Order Statistics By Section - Excluding Extended EVA Characters


Section	Total Word Tokens	Total Unique Word Tokens	Total Character Tokens	Total Unique Character Tokens	Total Longest Word Token
Herbal	11475	3441	55040	44	13
Astronomical	3057	1630	15781	23	14
Biological	6915	1550	34684	22	11
Cosmological	1818	834	9290	22	13
Pharmaceutical	3972	1668	20180	24	15
Recipes	10682	3102	56946	29	14

Table 3: Takahashi First-Order Statistics By Section - Including Extended EVA Characters

From Table 1, 2, and 3 above it can clearly be seen that the majority of character tokens, 99.95%, used are those of the basic EVA. It can also be observed that the majority of the extended EVA character tokens are found within the Herbal section of the Voynich Manuscript (from Tables 2 and 3).

All word and character tokens of each transcription have also been recorded along with the frequency that each occur.

Looking further at the Takahashi transcription, the following first-order statistics were found.

Figure 5: Longest word token

The longest word token appears in folio f87r in the Herbal section. The word in Takahashi transcription is “ypchocpheosaiin”, the original one in the Voynich Manuscript shows in Figure 5 on the right.

The length of shortest word tokens in the Voynich Manuscript is 1, which means they are single letter words, the table 2 below shows the statistics of all the single letter word tokens in the Voynich Manuscript.

Word length	Word	Frequency	Rel. Frequency
1	*	31	0.0818%
1	a	3	0.0079%
1	c	7	0.0185%
1	d	50	0.1300%
1	e	3	0.0079%
1	f	8	0.0211%
1	g	11	0.0290%
1	k	13	0.0343%
1	l	58	0.1500%
1	m	11	0.0290%
1	n	4	0.0105%
1	o	81	0.2100%
1	p	5	0.0132%
1	q	1	0.0026%
1	r	98	0.2600%
1	s	243	0.6400%
1	t	8	0.0211%
1	v	7	0.0185%
1	x	9	0.0237%
1	y	151	0.4000%

Table 4: Single letter word (shortest word)in the Voynich Manuscript

As it is believed that the Voynich originated from Europe [2], some Europe languages have more single letter words. For example in Greek and Hebrew, the alphabets are used to represent numbers. It also shows that the language within the manuscript had a closer resemblance to Hebrew. The frequent of words is helpful when investigate unknown languages, the most frequent word tokens in the Voynich Manuscript were found as shown in table 5 below. To analysis the data, an English literature is also investigated, the data were also recorded in table 5.

Table 5: Most frequent word in Voynich and English

The literature for English investigation is chosen based on the Voynich Manuscript’s statistic. The total word token of Voynich Manuscript is 37919, the total word for the English literature is 40786. Although the total word number is similar, the word token which only appear once is quite different, the number for Voynich is 5647, and for English is only 3420.

Table 6: Tokens appear at the start and end

Figure 6: Most frequent word in Voynich vs English

From figure 6, it can be clear seen that, the shape of the two data is similar, and their gap narrowed after the 10th most frequent word.

The table on the right shows the statistic of tokens only appear at the start and end of Takahashi transcription. It can be found that the token z appears least times at the start and end. Token o appears most at the start, the frequency is 8530, and token y appears most at the end, the frequency is 15409. All word and character tokens of each transcription have been recorded along with the frequency that each occur. Note that the character tokens are currently limited to the basic EVA characters (see appendix section A.4) but is currently being expanded to include the extended EVA characters. All the data and statistic from phase 1 are basic, they can be used in the rest phases.

Phase 2

The results found within phase 2 show that the current implementation of the English Extraction software has a high deviation within the error rate when using different English texts. While it is expected that errors will occur, the error rate found is too high to be acceptable. The results below show the prominent data that was found and how it was used within the English extraction software. However, as specified in section 5.2, a second implementation of the English extraction software is being developed.

Investigating the frequency statistics alone of each text within the English Corpus (see Appendix ADD REFERENCE HERE) only a small amount of tokens can be characterized as either alphabet or non-alphabet tokens exclusively. It also shows that the data of some texts differed when looking at tokens can could not be exclusively characterized. The statistics were combined to give the average found over the corpus. Table (((ADD TABLE REFERENCE))) below summarizes the boundaries found using the combined case which was used to give threshold values for the extraction software.


		Character Token	Relative Frequency
1	Non-Alphabet Token	,	0.018952262
2	Numerical Token	2	0.005601521
3	Upper-Case Alphabet Token	A	0.004233199
4	Lower-Case Alphabet Token	e	0.118199205

Table 'X': Highest Relative Frequency


		Character Token	Relative Frequency
1	Non-Alphabet Token	=	0.000000190
2	Numerical Token	0	0.001239706
3	Upper-Case Alphabet Token	Q	0.000040890
4	Lower-Case Alphabet Token	z	0.000503139

Table 'X': Lowest Relative Frequency

FIRST AND LAST CHARACTER TOKEN DATA The frequency of a character token appearing at the beginning or end of a word token was then analysed to determine if any character tokens can be characterized through this data (((ADD APPENDIX REFERENCE))). It was found that using this data alone did not allow for accurate characterization as specific characters that may only appear at the beginning or end of a word token could be either alphabet or non-alphabet characters. (((ADD EXAMPLE OR REFERENCE TO EXAMPLE IN APPENDIX))).

To further extend the two basic types of data used above, bigrams of the English corpus were generated and analysed. These led to much more prominent results, showing specific characters rarely appearing next to one type of character and more frequently next to another. The most frequency bigrams of each English text within the corpus are shown below in Figures ((("FIGURE REFERENCES"))).

Figure 'X': Most Frequent Bigrams in Macbeth

Figure 'X': Most Frequent Bigrams in The Merry Adventures of Robin Hood

Figure 'X': Most Frequent Bigrams in The Story of the Heavens

Figure 'X': Most Frequent Bigrams in An Account of the Foxglove and some of its Medical Uses

Figure 'X': Most Frequent Bigrams in The New Testament

As shown, 'th' and 'he' were the most frequent bigrams regardless of text. Other results found numerical tokens generally only appear next to other numerical tokens with the rare occurrence of non-alphabet tokens. Some non-alphabet tokens, particularly the apostrophe and comma, proved to be more difficult due to their higher frequencies and high probability of appearing beside an alphabet token. Other non-alphabet tokens, such as brackets and quotation marks, also added to the complexity within various texts as they could appear beside both alphabet and non-alphabet tokens, albeit with a much lower probability.

Threshold values were, again, determined based off of the bigram data. This could then be used, along with the basic frequency data obtained earlier, to begin extraction of the English alphabet.

The extraction software was used on different English texts. The initial text, not included in the English corpus, was Robert Stevenson's book Treasure Island. Running the software over the text produced the following lists in Table ((('X'))).


	Results
Possible Alphabet Tokens	A B C D E F G H I J K L M N O P Q R S T U V W Y a b c d e f g h i j k l m n o p q r s t u v w y z
Possible Non-Alphabet Tokens	! " ' ( ) * , - . 0 1 2 3 4 5 6 7 8 9 : ; ? X _ x

Table 'X': Treasure Island Alphabet Extraction Results

Clearly the results from Table ((('X'))) above are exceptional. Only two character tokens, 'X' and 'x', were characterized incorrectly giving an error rate of 2.703%.

The second text, included within the English corpus, to be used within the English extraction software was Howard Pyle's The Merry Adventures of Robin Hood. This produced the following lists in Table ((('Y'))).


	Results
Possible Alphabet Tokens	' ( - A B C D E F G H I K L M N O P Q R S T U W Y a b c d e f g h i j k l m n o p q r s t u v w y
Possible Non-Alphabet Tokens	! " ) , . 0 1 2 3 4 5 6 7 8 9 : ; ? J V X Z _ x z

Table 'Y': The Merry Adventures of Robin Hood Alphabet Extraction Results

The results here are much worse than initially received. Nine characters were incorrectly characterized giving an error rate of 12.16%. This large increase in error shows that the algorithm used within the software does not return accurate results.

It was concluded that the errors propagated from the usage of threshold values obtained from a much larger data set than that of which was tested. This halted the use of the software on the Voynich Manuscript as any results obtained would be inconclusive with such a high error rate. This also prompted for a second implementation of the English extraction software to be written which will exclude the use of threshold values, focusing more on the general data obtained from the bigrams. The second implementation is currently in development.

These results do, however, allude to differences within writing style. Despite both texts shown here being that of a fiction-novel, the frequencies of character tokens and character bigrams differed enough to produce different results. Unfortunately this is not definitive but does show how differences in token frequency can be used within linguistic Stylometry which could be used in a later phase to investigate authorship of the Voynich Manuscript.

Phase 3

Phase 3 has briefly been investigated but the results, from the current implementation of the Morphology software, only show the number of occurrences of a unique word token within another unique word token. The highest occurring word tokens are shown in Figure 'X' below.

Figure 'X': Highest Occurring Unique Word Tokens Within Other Unique Word Tokens

This does not take into account any single character tokens that may also occur as a word token, nor the placement of the word token within another word token. As such, further data and analysis is required before any conclusion can be made.

Phase 4

Phase 4 will examine the possible relation between texts, words and illustrations. First a certain folio f102r2 were selected.

Figure X: "frog"

It is a breakthrough point when start to investigate. The reason of choosing folio f102r2 is there is a “frog” in the top right corner, as shown in figure 6. The language in the Voynich Manuscript is an unknown language, therefore it is helpful to find out some known staff when start to investigate. The image of folio is in appendix section A.9.

As shown in table 6, the following statistic were found in folio f102r2:

Table X: "statistic of folio f102r2"

There are 21 word tokens that only appear in folio f102r2 and only appear once, they are unique word tokens, and there are no these word tokens in other folios in the whole Voynich Manuscript. The word tokens are:

airam, chefoly, dethody, desey, eeey, kockhas, kolor, lsais, olockhy, opolkod, qkeeey, qkeeod, qoochey, qyoeey, skekyd, soeees, soefchocphy, sossy, ydeeal, ykeockhey.

Figure X: "illustration in f34r"

Figure X: "illustration in f87r"

There are also 21 word tokens appear once in folio f102r2 and appear less than 10 times in other folios. The table 7 below shows the statistic of these word tokens.

Table X: "Word and foilo"

The unique word in the folio may be the description of unique illustration in the folio. In folio, there are 21 unique word token and 12 illustrations. It can tell nothing without further information.

There are 2 word token which appear in folio f102r2 once and also appear in other folios once are first selected to be investigated: amg, cheodor. They are occur in folios f34r and f87r respectively. The images of the three folios are in appendix A.9. Word token amg occurs in folio f102r2 and f34r, word token cheodor occurs in f102r2 and f87r. The three folios are both text with drawings. The drawing in f34r is shown in figure 7 on the right and drawing in f87r is in figure 8. They

looks like plant with leaves, roots and corollas. In f102r2, there are plants too, however the plants in f102r2 do not have corollas.

The similar between the f102r2 and f34r are:

Figure X: "root illustration in f102r2 and f34r"

- They both occur the word token amg
- They both occur plant illustration
- The plants in the folio both have roots,

The figure 9 on the right shows the similar roots part of two folio’s illustration, up one is folio 102r2, the under one is f34r. Through the comparison, perhaps the word amg in the Voynich Manuscript is used to description the plant roots as shown in figure 9. Again, this is just a crude conclusion, further more information are needed to for the investigation.

Project Management

Project Status and Timeline

As shown in section 4, the project has currently been split up into 5 distinct phases with the expectation of the 5th phase to be split off into much smaller phases. Each of these are to be worked on and completed within a given time to keep the project on schedule.

Due to issues found within Phase 2, as detailed in section 6.2, the project is behind the initial schedule. Developed software did not function to an acceptable standard. The schedule has since been revisited and modified, expanding the time allocated to Phase 2. The current project schedule has been graphically organized and displayed on a Gantt chart viewable in the Appendix section A.1.

Delivarables

The deliverables of the project are summarized below in table 2, detailing the deliverable and the respective deadline . The deliverable work schedule can also be viewed within the Gantt chart of the appendix section A.1.


Deliverable	Deadline
Proposal Seminar	31st of March, 2015
Research Proposal Draft	17th of April, 2015
Research Proposal and Progress Report	5th of June, 2015
Final Seminar	Week 10, Semester 2
Thesis	Week 11, Semester 2
Expo Poster	Week 11, Semester 2
Expo Presentation	Week 12, Semester 2
YouTube Video	Week 12, Semester 2
USB Flash Drive of all Code and Work	Week 12, Semester 2

Table 2: Deliverables

Task Allocation

Tasks have been allocated to the project team members through the phases in section 4. Collaboration between members will occur during phases 1, 2, and 5. However, it is expected that there will be a considerable amount of collaboration throughout all phases. The current allocations are summarized in table 3 below.


Task	Phase	Allocation
Pre-Processing of Interlinear Archive	1	Andrew McInnes
Writing and Testing Voynich First-Order Statistics Code	1	Andrew McInnes
Writing and Testing Character Token Code	1	Lifei Wang
Expanding First-Order Statistics Code	2	Andrew McInnes
Expanding Character Token Code	2	Lifei Wang
Writing and Testing English Alphabet Extraction Code	2	Andrew McInnes
Writing and Testing Illustration Code	3	Andrew McInnes
Writing and Testing Stylometry Code	4	Lifei Wang
Discussing and Determining Follow-Up Investigations	5	Andrew McInnes, Lifei Wang

Table 3: Task Allocation

Management Strategy

The project team will be managed through a minimum of one internal meeting between members outside of scheduled workshop and project time, and a minimum of one fortnightly meeting with supervisors. Each meeting will involve:

Current phase progress
Issue(s) encountered
Display of any relevant result(s) or research finding(s)

Feedback can then be gathered through both team members and supervisors.

All working copies of code and documents shall also be kept on a group Google Drive. These will be updated as necessary and are available for all necessary members.

Budget

The project team has been assigned a budget of $500. However the project is heavily computer-based where all currently required programs are freely available on the University systems. Therefore it is expected that none of the budget will need to be used.

It is possible that works may be found that are unavailable within the University. Should it be found that these works would further the research then the budget may be used on these works. This will be discussed with supervisors.

Risk Analysis

Multiple risks have been identified by the project team and ranked according to the likelihood of occurring, and the consequence of an occurrence. The likelihood and consequence was given a number ranking as denoted by the brackets ‘[ ]’. The main risks are summarised within table 4 below.


No.	Risk	Likelihood	Consequence	Risk Level
1	Underestimation and/or mismanagement of time and resources	High [8]	High [7]	56
2	Health related issues from long periods at computers	High [7]	Moderate [6]	42
3	Software not functioning to an acceptable standard	Moderate [5]	Moderate [6]	30
4	Team member illness or injury	High [7]	Moderate [4]	28
5	Issues with communication between team and/or supervisors	Low [3]	High [7]	21
6	Loss of software code	Low [2]	Very High [10]	20

Table 4: Risk Analysis (Summary)

The risk level was calculated by multiplying the likelihood rank and the consequence rank. This risk level corresponds to the overall risk that is posed to the project.

Short descriptions along with mitigation and continuity plans for each risk are detailed below.

Underestimation and/or mismanagement of time and resources

As the team members undertaking the project have no experience with such a large, software focused project, the likelihood of underestimation or mismanagement of time and/or resources is high.

Mitigation of this risk shall be through continual meeting within the project team and with supervisors. A minimum of a weekly internal meeting within the team and a fortnightly progress meeting with supervisors shall occur. Phase 5 of the project has also been left deliberately long for this purpose.

Should the risk still occur, the project schedule shall be discussed and reworked to allow for successful completion within the project time frame.

Health related issues from long periods at computers

Due to project being mostly computer-based, team members will be in front of computers for large quantities of time.

To mitigate any possible issues due to long periods in front of computers, team members will take periodic breaks from the computer during their work.

Software not functioning to an acceptable standard

The project itself is heavily investigative and software that is written relies on analysing data obtained from this investigation. It is possible that any software written may not function to an acceptable standard, such as having a high error rate.

Mitigation shall be handled through extensive testing. Software found to be functioning below an acceptable standard may need to be re-written. Note that despite software essentially failing, this is also a valid result as it may show that the procedure used cannot reliably conclude on relevant linguistic properties or relationships.

Team member illness or injury

The project shall be occurring over two semesters. There is a high likelihood that one, if not both, team members may fall ill or become injured within that time.

Should any health related issues arise, the member shall inform the rest of the project team and supervisors. Depending on the illness or injury, the other member may take over work from the ill or injured member. As the majority of work is computer-based it is expected that team members will be able to work to some extent during most illnesses or injuries.

Issues with communication between team and/or supervisors

Team members and supervisors are very busy throughout the project lifetime. As the main form of communication shall be through emails it is possible, although not likely, that miscommunication of information may occur.

Communication issues shall be mitigated through thoughtful, concise messages. Emails should not contain any ambiguities where possible and any questions that may be raised should be asked immediately.

Loss of software code

As explained multiple times, the majority of the project is software based. It is possible that through some errors or malicious intent that software code(s) may be lost. While unlikely, the consequences of this occurring are severe.

All code will therefore be kept in multiple backups. This includes the use of external flash drives, the University system, and Google Drive.

Preliminary Conclusions

The results found thus far cannot provide any significant, comprehensive conclusions. However they do lead to some small, basic hypotheses about the Voynich Manuscipt.

As the amount of unique character tokens used within the Voynich Manuscript is small, when compared to that found within a modern language such as English, it can be hypothesized that the alphabet itself lacks a distinction between alphabet cases and possibly lacks punctuation. This was much the case in other languages such as classical Greek and classical Latin. It may also show that numerals were represented by the alphabetical tokens.

Again, with the results and amount of data currently collected the hypotheses above cannot be concluded. Further research and investigation is required.

Citations and References

[1] D. Stolte, “Experts determine age of book 'nobody can read',” 10 February 2011. [Online]. Available: http://phys.org/news/2011-02-experts-age.html. [Accessed 12 March 2015].

[2] S. Reddy and K. Knight, “What We Know About The Voynich Manuscript,” LaTeCH '11 Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 78-86, 2011.

[3] G. Landini, “Evidence Of Linguistic Structure In The Voynich Manuscript Using Spectral Analysis,” Cryptologia, pp. 275-295, 2001.

[4] A. Schinner, “The Voynich Manuscript: Evidence of the Hoax Hypothesis,” Cryptologia, pp. 95-107, 2007.

[5] D. R. Amancio, E. G. Altmann, D. Rybski, O. N. Oliveira Jr. and L. d. F. Costa, “Probing the Statistical Properties of Unknown Texts: Application to the Voynich Manuscript,” PLoS ONE 8(7), vol. 8, no. 7, pp. 1-10, 2013.

[6] S. Chakrabarti, M. Ester, U. Fayyad, J. Gehrke, J. Han, S. Morishita, G. Piatetsky-Shapiro and W. Wang, “Data Mining Curriculum: A Proposal (Version 1.0),” 12 April 2015. [Online]. Available: http://www.kdd.org/curriculum/index.html.

[7] R. Zandbergen, “Description of the Manuscript,” 25 March 2015. [Online]. Available: http://voynich.nu/descr.html.

[8] J. Stolfi, “Voynich Manuscript stuff,” 23 May 2005. [Online]. Available: http://www.ic.unicamp.br/~stolfi/EXPORT/projects/voynich/Welcome.html.

[9] J. Stolfi, “A prefix-midfix-suffix decomposition of Voynichese words,” 10 12 1997. [Online]. Available: http://www.ic.unicamp.br/~stolfi/voynich/97-11-12-pms/.

[10] J. Stolfi, “A Grammar for Voynichese Words,” 14 June 2000. [Online]. Available: http://www.ic.unicamp.br/~stolfi/EXPORT/projects/voynich/00-06-07-word-grammar/.

[11] B. Shi and P. Roush, “Semester B Final Report 2014 - Cracking the Voynich code,” University of Adelaide, Adelaide, 2014.

[12] J. Goldsmith, “Unsupervised Learning of the Morphology of a Natural Language,” Computational Linguistics, pp. 153-198, 2001.

[13] P. Currier, “New Research on the Voynich Manuscript: Proceedings of a Seminar,” 30 November 1976. [Online]. Available: http://www.voynich.nu/extra/curr_main.html.

[14] M. A. Montemurro and D. H. Zanette, “Keywords and Co-Occurrence Patterns in the Voynich Manuscript: An Information-Theoretic Analysis,” PLoS ONE, vol. 8, no. 6, pp. 1-9, 2013.

[15] “The Voynich Manuscript,” 22 March 2015. [Online]. Available: https://archive.org/details/TheVoynichManuscript.

[16] R. Zandbergen, “Analysis of the text,” 13 April 2015. [Online]. Available: http://www.voynich.nu/analysis.html.

[17] M. Aronoff, and K. Fudeman, “What is morphology,” Vol. 8., John Wiley & Sons, pp. 1-25, 2011.

[18] G. Durrett and J. DeNero, “Supervised Learning of Complete Morphological Paradigms,” HLT-NAACL, pp. 1185-1195. 2013.

[19] H. Trost, “Computational Morphology,” [Online]. Available: http://ccl.pku.edu.cn/doubtfire/nlp/Lexical_Analysis/Word_Lemmatization/Introduction/Computational%20Morphology.htm. [Accessed 23 May 2015].

A. Appendix

A.1 Project Gantt Chart

A.2 The Voynich Manuscript

The following images are of the Voynich Manuscript. These images have been reproduced from the Internet Archive [15]. Note that ‘v’ denotes verso, and ‘r’ denotes recto.

The herbal section, folios 1r – 66v.

The astronomical section, folios 67r – 73v.

The biological section, folios 75r - 84v.

The cosmological section, folios 85r – 86v

The pharmaceutical section, folios 87r – 102v.

The recipes section, folios 103r – 116v.

A.3 The Interlinear Archive Example

The following images are an example of the text found within the Voynich Manuscript and its corresponding translation into the machine-readable EVA. These images have been reproduced from René Zandbergen’s website [16].

A.4 The European Voynich Alphabet

The EVA as shown on René Zandbergen’s website [16].

A.5 Interlinear Archive Format and Pre-Processing Example

The Interlinear Archive is formatted in a specific way to allow for the data to be parsed into varying software. The full descriptions of the formatting can be found within the 'F0' pages of the interlinear archive. The main formatting points are summarized in Table 'Z' below:


Format Character	Definition
.	Definite word break
,	Dubious word break
-	Line break within a paragraph
=	Paragraph break
!	Filler character - denotes a break that was either skipped or lost (used for aligning different transcriptions)
%	Filler character - long stretch (used for aligning different transcriptions)
#	Line comment

Table 'Z': Basic Interlinear Archive Formatting Characters

Other formatting within the Interlinear Archive included:

<f17v.P.1;H> - Identifier/Line Locator Tag <'Page Number'.'Unit within Page'.'Line Number';'Transcriber Code'>
{comment} - Inline comment, usually to identify a specific type of illustration within the manuscript
{&123} - Extended EVA character

The following gives an example of a page from the Interlinear Archive and the pre-processing that is completed during the initial stages of phase 1.

Unprocessed Interlinear Archive Example

<f17v.P.1;H> pchodol.chor.fchy.opydaiin.odaldy-{plant}

<f17v.P.1;C> pchodol.chor.pchy.opydaiin.odaldy-{plant}

<f17v.P.1;F> pchodol.chor.fchy.opydaiin.odaldy-{plant}

<f17v.P.2;H> ycheey.keeor.ctho!dal.okol.odaiin.okal-{plant}

<f17v.P.2;C> ycheey.kshor.ctho!dal.okol.odaiin.okal-{plant}

<f17v.P.2;F> ycheey.keeor.ctho.dal.okol.odaiin.okal-{plant}

<f17v.P.3;H> oldaim.odaiin.okal.oldaiin.chockhol.olol-{plant}

<f17v.P.3;C> oldaim.odaiin.okal.oldaiin.chockhol.olol-{plant}

<f17v.P.3;F> oldaim.odaiin.okal.oldaiin.chockhol.olol-{plant}

<f17v.P.4;H> kchor.fchol.cphol.olcheol.okeeey-{plant}

<f17v.P.4;C> kchor.fchol.cphol.olcheol.okee!y-{plant}

<f17v.P.4;F> kchor.fchol.cphol.olcheol.okeeey-{plant}

Processed File for H

pchodol chor fchy opydaiin odaldy

ycheey keeor cthodal okol odaiin okal

oldaim odaiin okal oldaiin chockhol olol

kchor fchol cphol olcheol okeeey

A.6 Transcriber Codes

The following is a list of the transcriber codes and their respective transcriber


Transcriber Code	Transcriber
C	Currier
F	Friedman (First Study Group)
T	John Tiltman
L	Don Latham
R	Mike Roe
K	Karl Kluge
J	Jim Reed
D	Currier Alternative
G	Friedman Alternative
I	Jim Reed Alternative
Q	Karl Kluge Alternative
M	Don Latham Alternative
H	Takeshi Takahashi
N	Gabriel Landini
U	Jorge Stolfi
V	John Grove
P	Father Th. Petersen
X	Dennis V. Mardle
Z	Rene Zandbergen

A.7 English Character Token Frequency Data

English Corpus Token Frequency

A.8 English Character Token Appear At Start and End Data

File:Startenddata.jpg

Character token appear at start and end of a word

A.9 English Text Corpus

Table ((('X'))) below shows the texts used within the English corpus.


Text Name	Author	Obtained From
The Merry Adventures of Robin Hood	Howard Pyle	https://www.gutenberg.org/
Macbeth	William Shakespeare	https://www.gutenberg.org/
The New Testament		https://www.gutenberg.org/
An Account of the Foxglove and its Medical Uses	William Withering	https://www.gutenberg.org/
The Story of the Heavens	Robert S. Ball	https://www.gutenberg.org/

Table 'X': English Corpus

Back

Research Project Proposal and Progress Report

Contents

Introduction

Background

Aim

Motivation

Significance

Technical Background

Technical Challenges

Knowledge Gaps

Requirements

Literature Review

Morphology Literature Review

Proposed Method

Phase 1 - Characterization of the Text

Phase 2 - English Investigation

Phase 3 - Morphology Investigation

Phase 4 - Illustration Investigation

Phase 5 - Other Ideas

Implementation and Testing

Phase 1 - Characterization of the Text

Pre-Processing of the Text

Implementation

Testing

Characterization of the Text

Implementation

Testing

Phase 2 - English Investigation

Character Token Bigrams

Implementation

Testing

English Alphabet Extraction

Implementation

Testing

Phase 3 - Morphology Investigation

Basic Morpheme Identification

Implementation

Testing

Phase 4 - Illustration Investigation

Search and Analysis Word Token

Implementation

Testing

Results

Phase 1

Phase 2

Phase 3

Phase 4

Project Management

Project Status and Timeline

Delivarables

Task Allocation

Management Strategy

Budget

Risk Analysis

Underestimation and/or mismanagement of time and resources

Health related issues from long periods at computers

Software not functioning to an acceptable standard

Team member illness or injury

Issues with communication between team and/or supervisors

Loss of software code

Preliminary Conclusions

Citations and References

A. Appendix

A.1 Project Gantt Chart

A.2 The Voynich Manuscript

A.3 The Interlinear Archive Example

A.4 The European Voynich Alphabet

A.5 Interlinear Archive Format and Pre-Processing Example

A.6 Transcriber Codes

A.7 English Character Token Frequency Data

A.8 English Character Token Appear At Start and End Data

A.9 English Text Corpus

See also

Back

Navigation menu

Search