Editing Final Report/Thesis 2016 (section)

=== Levenshtein Distance Test ===
The comparison method of this section is calculating the Levenshtein Distance between two initialism strings directly. Materials used for comparison, length of string and comparison time may vary based on the purposes of setting up each test.

==== Data Processing ====
All the materials for this test were transformed into Initialisms format, with all punctuations and spaces removed, for example:
Original Text: 		“Snow is falling, all around me, children playing, having fun…”
Initialisms Format:     “SIFARMCPHF…”
Initialisms were then divided into segmentations of a given length. Each segmentation is referred as ‘unit’ in the next sections. 

==== Preparation Test ====
It is always necessary to check if the method for tests is effective before starting any tests. Hence the purpose of this section is to prove that the Levenshtein distance algorithm has the ability to revealing the differences between different languages.  
Units of a fixed length extracted from English version and eight different language versions of the War and Peace were compared. To make it comprehensive, there are six groups of tests with different length varies from 50 letters/unit to 800 letters/unit and in each length group there are 100 times of comparison (10 units of one language compared with 10 units of another language one by one using a double ‘for loop’ ). Each unit was uniquely selected from texts without any overlapping. Results of these tests are normalized with the unit of length = 50. 

The result of the tests was presented in boxplots in the figure below (Language names were abbreviated, En stands for English, It-Italian, Fr-French, Ge-German, Pq-Portuguese, La-Latin, Sp-Spanish, Tr-Turkish and Po-Polish). Each box is a set of the output Levenshtein Distances between two texts. For example, the leftmost box represents the test results of 10 English units versus another 10 English units; same rule applies for the others. 

[[File:Group1.jpg|thumb|600px|center|Levenshtein Test Result]]

By analyzing the five box plots above in the Figure Group 1, the following conclusions can be inferred:
1. In all the five figures, the leftmost boxes are significantly lower than the others, which mean that the Levenshtein distances between English strings are much lower than those between English strings and strings of other languages.  Hence, the Levenshtein Distance is able to manifest the difference between different languages.

2. As the unit length grows the boxes become lower and more compact (median value and standard deviation shrink when unit length goes up). Some of them even converged into a specific value.  In addition, the difference between the last two plots (unit length = 400 and unit length = 800) is trivial. It means that the Levenshtein distances tend to be stable as the length of the unit increases.  

3.  When increasing the unit size, the trends of the nine language groups are highly similar: median values of the Levenshtein Distances decrease as the unit size increases. This implies that the performance of the Levenshtein Distance algorithm is independent with different languages.   

==== Main Test ====

The previous section has proved that the Levenshtein Distance actually can reveal the differences between languages. Based on this, the test in this section is to find out the difference between the mysterious code and texts of varies language.     
In the experimental group (line plots in figures), two versions the mysterious code were compared with 100 units of different languages extracted from the War and Peace. As the code is only of 50 letters long, there is no need to carry out the test in different unit length. 
In the comparison group (box plots in figures), units extracted from the UDHR were compared with the same materials in the experimental group. Like the preparation test, this test was also divided into 6 different length groups. 100 times of comparison was made, the 10 vs 10 double for loop structure was also preserved here. 
The reason why Latin language was not tested is because that the Latin version of the War and Peace was not found, and lacking of material for the control group test would affect the confidence.
Results are presented in the following figures. The horizontal line indicates the mean (average value) of the experimental group’s result and the boxes indicate results of the comparison group. For example, in the first plot: the six boxes represent the test result of UDHR units in English versus War and Peace units in English in six different length groups. Two lines indicate the test result of two versions of code versus units of War and Peace in English. The “Codes vs English” test results (red and green lines) appear in every plot as a reference.
[[File:Group2.jpg|thumb|600px|center|Levenshtein Main Test Result]]
The following interesting phenomenon could be observed from the Figure group2 above:

1. Observing within each plot: both the median value and the standard deviation fall as the unit length increases. This trend fits reasonably well with the trend of the preparation test (the one mentioned in the 2nd conclusion of section 4.1.2).
2.  Observing from the second plot to the last plot: there are distinct gaps between the Levenshtein Distances of codes versus English text (red and green lines), and the Levenshtein Distances of codes versus other languages (blue and purple lines).
3. Among all the languages tested except English and Turkish tests, results of the experimental group are consistently higher than results of the comparison group. In addition, experimental group’s result in Turkish was significant higher the one in English. 
4. Observing inside of the English test plot and the Turkish test plot: both the two groups of lines located around the median value of test result in 50 letters length group.
Conclusion:

According to the aforementioned phenomenon, the corresponding inferences can be drawn out:
1. The similarity of two trends reveals that the Levenshtein Distance algorithm does not sensitive to different testing materials. On the other hand, the relatively higher standard deviation in 50 letters group implies the higher randomness in small length tests. Notice that results of the experimental group are literally of 50-letters group. This reduces the credibility of the whole test.  
2. Within the scope of the experimental group, the mysterious code has relatively smaller Levenshtein Distance with English text compared to those with other languages.
3. The differences between results of experimental and comparison groups in six languages (German, Italian, French, Portuguese, Spanish and Polish) imply that the code is unlikely to be one of these languages.
4. In English and Turkish test results, the overlapped results from experimental and comparison groups imply the consistencies of the Levenshtein Distances between ‘Code vs War and Peace’ and ‘UDHR vs War and Peace’.  

==== Conclusion of the Levenshtein Distance Test ====
According to the previous analysis, the code has more possibility to be initialisms of English or Turkish text, rather than of texts in other languages. In addition, considering the higher Levenshtein Distance in “code vs Turkish” test, being initialisms of English is more possible than being initialisms of Turkish. 
On the other hand, the credibility of the test was challenged by the existence of randomness in small length group test. Hence the conclusion that the code is more likely to be English should be considered cautiously.