Editing
Final Report/Thesis 2016
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== 2-grams Simhash Test === Though the previous test has drawn out some useful conclusions, its lacking of credibility is still unacceptable. As a result, the 2-grams test with another algorithm (SimHash algorithm) has been designed and implemented. The test is expected to give supports to the conclusion drawn out in the previous Levenshtein Distance Test. In the meantime, it is expected to have a better credibility. ==== Data Processing ==== Another difference was that texts for this test were rearranged into the 2-grams format based on the initialisms format in the previous test. For example: Original String: “The visions dancing in my mind the early dawn the shades of time …” Initialisms format: “TVDIMMTEDTSOT ...” 2-grams format: “TV VD DI IM MM MT TE ED DT TS SO OT T…” ==== Preparation Test ==== Just like before, the preparation test was designed to check the performance of algorithm. Unlike the Levenshtein Algorithm, the grouping of test based on unit length is meaningless. This is because of the nature of the Simhash algorithm. For details please refer to section 3.2.2. The preparation test was based on the UDHR. In the first test: it compared 50 units extracted from the English version of UDHR with the whole text of UDHR (2-grams formatted as well) in both English and other languages, by firstly turning each unit to its corresponding Simhash string, then calculating the Hamming Distance between two Simhash strings. In the second test, the same method and layout are used again to compare 50 units with the whole text of UDHR in the same kind of language. To make it simple, test 1 is a cross-language test while test2 is a same-language test. Results were presented in the two box plots below (apologize for the being out of order of the x-axis; please make comparisons according to the column names under each box): [[File:Group3.jpg|thumb|600px|center|Simhash Preparation Test Result]] Comparing inside the cross-language group (Figure 20): the data set of “English versus English” (the more transparent box located at the bottom-right of the first figure) is considerably lower than other cross-language data sets (colored boxes in the first figure). Comparing across the two figures (Figure 20 and 21): it is easy to find out that the cross-language group has generally higher results than the same-language group (Color schemes in two figures are not the same , please refer to the column names when doing comparison). Median values of boxes in same-language group are all significantly lower than those in the cross-language group. All third quartiles (Q3) in the same-language group are lower than first quartiles (Q3) of the corresponding language group in the cross-language group. In addition, data sets in figures above are less biased compared to those in Levenshtein Test. Distributions of data in each boxplots are quite compact. The Simhash Algorithm is not sensitive to different languages. These facts give extra credibility to the Simhash algorithm. Based on the aforementioned observations, it is reasonable to draw out the conclusion that the Simhash Algorithm has an excellent ability of distinguishing different kinds of languages. ==== Main Test ==== Here, the two versions of the mysterious code are tested against the UDHR in different languages. For each version of code, it was compared to different language versions of the UDHR. Results are presented in the first bar chart. Pink represents the first version of the code and blue represents the other one. (Each bar actually consists of two sub-bars, length of sub-bar represents each result, and there is no overlapping between sub-bars). As there is little difference between pink and blue bars, the average value in each language group is taken. For comparison purpose, these average values are combined with the same-language test result, and presented in the second figure below (bars in the left figure are turned to horizontal lines in the right figure, which represent the mean value of the Hamming Distance set between code and a specific language): [[File:Group4.jpg|thumb|600px|center|Simhash Test Result]] Considering that the comparison between two versions of code and the UDHR texts may not be sufficient to draw out any convincing conclusion, the two versions of code are compared again with 12000 letters long War and Peace text in each language using the same Simhash method. As the two versions of code have extremely similar test results against a specific language, the two versions’ results are added together in each language group. Result of the test is shown below: [[File:Last.jpg|thumb|600px|center|Simhash Test Result]] Figure 22 clearly showed that the Hamming Distance between English and the code is the minimum; figure 24 also presents the same characteristic. In Figure 23, there are two horizontal lines reside inside of their corresponding boxes: English and Portuguese. But compared with Portuguese, the red line that represents the result of code vs English is relatively closer to the median value of the English vs English test result set. ==== Conclusion of the Simhash Test ==== Based on these facts, the conclusion is that the mysterious code is more likely to be English, rather than other eight languages being tested.
Summary:
Please note that all contributions to Derek may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Derek:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information