Editing
Final Report 2011
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Pattern Matcher== ===Concept=== The pattern matching software was the first module of the internet search application defined in the 2011 project objectives. The purpose of the pattern matching component was to analyse the contents of web pages for given input patterns. It was important that the pattern matching software interfaced with the web crawler module, accepting web page content and returning results to the centralised user interface, for later system integration. ===Previous Work=== The project group in 2010 put extensive work into creating pattern matching software able to analyse web pages. The three main pattern matching functions of the software are: # FindExact – matches exact text patterns # FindPattern – match patterns, such as @#@# matching ERER # FindInitialism – find exact matches considering only initial letters<ref name=FinalReport2010>''Final Report 2010'', Ramirez, Kevin and Lewis-Vassallo, Michael, 2010, https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2010</ref>. While the 2010 implementation does function, this year’s project aimed to completely redesign and expand some of these functions. Flexibility and modularity were considered to be lacking in the 2010 implementation which could use only pre-defined patterns. In 2011 the software would enable patterns to be dynmaically specified which would enhance usability. ===Technical Challenges=== Key technical challenges identified for the pattern matching module included: # Handling special cases #* Upper and lower alphabet cases and recognising when a pattern existed that used a combination. #* Recognising patterns separated by punctuation and whitespace (spanning multiple words) #* Acknowledging patterns spanning multiple lines # The Wildcard character. # Providing interactivity to the pattern detector Pattern matching programs should be flexible in that they can detect a pattern no matter what special case applies. In meeting these requirements, the algorithms should take into account that text could be in different capitalisations, separated by punctuation, split across lines or separated by other types of whitespace. The implementation of a wildcard character into pattern matching algorithms provided another technical challenge. The need for wildcard characters was driven by the existence of ambiguities in letters in the Somerton Man code (see [[Final Report 2011#Cipher Analysis| Cipher Analysis Technical Challenges]]). Finally, the desire to design an interactive pattern detecting algorithm that was not limited to pre-set cases as in the 2010 version created intricate technical challenges. The software in 2010 was limited by only detecting patterns specified by symbols ‘@’ and ‘#’ and only of lengths 3 and 4; this was done through an exhaustive list. The challenge of 2011 was to allow for any characters to be used for the pattern specification with any pattern size. ===Design=== The overall design of the pattern matching module is shown in the diagram below. <center>[[Image:Pattern Matcher Process.png | Pattern Matcher process]]</center> <center>'''Figure 10 - Pattern matching software operational process'''</center> The bolded process represents the application of a search algorithm which is the key section of the pattern matcher. The following search algorithms have been designed for 2011: * Exact ** Searches for exact sequence of characters specified (ignoring capitalisation). For example, a search for “ERER” would match “Gatherer”. * Pattern ** Searches for a pattern as specified by characters entered. For example, a search for “HDHD” would match both “Gatherer” and “gbhdhdk”. The reasoning for this pattern algorithm is it is able to detect simple mono-alphabetic substitution ciphers; a likely candidate for the Somerton Code. * Initials ** Searches only initial letters. Can be implemented in conjunction with both Exact and Pattern searches (i.e. ExactInitial and PatternInitial). For example, a search for “HDHD” in PatternInitial would match “Almond bread and butter”. This was implemented because previous studies have suggested the code is likely to be an initialism. * Regular Expression ** Searches for any user defined Regular Expression (or Regex). A [http://en.wikipedia.org/wiki/Regular_expression Regular Expression] is a concise and flexible means for matching string patterns<ref>Regular Expression, Wikipedia Foundation Inc, http://en.wikipedia.org/wiki/Regular_expression</ref>. Regexes are very powerful; alone it can implement all of the above search functions. For example, a search for “([Ee])([Rr])\2\1” would be identical to the Exact search for “ERER”. The disadvantage being the technical knowledge required to use regexes. The Regular Expression searching function takes advantage of inbuilt Java classes. The ability to have unknown or wildcard for characters represented as “*” –asterisk- was also designed in the non-regex search functions. Java was chosen as the programming language to maintain compatibility with the other modules. ===Implementation=== To improve regularity for ease of modification and integration into the web crawler, the search algorithms were implemented to use a Java-defined interface, ''SearchManager'', which specified a search function, to-String function and results publication function. All searching programs use this interface. The implementation of the Exact search function uses a simple stepping process, checking each character to see if it is valid in building the exact pattern (or a wildcard). The search function for Pattern is much more complex, beginning by building the pattern from the input parameter before stepping through and attempting to match the identified pattern. The search function for Initial considers only letters not following other letters to mimic the behaviour of an initial letter. The implementation allows it to be easily integrated with Exact and Pattern. There was a two-fold benefit to the use of Java as the development language. It provided compatibility with the Java based web crawler, as will be discussed in the following section and it enabled the use of the advanced text handling capabilities inherent in the language, which included the inbuilt library for regular expressions which was used directly in that search functionality. ===Testing=== The testing process for the pattern matching module followed a majorly white-box approach. Each search algorithm was subjected to scenario analyses using both web-page content and self-generated test files design to test specific scenarios. Boundary cases such as patterns spread over separate lines were also thoroughly tested with internal workings verified to be functioning as expected. An example of a simple test run on a test file investigating the Exact search algorithm as well as the wildcard function is shown below. <center>[[Image:ExactMatch.PNG| Testing output]]</center> <center>'''Figure 11 - Pattern matching software simple search test'''</center>
Summary:
Please note that all contributions to Derek may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Derek:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information