Editing Final Report 2011 (section)

==System Integration==
===Concept===
The purpose of the system integration was to interface between the pattern matcher and web crawler modules in order to create a functioning autonomous internet search application. As set out in the Extended Objectives, a goal was to provide access to this application through an intuitive GUI. The system integration also involved the design and implementation of this GUI.

===Technical Challenges===
System Integration introduced the following challenges:
# Multithreading thread control
# Pattern Matching generics
# Intuitive GUI design
# Internet access proxies

The access of the crawler through a GUI meant the GUI Java thread needed control of the web crawling threads, providing some multithreading challenges. Generic object use between the different pattern matching algorithms also introduced a challenge. This was solved as noted in the [[Final Report 2011#Pattern Matcher| Pattern Matcher]] section with the introduction of the ''SearchManager'' Java interface.

Other challenges came from the designing of a GUI providing all necessary input areas for a user to run a web search and from internet proxy issues which were encountered accessing the web from some locations.

===Design===
The design of the integrated system is explained by the diagram below.

<center>[[File:System integration.png| Integrated system process]]</center>
<center>'''Figure 14 - Integrated web crawler and pattern matcher system operational process'''</center>

The design of the GUI for the system was primarily done by Steven using the [http://netbeans.org/ Netbeans] IDE GUI development toolkit. The requirements of the user interface included allowing the user to specify: the seed website(s), the pattern and pattern type being searched for, and a result logging location. 

The design of the GUI can be seen below. The GUI design uses three separate tabs for simple viewing and operation of the software's functionality. The first tab, designated the seed tab, the second tab the pattern tab and the final tab specifies logging locations. There is a substantial area shared between each tab (the bottom half of the GUI) to which results are returned. As required, the pattern tab also allows the type of pattern to be selected via the checkboxes (Regex, Initial, Exact) shown. The third tab is not shown here.

<center>[[File:Crawler gui.png| GUI developed for web crawler]]</center>
<center>'''Figure 15 - Web crawler GUI seed and pattern display'''</center>

===Implementation===
The integration of the pattern matching module into the web crawler module was achieved through utilising Java generics with the ''SearchManager'' interface as defined in the [[Final Report 2011#Pattern Matcher|Pattern Matcher]] section. The OTS crawler4j provides a function to retrieve web page content and this was used to pass page content to the pattern matcher.

The integration of the system into the GUI was somewhat more involved. A multi-threading solution was required to enable control of the web crawler through the GUI as discussed in the Technical Challenges.

The Testing phase of the Web Crawler module revealed additional functionality provided by the OTS crawler. During the implementation phase, some of these features were integrated into the GUI to add a greater dimension of user control. These features were “Limit” on the Seed tab, which limits crawling to web page URLs starting with the specified limit; “No. Pages” on the Pattern (Scope) tab, which stops the crawler after the specified number of page downloads; and “Depth” on the Pattern tab, which defines how deep the crawler goes into a stack of links before giving up on that line of inquiry. These features can be seen in the Design images, above.

===Testing===
The modular testing of the pattern matcher and web crawler meant following validation of integration success, testing focussed on the GUI. The testing process used was primarily grey-box, with websites and search patterns being specifically selected. A restriction of the OTS crawler meant it was not possible to test crawls on locally stored pages and as such live websites had to be used. 
The testing process did reveal some issues regarding control of the web crawler through the GUI resulting in the multi-threading solution discussed above. Cross-platform compatibility was also successfully verified. Testing on University machines highlighted a proxy issue. While not catastrophic, it adds complexity to the user who must specify proxy settings. Overall, the testing process proved the system met requirements and worked as expected.