Editing
Progress Report 2010
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===Web Crawler Plans=== With the web crawler design with the Arachnid crawler there is currently two major issues to be resolved. *Character encoding differences with HTML URLs *Traversal method - breadth first vs depth first The image above shows a cut out from the command prompt displaying the current issue regarding character encoding. The text under the numeral “1” is from a successful URL discovery where the blue underline shows the URL ‘type’ and the red underline shows the found ‘title’ of the URL. In the second section under the numeral “2” the type is shown as UTF-8 and as can be seen the title being displayed is null, resulting in no valid information being retrieved from the URL that has been found at that point. [[Image:charencoding.jpg|400px|centre]] It is important to overcome this problem beacause UTF-8 is the international “standard” for encoding of URLs and appears to be the most common type found. This will be achieved by trying to firstly convert URLs to UTF-16 before handling them with Arachnid. The reason for doing this is that Arachnid is currently handling UTF-16 quite well and it may simpler to just convert each URL that is found to the encoding that is working, rather than altering the crawler to deal with a whole new type of character encoding. To do this the java method URLEncoder() method will be experimented with. The other issue involving traversal order is less critical however will be important in terms of data collection and storage. The types of websites that are of interest will be information sites containing large amounts of text that will most likely be on the website in a structured hierarchical manner. This means that it is quite important to be able to move down one avenue of information of a site and storing similar items of interest so that they can be easily collected and analysed later. Breadth first traversal is the simplest approach to filtering through a website however it will result in data found being stored in an unorganised manner with respect to the aim of the project. Depth first is slightly harder to implement but, if it can be achieved, will result in all results of a similar nature being stored in an ordered fashion. The following figures show the difference in breadth and depth first traversal and the resulting order of stored data. Each node can be thought of as a different web page on the same website where the numbers relate to the order that each page would be searched. [[Image:breadthfirst.jpg|800px|centre]] It is clear from the diagrams that depth first will allow for data from the same category in a hierarchy will be searched in order, relating to patterns and details found being stored in a usable order as well. Depth first traversal is the method that is most useful to the project and will be implemented as best as possible. In order to try and minimise any large delays in the project due to issues with Arachnid, more time will be spent with Wget and using it as an effective web crawler. As Wget is written in a C based language, there may be some issues with translating the pattern searching code in order to use data extracted by Wget. This means that researching Wget methods and experimenting with it as soon as possible is important in order to provide as much time as possible for combining the web crawler and the pattern matching algorithm.
Summary:
Please note that all contributions to Derek may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Derek:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information