Editing
Progress Report 2010
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===Web Crawling=== An open source Java based Web crawling device called Arachnid<ref>http://arachnid.sourceforge.net/</ref> has currently been the main focus in terms of producing the web crawler. This crawler is so far proving to be quite useful for the project due to the following traits: *Java based *Basic example provided - allowing for reasonably steep learning curve instead of having to become familiar with a crawler from scratch *Specific Case handling methods provided - methods for some URL cases are provided (with no handling code however it has provided a starting point to the solution) *Highly Modifiable So far after modifying the supplied crawler and exploiting methods used in the supplied example the crawler is capable of the following: *Compiling successfully (This is a huge bonus! it allows for actual tests to take place) *Intaking a base URL from the console *Testing for correct protocol (We only want to look at HTTP websites for now) *Test for a null URL *Test for non HTML URLs *Test for URLs that link to external websites *Run a handleLink() *Iterate through the website and produce a list of URLs and their associated name found along with an incrementing counter The screen print below shows the command lines printed out for the beginning of a search on www.adelaide.edu.au, the university website. As can be seen the results are displaying each found URL with a number and a title. The title = null lines are explained later in this report. [[Image:arachnidscreenprint.jpg|600px|centre]] While using Arachnid is proving to be quite useful, there have also been some issues encountered (see “Approach for Remainder of the Project” section). Because of this, other alternative crawling methods have been researched in parallel to Arachnid. The main alternative for retrieving data from websites that is being investigated is Wget <ref>http://www.gnu.org/software/wget/</ref>. This is a program that would be highly useful for use in the project for the following reasons: *Freely available *Supports downloading via HTTP *Able to act as a web crawler to download recursively until reaching a limit or end of web site The third point is extremely valuable to the goal of this project. Wget is able to mirror websites for offline viewing which will also allow for easier extraction and searching of data.
Summary:
Please note that all contributions to Derek may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Derek:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information