Editing
Final Report 2010
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
====Implementation==== Several different approaches were used to implement the web crawler in order to find method that was both effective and simple to use. After experimenting with open source crawlers available such as Arachnid<ref>http://arachnid.sourceforge.net/ The Arachnid Web Crawler</ref> and JSpider<ref>http://j-spider.sourceforge.net/ The JSpider Web Crawler</ref> we turned our attention to searching for a simpler solution that could be operated directly from the command prompt. Such a program would allow us to hopefully input a website or list of websites of interest, collect relevant data and then have some control over the pattern matching methods that would be used to produce useful results. After much searching and experimenting I came across an open source crawler called HTTrack<ref>http://www.httrack.com/</ref>. HTTrack was used for the following reasons: *It is free *It is simple to use. A GUI version and command line version come with the standard package which allowed for an easy visual experience to become familiar with the program that was easily translated to coded commands. *It allows full website mirroring. This means that the text from the websites is stored on the computer and can be used both offline and for multiple searches without needing to access and search the internet every time. *It has a huge amount of customisation options. This allowed for control over such things as search depth (how deep into a website), accessing external websites or just one (avoids jumping to websites that contain irrelevant data), search criteria (only text is downloaded, no images movies or unwanted files that are of no use and waste downloads) *It abides the Robots Exclusion Protocol (individual access rights that are customised by the owner of each website) *It has a command prompt option. This allows for a user friendly approach and integration with the pattern matching algorithm. To keep the whole project user friendly, a batch file was created that follows the following process: #Takes in a URL or list of URLs that are pre-saved in a text file at a known location on the computer. #Prompts the user to enter a destination on the computer to store the data retrieved from the website. #Accesses HTTrack and perform a predetermined search on the provided URL(s). #Once the website mirroring is complete the program moves to the predetermined location containing the pattern matching code #Compiles and runs the pattern matching code For a simple video demonstration of the web crawler working in conjunction with the pattern matching software please see [http://www.youtube.com/watch?v=Pk6AaKM8wd4: Web Crawler Demo] The demonstration video above shows the use of the file input batch file. There is also a version that will take in URL's directly from the user. All software is available from the appendix.
Summary:
Please note that all contributions to Derek may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Derek:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information