Editing
Progress Report 2010
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===Text Pattern Algorithms=== The most recent revision of the pattern finding algorithm and testing text file can be found [[Media:FindMatch.rar|here]]. The text pattern algorithm was written from scratch using java. The main purpose of the algorithm is to search through a text file and return an output saying if a word or initialism is found. This was done because we weren’t aware of what the web crawler’s were capable of using the initial functions (that is, using the web crawler without modifying any of the code). To start, a flow chart of the basic function of the algorithm was made to determine how to begin coding in the most efficient way possible. This is shown below: [[Image:FindMatchFlowChart.jpg|500px|centre]] To explain the function as simply as possible, the basic algorithm was designed to: *Accept a user entered input from the command prompt and place it into a string *Open a text file and place a line from that file into a string *Compare the file string with the user’s string character by character, printing an output if the entire user entered string was found *Repeat comparison process for every line until the end of the file is reached. The algorithm utilises the scanner class from the util package as well as the File and FileReader classes from the io package. They are important because these classes allow the algorithm to open the needed text file and scan through it. They were easily found on the java API<ref>http://java.sun.com/j2se/1.5.0/docs/api/</ref> and are fairly simple to use. For this algorithm, the output would report the line and character number of the first letter of the matched word found in the text file. To simplify the searching process, both the text file string and the user string were converted to lower case letters. It was felt that this helped improve the functionality of the algorithm because it involved fewer cases where characters would be the same letter but not the same. Since it doesn’t really matter if the characters are upper case or lower case for the results, there will really be no negative effect on the final outcome of the project. To implement searching for initialisms, the area circled in red was modified to only change the swCount variable if the previous character in the file string was a white space. The results of the output would report the line and character number like the previous method, as well as the sentence that the initialism represents. This was done because the aim of the project is to find what the initialisms of the Somerton code could possibly represent. This seemed to work, although several errors in the output were observed when some special cases were done. For example in the file string, if there were two white spaces in a row, the output would be incorrect (the result would print out the wrong character position and the wrong set of words that matched that initialism). Additionally, problems occurred when the input string had spaces in it as well (the method incorrectly assumed that the space represented an initialism as well). Several attempts were made to debug and fix these problems. The final solution was to create another method (to be used by the initialism finding algorithm) to remove the white spaces in the input string. Also, one more method was created to print the initialism sentence properly for every case that was thought of. For testing purposes, a main method was created which utilises the scanner class again. For this, the method accepts a user entered input to determine which algorithm will be used for the current search (FindExact or FindInitialism) and then accepts another user inputted string that represents the word or initialism that will be searched for. It is currently unknown whether this will be used as the interface for the final web crawler, although it is likely that the final interface will have a similar function. Some of the simulations and results from testing can be found below. [[Image:javaresults2.jpg|560px|centre]] The image above demonstrates the FindExact method. From observing the word being searched for, and the testing text file used, it is possible to see that the algorithm works in the assumed manner. As stated earlier, the code considers upper and lower case characters the same – this is illustrated above with the searches for “TEST” and “test” returning the same thing. The method correctly determines the line and character number of the beginning of the word. Furthermore, the method determines the word “ hello” as “hello” (without the spaces at the start) and can correctly find its location in the text file. Although the code ignores spaces in the search string, it can correctly determine the words “ t e s t” as “t e s t” with spaces in between, but not before the first non-whitespace character. Since the FindExact method can search for multiple words, it was felt that this specification was a requirement in the code. To do this, an algorithm was used to delete all of the white spaces in the user input string until a non-whitespace character was met. Additionally, the main method returns an input error when a letter other than ‘i’ or ‘e’ is used. [[Image:javaresults1.jpg|560px|centre]] This image shows that the FindInitialism method returns the same result for every different search of the initialism test (with spaces in different places in the string). Incredible amounts of debugging and modification was done in order for this to work correctly. Extensive use of the isWhitespace() method helped to achieve a working algorithm. This method was very important because it was used to determine how many words of the initialism were currently printed as well as determining if a character was the beginning of a word or not. The FindInitialism method is the method that will be developed for the final web crawler. It has a function specific to the final outcome for the project: to generate a list of potential sentences for an inputted initialism by reading through the source code of a web site.
Summary:
Please note that all contributions to Derek may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Derek:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information