Difference between revisions of "Authorship detection: 2010 group"

From Derek
Jump to: navigation, search
(Jie Dong)
(Semester 2, Week 12)
Line 455: Line 455:
 
===Semester 2, Week 12===
 
===Semester 2, Week 12===
 
====Jie Dong====
 
====Jie Dong====
 +
'''Progress and Status this week:'''
 +
# Rough draft for java extraction program standard:
 +
## Extraction algorithm
 +
## Word frequency
 +
## Word Recurrence Interval
 +
## Trigram Markov
 +
##
 +
##Overall structure
 +
## Choose files(training texts and testing text)
 +
##Outputs: two ArrayLists<String> of file names: trainFile and testFile
 +
## Read in texts
 +
##Each text (with filename “XX_YY.txt”) is imported as a single string. All punctuations except ‘ are removed and all letters are changed to lowercase. Then it is converted into a data structure called “Text” object which consists following fields:
 +
## Author (XX)
 +
## Title (YY)
 +
## A string of words (text content, no punctuation but ‘ and all letters lower case)
 +
## An array of words (text content ,no punctuation but ‘ and all letters lower case)
 +
## Sorted array (descending order according to occurrence frequencies ) of “Word” object which contains a word string appeared in the text and its occurrence frequency
 +
##Outputs: two ArrayLists<Text> of Text objects: trainText and testText
 +
 Extraction
 +
##Extracting useful data based on three different algorithms
 +
 +
 +
'''Plan and Goals for next week:'''
 +
 
====Leng Tan====
 
====Leng Tan====
 
'''Progress and Status this week:'''
 
'''Progress and Status this week:'''

Revision as of 13:00, 7 March 2011

Supervisors

Collaborators

Students

Weekly progress and questions

Semester 2, Week 1

Jie Dong

Progress and Status this week:

  1. First meeting with Derek, Brian and Maryam and other group member Leng and Tien-en.
  2. Derek, Brian and Maryam introduce us the basic idea of this data mining project
  3. The idea of authorship detection was introduced
  4. Several applications which data mining technique can be applied was mentioned
  5. Researches of past year students were mentioned and Maryam sent us several past year research report together with the code
  6. Research on the project, especially on SVM and some algorithms

Plan and Goals for new week:

  1. Prepare for the proposal seminar.
  2. Read research report from past years students.
  3. Understand project handbook.

Leng Tan

Progress and Status This Week

  1. the 1st meeting for the final year project was held with the supervisor, Prof Derek Abbot, co-supervisor, Dr Brian Ng, and Mrs Maryam, along with the team member.
  2. the initial project scope was introduced and general idea of the aim of the project is discussed.
  3. basic idea on the techniques of authorship detection is shown as well.
  4. several ideas for the future application of this project is highlighted.
  5. some hints on getting started was given which is to read Talis's final year report, which will be provided by Mrs Maryam.
  6. the first milestone of the project which is the proposal seminar is reminded.

Plan and Goals for Next Week

  1. fully read and understand Talis report.
  2. have a brief look on the code that will be supplied by Mrs Maryam.
  3. do some research on the background information of some controversial issues like the works on William Shakespeares, the Federalist Paper and the Letter to Hebrew.
  4. read through the project handbook of 2010 to have a rough idea of all the milestones of the project focusing on the project seminar.

Tien-en Phua

Progress and Status this week:

  1. Met up with project supervisor, Prof Derek Abbot, co-supervisor, Dr Brian Ng, and Mrs Maryam
  2. Derek discuss the concept behind authorship detection
  3. Derek explains about multi-dimensional graphs to link a disputed text to a known author.
  4. Discuss about possible future applications. Brian suggested code plagiarism and possibly music.
  5. Was provided by Maryam with other projects by students and started to go through the report by Talis.
  6. Went through the FYP Project handbook

Plan and Goals for new week:

  1. Identity the methods Talis used in his report
  2. Research on various methods
  3. Read up on past works regarding authorship detection
  4. Research on controvesy

Semester 2, Week 2

Jie Dong

Progress and Status this week:

  1. Three methods are chosen for this project: word frequency, word recurrence interval, and trigram markov model
  2. Reading material on SVM (SVM tutorial)
  3. Play with SVM software on Matlab
  4. Prepare slides for proposal seminar presentation on project aim, background, and part of project process

Plan and Goals for new week:

  1. Combine slides with other group member and do some modification
  2. Send slides draft to supervisor for feedback
  3. Do more modification
  4. Presentation on Thursday

Leng Tan

Progress and Status This Week

  1. identified 3 methods that was mentioned by Talis.
  2. have a brief knowledge and information of the controversial issue.
  3. have a brief idea on the upcoming propose seminar.

Plan and Goals for Next Week

  1. research on SVM.
  2. research on the backgroud history of the project
  3. research on the different technique use before in history
  4. prepare project proposal

Tien-en Phua

Progress and Status this week:

  1. Identity the three methods that Talis applied in his project, namely Word Frequency, Word Recurrence and Trigram Markvo
  2. Briefly understand how the three methods work
  3. Identity the past works done by other researchers.
  4. Identity three main controvesy namely the Federalist papers, Shakespeare plays and the Letter to the Hebrews

Plan and Goals for new week:

  1. Prepare for Project Proposal
  2. Develop Gantt chart, project budget and risk analysis
  3. Identity major milestones in project
  4. Write up on controvesy
  5. Further research on three methods

Semester 2, Week 3

Jie Dong

Progress and Status this week:

  1. We were introduced to Matthew and François-Pierre Huchet who are also participating in this project in Monday's meeting.
  2. Came up with draft(first whole draft) of proposal presentation slides. Discuss about the role of each person.
  3. Send slides to Brian and Matthew for feedback
  4. modify our slides
  5. Presentation on Thursday

Plan and Goals for new week:

  1. Do more researches for three methods and SVM
  2. Prepare for stage 1 design document

Leng Tan

Progress and Status This Week

  1. rough draft slides on the past research have been done for the propose seminar.
  2. a comparison list of the different technique is done.
  3. start research on SVM that is to be added in the slides with the different technique
  4. had a meeting with supervisors, and was introduced to Dr Matthew.
  5. focus 100% on the propose seminar.

Plan and Goals for Next Week

  1. have a more detailed review on the 3 methods.
  2. read the criteria for the stage 1 design document.

Tien-en Phua

Progress and Status this week:

  1. Prepare for project proposal
  2. Developed gantt chart, project budget and risk analysis
  3. Developed slides for milestones and controvesy
  4. Research on SVM (Support Vector Machine)
  5. Gain a better understanding on Word Frequency, WRI and Trigram Markvo

Plan and Goals for new week:

  1. Proceed to develop Stage 1 Design Document
  2. Understand SVM
  3. Develop Work Breakdown Structure
  4. Delegate task to individual members
  5. Read up on the other 4 reports

Semester 2, Week 4

Jie Dong

Progress and Status This Week

  1. In this project, we plan to have each person working on one method -- I am working on Trigram Markov model
  2. Read past reports for trigram Markov information
  3. Make stage 1 design document template
  4. Write project aim, background, and project approach in design document

Plan and Goals for Next Week

  1. Modify the design document draft
  2. Send to supervisors for feedback
  3. More modification
  4. Prepare a tutorial on SVM for other group members

Leng Tan

Progress and Status This Week

  1. research on the 3 methods have completed.
  2. fully read and understood the criteria for stage 1 design document.
  3. have a brief meeting with group members to delicate the tasks in preparing the stage 1 design document.


Plan and Goals for Next Week

  1. do a rough draft on the tasks that is allocated.
  2. do a layout design for the document.

Tien-en Phua

Progress and Status this week:

  1. Develop Work Breakdown Structure
  2. Identity tasks required for Stage 1 Design Document
  3. Broken down task and assigned to each member
  4. In the process of development of Stage 1 Design Document
  5. Further research on SVM and Word Frequency

Plan and Goals for new week:

  1. Complete write up on Word Frequency and SVM
  2. Complete Stage 1 Design Document
  3. Coding and further research on Word Frequency
  4. Read up on the other 4 reports

Semester 2, Week 5

Jie Dong

Progress and Status this week:

  1. Done abstract, project aim, background and significance
  2. Done description of data extraction part for Trigram Markov model in design document
  3. Feedback from supervisors on design document
  4. Final modification on design document
  5. Format the design document on wiki

Plan and Goals for Next Week:

  1. Design on Trigram Markov model
  2. learn to use SVM
  3. a bit coding on trigram Markov model

Leng Tan

Progress and Status this week:

  1. Done Literature Review of design document
  2. Done description of data extraction part for WRI in design document
  3. Done project approach and milestone for design document
  4. added modified WBS in appendix
  5. done initial check and compilation of Design document

Plan and Goals for Next Week:

  1. start do rough design for WRI of data extraction in java
  2. read SVM

Tien-en Phua

Progress and Status this week:

  1. Completed design document
    • Project Requirements
    • Description of data extraction of Function Word Frequency analysis
    • Project Budget
    • Background and Significance of Hebrews
    • Edited Gantt Chart, WBS to synchronise
    • Edited and grammar check etc
  2. Basic layout of software design for data extraction algorithm
  3. Wiki page

Plan and Goals for Next Week:

  1. Commence programming of algorithm using Java
  2. Read up on SVM

Semester 2, Week 6

Jie Dong

Progress and Status this week:

  1. Research on Trigram Markov model
  2. Two models are proposed:
    • Simple Trigram Markov model: only consider the effect of trigram in the text
    • Potential problem with first model: sparse data, new trigram appears in the test text, lead to poor cross entropy
    • Second model: Hidden Markov model on trigram: Not only count on trigram, but also unigram and bigram effects are taken into consideration. The transition probability is consisted from all three probabilities.
  3. The existence of punctuation and uppercase letter should be considered for text written in English.
  4. Programming on text file input and exception handle in JAVA

Plan and Goals for new week:

  1. Discuss the models with supervisor
  2. SVM problem
  3. Programming on first model

Leng Tan

Progress and Status this week:

  1. Done a design for the WRI code after discussion with group members.
  2. written about 50% of the code for data extraction using WRI.
  3. read a bit on SVM but still don't understand it.

Plan and Goals for new week:

  1. finish the coding for WRI.
  2. try to get help for SVM.

Tien-en Phua

Progress and Status this week:

  1. Finish the design algorithm code in java for word function frequency (pseudo - code).
  2. Start implementing the algorithm code.
  3. Code have been halfway done.

Plan and Goals for new week:

  1. Finish coding.
  2. Discuss about SVM problems.

Semester 2, Week 7

Jie Dong

Progress and Status this week:

  1. Reading chapter about Hidden Markov Chain of "Statistical language learning"
  2. Came up with my own test text to verify my code is working properly
  3. Meeting with Brian discuss my current work, the current approach does not work efficiently

Plan and Goals for new week:

  1. The previous algorithm only considers effect of the trigram words. Result for a test paragraph contains a lot useless information, which about 70% of trigrams only appear once. Information which is worth using in classification is just about 10%. By extracting common trigrams from several test texts, few of them left. Hence, another enhanced model, in which unigrams and bigrams are also taken into consideration, will be tested in the following week.
  2. SVM will also be used to test the result in coming week. Investigating how to use SVM functions in MATLAB, svmtrain and svmclassify (Bioinformatics toolbox)

Leng Tan

Progress and Status this week:

  1. Finish the Java coding for WRI technique in data extraction algorithm.
  2. Tested and verified that the code is working properly using a small test file. (text file with only few sentences)
  3. Have a meeting with Brian discussing on the SVM input and output.

Plan and Goals for new week:

  1. Figure out SVM.
  2. Test and try out SVM on matlab using small test files.

Tien-en Phua

Progress and Status this week:

  1. Completed coding for data extraction algorithm (DEA)
  2. Discuss implementation of output of data from DEA to SVM
  3. Analyse how other researches analyse their data

Plan and Goals for new week:

  1. Modification and refining of DEA code
  2. Continue analysis of how other researches used this DEA for authorship attribution
  3. Try applying data to SVM

Semester 2, Week 8

Jie Dong

Progress and Status this week:

  1. Peer review assessment on the design document on "Audio assisted vision system"
  2. Investigation on SVM in MATLAB
  3. Working on modified trigram model

Plan and Goals for new week:

  1. Test my result of java program with SVM

Leng Tan

Progress and Status this week:

  1. Receive a stage 1 design document on "Audio Assisted Vision System for Visually Impaired People".
  2. The document was fully read and take noted on presentation and various other perspective.
  3. The document was reviewed and a formal peer review report was produced.
  4. Investigation on Matlab for SVM was halted for a moment due to the peer review report.

Plan and Goals for new week:

  1. Figure out SVM.
  2. Test and try out SVM on matlab using small test files.

Tien-en Phua

Progress and Status this week:

  1. Complete the coding of Data Extraction Algorithm. Able to load file, remove punctuations, create a new output file for Support Vector Machine input
  2. Review Peer Document and did some research on the principles of echolocation performed by bats to understand the document
  3. Completed Peer Review on Audio Assisted Vision System For Visually Impair People

Plan and Goals for new week:

  1. Apply the generated data by the data extraction algorithm to Support vector machine
  2. Determine progress of project and review schedule.

Semester 2, Week 9

Jie Dong

Progress and Status this week:

  1. Hidden Markov model is implemented using Java, and the program produces a table containing probabilities information for some common trigrams from some texts input. The problem with it currently is because I am feeding all words appeared in texts into the program, there are few common trigrams among certain number of input texts. For example, I have tried with total 20 input texts from two authors, the number of trigram they do have in common is just one. In this case, I also set the program to allow part of these texts to have common trigram and others just put zero probabilities for these trigrams, the result is still not efficient.
  2. Read through Tails trigram description and code, I found that he simplified the method and extracted the key specification by deleting the non key words. By testing his idea using Java code, I found it does extract a lot more information than mine, however a question also raised to me is that whether it would reduce the accuracy of classification since it changes original text to another. This simplification needs to be proved.
  3. Produced result by extraction algorithm is fed into MATLAB SVM methods (svmtrain and svmclassify),it shows my extraction algorithm is not working properly. Sometimes, the predicted author for chosen texts are correct and sometimes are not. In term of SVM itself, it only supports classifying for two groups and multi-group classification produces error. In addition, they can only plot SVM structure for two dimensional data. Hence, more enhanced SVM toolboxes should be studied.

Plan and Goals for next week:

  1. GUI design
  2. Test efficiency using different groups of input texts
  3. Try another SVM toolbox from: http://asi.insa-rouen.fr/enseignants/~arakotom/toolbox/index.html

Leng Tan

Progress and Status this week:

  1. A basic SVM code which receives a text file input is produced.
  2. The SVM code will need 2 training data group and a number of test data group.
  3. The standardize format for the input to SVM was decided by team members.
  4. The input format will be in a MxN matrix where the first column will be the author and subsequent column is the data. (in my case, standard deviations)
  5. Initial data uses 20 standard deviation columns.

Plan and Goals for next week:

  1. The SVM do predict the author wrongly and this need to be resolve.
  2. Might be due to insufficient train data.
  3. Further testing is required.
  4. Might consider implementing GUI.
  5. Need to have a meeting with supervisors on progress and GUI implementation (can combine together GUI of Java and Matlab?)

Tien-en Phua

Progress and Status this week:

  1. Research for statistical software for obtaining the covariance of data StatGraphics
  2. Download and installed a choose software and attempts to operate the program
  3. Research on a book discussing the possible author of Hebrews Nacsbt: Lukan Authorship Of Hebrews

Plan and Goals for next week:

  1. Obtain the covariance of the data
  2. Check to see if data extraction algorithm produce similar results as Talis
  3. Produce code to "chop" all text file to a specific length for analysis
  4. Input data to SVM and observe the outcome
  5. Combine functions for analysis

Semester 2, Week 10

Jie Dong

Progress and Status this week:

  1. Original JAVA program is re-built in a standard eclipse project
  2. Delete Transition class, no longer used
  3. Change three classes (State, Gram, Record) to inner classes correspondingly
  4. Reduce original three main methods in separate class to only one in Driver class
  5. Move methods for User inputs to Driver class, including parameters and paths
  6. Add three header lines to Java program output: number of texts, number of disputed texts, number of trigram used

Plan and Goals for next week:

  1. Standardise three algorithms into one project folder
  2. Use same training data, unknown data to test three extraction algorithms
  3. Compare their accuracies in different situations(number of key words, number of texts,etc)

Leng Tan

Progress and Status this week:

  1. had a meeting with the supervisors and report on the progress of the project.
  2. SVM code is remain the same for the time being.
  3. A tabled results should be produced to compare the difference between each data extraction algorithm.
  4. the main idea of the progress report is discussed.

Plan and Goals for next week:

  1. A standardise template to combine all 3 data extraction algorithm was discussed.
  2. WRI code need to be slightly modified.
  3. need to plan the initial design for the GUI.

Tien-en Phua

Progress and Status this week:

  1. Modify code to accept multiple inputs
  2. Extract out federalist papers for testing on support vector machine using function word analysis
  3. Meeting with supervisors on Wednesday for progress updates and guidance on next step
  4. Commencement of progress report

Plan and Goals for next week:

  1. Produce a table of result displaying the accuracy of the algorithm with SVM Kernel function
  2. Complete progress report, project background, project specification, progress thus far and project management
  3. Combine the three algorithm together into a single driver file
  4. Discuss and design possible implementation of a GUI

Semester 2, Week 11

Jie Dong

Progress and Status this week:

  1. Update progress report
  2. JAVA program modification:
    • Sort list of files read in according to their name order
    • Replace manually parameter setup to automaticly read in data, form train set and testing set according to three header lines

Plan and Goals for next week:

  1. Write a standard document to combine our java extraction program together
  2. Complete Progress report

Leng Tan

Progress and Status this week:

  1. Do progress report.

Plan and Goals for next week:

  1. catch up on assignments and prepare for exams.

Tien-en Phua

Progress and Status this week:

  1. Update of progress report

Plan and Goals for next week:

  1. Complete 4 upcoming assignment
  2. Prepare for power system quiz

Semester 2, Week 12

Jie Dong

Progress and Status this week:

  1. Rough draft for java extraction program standard:
    1. Extraction algorithm
    2.  Word frequency
    3.  Word Recurrence Interval
    4.  Trigram Markov
    5. Overall structure
    6.  Choose files(training texts and testing text)
    7. Outputs: two ArrayLists<String> of file names: trainFile and testFile
    8.  Read in texts
    9. Each text (with filename “XX_YY.txt”) is imported as a single string. All punctuations except ‘ are removed and all letters are changed to lowercase. Then it is converted into a data structure called “Text” object which consists following fields:
    10.  Author (XX)
    11.  Title (YY)
    12.  A string of words (text content, no punctuation but ‘ and all letters lower case)
    13.  An array of words (text content ,no punctuation but ‘ and all letters lower case)
    14.  Sorted array (descending order according to occurrence frequencies ) of “Word” object which contains a word string appeared in the text and its occurrence frequency
    15. Outputs: two ArrayLists<Text> of Text objects: trainText and testText

 Extraction

    1. Extracting useful data based on three different algorithms


Plan and Goals for next week:

Leng Tan

Progress and Status this week:

  1. assigments due for this week is completed.

Tien-en Phua

Progress and Status this week:

  1. Completed all assignments due this week

Plan and Goals for next week:

  1. Need to prepare for exams. SWOT week next week.
  • Project will "pause" till after exam period, 20 Nov 2010, thereafter the team will be working individually back in their home country and update each other via email

Semester 1, Week 1

Jie Dong

Progress and Status this week:

  1. Had a small discussion with the team members and work on SVM.
  2. Modify SVM program to support multi-group classification function
  3. Test the accuracy of the whole classifying program with English texts
  4. Generate accuracy table with respect to three different variables: tolerance, number of key words and kernal function(linear, quadratic, rbf, polynomial)

Plan and Goals for new week:

  1. Discuss with supervisor about the performance of current program and suggest ways to increase accuracy
  2. Apply interface developed by Joel

Leng Tan

Progress and Status this week:

  1. Brief discussion with team members on the project.
  2. the english texts is used to test the accurancy of the program.
  3. Try different kernel function of the SVM while testing the accurancy.

Plan and Goals for Next Week

  1. Organize a meeting with the supervisors for updates.
  2. discuss with joel for a constant text length.
  3. try to combine the code.

Tien-en Phua

Progress and Status this week:

  1. Conduct a brief meeting with team members to further evaluate on SVM.

Plan and Goals for new week:

  1. Have a meeting with supervisors showing the results.

See also

Back