Authorship detection: 2010 group
Supervisors
Collaborators
- François-Pierre Huchet, ITII Pays de la Loire, Nantes, France.
- J. José Alviar, University of Navarra, Spain
Students
Weekly progress and questions
Semester 2, Week 1
Jie Dong
Progress and Status this week:
- First meeting with Derek, Brian and Maryam and other group member Leng and Tien-en.
- Derek, Brian and Maryam introduce us the basic idea of this data mining project
- The idea of authorship detection was introduced
- Several applications which data mining technique can be applied was mentioned
- Researches of past year students were mentioned and Maryam sent us several past year research report together with the code
- Research on the project, especially on SVM and some algorithms
Plan and Goals for new week:
- Prepare for the proposal seminar.
- Read research report from past years students.
- Understand project handbook.
Leng Tan
Progress and Status This Week
- the 1st meeting for the final year project was held with the supervisor, Prof Derek Abbot, co-supervisor, Dr Brian Ng, and Mrs Maryam, along with the team member.
- the initial project scope was introduced and general idea of the aim of the project is discussed.
- basic idea on the techniques of authorship detection is shown as well.
- several ideas for the future application of this project is highlighted.
- some hints on getting started was given which is to read Talis's final year report, which will be provided by Mrs Maryam.
- the first milestone of the project which is the proposal seminar is reminded.
Plan and Goals for Next Week
- fully read and understand Talis report.
- have a brief look on the code that will be supplied by Mrs Maryam.
- do some research on the background information of some controversial issues like the works on William Shakespeares, the Federalist Paper and the Letter to Hebrew.
- read through the project handbook of 2010 to have a rough idea of all the milestones of the project focusing on the project seminar.
Tien-en Phua
Progress and Status this week:
- Met up with project supervisor, Prof Derek Abbot, co-supervisor, Dr Brian Ng, and Mrs Maryam
- Derek discuss the concept behind authorship detection
- Derek explains about multi-dimensional graphs to link a disputed text to a known author.
- Discuss about possible future applications. Brian suggested code plagiarism and possibly music.
- Was provided by Maryam with other projects by students and started to go through the report by Talis.
- Went through the FYP Project handbook
Plan and Goals for new week:
- Identity the methods Talis used in his report
- Research on various methods
- Read up on past works regarding authorship detection
- Research on controvesy
Semester 2, Week 2
Jie Dong
Progress and Status this week:
- Three methods are chosen for this project: word frequency, word recurrence interval, and trigram markov model
- Reading material on SVM (SVM tutorial)
- Play with SVM software on Matlab
- Prepare slides for proposal seminar presentation on project aim, background, and part of project process
Plan and Goals for new week:
- Combine slides with other group member and do some modification
- Send slides draft to supervisor for feedback
- Do more modification
- Presentation on Thursday
Leng Tan
Progress and Status This Week
- identified 3 methods that was mentioned by Talis.
- have a brief knowledge and information of the controversial issue.
- have a brief idea on the upcoming propose seminar.
Plan and Goals for Next Week
- research on SVM.
- research on the backgroud history of the project
- research on the different technique use before in history
- prepare project proposal
Tien-en Phua
Progress and Status this week:
- Identity the three methods that Talis applied in his project, namely Word Frequency, Word Recurrence and Trigram Markvo
- Briefly understand how the three methods work
- Identity the past works done by other researchers.
- Identity three main controvesy namely the Federalist papers, Shakespeare plays and the Letter to the Hebrews
Plan and Goals for new week:
- Prepare for Project Proposal
- Develop Gantt chart, project budget and risk analysis
- Identity major milestones in project
- Write up on controvesy
- Further research on three methods
Semester 2, Week 3
Jie Dong
Progress and Status this week:
- We were introduced to Matthew and François-Pierre Huchet who are also participating in this project in Monday's meeting.
- Came up with draft(first whole draft) of proposal presentation slides. Discuss about the role of each person.
- Send slides to Brian and Matthew for feedback
- modify our slides
- Presentation on Thursday
Plan and Goals for new week:
- Do more researches for three methods and SVM
- Prepare for stage 1 design document
Leng Tan
Progress and Status This Week
- rough draft slides on the past research have been done for the propose seminar.
- a comparison list of the different technique is done.
- start research on SVM that is to be added in the slides with the different technique
- had a meeting with supervisors, and was introduced to Dr Matthew.
- focus 100% on the propose seminar.
Plan and Goals for Next Week
- have a more detailed review on the 3 methods.
- read the criteria for the stage 1 design document.
Tien-en Phua
Progress and Status this week:
- Prepare for project proposal
- Developed gantt chart, project budget and risk analysis
- Developed slides for milestones and controvesy
- Research on SVM (Support Vector Machine)
- Gain a better understanding on Word Frequency, WRI and Trigram Markvo
Plan and Goals for new week:
- Proceed to develop Stage 1 Design Document
- Understand SVM
- Develop Work Breakdown Structure
- Delegate task to individual members
- Read up on the other 4 reports
Semester 2, Week 4
Jie Dong
Progress and Status This Week
- In this project, we plan to have each person working on one method -- I am working on Trigram Markov model
- Read past reports for trigram Markov information
- Make stage 1 design document template
- Write project aim, background, and project approach in design document
Plan and Goals for Next Week
- Modify the design document draft
- Send to supervisors for feedback
- More modification
- Prepare a tutorial on SVM for other group members
Leng Tan
Progress and Status This Week
- research on the 3 methods have completed.
- fully read and understood the criteria for stage 1 design document.
- have a brief meeting with group members to delicate the tasks in preparing the stage 1 design document.
Plan and Goals for Next Week
- do a rough draft on the tasks that is allocated.
- do a layout design for the document.
Tien-en Phua
Progress and Status this week:
- Develop Work Breakdown Structure
- Identity tasks required for Stage 1 Design Document
- Broken down task and assigned to each member
- In the process of development of Stage 1 Design Document
- Further research on SVM and Word Frequency
Plan and Goals for new week:
- Complete write up on Word Frequency and SVM
- Complete Stage 1 Design Document
- Coding and further research on Word Frequency
- Read up on the other 4 reports
Semester 2, Week 5
Jie Dong
Progress and Status this week:
- Done abstract, project aim, background and significance
- Done description of data extraction part for Trigram Markov model in design document
- Feedback from supervisors on design document
- Final modification on design document
- Format the design document on wiki
Plan and Goals for Next Week:
- Design on Trigram Markov model
- learn to use SVM
- a bit coding on trigram Markov model
Leng Tan
Progress and Status this week:
- Done Literature Review of design document
- Done description of data extraction part for WRI in design document
- Done project approach and milestone for design document
- added modified WBS in appendix
- done initial check and compilation of Design document
Plan and Goals for Next Week:
- start do rough design for WRI of data extraction in java
- read SVM
Tien-en Phua
Progress and Status this week:
- Completed design document
- Project Requirements
- Description of data extraction of Function Word Frequency analysis
- Project Budget
- Background and Significance of Hebrews
- Edited Gantt Chart, WBS to synchronise
- Edited and grammar check etc
 
- Basic layout of software design for data extraction algorithm
- Wiki page
Plan and Goals for Next Week:
- Commence programming of algorithm using Java
- Read up on SVM
Semester 2, Week 6
Jie Dong
Progress and Status this week:
- Research on Trigram Markov model
- Two models are proposed:
- Simple Trigram Markov model: only consider the effect of trigram in the text
- Potential problem with first model: sparse data, new trigram appears in the test text, lead to poor cross entropy
- Second model: Hidden Markov model on trigram: Not only count on trigram, but also unigram and bigram effects are taken into consideration. The transition probability is consisted from all three probabilities.
 
- The existence of punctuation and uppercase letter should be considered for text written in English.
- Programming on text file input and exception handle in JAVA
Plan and Goals for new week:
- Discuss the models with supervisor
- SVM problem
- Programming on first model
Leng Tan
Progress and Status this week:
- Done a design for the WRI code after discussion with group members.
- written about 50% of the code for data extraction using WRI.
- read a bit on SVM but still don't understand it.
Plan and Goals for new week:
- finish the coding for WRI.
- try to get help for SVM.
Tien-en Phua
Progress and Status this week:
- Finish the design algorithm code in java for word function frequency (pseudo - code).
- Start implementing the algorithm code.
- Code have been halfway done.
Plan and Goals for new week:
- Finish coding.
- Discuss about SVM problems.
Semester 2, Week 7
Jie Dong
Progress and Status this week:
- Reading chapter about Hidden Markov Chain of "Statistical language learning"
- Came up with my own test text to verify my code is working properly
- Meeting with Brian discuss my current work, the current approach does not work efficiently
Plan and Goals for new week:
- The previous algorithm only considers effect of the trigram words. Result for a test paragraph contains a lot useless information, which about 70% of trigrams only appear once. Information which is worth using in classification is just about 10%. By extracting common trigrams from several test texts, few of them left. Hence, another enhanced model, in which unigrams and bigrams are also taken into consideration, will be tested in the following week.
- SVM will also be used to test the result in coming week. Investigating how to use SVM functions in MATLAB, svmtrain and svmclassify (Bioinformatics toolbox)
Leng Tan
Progress and Status this week:
- Finish the Java coding for WRI technique in data extraction algorithm.
- Tested and verified that the code is working properly using a small test file. (text file with only few sentences)
- Have a meeting with Brian discussing on the SVM input and output.
Plan and Goals for new week:
- Figure out SVM.
- Test and try out SVM on matlab using small test files.
Tien-en Phua
Progress and Status this week:
- Completed coding for data extraction algorithm (DEA)
- Discuss implementation of output of data from DEA to SVM
- Analyse how other researches analyse their data
Plan and Goals for new week:
- Modification and refining of DEA code
- Continue analysis of how other researches used this DEA for authorship attribution
- Try applying data to SVM
Semester 2, Week 8
Jie Dong
Progress and Status this week:
- Peer review assessment on the design document on "Audio assisted vision system"
- Investigation on SVM in MATLAB
- Working on modified trigram model
Plan and Goals for new week:
- Test my result of java program with SVM
Leng Tan
Progress and Status this week:
- Receive a stage 1 design document on "Audio Assisted Vision System for Visually Impaired People".
- The document was fully read and take noted on presentation and various other perspective.
- The document was reviewed and a formal peer review report was produced.
- Investigation on Matlab for SVM was halted for a moment due to the peer review report.
Plan and Goals for new week:
- Figure out SVM.
- Test and try out SVM on matlab using small test files.
Tien-en Phua
Progress and Status this week:
- Complete the coding of Data Extraction Algorithm. Able to load file, remove punctuations, create a new output file for Support Vector Machine input
- Review Peer Document and did some research on the principles of echolocation performed by bats to understand the document
- Completed Peer Review on Audio Assisted Vision System For Visually Impair People
Plan and Goals for new week:
- Apply the generated data by the data extraction algorithm to Support vector machine
- Determine progress of project and review schedule.
Semester 2, Week 9
Jie Dong
Progress and Status this week:
- Hidden Markov model is implemented using Java, and the program produces a table containing probabilities information for some common trigrams from some texts input. The problem with it currently is because I am feeding all words appeared in texts into the program, there are few common trigrams among certain number of input texts. For example, I have tried with total 20 input texts from two authors, the number of trigram they do have in common is just one. In this case, I also set the program to allow part of these texts to have common trigram and others just put zero probabilities for these trigrams, the result is still not efficient.
- Read through Tails trigram description and code, I found that he simplified the method and extracted the key specification by deleting the non key words. By testing his idea using Java code, I found it does extract a lot more information than mine, however a question also raised to me is that whether it would reduce the accuracy of classification since it changes original text to another. This simplification needs to be proved.
- Produced result by extraction algorithm is fed into MATLAB SVM methods (svmtrain and svmclassify),it shows my extraction algorithm is not working properly. Sometimes, the predicted author for chosen texts are correct and sometimes are not. In term of SVM itself, it only supports classifying for two groups and multi-group classification produces error. In addition, they can only plot SVM structure for two dimensional data. Hence, more enhanced SVM toolboxes should be studied.
Plan and Goals for next week:
- GUI design
- Test efficiency using different groups of input texts
- Try another SVM toolbox from: http://asi.insa-rouen.fr/enseignants/~arakotom/toolbox/index.html
Leng Tan
Progress and Status this week:
- A basic SVM code which receives a text file input is produced.
- The SVM code will need 2 training data group and a number of test data group.
- The standardize format for the input to SVM was decided by team members.
- The input format will be in a MxN matrix where the first column will be the author and subsequent column is the data. (in my case, standard deviations)
- Initial data uses 20 standard deviation columns.
Plan and Goals for next week:
- The SVM do predict the author wrongly and this need to be resolve.
- Might be due to insufficient train data.
- Further testing is required.
- Might consider implementing GUI.
- Need to have a meeting with supervisors on progress and GUI implementation (can combine together GUI of Java and Matlab?)
Tien-en Phua
Progress and Status this week:
- Research for statistical software for obtaining the covariance of data StatGraphics
- Download and installed a choose software and attempts to operate the program
- Research on a book discussing the possible author of Hebrews Nacsbt: Lukan Authorship Of Hebrews
Plan and Goals for next week:
- Obtain the covariance of the data
- Check to see if data extraction algorithm produce similar results as Talis
- Produce code to "chop" all text file to a specific length for analysis
- Input data to SVM and observe the outcome
- Combine functions for analysis
Semester 2, Week 10
Jie Dong
Progress and Status this week:
- Original JAVA program is re-built in a standard eclipse project
- Delete Transition class, no longer used
- Change three classes (State, Gram, Record) to inner classes correspondingly
- Reduce original three main methods in separate class to only one in Driver class
- Move methods for User inputs to Driver class, including parameters and paths
- Add three header lines to Java program output: number of texts, number of disputed texts, number of trigram used
Plan and Goals for next week:
- Standardise three algorithms into one project folder
- Use same training data, unknown data to test three extraction algorithms
- Compare their accuracies in different situations(number of key words, number of texts,etc)
Leng Tan
Progress and Status this week:
- had a meeting with the supervisors and report on the progress of the project.
- SVM code is remain the same for the time being.
- A tabled results should be produced to compare the difference between each data extraction algorithm.
- the main idea of the progress report is discussed.
Plan and Goals for next week:
- A standardise template to combine all 3 data extraction algorithm was discussed.
- WRI code need to be slightly modified.
- need to plan the initial design for the GUI.
Tien-en Phua
Progress and Status this week:
- Modify code to accept multiple inputs
- Extract out federalist papers for testing on support vector machine using function word analysis
- Meeting with supervisors on Wednesday for progress updates and guidance on next step
- Commencement of progress report
Plan and Goals for next week:
- Produce a table of result displaying the accuracy of the algorithm with SVM Kernel function
- Complete progress report, project background, project specification, progress thus far and project management
- Combine the three algorithm together into a single driver file
- Discuss and design possible implementation of a GUI
Semester 2, Week 11
Jie Dong
Progress and Status this week:
- Update progress report
- JAVA program modification:
- Sort list of files read in according to their name order
- Replace manually parameter setup to automaticly read in data, form train set and testing set according to three header lines
 
Plan and Goals for next week:
- Write a standard document to combine our java extraction program together
- Complete Progress report
Leng Tan
Progress and Status this week:
- Do progress report.
Plan and Goals for next week:
- catch up on assignments and prepare for exams.
Tien-en Phua
Progress and Status this week:
- Update of progress report
Plan and Goals for next week:
- Complete 4 upcoming assignment
- Prepare for power system quiz
Semester 2, Week 12
Jie Dong
Progress and Status this week:
- Rough draft for java extraction program standard and send other group member the standard
- Modify progress report and upload to Wiki
Plan and Goals for next week:
- Stop project for a period of time to prepare for exams
Leng Tan
Progress and Status this week:
- assigments due for this week is completed.
Plan and Goals for next week:
- Stop project as exams are coming.
Tien-en Phua
Progress and Status this week:
- Completed all assignments due this week
Plan and Goals for next week:
- Need to prepare for exams. SWOT week next week.
- Project will "pause" till after exam period, 20 Nov 2010, thereafter the team will be working individually back in their home country and update each other via email
Semester 1, Week 1
Jie Dong
Progress and Status this week:
- Had a small discussion with the team members and work on SVM.
- Modify SVM program to support multi-group classification function
- Test the accuracy of the whole classifying program with English texts
- Generate accuracy table with respect to three different variables: tolerance, number of key words and kernal function(linear, quadratic, rbf, polynomial)
Plan and Goals for new week:
- Discuss with supervisor about the performance of current program and suggest ways to increase accuracy
- Apply interface developed by Joel
Leng Tan
Progress and Status this week:
- Brief discussion with team members on the project.
- the english texts is used to test the accurancy of the program.
- Try different kernel function of the SVM while testing the accurancy.
Plan and Goals for Next Week
- Organize a meeting with the supervisors for updates.
- discuss with joel for a constant text length.
- try to combine the code.
Tien-en Phua
Progress and Status this week:
- Conduct a brief meeting with team members to further evaluate on SVM.
- Modified program from using function word objects to use of arrays and arraylist instead. Improve resource management and performance time
- Modified program to take in large amount of data as input instead of a single file
- Modified program to create a new folder to store all temporary (or modified) data. Reduce the clutter in the parent folder
- Test program using the federalist papers
Plan and Goals for new week:
- Have a meeting with supervisors showing the results.
- Further testing
Semester 1, Week 2
Jie Dong
Progress this week
- Met up with supervisors
- Applied trigram model algorithm on the 170 English text and test the accuracy of SVM for trigram Markov Model
- Number of key words used in the test are 5,10,15,20,25,30,35,40,45,50
- Four different kernel functions were used: Linear, Quadratic, rbf, polynomial. And it has been shown that Linear kernel function have the best performance among these four. However, the accuracy is still very low about 50%.
Goals for next week
- The effect of punctuations in the text should be taken into consideration, such as "-" and "'"
- Modified Trigram software
- Further testing
Leng Tan
Progress this week
- Met up with supervisor
- applied algorithm on the 170 english text and test the accuracy of the SVM for WRI
- applied different kernel function and observe the different result
- develop on word count program for text
Goals for next week
- Modify the delete punctuation method in the interface (look at minutes report number 10th for specs)
- implement interface
- change the number of keywords (currently is 20, try 5,10,15,20,25 and observe the difference)
- start using the new testament as test data
Tien-en Phua
Progress this week
- Had meeting with supervisors
- Applied algorithm to 170 English text
- Applied algorithm to 85 Federalist paper
- Monitor project progress and re-evaluate the project milestone and timeline
- Develop software for chopping text
- Develop software to count total words of text and also the number of occurrence of each word for better text analysis
Goals for next week
- Identify the reason for in-correct classifications
- Further testing to ensure the correct operation
- Study Greek alphabets
Semester 1, Week 3
Jie Dong
Progress and Status this week:
- Considering the meaning of punctuations appeared in the English texts, especially "=" and "'".
- Content which are not written by author should be removed before extraction, such as chapter number and title.
- Test the effect of above modification
Goals Next Week
- Prepare test data using Federalist Paper
- Prepare test data using Greek text
Leng Tan
Progress this week
- Developed a program to count the total number of words that contained "-" and "'"
- Implemented interface made by Joel
- Modified the WRI method and change the threshold of the number of keywords.
Goals for next week
- Try to improve the accuracy.
Tien-en Phua
Progress This Week
- Analyse the results for the Federalist and 170 English Text
- Continue developing auxiliary software (ie CountWord program, Punctuation program)
- Research on ways to balance the training data to SVM
Goals Next Week
- Continue testing on Federalist and 170 English Text
- Aim to achieve an 70% accuracy
- Standardize the training data to SVM
Semester 1, Week 4
Jie Dong
Progress and Status this week:
Goals Next Week
Leng Tan
Progress This Week
- Had a meeting with supervisors.
- English text achieves only around 25-30%.
- Study the New Testament.
Goals Next Week
- Try and find the Greek file for the new Testament.
- try use Federalist Text.
Tien-en Phua
Progress This Week
- Develop a method of normalizing text.
- Run test on 170 English Text. Obtained a 100% accuracy
- Run test on Federalist Text. Obtained a 91% accuracy
Goals Next Week
- Obtain a full set of Greek text
- Chop Greek text accordingly
- Require further testing and analysis
- Apply Greek text accordingly
Semester 1, Week 5
Jie Dong
Progress and Status this week:
- Algorithm update:
- The new version of trigram extraction algorithm inserts a "#" before a sentence and a "$" after a sentence. For example, there is a string "Today is a good day. I want to go to picnic."After TextEditor class, it will becomes "# Today is a good day & # I want to go to picnic $"
- The motivation to this modification is because in an English text, each sentence exists relatively independent with each other. In terms of the example above, "......a good day. I want ......", it is not necessary to calculate the probability of apperance of "I" after the bigram "good day". Instead, it will be more significant to characterise an author's writing habit by knowing the probability of apperance of "I" in the start of a sentence, i.e. after the bigram "$ #". Likewise, the probability of a word appearing at the end of the sentence is important to know as well, that is "day $ #".In addition, by this method, we can discover how often is a specific word used in one sentence
- To determine the beginning and end of a sentence, delimiter "." is used. In the future, with further study of English text characteristics, there might be more delimiters
 
- Generate classification results based on Federalist Text.
Goals Next Week
- Perform more tests on different disputed texts
- Try another key words selection algorithm: based on occurring frequency
Leng Tan
Progress This Week
Tried using Federalist Text. best results give accuracy up to 80% when threshold = 10, data dimension = 25. this might be due to the short text length of the Federalist Text. It is noted that WRI works better without normalization. Found Greek File for the new Testament but not sure if is the right one.
Goals Next Week
Do Federalist Text again with different disputed text. Try redo English text again with normalization.
Tien-en Phua
Semester 1, Week 6
Jie Dong
Leng Tan
Tien-en Phua
See also
- Authorship detection: 2010 group
- Authorship detection: Who wrote the Letter to the Hebrews?
- Minutes of Meeting 2010: Who wrote the Letter to the Hebrews?
- Critical design review 2010: Who wrote the Letter to the Hebrews?
- Progress Report 2010: Who wrote the Letter to the Hebrews?
- Final report 2010: Who wrote the Letter to the Hebrews?