Difference between revisions of "Authorship detection: 2010 group"

From Derek
Jump to: navigation, search
(See also)
(See also)
 
(86 intermediate revisions by 3 users not shown)
Line 293: Line 293:
 
# The previous algorithm only considers effect of the trigram words. Result for a test paragraph contains a lot useless information, which about 70% of trigrams only appear once. Information which is worth using in classification is just about 10%. By extracting common trigrams from several test texts, few of them left. Hence, another enhanced model, in which unigrams and bigrams are also taken into consideration, will be tested in the following week.
 
# The previous algorithm only considers effect of the trigram words. Result for a test paragraph contains a lot useless information, which about 70% of trigrams only appear once. Information which is worth using in classification is just about 10%. By extracting common trigrams from several test texts, few of them left. Hence, another enhanced model, in which unigrams and bigrams are also taken into consideration, will be tested in the following week.
 
# SVM will also be used to test the result in coming week. Investigating how to use SVM functions in MATLAB, svmtrain and svmclassify (Bioinformatics toolbox)  
 
# SVM will also be used to test the result in coming week. Investigating how to use SVM functions in MATLAB, svmtrain and svmclassify (Bioinformatics toolbox)  
 
+
# Peer review assessment
 
====Leng Tan====
 
====Leng Tan====
  
Line 340: Line 340:
 
====Tien-en Phua====
 
====Tien-en Phua====
 
'''Progress and Status this week:'''
 
'''Progress and Status this week:'''
# Complete the Data Extraction Algorithm
+
# Complete the coding of Data Extraction Algorithm. Able to load file, remove punctuations, create a new output file for Support Vector Machine input
# Completed Peer Review
+
# Review Peer Document and did some research on the principles of echolocation performed by bats to understand the document
# Review Peer Document
+
# Completed Peer Review on Audio Assisted Vision System For Visually Impair People
  
 
'''Plan and Goals for new week:'''
 
'''Plan and Goals for new week:'''
# Apply data to SVM
+
# Apply the generated data by the data extraction algorithm to Support vector machine
# Determine progress of project
+
# Determine progress of project and review schedule.
  
 
===Semester 2, Week 9===
 
===Semester 2, Week 9===
Line 377: Line 377:
 
====Tien-en Phua====
 
====Tien-en Phua====
 
'''Progress and Status this week:'''
 
'''Progress and Status this week:'''
# Research for statistical software for obtaining the covariance of data
+
# Research for statistical software for obtaining the covariance of data [http://www.statgraphics.com/ StatGraphics]
 
# Download and installed a choose software and attempts to operate the program
 
# Download and installed a choose software and attempts to operate the program
# Research on a book discussing the possible author of Hebrews
+
# Research on a book discussing the possible author of Hebrews [http://orders.koorong.com/search/product/view.jhtml?code=9780805447149 Nacsbt: Lukan Authorship Of Hebrews]
 
'''Plan and Goals for next week:'''
 
'''Plan and Goals for next week:'''
 
# Obtain the covariance of the data
 
# Obtain the covariance of the data
Line 389: Line 389:
 
===Semester 2, Week 10===
 
===Semester 2, Week 10===
 
====Jie Dong====
 
====Jie Dong====
 +
'''Progress and Status this week:'''
 +
# Original JAVA program is re-built in a standard eclipse project
 +
# Delete Transition class, no longer used
 +
# Change three classes (State, Gram, Record) to inner classes correspondingly
 +
# Reduce original three main methods in separate class to only one in Driver class
 +
# Move methods for User inputs to Driver class, including parameters and paths
 +
# Add three header lines to Java program output: number of texts, number of disputed texts, number of trigram used
 +
'''Plan and Goals for next week:'''
 +
# Standardise three algorithms into one project folder
 +
# Use same training data, unknown data to test three extraction algorithms
 +
# Compare their accuracies in different situations(number of key words, number of texts,etc)
 +
 
====Leng Tan====
 
====Leng Tan====
 
'''Progress and Status this week:'''
 
'''Progress and Status this week:'''
Line 402: Line 414:
  
 
====Tien-en Phua====
 
====Tien-en Phua====
 +
'''Progress and Status this week:'''
 +
# Modify code to accept multiple inputs
 +
# Extract out federalist papers for testing on support vector machine using function word analysis
 +
# Meeting with supervisors on Wednesday for progress updates and guidance on next step
 +
# Commencement of progress report
 +
 +
'''Plan and Goals for next week:'''
 +
# Produce a table of result displaying the accuracy of the algorithm with SVM Kernel function
 +
# Complete progress report, project background, project specification, progress thus far and project management
 +
# Combine the three algorithm together into a single driver file
 +
# Discuss and design possible implementation of a GUI
  
 
===Semester 2, Week 11===
 
===Semester 2, Week 11===
 
====Jie Dong====
 
====Jie Dong====
 +
'''Progress and Status this week:'''
 +
# Update progress report
 +
# JAVA program modification:
 +
#* Sort list of files read in according to their name order
 +
#* Replace manually parameter setup to automaticly read in data, form train set and testing set according to three header lines
 +
 +
'''Plan and Goals for next week:'''
 +
# Write a standard document to combine our java extraction program together
 +
# Complete Progress report
 +
 
====Leng Tan====
 
====Leng Tan====
 +
'''Progress and Status this week:'''
 +
# Do progress report.
 +
 +
'''Plan and Goals for next week:'''
 +
# catch up on assignments and prepare for exams.
 +
 
====Tien-en Phua====
 
====Tien-en Phua====
 +
'''Progress and Status this week:'''
 +
# Update of progress report
 +
 +
'''Plan and Goals for next week:'''
 +
# Complete 4 upcoming assignment
 +
# Prepare for power system quiz
  
 
===Semester 2, Week 12===
 
===Semester 2, Week 12===
 
====Jie Dong====
 
====Jie Dong====
 +
'''Progress and Status this week:'''
 +
# Rough draft for java extraction program standard and send other group member the standard
 +
# Modify progress report and upload to Wiki
 +
 +
'''Plan and Goals for next week:'''
 +
# Stop project for a period of time to prepare for exams
 +
 
====Leng Tan====
 
====Leng Tan====
 +
'''Progress and Status this week:'''
 +
# assigments due for this week is completed.
 +
 +
'''Plan and Goals for next week:'''
 +
# Stop project as exams are coming.
 +
 
====Tien-en Phua====
 
====Tien-en Phua====
 +
'''Progress and Status this week:'''
 +
# Completed all assignments due this week
 +
 +
'''Plan and Goals for next week:'''
 +
# Need to prepare for exams. SWOT week next week.
 +
* Project will "pause" till after exam period, 20 Nov 2010, thereafter the team will be working individually back in their home country and update each other via email
 +
 +
===Semester 1, Week 1===
 +
 +
====Jie Dong====
 +
'''Progress and Status this week:'''
 +
# Had a small discussion with the team members and work on SVM.
 +
# Modify SVM program to support multi-group classification function
 +
# Test the accuracy of the whole classifying program with English texts
 +
# Generate accuracy table with respect to three different variables: tolerance, number of key words and kernal function(linear, quadratic, rbf, polynomial)
 +
'''Plan and Goals for new week:'''
 +
# Discuss with supervisor about the performance of current program and suggest ways to increase accuracy
 +
# Apply interface developed by Joel
 +
 +
====Leng Tan====
 +
'''Progress and Status this week:'''
 +
# Brief discussion with team members on the project.
 +
# the english texts is used to test the accurancy of the program.
 +
# Try different kernel function of the SVM while testing the accurancy.
 +
 +
''' Plan and Goals for Next Week '''
 +
# Organize a meeting with the supervisors for updates.
 +
# discuss with joel for a constant text length.
 +
# try to combine the code.
 +
 +
====Tien-en Phua====
 +
'''Progress and Status this week:'''
 +
# Conduct a brief meeting with team members to further evaluate on SVM.
 +
# Modified program from using function word objects to use of arrays and arraylist instead. Improve resource management and performance time
 +
# Modified program to take in large amount of data as input instead of a single file
 +
# Modified program to create a new folder to store all temporary (or modified) data. Reduce the clutter in the parent folder
 +
# Test program using the federalist papers
 +
 +
'''Plan and Goals for new week:'''
 +
# Have a meeting with supervisors showing the results.
 +
# Further testing
 +
 +
===Semester 1, Week 2===
 +
====Jie Dong====
 +
'''Progress and Status this week:'''
 +
# Met up with supervisors
 +
# Applied trigram model algorithm on the 170 English text and test the accuracy of SVM for trigram Markov Model
 +
# Number of key words used in the test are 5,10,15,20,25,30,35,40,45,50
 +
# Four different kernel functions were used: Linear, Quadratic, rbf, polynomial. And it has been shown that Linear kernel function have the best performance among these four. However, the accuracy is still very low about 50%.
 +
'''Plan and Goals for new week:'''
 +
# The effect of punctuations in the text should be taken into consideration, such as "-" and "'"
 +
# Modified Trigram software
 +
# Further testing
 +
====Leng Tan====
 +
'''Progress and Status this week:'''
 +
# Met up with supervisor
 +
# applied algorithm on the 170 english text and test the accuracy of the SVM for WRI
 +
# applied different kernel function and observe the different result
 +
# develop on word count program for text
 +
'''Plan and Goals for new week:'''
 +
# Modify the delete punctuation method in the interface (look at minutes report number 10th for specs)
 +
# implement interface
 +
# change the number of keywords (currently is 20, try 5,10,15,20,25 and observe the difference)
 +
# start using the new testament as test data
 +
 +
====Tien-en Phua====
 +
'''Progress and Status this week:'''
 +
# Had meeting with supervisors
 +
# Applied algorithm to 170 English text
 +
# Applied algorithm to 85 Federalist paper
 +
# Monitor project progress and re-evaluate the project milestone and timeline
 +
# Develop software for chopping text
 +
# Develop software to count total words of text and also the number of occurrence of each word for better text analysis
 +
 +
'''Plan and Goals for new week:'''
 +
# Identify the reason for in-correct classifications
 +
# Further testing to ensure the correct operation
 +
# Study Greek alphabets
 +
 +
===Semester 1, Week 3===
 +
====Jie Dong====
 +
'''Progress and Status this week:'''
 +
# Get rid of the concept of tolerance.
 +
# Considering the meaning of punctuations appeared in the English texts, especially "-" and "'".
 +
# Content which are not written by author should be removed before extraction, such as chapter number and title.
 +
# Test the effect of above modification
 +
 +
 +
'''Goals Next Week'''
 +
# Prepare test data using Federalist Paper
 +
# Prepare test data using Greek text
 +
 +
====Leng Tan====
 +
'''Progress and Status this week:'''
 +
# Developed a program to count the total number of words that contained "-" and "'"
 +
# Implemented interface made by Joel
 +
# Modified the WRI method and change the threshold of the number of keywords.
 +
'''Plan and Goals for new week:'''
 +
# Try to improve the accuracy.
 +
====Tien-en Phua====
 +
'''Progress and Status this week:'''
 +
# Analyse the results for the Federalist and 170 English Text
 +
# Continue developing auxiliary software (ie CountWord program, Punctuation program)
 +
# Research on ways to balance the training data to SVM
 +
'''Plan and Goals for new week:'''
 +
# Continue testing on Federalist and 170 English Text
 +
# Aim to achieve an 70% accuracy
 +
# Standardize the training data to SVM
 +
 +
===Semester 1, Week 4===
 +
====Jie Dong====
 +
'''Progress and Status this week:'''
 +
# Run program modified last week on English data set
 +
# Varying threshold and size of training data
 +
# Achieve a classification accuracy of around 80%
 +
# Help group member to prepare for Federalist data set
 +
 +
'''Plan and Goals for new week:'''
 +
# Study the cause of unsatisfactory classification accuracy and try to improve it
 +
# Perform similar tests on Federalist Paper
 +
# Discuss results with other group member, and see their algorithm performance
 +
 +
====Leng Tan====
 +
'''Progress and Status this week:'''
 +
# Had a meeting with supervisors.
 +
# English text achieves only around 25-30%.
 +
# Study the New Testament.
 +
'''Plan and Goals for new week:'''
 +
# Try and find the Greek file for the new Testament.
 +
# try use Federalist Text.
 +
====Tien-en Phua====
 +
'''Progress and Status this week:'''
 +
# Develop a method of normalizing text.
 +
# Run test on 170 English Text. Obtained a 100% accuracy
 +
# Run test on Federalist Text. Obtained a 91% accuracy
 +
 +
'''Plan and Goals for new week:'''
 +
# Obtain a full set of Greek text
 +
# Chop Greek text accordingly
 +
# Require further testing and analysis
 +
# Apply Greek text accordingly
 +
 +
===Semester 1, Week 5===
 +
====Jie Dong====
 +
'''Progress and Status this week:'''
 +
# Algorithm update:
 +
#* The new version of trigram extraction algorithm inserts a "#" before a sentence and a "$" after a sentence. For example, there is a string "Today is a good day. I want to go to picnic."After TextEditor class, it will becomes "# Today is a good day & # I want to go to picnic $"
 +
#* The motivation to this modification is because in an English text, each sentence exists relatively independent with each other. In terms of the example above, "......a good day. I want ......", it is not necessary to calculate the probability of apperance of "I" after the bigram "good day". Instead, it will be more significant to characterise an author's writing habit by knowing the probability of apperance of "I" in the start of a sentence, i.e. after the bigram "$ #". Likewise, the probability of a word appearing at the end of the sentence is important to know as well, that is "day $ #".In addition, by this method, we can discover how often is a specific word used in one sentence
 +
#* To determine the beginning and end of a sentence, delimiter "." is used. In the future, with further study of English text characteristics, there might be more delimiters
 +
# Generate classification results based on Federalist Text.
 +
 +
 +
'''Plan and Goals for new week:'''
 +
# Perform more tests on different disputed texts
 +
# Try another key words selection algorithm: based on occurring frequency
 +
 +
====Leng Tan====
 +
'''Progress and Status this week:'''
 +
Tried using Federalist Text.
 +
best results give accuracy up to 70% when threshold = 10, data dimension = 25. this might be due to the short text length of the Federalist Text.
 +
It is noted that WRI works better without normalization.
 +
Found Greek File for the new Testament but not sure if is the right one.
 +
 +
'''Plan and Goals for new week:'''
 +
Do Federalist Text again with different disputed text.
 +
Try redo English text again with normalization.
 +
====Tien-en Phua====
 +
'''Progress and Status this week:'''
 +
# Analysis of federalist result as it is most similar in style to the new testaments text
 +
# Namely that most of the federalist paper is written by Hamilton and likewise the new testaments is written by Paul with a few others written by different authors like Luke, John, Peter
 +
# Comparison of results with other feature extraction algorithm
 +
# After comparison of Function Word Analysis (FWA) and frequency occurrence of function words, the FWA proves to be a better algorithm as it produces more accurate results than frequency occurrence.
 +
# Using FWA reduces the need to chop text and allowing lesser data to be "chunk" out.
 +
'''Plan and Goals for new week:'''
 +
# According to Gantt Chart, the implementation of controversies should take place next week.
 +
# Implement both FWA and frequency occurrence to the KJV text
 +
# Frequency occurrence should produce consistent results to Talis.
 +
 +
===Semester 1, Week 6===
 +
====Jie Dong====
 +
'''Progress and Status this week:'''
 +
# With the modification last week, I re-ran the test on English data set
 +
# The classification accuracy increased to 85% - 90%. The highest was achieved when threshold = 30
 +
# Clear trend can be observed: increasing size of training data, accuracy increases, threshold firstly increase and then drop
 +
# Perform tests on Federalist paper, but the accuracy is very low, at about 35% average
 +
# Discuss with supervisor and group member with the result on Federalist paper
 +
# Since function words analysis achieve a good performance, it was suggested that combine part of them to enhance the algorithm
 +
'''Plan and Goals for new week:'''
 +
# Implement Trigram Markov model to select trigrams with "Golden Key words"
 +
# Start to prepare form final seminar
 +
# Achieve test results on King James Version
 +
====Leng Tan====
 +
'''Progress and Status this week:'''
 +
# It was verified that the English text actually does not have any problem.
 +
# Results shown was not favourable. The prediction was very inconsistent, achieving a low accuracy rate of 53% most of the time.
 +
# It was suggested that the testing could be bias to Madison as only Madison text was taken as testing data.
 +
# Comparing with the earlier results using the English text, which involves 100 training data and 70 disputed text, the accuracy and consistency was even much lower.
 +
# WRI might not be suitable for authorship detection.
 +
# It was suggested to combine the Function Word Frequency developed by Joel to enhance the algorithm.
 +
 +
'''Plan and Goals for new week:'''
 +
# Examine the WRI algorithm with further testing.
 +
# Implement enhanced version of the WRI by combining the algorithm with function word frequency.
 +
# Prepare the powerpoint slides for the seminar.
 +
# Start to make the initial stage of the video.
 +
 +
====Tien-en Phua====
 +
'''Progress and Status this week:'''
 +
# Analysis results of FWA on KJV
 +
# Analysis results of frequency occurrence on KJV
 +
# Frequency occurrence produces consistent results with Talis listing down Paul, Barnabas, Luke and Matthew as the possible authors
 +
# FWA produces a different results which will be discuss WHY.
 +
# Discuss seminar structure with team
 +
# Delegate task to team members for seminar
 +
# Produce a uniform set of data for testing and results presentation
 +
 +
'''Plan and Goals for new week:'''
 +
# Consolidate results from English text, Federalist text and King James Version
 +
# Research on future improvement for FWA
 +
# Conduct a detail literature review on the background of the new testaments
 +
 +
===Semester 1, Week 7===
 +
====Jie Dong====
 +
'''Progress and Status this week:'''
 +
# Implemented Trigram Markov model to select trigrams with chosen function words
 +
# Discovered that trigram containing chosen function words usually occurs more than once. Hence, selection threshold selection words similarly with function word selection method
 +
# Made draft of powerpoint slides on SVM and trigram part
 +
# Run classification test on King James Version of New Testament
 +
# Finalise performance results for English text, Federalist and KJV
 +
'''Plan and Goals for new week:'''
 +
# Combine slides made by all group members and modifiy, slides should be finialised by next week
 +
# Practice makes perfect!!! :)
 +
# Discuss results of our own extraction algorithms among group members, make suggestion on potential modification
 +
====Leng Tan====
 +
'''Progress and Status this week:'''
 +
# Enhanced version of the WRI combined with function word frequency is done.
 +
# Get the results and arrive to a conclusion.
 +
# Prepare the powerpoint slides for the final seminar.
 +
# started recording some video footage for the final year project video.
 +
 +
'''Plan and Goals for new week:'''
 +
# Examine the WRI algorithm with further testing.
 +
# Implement enhanced version of the WRI by combining the algorithm with function word frequency.
 +
# Complete the powerpoint slides for the seminar.
 +
# Start to make the initial stage of the video.
 +
 +
====Tien-en Phua====
 +
'''Progress and Status this week:'''
 +
# Consolidate results from English text, Federalist text and King James Version
 +
# Research on future improvement for FWA
 +
## Calculate mean of a function word in a group of text by an author
 +
## Calculate the standard deviation of a function word in a group of text by an author
 +
## Consider calculating the probability of the occurrence of a function word by inputting the above parameters to SVM
 +
# Possible authors of Hebrews namely Apollos, Clement, Paul, Barnabas, Luke and Peter
 +
 +
'''Plan and Goals for new week:'''
 +
# Complete presentation slides
 +
# Practice presentation at least twice before seminar
 +
# Assist team members in analyzing their results
 +
 +
===Semester 1, Week 8===
 +
====Jie Dong====
 +
'''Progress and Status this week:'''
 +
# Practice more times for the Final Year Seminar.
 +
# Final seminar on Thursday
 +
'''Plan and Goals for new week:'''
 +
# Run tests on Gospel of Luke and Acts of Apostles in KJV which were prepared by Joel.
 +
 +
====Leng Tan====
 +
'''Progress and Status this week:'''
 +
# Final Year Project Seminar
 +
'''Plan and Goals for new week:'''
 +
# Need to further discuss on pre-processing of the texts before implementing feature extraction algorithm.
 +
# Run tests on Gospel of Luke and Acts of Apostles in Koine Greek which were prepared by Joel.
 +
# Discuss with Joel on automated function word.
 +
 +
====Tien-en Phua====
 +
'''Progress and Status this week:'''
 +
# Project final seminar
 +
'''Plan and Goals for new week:'''
 +
# Discuss on the techniques for pre-processing of Koine Greek
 +
# Run test on Gospel of Luke and Acts of Apostles in Koine Greek
 +
# Obtain Koine Greek on possible authors of the letter to Hebrews
 +
 +
===Semester 1, Week 9===
 +
====Jie Dong====
 +
'''Progress and Status this week:'''
 +
# Meeting with supervisor
 +
# Run test on English Version of Gospel of Luke
 +
# Run test on English Version of Acts of Apostles
 +
'''Plan and Goals for new week:'''
 +
# Perform the same tests on Koine Greek version of New Testament
 +
# Planning for final report
 +
# Think of ideas on the video
 +
 +
====Leng Tan====
 +
'''Progress and Status this week:'''
 +
# Meeting with supervisor discuss on final report
 +
# Run test on Gospel of Luke
 +
# Run test on Acts of Apostles
 +
'''Plan and Goals for new week:'''
 +
# Start planning for the final report
 +
# discuss with team on the video
 +
 +
====Tien-en Phua====
 +
'''Progress and Status this week:'''
 +
# Meeting with supervisor
 +
# Run test on Koine Greek + KJV to determine the author of the Gospel of Luke
 +
# Run test on Koine Greek + KJV to determine the author of the Acts of the Apostle
 +
# Analysis results and discuss with team
 +
'''Plan and Goals for new week:'''
 +
# Commence final report writing and discussion
 +
# Obtain set of text for Barnabas and Clement
 +
 +
===Semester 1, Week 10===
 +
====Jie Dong====
 +
'''Progress and Status this week:'''
 +
# Discuss with the team on the structure of the final report
 +
# Validate trigram Markov model using Koine Greek version of New Testament: Luke and Acts
 +
# Predict potential authors for the Letter to the Hebrews
 +
# Write up section for Support Vector Machine
 +
# Start to write on Trigram Markov model
 +
'''Plan and Goals for new week:'''
 +
# Find people to do a brief proof reading on what I write
 +
# Complete the report
 +
 +
====Leng Tan====
 +
'''Progress and Status this week:'''
 +
# Discuss with team on the video.
 +
# Discuss on the overall style of the report.
 +
# Completed a template to use for the final report.
 +
# Write on past research.
 +
# Write on project management.
 +
# Write on WRI.
 +
'''Plan and Goals for new week:'''
 +
# Proof read the report.
 +
# Complete the report.
 +
# Prepare for final exhibition poster
 +
 +
====Tien-en Phua====
 +
'''Progress and Status this week:'''
 +
# Discussion with the team on the structure for the final report
 +
# Write up background of the letter of Hebrews
 +
# Write up background of the Bible
 +
# Write up on project aim, approach and report strucutre
 +
# Research on a standard set of corpus for team to work on
 +
# Set of text of the Epistle of Barnabas and the First Epistle of Clement to Corinitians obtained in Koine Greek.
 +
# Process the Koine greek text to beta code
 +
 +
'''Plan and Goals for new week:'''
 +
# Complete final report
 +
# commence planning for exhibition
 +
 +
===Semester 1, Week 11===
 +
====Jie Dong====
 +
'''Progress and Status this week:'''
 +
# Apply the common set of data to Trigram Markov model
 +
# Complete testing results for Trigram Markov Model to write up results for final report
 +
# Write section for Trigram Markov model and edit SVM part
 +
# Prepare appendix section
 +
# Discuss with Clement about layout of poster
 +
# Made a draft for our poster
 +
## Background, color theme, layout and detail section content
 +
## Draft for Introduction and Controversy
 +
## Flow diagram of our project appraoch
 +
 +
'''Plan and Goals for new week:'''
 +
# Complete our poster with other members
 +
# Prepare for exhibition
 +
 +
====Leng Tan====
 +
'''Progress and Status this week:'''
 +
# complete results for WRI.
 +
# Touch up on the final report.
 +
# Preliminary discuss for the poster.
 +
'''Plan and Goals for new week:'''
 +
# Prepare the poster.
 +
 +
====Tien-en Phua====
 +
'''Progress and Status this week:'''
 +
# Prepare a common set of data for team to write up results for final report
 +
* English Text, 156 text, 26 per author. 22 training, 4 disputed
 +
* The Federalist papers, 82 text, 17 disputed, 65 training
 +
* Kings James Version
 +
* Koine Greek using Barnabas, Clement, John, Luke, Mark, Matthew, Paul, Peter
 +
# Write up on results for the english text and discussion
 +
# Write up on results for the federalist papers text and discussion
 +
# Write up on results for the king james version and discussion
 +
# Write up on results for the koine greek and discussion
 +
# Write up abstract
 +
# Prepare appendix
 +
 +
'''Plan and Goals for new week:'''
 +
# Prepare for project exhibition
 +
 +
===Semester 1, Week 12===
 +
====Jie Dong====
 +
'''Progress and Status this week:'''
 +
# Complete final exhibition poster
 +
# Making flyers for final year exhibition
 +
# Start video editing for introduction and SVM process
 +
 +
'''Plan and Goals for new week:'''
 +
# Complete the video
 +
 +
====Leng Tan====
 +
'''Progress and Status this week:'''
 +
# Complete the poster with the team
 +
# Make the flyers
 +
# Upload the final report to wiki format
 +
# Start the video editing for results and future application
 +
 +
'''Plan and Goals for new week:'''
 +
#Complete the video
 +
 +
====Tien-en Phua====
 +
'''Progress and Status this week:'''
 +
# start the video editing for SVM and three algorithms
 +
# create new account for youtube
 +
 +
'''Plan and Goals for new week:'''
 +
#complete video and upload to youtube
  
 
==See also==
 
==See also==
 
*[[Authorship detection: Who wrote the Letter to the Hebrews?]]
 
*[[Authorship detection: Who wrote the Letter to the Hebrews?]]
*[[Authorship detection: 2010 group]]
+
*[[Minutes of Meeting 2010: Who wrote the Letter to the Hebrews?]]
*[[Final report 2010: Who wrote the Letter to the Hebrews?]]
+
 
*[[Critical design review 2010: Who wrote the Letter to the Hebrews?]]
 
*[[Critical design review 2010: Who wrote the Letter to the Hebrews?]]
 
*[[Progress Report 2010: Who wrote the Letter to the Hebrews?]]
 
*[[Progress Report 2010: Who wrote the Letter to the Hebrews?]]
 +
*[[Final report 2010: Who wrote the Letter to the Hebrews?]]
 +
*[[Youtube Video Presentation 2010: Who wrote the Letter to the Hebrews?]]
  
 
==Back==
 
==Back==

Latest revision as of 22:07, 12 March 2012

Contents

Supervisors

Collaborators

Students

Weekly progress and questions

Semester 2, Week 1

Jie Dong

Progress and Status this week:

  1. First meeting with Derek, Brian and Maryam and other group member Leng and Tien-en.
  2. Derek, Brian and Maryam introduce us the basic idea of this data mining project
  3. The idea of authorship detection was introduced
  4. Several applications which data mining technique can be applied was mentioned
  5. Researches of past year students were mentioned and Maryam sent us several past year research report together with the code
  6. Research on the project, especially on SVM and some algorithms

Plan and Goals for new week:

  1. Prepare for the proposal seminar.
  2. Read research report from past years students.
  3. Understand project handbook.

Leng Tan

Progress and Status This Week

  1. the 1st meeting for the final year project was held with the supervisor, Prof Derek Abbot, co-supervisor, Dr Brian Ng, and Mrs Maryam, along with the team member.
  2. the initial project scope was introduced and general idea of the aim of the project is discussed.
  3. basic idea on the techniques of authorship detection is shown as well.
  4. several ideas for the future application of this project is highlighted.
  5. some hints on getting started was given which is to read Talis's final year report, which will be provided by Mrs Maryam.
  6. the first milestone of the project which is the proposal seminar is reminded.

Plan and Goals for Next Week

  1. fully read and understand Talis report.
  2. have a brief look on the code that will be supplied by Mrs Maryam.
  3. do some research on the background information of some controversial issues like the works on William Shakespeares, the Federalist Paper and the Letter to Hebrew.
  4. read through the project handbook of 2010 to have a rough idea of all the milestones of the project focusing on the project seminar.

Tien-en Phua

Progress and Status this week:

  1. Met up with project supervisor, Prof Derek Abbot, co-supervisor, Dr Brian Ng, and Mrs Maryam
  2. Derek discuss the concept behind authorship detection
  3. Derek explains about multi-dimensional graphs to link a disputed text to a known author.
  4. Discuss about possible future applications. Brian suggested code plagiarism and possibly music.
  5. Was provided by Maryam with other projects by students and started to go through the report by Talis.
  6. Went through the FYP Project handbook

Plan and Goals for new week:

  1. Identity the methods Talis used in his report
  2. Research on various methods
  3. Read up on past works regarding authorship detection
  4. Research on controvesy

Semester 2, Week 2

Jie Dong

Progress and Status this week:

  1. Three methods are chosen for this project: word frequency, word recurrence interval, and trigram markov model
  2. Reading material on SVM (SVM tutorial)
  3. Play with SVM software on Matlab
  4. Prepare slides for proposal seminar presentation on project aim, background, and part of project process

Plan and Goals for new week:

  1. Combine slides with other group member and do some modification
  2. Send slides draft to supervisor for feedback
  3. Do more modification
  4. Presentation on Thursday

Leng Tan

Progress and Status This Week

  1. identified 3 methods that was mentioned by Talis.
  2. have a brief knowledge and information of the controversial issue.
  3. have a brief idea on the upcoming propose seminar.

Plan and Goals for Next Week

  1. research on SVM.
  2. research on the backgroud history of the project
  3. research on the different technique use before in history
  4. prepare project proposal

Tien-en Phua

Progress and Status this week:

  1. Identity the three methods that Talis applied in his project, namely Word Frequency, Word Recurrence and Trigram Markvo
  2. Briefly understand how the three methods work
  3. Identity the past works done by other researchers.
  4. Identity three main controvesy namely the Federalist papers, Shakespeare plays and the Letter to the Hebrews

Plan and Goals for new week:

  1. Prepare for Project Proposal
  2. Develop Gantt chart, project budget and risk analysis
  3. Identity major milestones in project
  4. Write up on controvesy
  5. Further research on three methods

Semester 2, Week 3

Jie Dong

Progress and Status this week:

  1. We were introduced to Matthew and François-Pierre Huchet who are also participating in this project in Monday's meeting.
  2. Came up with draft(first whole draft) of proposal presentation slides. Discuss about the role of each person.
  3. Send slides to Brian and Matthew for feedback
  4. modify our slides
  5. Presentation on Thursday

Plan and Goals for new week:

  1. Do more researches for three methods and SVM
  2. Prepare for stage 1 design document

Leng Tan

Progress and Status This Week

  1. rough draft slides on the past research have been done for the propose seminar.
  2. a comparison list of the different technique is done.
  3. start research on SVM that is to be added in the slides with the different technique
  4. had a meeting with supervisors, and was introduced to Dr Matthew.
  5. focus 100% on the propose seminar.

Plan and Goals for Next Week

  1. have a more detailed review on the 3 methods.
  2. read the criteria for the stage 1 design document.

Tien-en Phua

Progress and Status this week:

  1. Prepare for project proposal
  2. Developed gantt chart, project budget and risk analysis
  3. Developed slides for milestones and controvesy
  4. Research on SVM (Support Vector Machine)
  5. Gain a better understanding on Word Frequency, WRI and Trigram Markvo

Plan and Goals for new week:

  1. Proceed to develop Stage 1 Design Document
  2. Understand SVM
  3. Develop Work Breakdown Structure
  4. Delegate task to individual members
  5. Read up on the other 4 reports

Semester 2, Week 4

Jie Dong

Progress and Status This Week

  1. In this project, we plan to have each person working on one method -- I am working on Trigram Markov model
  2. Read past reports for trigram Markov information
  3. Make stage 1 design document template
  4. Write project aim, background, and project approach in design document

Plan and Goals for Next Week

  1. Modify the design document draft
  2. Send to supervisors for feedback
  3. More modification
  4. Prepare a tutorial on SVM for other group members

Leng Tan

Progress and Status This Week

  1. research on the 3 methods have completed.
  2. fully read and understood the criteria for stage 1 design document.
  3. have a brief meeting with group members to delicate the tasks in preparing the stage 1 design document.


Plan and Goals for Next Week

  1. do a rough draft on the tasks that is allocated.
  2. do a layout design for the document.

Tien-en Phua

Progress and Status this week:

  1. Develop Work Breakdown Structure
  2. Identity tasks required for Stage 1 Design Document
  3. Broken down task and assigned to each member
  4. In the process of development of Stage 1 Design Document
  5. Further research on SVM and Word Frequency

Plan and Goals for new week:

  1. Complete write up on Word Frequency and SVM
  2. Complete Stage 1 Design Document
  3. Coding and further research on Word Frequency
  4. Read up on the other 4 reports

Semester 2, Week 5

Jie Dong

Progress and Status this week:

  1. Done abstract, project aim, background and significance
  2. Done description of data extraction part for Trigram Markov model in design document
  3. Feedback from supervisors on design document
  4. Final modification on design document
  5. Format the design document on wiki

Plan and Goals for Next Week:

  1. Design on Trigram Markov model
  2. learn to use SVM
  3. a bit coding on trigram Markov model

Leng Tan

Progress and Status this week:

  1. Done Literature Review of design document
  2. Done description of data extraction part for WRI in design document
  3. Done project approach and milestone for design document
  4. added modified WBS in appendix
  5. done initial check and compilation of Design document

Plan and Goals for Next Week:

  1. start do rough design for WRI of data extraction in java
  2. read SVM

Tien-en Phua

Progress and Status this week:

  1. Completed design document
    • Project Requirements
    • Description of data extraction of Function Word Frequency analysis
    • Project Budget
    • Background and Significance of Hebrews
    • Edited Gantt Chart, WBS to synchronise
    • Edited and grammar check etc
  2. Basic layout of software design for data extraction algorithm
  3. Wiki page

Plan and Goals for Next Week:

  1. Commence programming of algorithm using Java
  2. Read up on SVM

Semester 2, Week 6

Jie Dong

Progress and Status this week:

  1. Research on Trigram Markov model
  2. Two models are proposed:
    • Simple Trigram Markov model: only consider the effect of trigram in the text
    • Potential problem with first model: sparse data, new trigram appears in the test text, lead to poor cross entropy
    • Second model: Hidden Markov model on trigram: Not only count on trigram, but also unigram and bigram effects are taken into consideration. The transition probability is consisted from all three probabilities.
  3. The existence of punctuation and uppercase letter should be considered for text written in English.
  4. Programming on text file input and exception handle in JAVA

Plan and Goals for new week:

  1. Discuss the models with supervisor
  2. SVM problem
  3. Programming on first model

Leng Tan

Progress and Status this week:

  1. Done a design for the WRI code after discussion with group members.
  2. written about 50% of the code for data extraction using WRI.
  3. read a bit on SVM but still don't understand it.

Plan and Goals for new week:

  1. finish the coding for WRI.
  2. try to get help for SVM.

Tien-en Phua

Progress and Status this week:

  1. Finish the design algorithm code in java for word function frequency (pseudo - code).
  2. Start implementing the algorithm code.
  3. Code have been halfway done.

Plan and Goals for new week:

  1. Finish coding.
  2. Discuss about SVM problems.

Semester 2, Week 7

Jie Dong

Progress and Status this week:

  1. Reading chapter about Hidden Markov Chain of "Statistical language learning"
  2. Came up with my own test text to verify my code is working properly
  3. Meeting with Brian discuss my current work, the current approach does not work efficiently

Plan and Goals for new week:

  1. The previous algorithm only considers effect of the trigram words. Result for a test paragraph contains a lot useless information, which about 70% of trigrams only appear once. Information which is worth using in classification is just about 10%. By extracting common trigrams from several test texts, few of them left. Hence, another enhanced model, in which unigrams and bigrams are also taken into consideration, will be tested in the following week.
  2. SVM will also be used to test the result in coming week. Investigating how to use SVM functions in MATLAB, svmtrain and svmclassify (Bioinformatics toolbox)
  3. Peer review assessment

Leng Tan

Progress and Status this week:

  1. Finish the Java coding for WRI technique in data extraction algorithm.
  2. Tested and verified that the code is working properly using a small test file. (text file with only few sentences)
  3. Have a meeting with Brian discussing on the SVM input and output.

Plan and Goals for new week:

  1. Figure out SVM.
  2. Test and try out SVM on matlab using small test files.

Tien-en Phua

Progress and Status this week:

  1. Completed coding for data extraction algorithm (DEA)
  2. Discuss implementation of output of data from DEA to SVM
  3. Analyse how other researches analyse their data

Plan and Goals for new week:

  1. Modification and refining of DEA code
  2. Continue analysis of how other researches used this DEA for authorship attribution
  3. Try applying data to SVM

Semester 2, Week 8

Jie Dong

Progress and Status this week:

  1. Peer review assessment on the design document on "Audio assisted vision system"
  2. Investigation on SVM in MATLAB
  3. Working on modified trigram model

Plan and Goals for new week:

  1. Test my result of java program with SVM

Leng Tan

Progress and Status this week:

  1. Receive a stage 1 design document on "Audio Assisted Vision System for Visually Impaired People".
  2. The document was fully read and take noted on presentation and various other perspective.
  3. The document was reviewed and a formal peer review report was produced.
  4. Investigation on Matlab for SVM was halted for a moment due to the peer review report.

Plan and Goals for new week:

  1. Figure out SVM.
  2. Test and try out SVM on matlab using small test files.

Tien-en Phua

Progress and Status this week:

  1. Complete the coding of Data Extraction Algorithm. Able to load file, remove punctuations, create a new output file for Support Vector Machine input
  2. Review Peer Document and did some research on the principles of echolocation performed by bats to understand the document
  3. Completed Peer Review on Audio Assisted Vision System For Visually Impair People

Plan and Goals for new week:

  1. Apply the generated data by the data extraction algorithm to Support vector machine
  2. Determine progress of project and review schedule.

Semester 2, Week 9

Jie Dong

Progress and Status this week:

  1. Hidden Markov model is implemented using Java, and the program produces a table containing probabilities information for some common trigrams from some texts input. The problem with it currently is because I am feeding all words appeared in texts into the program, there are few common trigrams among certain number of input texts. For example, I have tried with total 20 input texts from two authors, the number of trigram they do have in common is just one. In this case, I also set the program to allow part of these texts to have common trigram and others just put zero probabilities for these trigrams, the result is still not efficient.
  2. Read through Tails trigram description and code, I found that he simplified the method and extracted the key specification by deleting the non key words. By testing his idea using Java code, I found it does extract a lot more information than mine, however a question also raised to me is that whether it would reduce the accuracy of classification since it changes original text to another. This simplification needs to be proved.
  3. Produced result by extraction algorithm is fed into MATLAB SVM methods (svmtrain and svmclassify),it shows my extraction algorithm is not working properly. Sometimes, the predicted author for chosen texts are correct and sometimes are not. In term of SVM itself, it only supports classifying for two groups and multi-group classification produces error. In addition, they can only plot SVM structure for two dimensional data. Hence, more enhanced SVM toolboxes should be studied.

Plan and Goals for next week:

  1. GUI design
  2. Test efficiency using different groups of input texts
  3. Try another SVM toolbox from: http://asi.insa-rouen.fr/enseignants/~arakotom/toolbox/index.html

Leng Tan

Progress and Status this week:

  1. A basic SVM code which receives a text file input is produced.
  2. The SVM code will need 2 training data group and a number of test data group.
  3. The standardize format for the input to SVM was decided by team members.
  4. The input format will be in a MxN matrix where the first column will be the author and subsequent column is the data. (in my case, standard deviations)
  5. Initial data uses 20 standard deviation columns.

Plan and Goals for next week:

  1. The SVM do predict the author wrongly and this need to be resolve.
  2. Might be due to insufficient train data.
  3. Further testing is required.
  4. Might consider implementing GUI.
  5. Need to have a meeting with supervisors on progress and GUI implementation (can combine together GUI of Java and Matlab?)

Tien-en Phua

Progress and Status this week:

  1. Research for statistical software for obtaining the covariance of data StatGraphics
  2. Download and installed a choose software and attempts to operate the program
  3. Research on a book discussing the possible author of Hebrews Nacsbt: Lukan Authorship Of Hebrews

Plan and Goals for next week:

  1. Obtain the covariance of the data
  2. Check to see if data extraction algorithm produce similar results as Talis
  3. Produce code to "chop" all text file to a specific length for analysis
  4. Input data to SVM and observe the outcome
  5. Combine functions for analysis

Semester 2, Week 10

Jie Dong

Progress and Status this week:

  1. Original JAVA program is re-built in a standard eclipse project
  2. Delete Transition class, no longer used
  3. Change three classes (State, Gram, Record) to inner classes correspondingly
  4. Reduce original three main methods in separate class to only one in Driver class
  5. Move methods for User inputs to Driver class, including parameters and paths
  6. Add three header lines to Java program output: number of texts, number of disputed texts, number of trigram used

Plan and Goals for next week:

  1. Standardise three algorithms into one project folder
  2. Use same training data, unknown data to test three extraction algorithms
  3. Compare their accuracies in different situations(number of key words, number of texts,etc)

Leng Tan

Progress and Status this week:

  1. had a meeting with the supervisors and report on the progress of the project.
  2. SVM code is remain the same for the time being.
  3. A tabled results should be produced to compare the difference between each data extraction algorithm.
  4. the main idea of the progress report is discussed.

Plan and Goals for next week:

  1. A standardise template to combine all 3 data extraction algorithm was discussed.
  2. WRI code need to be slightly modified.
  3. need to plan the initial design for the GUI.

Tien-en Phua

Progress and Status this week:

  1. Modify code to accept multiple inputs
  2. Extract out federalist papers for testing on support vector machine using function word analysis
  3. Meeting with supervisors on Wednesday for progress updates and guidance on next step
  4. Commencement of progress report

Plan and Goals for next week:

  1. Produce a table of result displaying the accuracy of the algorithm with SVM Kernel function
  2. Complete progress report, project background, project specification, progress thus far and project management
  3. Combine the three algorithm together into a single driver file
  4. Discuss and design possible implementation of a GUI

Semester 2, Week 11

Jie Dong

Progress and Status this week:

  1. Update progress report
  2. JAVA program modification:
    • Sort list of files read in according to their name order
    • Replace manually parameter setup to automaticly read in data, form train set and testing set according to three header lines

Plan and Goals for next week:

  1. Write a standard document to combine our java extraction program together
  2. Complete Progress report

Leng Tan

Progress and Status this week:

  1. Do progress report.

Plan and Goals for next week:

  1. catch up on assignments and prepare for exams.

Tien-en Phua

Progress and Status this week:

  1. Update of progress report

Plan and Goals for next week:

  1. Complete 4 upcoming assignment
  2. Prepare for power system quiz

Semester 2, Week 12

Jie Dong

Progress and Status this week:

  1. Rough draft for java extraction program standard and send other group member the standard
  2. Modify progress report and upload to Wiki

Plan and Goals for next week:

  1. Stop project for a period of time to prepare for exams

Leng Tan

Progress and Status this week:

  1. assigments due for this week is completed.

Plan and Goals for next week:

  1. Stop project as exams are coming.

Tien-en Phua

Progress and Status this week:

  1. Completed all assignments due this week

Plan and Goals for next week:

  1. Need to prepare for exams. SWOT week next week.
  • Project will "pause" till after exam period, 20 Nov 2010, thereafter the team will be working individually back in their home country and update each other via email

Semester 1, Week 1

Jie Dong

Progress and Status this week:

  1. Had a small discussion with the team members and work on SVM.
  2. Modify SVM program to support multi-group classification function
  3. Test the accuracy of the whole classifying program with English texts
  4. Generate accuracy table with respect to three different variables: tolerance, number of key words and kernal function(linear, quadratic, rbf, polynomial)

Plan and Goals for new week:

  1. Discuss with supervisor about the performance of current program and suggest ways to increase accuracy
  2. Apply interface developed by Joel

Leng Tan

Progress and Status this week:

  1. Brief discussion with team members on the project.
  2. the english texts is used to test the accurancy of the program.
  3. Try different kernel function of the SVM while testing the accurancy.

Plan and Goals for Next Week

  1. Organize a meeting with the supervisors for updates.
  2. discuss with joel for a constant text length.
  3. try to combine the code.

Tien-en Phua

Progress and Status this week:

  1. Conduct a brief meeting with team members to further evaluate on SVM.
  2. Modified program from using function word objects to use of arrays and arraylist instead. Improve resource management and performance time
  3. Modified program to take in large amount of data as input instead of a single file
  4. Modified program to create a new folder to store all temporary (or modified) data. Reduce the clutter in the parent folder
  5. Test program using the federalist papers

Plan and Goals for new week:

  1. Have a meeting with supervisors showing the results.
  2. Further testing

Semester 1, Week 2

Jie Dong

Progress and Status this week:

  1. Met up with supervisors
  2. Applied trigram model algorithm on the 170 English text and test the accuracy of SVM for trigram Markov Model
  3. Number of key words used in the test are 5,10,15,20,25,30,35,40,45,50
  4. Four different kernel functions were used: Linear, Quadratic, rbf, polynomial. And it has been shown that Linear kernel function have the best performance among these four. However, the accuracy is still very low about 50%.

Plan and Goals for new week:

  1. The effect of punctuations in the text should be taken into consideration, such as "-" and "'"
  2. Modified Trigram software
  3. Further testing

Leng Tan

Progress and Status this week:

  1. Met up with supervisor
  2. applied algorithm on the 170 english text and test the accuracy of the SVM for WRI
  3. applied different kernel function and observe the different result
  4. develop on word count program for text

Plan and Goals for new week:

  1. Modify the delete punctuation method in the interface (look at minutes report number 10th for specs)
  2. implement interface
  3. change the number of keywords (currently is 20, try 5,10,15,20,25 and observe the difference)
  4. start using the new testament as test data

Tien-en Phua

Progress and Status this week:

  1. Had meeting with supervisors
  2. Applied algorithm to 170 English text
  3. Applied algorithm to 85 Federalist paper
  4. Monitor project progress and re-evaluate the project milestone and timeline
  5. Develop software for chopping text
  6. Develop software to count total words of text and also the number of occurrence of each word for better text analysis

Plan and Goals for new week:

  1. Identify the reason for in-correct classifications
  2. Further testing to ensure the correct operation
  3. Study Greek alphabets

Semester 1, Week 3

Jie Dong

Progress and Status this week:

  1. Get rid of the concept of tolerance.
  2. Considering the meaning of punctuations appeared in the English texts, especially "-" and "'".
  3. Content which are not written by author should be removed before extraction, such as chapter number and title.
  4. Test the effect of above modification


Goals Next Week

  1. Prepare test data using Federalist Paper
  2. Prepare test data using Greek text

Leng Tan

Progress and Status this week:

  1. Developed a program to count the total number of words that contained "-" and "'"
  2. Implemented interface made by Joel
  3. Modified the WRI method and change the threshold of the number of keywords.

Plan and Goals for new week:

  1. Try to improve the accuracy.

Tien-en Phua

Progress and Status this week:

  1. Analyse the results for the Federalist and 170 English Text
  2. Continue developing auxiliary software (ie CountWord program, Punctuation program)
  3. Research on ways to balance the training data to SVM

Plan and Goals for new week:

  1. Continue testing on Federalist and 170 English Text
  2. Aim to achieve an 70% accuracy
  3. Standardize the training data to SVM

Semester 1, Week 4

Jie Dong

Progress and Status this week:

  1. Run program modified last week on English data set
  2. Varying threshold and size of training data
  3. Achieve a classification accuracy of around 80%
  4. Help group member to prepare for Federalist data set

Plan and Goals for new week:

  1. Study the cause of unsatisfactory classification accuracy and try to improve it
  2. Perform similar tests on Federalist Paper
  3. Discuss results with other group member, and see their algorithm performance

Leng Tan

Progress and Status this week:

  1. Had a meeting with supervisors.
  2. English text achieves only around 25-30%.
  3. Study the New Testament.

Plan and Goals for new week:

  1. Try and find the Greek file for the new Testament.
  2. try use Federalist Text.

Tien-en Phua

Progress and Status this week:

  1. Develop a method of normalizing text.
  2. Run test on 170 English Text. Obtained a 100% accuracy
  3. Run test on Federalist Text. Obtained a 91% accuracy

Plan and Goals for new week:

  1. Obtain a full set of Greek text
  2. Chop Greek text accordingly
  3. Require further testing and analysis
  4. Apply Greek text accordingly

Semester 1, Week 5

Jie Dong

Progress and Status this week:

  1. Algorithm update:
    • The new version of trigram extraction algorithm inserts a "#" before a sentence and a "$" after a sentence. For example, there is a string "Today is a good day. I want to go to picnic."After TextEditor class, it will becomes "# Today is a good day & # I want to go to picnic $"
    • The motivation to this modification is because in an English text, each sentence exists relatively independent with each other. In terms of the example above, "......a good day. I want ......", it is not necessary to calculate the probability of apperance of "I" after the bigram "good day". Instead, it will be more significant to characterise an author's writing habit by knowing the probability of apperance of "I" in the start of a sentence, i.e. after the bigram "$ #". Likewise, the probability of a word appearing at the end of the sentence is important to know as well, that is "day $ #".In addition, by this method, we can discover how often is a specific word used in one sentence
    • To determine the beginning and end of a sentence, delimiter "." is used. In the future, with further study of English text characteristics, there might be more delimiters
  2. Generate classification results based on Federalist Text.


Plan and Goals for new week:

  1. Perform more tests on different disputed texts
  2. Try another key words selection algorithm: based on occurring frequency

Leng Tan

Progress and Status this week: Tried using Federalist Text. best results give accuracy up to 70% when threshold = 10, data dimension = 25. this might be due to the short text length of the Federalist Text. It is noted that WRI works better without normalization. Found Greek File for the new Testament but not sure if is the right one.

Plan and Goals for new week: Do Federalist Text again with different disputed text. Try redo English text again with normalization.

Tien-en Phua

Progress and Status this week:

  1. Analysis of federalist result as it is most similar in style to the new testaments text
  2. Namely that most of the federalist paper is written by Hamilton and likewise the new testaments is written by Paul with a few others written by different authors like Luke, John, Peter
  3. Comparison of results with other feature extraction algorithm
  4. After comparison of Function Word Analysis (FWA) and frequency occurrence of function words, the FWA proves to be a better algorithm as it produces more accurate results than frequency occurrence.
  5. Using FWA reduces the need to chop text and allowing lesser data to be "chunk" out.

Plan and Goals for new week:

  1. According to Gantt Chart, the implementation of controversies should take place next week.
  2. Implement both FWA and frequency occurrence to the KJV text
  3. Frequency occurrence should produce consistent results to Talis.

Semester 1, Week 6

Jie Dong

Progress and Status this week:

  1. With the modification last week, I re-ran the test on English data set
  2. The classification accuracy increased to 85% - 90%. The highest was achieved when threshold = 30
  3. Clear trend can be observed: increasing size of training data, accuracy increases, threshold firstly increase and then drop
  4. Perform tests on Federalist paper, but the accuracy is very low, at about 35% average
  5. Discuss with supervisor and group member with the result on Federalist paper
  6. Since function words analysis achieve a good performance, it was suggested that combine part of them to enhance the algorithm

Plan and Goals for new week:

  1. Implement Trigram Markov model to select trigrams with "Golden Key words"
  2. Start to prepare form final seminar
  3. Achieve test results on King James Version

Leng Tan

Progress and Status this week:

  1. It was verified that the English text actually does not have any problem.
  2. Results shown was not favourable. The prediction was very inconsistent, achieving a low accuracy rate of 53% most of the time.
  3. It was suggested that the testing could be bias to Madison as only Madison text was taken as testing data.
  4. Comparing with the earlier results using the English text, which involves 100 training data and 70 disputed text, the accuracy and consistency was even much lower.
  5. WRI might not be suitable for authorship detection.
  6. It was suggested to combine the Function Word Frequency developed by Joel to enhance the algorithm.

Plan and Goals for new week:

  1. Examine the WRI algorithm with further testing.
  2. Implement enhanced version of the WRI by combining the algorithm with function word frequency.
  3. Prepare the powerpoint slides for the seminar.
  4. Start to make the initial stage of the video.

Tien-en Phua

Progress and Status this week:

  1. Analysis results of FWA on KJV
  2. Analysis results of frequency occurrence on KJV
  3. Frequency occurrence produces consistent results with Talis listing down Paul, Barnabas, Luke and Matthew as the possible authors
  4. FWA produces a different results which will be discuss WHY.
  5. Discuss seminar structure with team
  6. Delegate task to team members for seminar
  7. Produce a uniform set of data for testing and results presentation

Plan and Goals for new week:

  1. Consolidate results from English text, Federalist text and King James Version
  2. Research on future improvement for FWA
  3. Conduct a detail literature review on the background of the new testaments

Semester 1, Week 7

Jie Dong

Progress and Status this week:

  1. Implemented Trigram Markov model to select trigrams with chosen function words
  2. Discovered that trigram containing chosen function words usually occurs more than once. Hence, selection threshold selection words similarly with function word selection method
  3. Made draft of powerpoint slides on SVM and trigram part
  4. Run classification test on King James Version of New Testament
  5. Finalise performance results for English text, Federalist and KJV

Plan and Goals for new week:

  1. Combine slides made by all group members and modifiy, slides should be finialised by next week
  2. Practice makes perfect!!! :)
  3. Discuss results of our own extraction algorithms among group members, make suggestion on potential modification

Leng Tan

Progress and Status this week:

  1. Enhanced version of the WRI combined with function word frequency is done.
  2. Get the results and arrive to a conclusion.
  3. Prepare the powerpoint slides for the final seminar.
  4. started recording some video footage for the final year project video.

Plan and Goals for new week:

  1. Examine the WRI algorithm with further testing.
  2. Implement enhanced version of the WRI by combining the algorithm with function word frequency.
  3. Complete the powerpoint slides for the seminar.
  4. Start to make the initial stage of the video.

Tien-en Phua

Progress and Status this week:

  1. Consolidate results from English text, Federalist text and King James Version
  2. Research on future improvement for FWA
    1. Calculate mean of a function word in a group of text by an author
    2. Calculate the standard deviation of a function word in a group of text by an author
    3. Consider calculating the probability of the occurrence of a function word by inputting the above parameters to SVM
  3. Possible authors of Hebrews namely Apollos, Clement, Paul, Barnabas, Luke and Peter

Plan and Goals for new week:

  1. Complete presentation slides
  2. Practice presentation at least twice before seminar
  3. Assist team members in analyzing their results

Semester 1, Week 8

Jie Dong

Progress and Status this week:

  1. Practice more times for the Final Year Seminar.
  2. Final seminar on Thursday

Plan and Goals for new week:

  1. Run tests on Gospel of Luke and Acts of Apostles in KJV which were prepared by Joel.

Leng Tan

Progress and Status this week:

  1. Final Year Project Seminar

Plan and Goals for new week:

  1. Need to further discuss on pre-processing of the texts before implementing feature extraction algorithm.
  2. Run tests on Gospel of Luke and Acts of Apostles in Koine Greek which were prepared by Joel.
  3. Discuss with Joel on automated function word.

Tien-en Phua

Progress and Status this week:

  1. Project final seminar

Plan and Goals for new week:

  1. Discuss on the techniques for pre-processing of Koine Greek
  2. Run test on Gospel of Luke and Acts of Apostles in Koine Greek
  3. Obtain Koine Greek on possible authors of the letter to Hebrews

Semester 1, Week 9

Jie Dong

Progress and Status this week:

  1. Meeting with supervisor
  2. Run test on English Version of Gospel of Luke
  3. Run test on English Version of Acts of Apostles

Plan and Goals for new week:

  1. Perform the same tests on Koine Greek version of New Testament
  2. Planning for final report
  3. Think of ideas on the video

Leng Tan

Progress and Status this week:

  1. Meeting with supervisor discuss on final report
  2. Run test on Gospel of Luke
  3. Run test on Acts of Apostles

Plan and Goals for new week:

  1. Start planning for the final report
  2. discuss with team on the video

Tien-en Phua

Progress and Status this week:

  1. Meeting with supervisor
  2. Run test on Koine Greek + KJV to determine the author of the Gospel of Luke
  3. Run test on Koine Greek + KJV to determine the author of the Acts of the Apostle
  4. Analysis results and discuss with team

Plan and Goals for new week:

  1. Commence final report writing and discussion
  2. Obtain set of text for Barnabas and Clement

Semester 1, Week 10

Jie Dong

Progress and Status this week:

  1. Discuss with the team on the structure of the final report
  2. Validate trigram Markov model using Koine Greek version of New Testament: Luke and Acts
  3. Predict potential authors for the Letter to the Hebrews
  4. Write up section for Support Vector Machine
  5. Start to write on Trigram Markov model

Plan and Goals for new week:

  1. Find people to do a brief proof reading on what I write
  2. Complete the report

Leng Tan

Progress and Status this week:

  1. Discuss with team on the video.
  2. Discuss on the overall style of the report.
  3. Completed a template to use for the final report.
  4. Write on past research.
  5. Write on project management.
  6. Write on WRI.

Plan and Goals for new week:

  1. Proof read the report.
  2. Complete the report.
  3. Prepare for final exhibition poster

Tien-en Phua

Progress and Status this week:

  1. Discussion with the team on the structure for the final report
  2. Write up background of the letter of Hebrews
  3. Write up background of the Bible
  4. Write up on project aim, approach and report strucutre
  5. Research on a standard set of corpus for team to work on
  6. Set of text of the Epistle of Barnabas and the First Epistle of Clement to Corinitians obtained in Koine Greek.
  7. Process the Koine greek text to beta code

Plan and Goals for new week:

  1. Complete final report
  2. commence planning for exhibition

Semester 1, Week 11

Jie Dong

Progress and Status this week:

  1. Apply the common set of data to Trigram Markov model
  2. Complete testing results for Trigram Markov Model to write up results for final report
  3. Write section for Trigram Markov model and edit SVM part
  4. Prepare appendix section
  5. Discuss with Clement about layout of poster
  6. Made a draft for our poster
    1. Background, color theme, layout and detail section content
    2. Draft for Introduction and Controversy
    3. Flow diagram of our project appraoch

Plan and Goals for new week:

  1. Complete our poster with other members
  2. Prepare for exhibition

Leng Tan

Progress and Status this week:

  1. complete results for WRI.
  2. Touch up on the final report.
  3. Preliminary discuss for the poster.

Plan and Goals for new week:

  1. Prepare the poster.

Tien-en Phua

Progress and Status this week:

  1. Prepare a common set of data for team to write up results for final report
  • English Text, 156 text, 26 per author. 22 training, 4 disputed
  • The Federalist papers, 82 text, 17 disputed, 65 training
  • Kings James Version
  • Koine Greek using Barnabas, Clement, John, Luke, Mark, Matthew, Paul, Peter
  1. Write up on results for the english text and discussion
  2. Write up on results for the federalist papers text and discussion
  3. Write up on results for the king james version and discussion
  4. Write up on results for the koine greek and discussion
  5. Write up abstract
  6. Prepare appendix

Plan and Goals for new week:

  1. Prepare for project exhibition

Semester 1, Week 12

Jie Dong

Progress and Status this week:

  1. Complete final exhibition poster
  2. Making flyers for final year exhibition
  3. Start video editing for introduction and SVM process

Plan and Goals for new week:

  1. Complete the video

Leng Tan

Progress and Status this week:

  1. Complete the poster with the team
  2. Make the flyers
  3. Upload the final report to wiki format
  4. Start the video editing for results and future application

Plan and Goals for new week:

  1. Complete the video

Tien-en Phua

Progress and Status this week:

  1. start the video editing for SVM and three algorithms
  2. create new account for youtube

Plan and Goals for new week:

  1. complete video and upload to youtube

See also

Back