Authorship detection: Who wrote the Letter to the Hebrews?
Contents
Supervisors
Collaborators
- François-Pierre Huchet, ITII Pays de la Loire, Nantes, France.
- Talis Putnins, BICEPS, Latvia.
- J. José Alviar, University of Navarra, Spain
Honours student group
- 2010: Jie Dong, Leng Tan, and Tien-en Phua, see Authorship detection: 2010 group
- 2011: Yan Xie, Kai He, and Zhaokun Wang see Authorship detection: 2011 group
Project guidelines
Project description
The project aims to solve the controversy “Who wrote the Letter to Hebrews?” The team intends to further enhance three extraction algorithms, Function Word Analysis, Word Recurrence Interval (WRI) and Trigram Markov, which have been shown to produce relatively satisfactory results, as compared to data compression, in terms of authorship detection and compare its effectiveness to existing algorithms. The team plans to utilize a Support Vector Machine (SVM) to develop a classification model that would be able to accurately classify a disputed text to its author using a database of undisputed texts. With this model, the team would be able to present an accurate hypothesis to the controversy “Who wrote the letter to Hebrew?” In addition, if time permits, the team would aim to verify the authorship of other controversial texts such as The Federalist Paper and the works of Shakespeare. Furthermore, the team would like to further develop our algorithm to applications such as source code plagiarism detection and future search engines.
Approach and methodology
The New Testament of the Bible contains a number of texts of unknown or unclear authorship. The most notorious of these is the Letter to the Hebrews. Traditionally ascribed to the author Paul, it is unclear if this is indeed the case.
This project will involve trying to answer the question of who wrote the letter, using a number of different measures of relationships between different texts. The first of these is word recurrence interval (WRI) looking at the interval with which key words in a text occur. For example the interval between the word “the" in the phrase “The cat sat on the mat” is three, as there are three words between the occurrences of “the.”
The other methods you should try include various statistical tests, Markov model approaches, and universal file compression techniques. By the end of the project, you should be familiar with all these methods, and hopefully have something to say about the authorship of the Letter to the Hebrews!
Several methods that are used in authorship detection are:
- Scaled standard deviation of WRI graphical method
- Mid-point of WRI curve method
- Periodic truncation of WRI plot method
- Kolmogorov-Smirnov test
- Trigram Markov model method
- Gutman (LZ preprocessing) method
- Frequency occurrence of function words
- Function word analysis.
Possible extension
If you get through all the work on Hebrews early, you might like to try your hand at the authorship of the plays of Shakespeare, more information on the controversy can be found at http://shakespeare-oxford.com/
Expectations
- If you discover who wrote the Letter to the Hebrews you will be instantly famous. That would be nice. However, we don't expect you to do that in a 1-yr project! The key expectation is that you carefully try out the different methods and make conclusions as to which potential authors have greater statistical "weight" than others.
- We expect you to critically look at the conclusions of the previous project groups and highlight to what extent your conclusions agree and where you disagree.
- We expect all the written work to be place on this wiki. No paper reports are to be handed up. Just hand up a CD with your complete project directory at the end. One CD for each group member.
- It is expected that you fill out wiki each week, every Friday evening, to briefly state what you did that week and what the goals are for the following week.
- It is important to regularly see your main supervisors. Don't let more than 2 weeks go by without them seeing your face briefly.
- You should be making at least one formal progress meeting with supervisors per month. It does not strictly have to be exactly a month, but roughly each month you should be in a position to show some progress and have some problems and difficulties to discuss.
- The onus is on you to drive the meetings, make the appointments and set them up.
- You are expected to make a YouTube presentation of your whole project.
Relationship to possible career path
Whilst the project is fascinating as you'll learn about a specific historical problem—and we do want you to have a lot of fun with it—the project does have a hard-core serious engineering side. It will familiarize you with techniques in information theory, probability, statistics, encryption, decryption, and datamining. It will also improve your software skills. The project will also involve writing software code that performs classification of data. This will force you to learn about a number of software tools and techniques. The software you write will have applications in authorship detection, plagiarism detection, next-generation web search engines, datamining and also make you rich/famous. The types of jobs out there where these skills are useful are in computer security, comms, or in digital forensics. The types of industries that will need you are: the software industry, e-finance industry, e-security, IT industry, Google, telecoms industry, ASIO, ASIS, defence industry (e.g. DSD), etc. So go ahead and have fun with this, but keep your eye on the bigger engineering picture and try to build up an appreciation of why these techniques are useful to our industry. Now go find who wrote Hebrews...this message will self-destruct in five seconds. :-)
See also
- Authorship detection: 2010 group
- Minutes of Meeting 2010: Who wrote the Letter to the Hebrews?
- Critical design review 2010: Who wrote the Letter to the Hebrews?
- Progress Report 2010: Who wrote the Letter to the Hebrews?
- Final report 2010: Who wrote the Letter to the Hebrews?
- Youtube Video Presentation 2010: Who wrote the Letter to the Hebrews?
References and useful resources
- M.J. Berryman, Investigation of Differential Entropic Measures for Data Mining of Both Text and DNA, 2002.
- K.S. Hui, Data Mining, 2003.
- M.J. Berryman, A. Allison, and D. Abbott, “Statistical techniques for text classification based on word recurrence intervals”, Fluctuation and Noise Letters, 3(1) ppL1-L10, 2003.
- M.J. Berryman, A. Allison, and D. Abbott, “Signal processing and statistical methods in analysis of text and DNA,” Proc. SPIE: Biomedical Applications of Micro and Nanoengineering (Ed. Dan V. Nicolau, Swinburne Univ. of Technology, Australia), 4937, Melbourne, Dec 16-18, 2002.
- M. Gutman, “Asymptotically optimal classification for multiple tests with empirically observed statistics”, IEEE Transactions on Information Theory, 35(2) pp401-408, 1989.
External links
If you find any useful external links, list them here: