Markov models: Difference between revisions
Initial upload of text |
No edit summary |
||
(4 intermediate revisions by the same user not shown) | |||
Line 19: | Line 19: | ||
* <math>P(b|c) = 1 / (2+1+1+1)</math> | * <math>P(b|c) = 1 / (2+1+1+1)</math> | ||
* <math>P(c|a) = 1 / (2+1+1+1)</math> | * <math>P(c|a) = 1 / (2+1+1+1)</math> | ||
* <math>P(\ | * <math>P(\textrm{any other combinations}) = 0</math> | ||
P(a) = 2/(2+3+1) | * <math>P(a) = 2/(2+3+1)</math> | ||
P(b) = 3/(2+3+1) | * <math>P(b) = 3/(2+3+1)</math> | ||
P(c) = 1/(2+3+1) | * <math>P(c) = 1/(2+3+1)</math> | ||
P(d)..P(z) = 0 | * <math>P(d)..P(z) = 0</math> | ||
Then for some sequence of letters <math>x_1,\ldots ,x_n</math> (the code) | Then for some sequence of letters <math>x_1,\ldots ,x_n</math> (the code) | ||
Line 34: | Line 34: | ||
compute the sum of the log(probabilities), then raise back as a | compute the sum of the log(probabilities), then raise back as a | ||
power of the base of the logarithm, otherwise you'll get too many | power of the base of the logarithm, otherwise you'll get too many | ||
numerical errors. | numerical errors. If the probability is zero, then treat it as -9999 (in the log transformed version). | ||
The first term is why you need the single letter frequencies as well. | The first term is why you need the single letter frequencies as well. | ||
Line 45: | Line 43: | ||
You can extend this to higher order Markov models, eg. for a 2nd | You can extend this to higher order Markov models, eg. for a 2nd | ||
order one you estimate | order one you estimate <math>P(b|bc)</math> from counts of bcb, etc. | ||
P(b|bc) from counts of bcb, etc. | |||
== See also == | |||
[http://en.wikipedia.org/wiki/Markov_chain Markov chain (Wikipedia)] |
Latest revision as of 14:33, 10 June 2009
In a first order Markov model, you compute the letter transition probabilities, , etc. from the frequencies, so for example for every pair of letters in abbcab the counts are
Then compute the probabilities:
Then for some sequence of letters (the code) The likelihood that that code came from a language (or initial letters of language) that our sequence abbcab came from is a product of probabilities, assuming independence (since you are treating it as a first Note that you should first take logs (using Java doubles), then compute the sum of the log(probabilities), then raise back as a power of the base of the logarithm, otherwise you'll get too many numerical errors. If the probability is zero, then treat it as -9999 (in the log transformed version). The first term is why you need the single letter frequencies as well.
Doing this for the case where you have fifty samples of a text gives you a confidence level in your final p-value, but it makes the estimates of the probabilities wrong. 50 may be too many for short texts.
You can extend this to higher order Markov models, eg. for a 2nd order one you estimate from counts of bcb, etc.