Markov models: Difference between revisions

Latest revision as of 14:33, 10 June 2009

In a first order Markov model, you compute the letter transition probabilities, $P (a | b)$ , etc. from the frequencies, so for example for every pair of letters in abbcab the counts are

$C (a b) = 2$
$C (b b) = 1$
$C (b c) = 1$
$C (c a) = 1$

$C (a) = 2$
$C (b) = 3$
$C (c) = 1$

Then compute the probabilities:

$P (b | a) = 2 / (2 + 1 + 1 + 1)$
$P (b | b) = 1 / (2 + 1 + 1 + 1)$
$P (b | c) = 1 / (2 + 1 + 1 + 1)$
$P (c | a) = 1 / (2 + 1 + 1 + 1)$
$P (anyothercombinations) = 0$

$P (a) = 2 / (2 + 3 + 1)$
$P (b) = 3 / (2 + 3 + 1)$
$P (c) = 1 / (2 + 3 + 1)$
$P (d) . . P (z) = 0$

Then for some sequence of letters $x_{1}, \dots, x_{n}$ (the code) The likelihood that that code came from a language (or initial letters of language) that our sequence abbcab came from is $P (x_{1}) P (x_{2} | x_{1}) \dots P (x_{n} | x_{n - 1}) =$ a product of probabilities, assuming independence (since you are treating it as a first Note that you should first take logs (using Java doubles), then compute the sum of the log(probabilities), then raise back as a power of the base of the logarithm, otherwise you'll get too many numerical errors. If the probability is zero, then treat it as -9999 (in the log transformed version). The first term is why you need the single letter frequencies as well.

Doing this for the case where you have fifty samples of a text gives you a confidence level in your final p-value, but it makes the estimates of the probabilities wrong. 50 may be too many for short texts.

You can extend this to higher order Markov models, eg. for a 2nd order one you estimate $P (b | b c)$ from counts of bcb, etc.

@@ Line 34: / Line 34: @@
 compute the sum of the log(probabilities), then raise back as a
 power of the base of the logarithm, otherwise you'll get too many
-numerical errors. You also need to define <math>\log 0 = 0</math>. (i.e. in the
+numerical errors. If the probability is zero, then treat it as -9999 (in the log transformed version).
-coding logic, if the probability is zero, then add zero, otherwise
-then compute the log and add that)
 The first term is why you need the single letter frequencies as well.
@@ Line 46: / Line 44: @@
 You can extend this to higher order Markov models, eg. for a 2nd
 order one you estimate <math>P(b|bc)</math> from counts of bcb, etc.
+== See also ==
+[http://en.wikipedia.org/wiki/Markov_chain Markov chain (Wikipedia)]

Markov models: Difference between revisions

Latest revision as of 14:33, 10 June 2009

See also

Navigation menu

Search