Difference between revisions of "Markov models"

From Derek
Jump to: navigation, search
(Initial upload of text)
 
 
(4 intermediate revisions by the same user not shown)
Line 19: Line 19:
 
* <math>P(b|c) = 1 / (2+1+1+1)</math>
 
* <math>P(b|c) = 1 / (2+1+1+1)</math>
 
* <math>P(c|a) = 1 / (2+1+1+1)</math>
 
* <math>P(c|a) = 1 / (2+1+1+1)</math>
* <math>P(\mathrm{any other combinations}) = 0</math>
+
* <math>P(\textrm{any other combinations}) = 0</math>
  
P(a) = 2/(2+3+1)
+
* <math>P(a) = 2/(2+3+1)</math>
P(b) = 3/(2+3+1)
+
* <math>P(b) = 3/(2+3+1)</math>
P(c) = 1/(2+3+1)
+
* <math>P(c) = 1/(2+3+1)</math>
P(d)..P(z) = 0
+
* <math>P(d)..P(z) = 0</math>
  
 
Then for some sequence of letters <math>x_1,\ldots ,x_n</math> (the code)
 
Then for some sequence of letters <math>x_1,\ldots ,x_n</math> (the code)
Line 34: Line 34:
 
compute the sum of the log(probabilities), then raise back as a   
 
compute the sum of the log(probabilities), then raise back as a   
 
power of the base of the logarithm, otherwise you'll get too many   
 
power of the base of the logarithm, otherwise you'll get too many   
numerical errors. You also need to define <math>log 0 = 0</math>. (i.e. in the 
+
numerical errors. If the probability is zero, then treat it as -9999 (in the log transformed version).
coding logic, if the probability is zero, then add zero, otherwise 
+
then compute the log and add that)
+
 
The first term is why you need the single letter frequencies as well.
 
The first term is why you need the single letter frequencies as well.
  
Line 45: Line 43:
  
 
You can extend this to higher order Markov models, eg. for a 2nd   
 
You can extend this to higher order Markov models, eg. for a 2nd   
order one you estimate
+
order one you estimate <math>P(b|bc)</math> from counts of bcb, etc.
P(b|bc) from counts of bcb, etc.
+
 
 +
== See also ==
 +
[http://en.wikipedia.org/wiki/Markov_chain Markov chain (Wikipedia)]

Latest revision as of 15:33, 10 June 2009

In a first order Markov model, you compute the letter transition probabilities, [math]P(a | b)[/math], etc. from the frequencies, so for example for every pair of letters in abbcab the counts are

  • [math]C(ab) = 2[/math]
  • [math]C(bb) = 1[/math]
  • [math]C(bc) = 1[/math]
  • [math]C(ca) = 1[/math]
  • [math]C(a) = 2[/math]
  • [math]C(b) = 3[/math]
  • [math]C(c) = 1[/math]

Then compute the probabilities:

  • [math]P(b|a) = 2 / (2+1+1+1)[/math]
  • [math]P(b|b) = 1 / (2+1+1+1)[/math]
  • [math]P(b|c) = 1 / (2+1+1+1)[/math]
  • [math]P(c|a) = 1 / (2+1+1+1)[/math]
  • [math]P(\textrm{any other combinations}) = 0[/math]
  • [math]P(a) = 2/(2+3+1)[/math]
  • [math]P(b) = 3/(2+3+1)[/math]
  • [math]P(c) = 1/(2+3+1)[/math]
  • [math]P(d)..P(z) = 0[/math]

Then for some sequence of letters [math]x_1,\ldots ,x_n[/math] (the code) The likelihood that that code came from a language (or initial letters of language) that our sequence abbcab came from is [math]P(x_1)P(x_2 | x_1)\ldots P(x_n | x_{n-1} ) =[/math] a product of probabilities, assuming independence (since you are treating it as a first Note that you should first take logs (using Java doubles), then compute the sum of the log(probabilities), then raise back as a power of the base of the logarithm, otherwise you'll get too many numerical errors. If the probability is zero, then treat it as -9999 (in the log transformed version). The first term is why you need the single letter frequencies as well.

Doing this for the case where you have fifty samples of a text gives you a confidence level in your final p-value, but it makes the estimates of the probabilities wrong. 50 may be too many for short texts.

You can extend this to higher order Markov models, eg. for a 2nd order one you estimate [math]P(b|bc)[/math] from counts of bcb, etc.

See also

Markov chain (Wikipedia)