Difference between revisions of "Markov models"

From Derek
Jump to: navigation, search
m
m
Line 34: Line 34:
 
compute the sum of the log(probabilities), then raise back as a   
 
compute the sum of the log(probabilities), then raise back as a   
 
power of the base of the logarithm, otherwise you'll get too many   
 
power of the base of the logarithm, otherwise you'll get too many   
numerical errors. You also need to define <math>log 0 = 0</math>. (i.e. in the   
+
numerical errors. You also need to define <math>\log 0 = 0</math>. (i.e. in the   
 
coding logic, if the probability is zero, then add zero, otherwise   
 
coding logic, if the probability is zero, then add zero, otherwise   
 
then compute the log and add that)
 
then compute the log and add that)

Revision as of 15:50, 18 May 2009

In a first order Markov model, you compute the letter transition probabilities, [math]P(a | b)[/math], etc. from the frequencies, so for example for every pair of letters in abbcab the counts are

  • [math]C(ab) = 2[/math]
  • [math]C(bb) = 1[/math]
  • [math]C(bc) = 1[/math]
  • [math]C(ca) = 1[/math]
  • [math]C(a) = 2[/math]
  • [math]C(b) = 3[/math]
  • [math]C(c) = 1[/math]

Then compute the probabilities:

  • [math]P(b|a) = 2 / (2+1+1+1)[/math]
  • [math]P(b|b) = 1 / (2+1+1+1)[/math]
  • [math]P(b|c) = 1 / (2+1+1+1)[/math]
  • [math]P(c|a) = 1 / (2+1+1+1)[/math]
  • [math]P(\textrm{any other combinations}) = 0[/math]
  • [math]P(a) = 2/(2+3+1)[/math]
  • [math]P(b) = 3/(2+3+1)[/math]
  • [math]P(c) = 1/(2+3+1)[/math]
  • [math]P(d)..P(z) = 0[/math]

Then for some sequence of letters [math]x_1,\ldots ,x_n[/math] (the code) The likelihood that that code came from a language (or initial letters of language) that our sequence abbcab came from is [math]P(x_1)P(x_2 | x_1)\ldots P(x_n | x_{n-1} ) =[/math] a product of probabilities, assuming independence (since you are treating it as a first Note that you should first take logs (using Java doubles), then compute the sum of the log(probabilities), then raise back as a power of the base of the logarithm, otherwise you'll get too many numerical errors. You also need to define [math]\log 0 = 0[/math]. (i.e. in the coding logic, if the probability is zero, then add zero, otherwise then compute the log and add that) The first term is why you need the single letter frequencies as well.

Doing this for the case where you have fifty samples of a text gives you a confidence level in your final p-value, but it makes the estimates of the probabilities wrong. 50 may be too many for short texts.

You can extend this to higher order Markov models, eg. for a 2nd order one you estimate [math]P(b|bc)[/math] from counts of bcb, etc.