Markov models

From Derek
Jump to navigation Jump to search

In a first order Markov model, you compute the letter transition probabilities, P(a|b), etc. from the frequencies, so for example for every pair of letters in abbcab the counts are

  • C(ab)=2
  • C(bb)=1
  • C(bc)=1
  • C(ca)=1
  • C(a)=2
  • C(b)=3
  • C(c)=1

Then compute the probabilities:

  • P(b|a)=2/(2+1+1+1)
  • P(b|b)=1/(2+1+1+1)
  • P(b|c)=1/(2+1+1+1)
  • P(c|a)=1/(2+1+1+1)
  • P(anyothercombinations)=0
  • P(a)=2/(2+3+1)
  • P(b)=3/(2+3+1)
  • P(c)=1/(2+3+1)
  • P(d)..P(z)=0

Then for some sequence of letters x1,,xn (the code) The likelihood that that code came from a language (or initial letters of language) that our sequence abbcab came from is P(x1)P(x2|x1)P(xn|xn1)= a product of probabilities, assuming independence (since you are treating it as a first Note that you should first take logs (using Java doubles), then compute the sum of the log(probabilities), then raise back as a power of the base of the logarithm, otherwise you'll get too many numerical errors. If the probability is zero, then treat it as -9999 (in the log transformed version). The first term is why you need the single letter frequencies as well.

Doing this for the case where you have fifty samples of a text gives you a confidence level in your final p-value, but it makes the estimates of the probabilities wrong. 50 may be too many for short texts.

You can extend this to higher order Markov models, eg. for a 2nd order one you estimate P(b|bc) from counts of bcb, etc.

See also[edit]

Markov chain (Wikipedia)