[Solved] Build your own Hidden Markov Model So far

Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 26, 2024

Build your own Hidden Markov Model So far we have discussed Markov Chain and its relation with HMM. Now its time to construct your own

Build your own Hidden Markov Model

So far we have discussed Markov Chain and its relation with HMM. Now its time to construct your own HMM to solve specific problems.

Scenario:

Suppose you are a bioinformatist working in a microbiology research group. The microbiologists in your lab identified new bacteria that possibly causes stomach cancer. Your job is to annotate the genome of this new bacteria. The first step of genome annotation is to identify the coding- and non-coding regions. You decide to achieve that by constructing an HMM model.

Construct an HMM model by answering following questions:

What are the hidden states? What is the observation?

Hidden state: The genome (nucleotides) that causes stomach cancer.

Observable state: The gene expression. (cancer/ no cancer)

In the Urn and Ball problem, each hidden state (urn) is linked with one observation (a ball). If you apply the same rule to this problem, you may find every neighboring nucleotide coming from a different hidden state. It is obviously a wrong assumption for genome. Therefore, you need to link one hidden state to a window of nucleotides.

To simplify the problem, lets assume that you have in total K windows of length L (), and the position of each ( is fixed on the genome. Consequently, you will have a sequence of hidden states .

Draw the hidden states and nucleotide windows as the Urn and Ball example in Wiki (https://en.wikipedia.org/wiki/Hidden_Markov_model#A_concrete_example)

In Urn and Ball example, the hidden states and observation is connected by probability, e.g. the probability of ball drawn from urn . How would you connect the hidden states and observations in this problem? (Hint: as we spent so much time discussing Markov Chain, it shouldnt be hard for you to guess the answer.)

Observe your graph. How many Markov Chains are needed in this problem? What are they? (Hint: neighboring nucleotide windows may be from different hidden states.

How would you model the relationship of the neighboring hidden states? Are they independent from each other?)

Assume you will use first-order Markov Chains. Describe your strategy of calculating the transition matrix and initial probability. (Hint: Obviously, you need the data to calculate these parameters. In the previous exercise, you calculated the transition matrix using a given DNA sequence (Q5 on Wednesday). However, this strategy will not work for this problem as you dont know where the coding regions are. Therefore, what data do you want to use? Where can you find the data?)

Now you have set up your HMM. It is obvious that for each window , you will have several options of hidden states. The amount of computation is enormous if using a brute-force strategy (Does this problem sound familiar to you?). However, statisticians developed a really smart method called Viterbi algorithm. It is based on Dynamic Programming (our old friend!)

Review what youve learnt about dynamic programming in the global alignment problem:

for each cell, the score is the max of () gap penalty, score() gap penalty, score() mismatch penalty or score() + matching reward). The backtracking direction is determined accordingly. The score for the base case (the very top left cell in the grids) is 0.

Fill the following table to adapt HMM into a dynamic programming problem:

What are the counterparts?
Global alignment

You are interested in probabilities rather than the scores. How would you apply the same rule to HMM? Lets work on the probability for base case (i.e. the beginning window ). What parameters do you need? How to connect the next case with base case ?

Now you have an HMM model. Remember that gene structure is complicated, as we discussed it in Unit 1. Do you think this model is good in identifying genes? If not, what changes do you want to make? (Hint: the setting of HMM is adjustable, e.g. the number of hidden states, length of the nucleotide window, positions of the window, Markov Chains, etc.)