Entity recognition is commonly used to parse unstructured text document and extract useful entity information (like location, person, brand) to construct a more useful structured representation. It is one of the most common text processing to understand a text document.

I am planning to write a blog series on text processing. In this first blog of a series of basic text processing algorithm, I will introduce some basic algorithm for entity recognition.

Given the sentence: Ricky is living in CA USA.

Output: Ricky/SP is/NA living/NA in/NA CA/SL USA/CL

Basically we want to tag each word with the entity, whose definition is domain specific. In this example, we define the following tags

- NA - Not Applicable
- SP - Start of a person name
- CP - Continue of a person name
- SL - Start of a location name
- CL - Continue of a location name

### Hidden Markov Model

Lets say there is a sequence of state, lets look at multiple probabilistic graph.However, in our tagging example, we don't directly observe the tag. Instead, we only observe the words. In this case, we can use a hidden markov model (ie: HMM).

Now the tagging problem can be structured as follows.

Given a sequence of words, we want to predict the most likely tag sequence.

Find a tag sequence t1, t2, ... tn that maximize the probability of P(t1, t2, .... | w1, w2 ...)

Using Bayes rules,

P(t1, t2, .... | w1, w2 ...) = P(t1, t2, ... tn, w1, w2, ... wn) / P(w1, w2, ... wn)

Since the sequence w1, w2 ... wn is observed and constant among all tag sequence. This is equivalent to maximize P(t1, t2, ... tn, w1, w2, ... wn) which is equal to P(t1|S)*P(t2|t1)…P(E|tn)*P(w1|t1)*P(w2|t2)…

Now, P(t1|S), P(t2|t1) ... can be estimated by counting the occurrence within the training data.

P(t2|t1) = count(t1, t2) / count(*, t2)

Similarly, P(w1|t1) = count(w1, t1) / count(*, t1)

### Viterbi Algorithm

Now the problem is find a tag sequence t1, ... tn to maximizeP(t1|S)*P(t2|t1)…P(E|tn)*P(w1|t1)*P(w2|t2)…

A naive method is to find all possible combination of tag sequence and then evaluate the above probability. The order of complexity will be O(|T|^n) where T is the number of possible tag values. Notice that this is exponential complexity with respect to the length of the sentence.

However, there is a more effective Viterbi algorithm that leverage the Markov chain properties as well as dynamic programming technique.

The key element is M(k, L) which indicates the max probability of any length k sequence that ends at tk = L. On the other hand, M(k, L) is computed by looking back different choices of S of the length k-1 sequence, and pick the one that gives the maximum M(k-1, S)*P(tk=L | tk-1=S)*P(wk|tk=L). The complexity of this algorithm is O(n*|T|^2).

To find the actual tag sequence, we also maintain a back pointer from every cell to S which leads to the cell. Then we can trace back the path from the max cell M(n, STOP) where STOP is the end of sentence.

Notice that for some rare words that is not observed from the training data, P(wk|tk=L) will be zero and cause the whole term to be zero. Such words can be numbers, dates. One way to solve this problem is to group these rare words into patterns (e.g. 3 digits, year2012 ... etc) and then compute P(group(wk) | tk=L). However, such grouping is domain specific and has to be hand-tuned.