LSTM model for NER Tagging

Ganga
7 min readApr 22, 2021

As we read an article or a novel, we understand every new word/phrase based on our understanding of the previous context. We don’t choose to forget everything and analyse from scratch again. We continue building on past inputs i.e., our thoughts have persistence!

This is exactly how a Recurrent neural network (RNN) works, they have loops in them that allow information to persist. This bridges the gap between traditional neural networks and well-performing language models.

However, RNNs too come with a limitation of their own. For example, consider the sentence, “I have a pet dog. My pet ___, Ozzy is asleep.”
We don’t need any further context. It is very clear that the next word should be dog. In such scenarios where the gap between the relevant word and the region that it is needed is small, RNNs come in handy.
There are also cases where we need more context. Say we need to predict the blank in the text, “I grew up in Spain … I speak fluent _______.”
The latest chunk of information suggests that the next word is the name of a language, but if we want to narrow down to this, we need the context Spain, from further behind.
Unfortunately, as this gap between the relevant word and that under question grows, RNNs are unable to learn by preserving the context.

RNNs are unable to retain the context

What do we need now? A longer short-term memory. Yes, LSTM!

Long Short Term Memory — while this may sound like an oxymoron, LSTM networks are among the most promising advancements in the field of deep learning. They are a special variant of RNN, capable of learning long-term dependencies.

LSTMs can preserve long-term dependencies

What sets LSTM networks apart?

LSTM networks have a gated structure capable of adding or removing information. They use sigmoid functions for activation in combination with three gates:

  1. Input Gate — Decides what information is relevant to add from the current step
  2. Forget Gate — Decides what is relevant to keep from prior steps
  3. Output Gate — Determines what the next hidden state should be

A tanh activation with a zero centered range is used, for which a long sum operation distributes the range well — thereby allowing the cell state to flow longer and supporting long term dependencies.

Here’s a tutorial that explains the internal mechanism of LSTM networks with some equations and illustrations.

To know more about what happens within a LSTM cell, refer to this article.

Let’s take a look at a key takeaway that goes into training an LSTM layer.
What should the inputs look like?
The input is a 3 dimensional tensor with a shape —
(batch size, sequence length, embedding dimension)

— In the shape, the batch size refers to the number of samples used for training at a time.
Sequence length is the number of time steps that each sample span over.
— The third dimension gives the size of one time step within a sample, much like an embedding of each of these data points.

We have covered the fundamentals of LSTM networks. Now, let’s touch upon NER tagging.

Named-entity recognition

Named-entity recognition (NER) is a subtask of information extraction, that is used to identify and group tokens into a predefined set of named entities.

Here is an example of a sentence tagged using IOB tags

An example of IOB tagging

The IOB format (short for inside, outside, beginning) is a tagging format that is used for tagging tokens in a chunking task such as named-entity recognition. These tags are similar to part-of-speech tags but give us information about the location of the word in the chunk. The IOB Tagging system contain tags of the form:

  • B-{CHUNK_TYPE} — for the word in the beginning of the chunk
  • I-{CHUNK_TYPE} — for words inside the chunk
  • O — for words outside/not part of the chunk

There are different kinds of tagging formats that can be used as per the presence of n-grams, use cases, etc.

Now, our goal is to train an LSTM model to predict IOB tags for any given text, using a preset of tagged tokens. The implementation will be carried out with PyTorch.

This is the use case we will be tackling:
Extraction of skills from JDs

Let’s get started!

Implementation

Step 1: Preprocess the dataframe containing JDs and tokenise them

Preprocessing may include a variety of steps such as:

— Removal of HTML tags
— Removal of stray spaces and punctuations
— Removal of URLs
— Removal of email IDs
— Conversion to lowercase

Tokenisation is carried out using the nltk library.

Step 2: Use FlashText to tag the JD using a list of skills

Consider the given skillset:

List of skills

Consider this JD for a Senior Data Scientist from kaggle’s Monster Job Postings:

Sample JD

The words in red indicate the identified skill.
Note that the conversion of JDs and skills into lowercase will ensure uniformity, and thus ease the tagging process.

The tagged dataframe with tokens should look like this. Let’s call it labelled.

Step 3: Group the labelled tokens into sentences using ‘.’ as the delimiter

Step 4: Create a dictionary with unique key-value pairs — for tokens and tags.

Step 5: Write a generator function to generate batches for training

Prior to this, the data can be split into two sets, for training and validation. A 7:3 ratio is ideal.

Here we will be using a batch size of 32.

Step 6: Write a function to prepare sequences in order to pass to the model.

This function will convert the sequences in every batch into integers, and pad them so as to make a matrix representation of the inputs. This matrix is then converted into tensors for training.

Step 7: Defining the model class

A softmax activation along the last dimension ensures the values of each row to add up to a probability of 1.

Refer to the official documentation hosted by PyTorch, to get a clearer picture of the layers used and maintain consistency of dimensions throughout.

Step 8: Instantiating the model

This is a very crucial step, especially for a skewed/unbalanced dataset.

Let’s take a look at the loss function. Cross-entropy loss is used here as there are multiple target classes — ignore_index is set to -1 thereby ignoring the padding for tags and weight is initialised to weights calculated as follows.

This gives more weightage to sparse classes in the dataset.

Step 9: Training the model

Note that we will be calculating the accuracy by disregarding the padding.

This model can now be used to predict the tags for any given JD.

Since most of the available datasets for this task will be skewed, a better practice is to calculate the F1 score for each class rather than a straight-forward accuracy score. Scikit-learn contains inbuilt functionalities for this.

Scope for experimentation
-
Word2Vec embeddings can be used instead of the embedding layer for indices.
- Sentence embeddings may also help extract more semantics.
- The hyperparameters such as number of epochs, hidden dimension, embedding dimension can be tweaked.
- Another variant of LSTM is the Bidirectional LSTM, which also considers the sequence in a reverse order.
- Attention is another concept worth exploring for this use case.

Essentially all of the remarkable results achieved by traditional RNNs can be achieved using LSTMs. They work a lot better too for most tasks!

Written down as a set of equations, LSTMs can look pretty intimidating. Hopefully, a hands-on approach through snippets of code and illustrations have made it easier!

LSTMs are the way to go if our language models require the context to persist over a longer duration, they can be used for text generation and sentiment analysis tasks as well.

References

--

--